RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
To: 'Wilco Dijkstra' <Wilco.Dijkstra@arm.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>,
	Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Mon, 19 Apr 2021 02:51:39 +0000	[thread overview]
Message-ID: <TYAPR01MB6025EB6F9AD5EB730761F46EDF499@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB5599AFAEFDA55471AF1C648C834E9@VE1PR08MB5599.eurprd08.prod.outlook.com>

Hi Wilco-san,

Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
512 byte in this mail. 

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > Yes, I implemented for the case of 1 byte to 512 byte [9][10].
> > SVE code seems faster than ASIMD in small/medium range too [11][12][13].
> 
> That adds quite a lot of code and uses a slow linear chain of comparisons. A small
> loop like used in the memset should work fine to handle copies smaller than
> 256 or 512 bytes (you can handle the zero bytes case for free in this code rather
> than special casing it).
> 

I compared performance of the size less than 512 byte for the following five
implementation cases.

CASE 1: liner chain
As mentioned in the reply [0] I removed BTI_J [1], but the macro " shortcut_for_small_size"
stays linear chain [2]
A64FX performance is 4-14 Gbps [3].
The other arch implementations call BTI_J, so performance is degraded.
.
[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125079.html
[1] https://github.com/NaohiroTamura/glibc/commit/7d7217b518e59c78582ac4e89cae725cf620877e
[2] https://github.com/NaohiroTamura/glibc/blob/7d7217b518e59c78582ac4e89cae725cf620877e/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[3] https://drive.google.com/file/d/16qo7N05W526H9j7_9qjm-_Q7gZmOXwpY/view

CASE 2: whilelt loop such as memset
I tested "whilelt loop" implementation instead of the macro " shortcut_for_small_size".
And after having tested, I commented out "whilelt loop" implementation [4]
Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5]. 
Please notice that "whilelt loop" implementation cannot be used for memmove,
because it doesn't work for backward copy.
On the other hand, the macro " shortcut_for_small_size" works for backward copy, because
it loads up to all 512 byte of data into z0 to z7 SVE registers at once, and then store all data.

[4] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR308-R318
[5] https://drive.google.com/file/d/1xdw7mr0c90VupVkQwelFafQHNkXslCwv/view

CASE 3: binary tree chain
I updated the macro " shortcut_for_small_size" to use binary tree chain [6][7].
Comparing with the CASE 1, the size less than 96 byte degraded from 4.0-6.0 Gbps
to 2.5-5.0 Gbps, but the size 512 byte improved from 14.0 Gbps to 17.5 Gbps.

[6] https://github.com/NaohiroTamura/glibc/commit/5c17af8c57561ede5ed2c2af96c9efde4092f02f
[7] https://github.com/NaohiroTamura/glibc/blob/5c17af8c57561ede5ed2c2af96c9efde4092f02f/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L177-L204
[8] https://drive.google.com/file/d/13w8yKdeLpVbp-uJmCttKBKtScya1tXqP/view

CASE 4: binary tree chain except up to 64 byte
I handled up to 64 byte so as to return quickly [9].
Comparing with the CASE 3, the size less than 64 byte improved from 2.5 Gbps to
4.0 Gbps, but the size 512 byte degraded from 17.5 Gbps to 16.5 Gbps [10].

[9] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR177-R184
[10] https://drive.google.com/file/d/1lFsjns9g_7fySAsvx_RVS9o6HSrk6ir9/view

CASE 5: binary tree chain except up to 128 byte
I handled up to 128 byte so as to return quickly [11].
Comparing with the CASE 4, the size less than 128 byte improved from 4.0-6.0 Gbps
to 4.0-7.0 Gbps, but the size 512 byte degraded from 16.5 Gbps to 16.0 Gbps [12].

[11] https://github.com/NaohiroTamura/glibc/commit/fefc59f01ecfd6a207fe261de5ab133f4409d687#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR184-R195
[12] https://drive.google.com/file/d/1HS277_qQUuEeZqLUo0H2XRlFhOhIdI_o/view

In conclusion, I'd like to adopt the CASE 5 implementation, considering the
performance balance between the small size (less than 128 byte) and medium size
(close to 512 byte).

Thanks.
Naohiro

next prev parent reply	other threads:[~2021-04-19  2:51 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 Wilco Dijkstra
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra
2021-04-30 15:01             ` Szabolcs Nagy
2021-04-30 15:23               ` Wilco Dijkstra
2021-04-30 15:30                 ` Florian Weimer
2021-04-30 15:40                   ` Wilco Dijkstra
2021-05-04  7:56                     ` Szabolcs Nagy
2021-05-04 10:17                       ` Florian Weimer
2021-05-04 10:38                         ` Wilco Dijkstra
2021-05-04 10:42                         ` Szabolcs Nagy
2021-05-04 11:07                           ` Florian Weimer
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy
2021-05-06 15:09                 ` Florian Weimer
2021-05-06 17:31               ` Wilco Dijkstra
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot [this message]
2021-04-19 14:57       ` Wilco Dijkstra
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy
2021-05-19  0:11     ` naohirot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=TYAPR01MB6025EB6F9AD5EB730761F46EDF499@TYAPR01MB6025.jpnprd01.prod.outlook.com \
    --to=naohirot@fujitsu.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).