Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>,
	Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Subject: Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Wed, 21 Apr 2021 15:02:00 +0000	[thread overview]
Message-ID: <VE1PR08MB5599992190ACACB4CBFC673A83479@VE1PR08MB5599.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <TYAPR01MB602588E784D1D9C87F7216B6DF479@TYAPR01MB6025.jpnprd01.prod.outlook.com>

Hi Naohiro,

> It's really smart way, isn't it? 😊

Well that's the point of SVE!

> I re-implemented the macro " shortcut_for_small_size" using the whilelo, and
> please check it [1][2] if understood correctly.

Yes it works fine. You should still remove the check for zero at entry (which is really slow
and unnecessary) and the argument moves. L2 doesn't need the ptrue, all it needs
is MOV dest_ptr, dst.

> The performance of "whilelo dispatch" [3] is almost same as "binary tree dispatch" [4]
> but I notice that there are gaps at 128 byte and at 256 byte [3].

From what I can see, the new version is faster across the full range. It would be useful to show
both new and old in the same graph rather than separately. You can do that by copying the file
and use a different name for the functions. I do this all the time as it allows direct comparison
of several variants in one benchmark run.

That said, the dip at 256+64 looks fairly substantial. It could be throughput of WHILELO - to test
that you could try commenting out the long WHILELO sequence for the 256-512 byte case and
see whether it improves. If it is WHILELO, it is possible to remove 3x WHILELO from the earlier
cases by moving them after a branch (so that the 256-512 case only needs to execute 5x WHILELO
rather than 8 into total). Also it is worth checking if the 256-512 case beats jumping directly
to L(unroll4) - however note that code isn't optimized yet (eg. there is no need for complex
software pipelined loops since we can only iterate once!). If all that doesn't help, it may be
best to split into 256-384 and 384-512 so you only need 2x WHILELO.

> I checked bench-memcpy-random [5], but it measures the performance from the size
> 4K byte to 512K byte.
> How do we know the branch issue for less than 512 byte?

The size is the size of the memory region tested, not the size of the copies. The actual copies
are very small (90% are smaller than 128 bytes). The key is that it doesn't repeat the same copy
over and over so it's hard on the branch predictor just like in a real application.

Cheers,
Wilco

next prev parent reply	other threads:[~2021-04-21 15:02 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 Wilco Dijkstra
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra
2021-04-30 15:01             ` Szabolcs Nagy
2021-04-30 15:23               ` Wilco Dijkstra
2021-04-30 15:30                 ` Florian Weimer
2021-04-30 15:40                   ` Wilco Dijkstra
2021-05-04  7:56                     ` Szabolcs Nagy
2021-05-04 10:17                       ` Florian Weimer
2021-05-04 10:38                         ` Wilco Dijkstra
2021-05-04 10:42                         ` Szabolcs Nagy
2021-05-04 11:07                           ` Florian Weimer
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy
2021-05-06 15:09                 ` Florian Weimer
2021-05-06 17:31               ` Wilco Dijkstra
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra [this message]
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy
2021-05-19  0:11     ` naohirot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VE1PR08MB5599992190ACACB4CBFC673A83479@VE1PR08MB5599.eurprd08.prod.outlook.com \
    --to=wilco.dijkstra@arm.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=libc-alpha@sourceware.org \
    --cc=naohirot@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).