Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>,
	Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Subject: Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Mon, 19 Apr 2021 14:57:21 +0000	[thread overview]
Message-ID: <VE1PR08MB5599169922D816320160F5A383499@VE1PR08MB5599.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <TYAPR01MB6025EB6F9AD5EB730761F46EDF499@TYAPR01MB6025.jpnprd01.prod.outlook.com>

Hi Naohiro,

> Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
> 512 byte in this mail. 

Yes, one subject at a time is a good idea.

> Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5]. 
> Please notice that "whilelt loop" implementation cannot be used for memmove,
> because it doesn't work for backward copy.

Indeed, the memmove code would need a similar loop but backwards. However it sounds like
small loops are not efficient (possibly a high taken branch penalty), so it's not a good option.

> In conclusion, I'd like to adopt the CASE 5 implementation, considering the
> performance balance between the small size (less than 128 byte) and medium size
> (close to 512 byte).

Yes something like this would work. I would strip out any unnecessary instructions and merge
multiple cases to avoid branches as much as possible. For example start memcpy like this:

memcpy:
   cntb        vector_length
   whilelo     p0.b, xzr, n    // gives a free ptrue for N >= VL
   whilelo     p1.b, vector_length, n
   b.last       1f
   ld1b        z0.b, p0/z, [src]
   ld1b        z1.b, p1/z, [src, #1, mul vl]
   st1b        z0.b, p0, [dest]
   st1b        z1.b, p1, [dest, #1, mul vl]
   ret

The proposed case 5 uses 13 instructions up to 64 bytes and 19 up to 128, the above 
does 0-127 bytes in 9 instructions. You can see the code is perfectly balanced, with
4 load/store instructions, 3 ALU instructions and 2 branches.

Rather than doing a complex binary search, we can use the same trick to merge the code
for 128-256 and 256-512. So overall we only need 2 comparisons which we can write like:

cmp n, vector_length, lsl 3

Like I mentioned before, it is a really good idea to run bench-memcpy-random since it
will clearly show issues with branch prediction on small copies. For memcpy and related
functions you want to minimize branches and only use branches that are heavily biased.

Cheers,
Wilco

next prev parent reply	other threads:[~2021-04-19 14:57 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 Wilco Dijkstra
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra
2021-04-30 15:01             ` Szabolcs Nagy
2021-04-30 15:23               ` Wilco Dijkstra
2021-04-30 15:30                 ` Florian Weimer
2021-04-30 15:40                   ` Wilco Dijkstra
2021-05-04  7:56                     ` Szabolcs Nagy
2021-05-04 10:17                       ` Florian Weimer
2021-05-04 10:38                         ` Wilco Dijkstra
2021-05-04 10:42                         ` Szabolcs Nagy
2021-05-04 11:07                           ` Florian Weimer
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy
2021-05-06 15:09                 ` Florian Weimer
2021-05-06 17:31               ` Wilco Dijkstra
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra [this message]
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy
2021-05-19  0:11     ` naohirot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VE1PR08MB5599169922D816320160F5A383499@VE1PR08MB5599.eurprd08.prod.outlook.com \
    --to=wilco.dijkstra@arm.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=libc-alpha@sourceware.org \
    --cc=naohirot@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).