public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
To: 'Wilco Dijkstra' <Wilco.Dijkstra@arm.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>,
	Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Thu, 6 May 2021 10:01:15 +0000	[thread overview]
Message-ID: <TYAPR01MB6025DC532F3C09A0E121188CDF589@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB55993A255FE21BA012D010C3835F9@VE1PR08MB5599.eurprd08.prod.outlook.com>

Hi Wilco,

Thanks for the comments, I applied all of your comments to both
memcpy/memmove and memset except (3) alignment code for memset.
The latest code became memcpy/memove [1] and memset [2] in the
patch-20210317 [3] branch by evaluating the performance data as shown
below.

[1] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memset_a64fx.S
[3] https://github.com/NaohiroTamura/glibc/commits/patch-20210317

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> I've only looked at memcpy so far. My comments on memcpy:
> 
> (1) Improve the tail code in unroll4/2/1/last to do the reverse of
>     shortcut_for_small_size - basically there is no need for loops or lots of
> branches.
>

I updated the tail code both memcpy/memmove [4] and memset [5], and
replaced small size code of memset [5].
The performance is shown as "whilelo" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].

[4] https://github.com/NaohiroTamura/glibc/commit/f7d9d7b22814affdd89cf291905b9c6601e2031d
[5] https://github.com/NaohiroTamura/glibc/commit/b79d6731f800a56be66c895c035b791ca5176bbb
[6] https://docs.google.com/spreadsheets/d/1Rh-bwF6dpWqoOCbL2epogUPn4I2Emd0NiFgoEOPaujM/edit
[7] https://docs.google.com/spreadsheets/d/1TS0qFhyR_06OyqaRHYAdCKxwvRz7f1T8jI7Pu6x2GIk/edit

> (2) Rather than start with L2, check for n > L2_SIZE && vector_length == 64 and
>     start with the vl_agnostic case. Copies > L2_SIZE will be very rare so it's best
> to
>     handle the common case first.
> 

I changed the order both both memcpy/memmove [8] and memset [9].
The performance is shown as "agnostic1st" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].

[8] https://github.com/NaohiroTamura/glibc/commit/c0d7e39aa4aefe3d7b7d2a8a7c220150a0eb78fe
[9] https://github.com/NaohiroTamura/glibc/commit/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4

> (3) The alignment code can be significantly simplified. Why not just process
>     4 vectors unconditionally and then align the pointers? That avoids all the
>     complex code and is much faster.
> 

In terms of memcpy/memmove, I tried 4 patterns, "simplifiedL2algin"[10], 
" simplifiedL2algin2"[11], "agnosticVLalign"[12], and "noalign"[13] as shown in
Google Sheet Graph [14].

"simplifiedL2algin"[10] simplified to 4 whilelo, " simplifiedL2algin2"[11] simplified
to 2 whilelo or 4 whilelo, "agnosticVLalign"[12] added alignment code to L(vl_agnostic),
and "noalign"[13] removed all alignments.

"agnosticVLalign"[12] and "noalign"[13] didn't improve the performance, so these
commits are kept in the patch-20210317-memcpy-alignment branch [15]

[10] https://github.com/NaohiroTamura/glibc/commit/dd4ede78ec4d74e61a4dc3166fc8586168c4e410
[11] https://github.com/NaohiroTamura/glibc/commit/dd246ff01d59e4e91d10261cd070baae07c0093e
[12] https://github.com/NaohiroTamura/glibc/commit/35b8057d91024bf41595d38d94b2c3c76bdfd6b0
[13] https://github.com/NaohiroTamura/glibc/commit/b1f16f3e738152a5c0f3441201058b48901b4910
[14] https://docs.google.com/spreadsheets/d/1REBslxd56kMDMiXHAtRkBn4IaUO7AVmgvGldJl5qc58/edit
[15] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memcpy-alignment

In terms of memset, I tried 4 patterns too, " VL/CL-align "[16], "CL-align"[17],
"CL-align2"[18] and "noalign"[19] as shown in Google Sheet Graph [20].

" VL/CL-align "[16] simplified to 1 whilelo for VL and 3 whilelo for CL,
"CL-align"[17] simplified to 4 whilelo, "CL-align2"[18] simplified to 2 whilelo or
4 whilelo, and "noalign"[19] removed all alignments.

As shown in Google Sheet Graph [20] all of 4 patters didn't improve the
performance, so these commits are kept in the
patch-20210317-memset-alignment branch [21]

[16] https://github.com/NaohiroTamura/glibc/commit/2405b67a6bb8b380476967e150b35f10e0f25fe3
[17 https://github.com/NaohiroTamura/glibc/commit/a01a8ef08f3b53a691502538dabce3d5941790ff
[18] https://github.com/NaohiroTamura/glibc/commit/c8eb4467acbc97890a4f76f716a88d2dd901e083
[19] https://github.com/NaohiroTamura/glibc/commit/01ff56a9e558d650b09b0053adbc3215d269d65f
[20] https://docs.google.com/spreadsheets/d/1qT0ZkbrrL3fpEyfdjr23cbtanNyPFKN8xDo6E9Mb_YQ/edit
[21] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memset-alginment

> (4) Is there a benefit of aligning src or dst to vector size in the vl_agnostic case?
>     If so, it would be easy to align to a vector first and then if n > L2_SIZE do the
>     remaining 3 vectors to align to a full cacheline.
> 

As tried in (3), "agnosticVLalign"[12] didn't improve the performance.

> (5) I'm not sure I understand the reason for src_notag/dest_notag. However if
>     you want to ignore tags, just change the mov src_ptr, src into AND that
>     clears the tag. There is no reason to both clear the tag and also keep the
>     original pointer and tag.
> 

A64FX has Fujitsu's proprietary enhancement regarding tag address.
I removed dest_notag/src_notag macro and simplified L(dispatch) [22]
"src" address has to be kept to jump to L(last)[23].

[22] https://github.com/NaohiroTamura/glibc/commit/519244f5058d0aa98634bb544bae3358f0b7b07c
[23] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L399

> For memmove I would suggest to merge it with memcpy to save ~100 instructions.
> I don't understand the complexity of the L(dispatch) code - you just need a simple
> 3-instruction overlap check that branches to bwd_unroll8.
> 

I simplified the he L(dispatch) code to 3 instructions[24] in the commit[23]. 

[24] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L368-L370

> I haven't looked at memset, but pretty much all the improvements apply there too.

So please review the latest memset [2].

> >> I think the best option for now is to change BTI_C into NOP if
> >> AARCH64_HAVE_BTI is not set. This avoids creating alignment issues in
> >> existing code (which is written to assume the hint is present) and works for all
> string functions.
> >
> > I updated sysdeps/aarch64/sysdep.h following your advice [1].
> >
> > [1]
> > https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0
> > c82cb70339294386
> 
> I meant using an actual NOP in the #else case so that existing string functions
> won't change. Also note the #defines in the #if and #else need to be indented.
> 

I've read the mail thread regarding BTI, but I think I couldn't fully understand the
problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
IFUNC code is executed only on A64FX.
Are we discussing the BTI_C code which is not in IFUNC code?

Thanks.
Naohiro


  parent reply	other threads:[~2021-05-06 10:01 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 Wilco Dijkstra
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra
2021-04-30 15:01             ` Szabolcs Nagy
2021-04-30 15:23               ` Wilco Dijkstra
2021-04-30 15:30                 ` Florian Weimer
2021-04-30 15:40                   ` Wilco Dijkstra
2021-05-04  7:56                     ` Szabolcs Nagy
2021-05-04 10:17                       ` Florian Weimer
2021-05-04 10:38                         ` Wilco Dijkstra
2021-05-04 10:42                         ` Szabolcs Nagy
2021-05-04 11:07                           ` Florian Weimer
2021-05-06 10:01             ` naohirot [this message]
2021-05-06 14:26               ` Szabolcs Nagy
2021-05-06 15:09                 ` Florian Weimer
2021-05-06 17:31               ` Wilco Dijkstra
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy
2021-05-19  0:11     ` naohirot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=TYAPR01MB6025DC532F3C09A0E121188CDF589@TYAPR01MB6025.jpnprd01.prod.outlook.com \
    --to=naohirot@fujitsu.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).