* [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 @ 2023-02-09 11:43 Wilco Dijkstra 2023-02-09 12:25 ` Adhemerval Zanella Netto 0 siblings, 1 reply; 3+ messages in thread From: Wilco Dijkstra @ 2023-02-09 11:43 UTC (permalink / raw) To: Adhemerval Zanella; +Cc: 'GNU C Library' Hi Adhemerval, > The generic routines still assumes that hardware can't or is prohibitive > expensive to issue unaligned memory access. However, I think we move toward > this direction to start adding unaligned variants when it makes sense. There is a _STRING_ARCH_unaligned define that can be set per target. It needs cleaning up since it's used mostly for premature micro-optimizations (eg. getenv.c) where using a fixed size memcpy would be best (it also appears to have big-endian bugs). > Another usual tuning is loop unrolling, which depends on underlying hardware. > Unfortunately we need to explicit force gcc to unroll some loop construction > (for instance check sysdeps/powerpc/powerpc64/power4/Makefile), so this might > be another approach you might use to tune RISCV routines. Compiler unrolling is unlikely to give improved results, especially on GCC where the default unroll factor is still 16 times which will just bloat the code... So all reasonable unrolling is best done by hand (and doesn't need to be target specific). > The memcpy, memmove, memset, memcmp are a slight different subject. Although > current generic mem routines does use some explicit unrolling, it also does > not take in consideration unaligned access, vector instructions, or special > instruction (such as cache clear one). And these usually make a lot of > difference. Indeed. However it is also quite difficult to make use of all these without a lot of target specific code and inline assembler. And at that point you might as well use assembler... > What I would expect it maybe we can use a similar strategy Google is doing > with llvm libc, which based its work on the automemcpy paper [1]. It means > that for unaligned, each architecture will reimplement the memory routine > block. Although the project focus on static compiling, I think using hooks > over assembly routines might be a better approach (you might reuse code > blocks or try different strategies more easily). > > [1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf I'm still not convinced about this strategy - it's hard to beat assembler using generic code. The way it works in LLVM is that you implement a new set of builtins that inline an optimal memcpy for a fixed size. But you don't know the alignment, so this only works on targets that support fast unaligned access. And with different compiler versions/options you get major performance variations due to code reordering, register allocation differences or failure to emit load/store pairs... I believe it is reasonable to ensure the generic string functions are efficient to avoid having to write assembler for every string function. However it becomes crazy when you set the goal to be as close as possible to the best assembler version in all cases. Most targets will add assembly versions for key functions like memcpy, strlen etc. Cheers, Wilco ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 2023-02-09 11:43 [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Wilco Dijkstra @ 2023-02-09 12:25 ` Adhemerval Zanella Netto 0 siblings, 0 replies; 3+ messages in thread From: Adhemerval Zanella Netto @ 2023-02-09 12:25 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: 'GNU C Library' On 09/02/23 08:43, Wilco Dijkstra wrote: > Hi Adhemerval, > >> The generic routines still assumes that hardware can't or is prohibitive >> expensive to issue unaligned memory access. However, I think we move toward >> this direction to start adding unaligned variants when it makes sense. > > There is a _STRING_ARCH_unaligned define that can be set per target. It needs > cleaning up since it's used mostly for premature micro-optimizations (eg. getenv.c) > where using a fixed size memcpy would be best (it also appears to have big-endian > bugs). > I will add on my backlog to clean this up. And it is not ideal, at least for RISCV plans, to have a global flag that sets fast unaligned; maybe we should move a per-file flag so it can recompiled to provide ifunc variants. >> Another usual tuning is loop unrolling, which depends on underlying hardware. >> Unfortunately we need to explicit force gcc to unroll some loop construction >> (for instance check sysdeps/powerpc/powerpc64/power4/Makefile), so this might >> be another approach you might use to tune RISCV routines. > > Compiler unrolling is unlikely to give improved results, especially on GCC where > the default unroll factor is still 16 times which will just bloat the code... > So all reasonable unrolling is best done by hand (and doesn't need to be target > specific). The Makefile snippet I posted uses max-variable-expansions-in-unroller and max-unroll-times to limit the number of unroll. This will most likely need to be done per architecture and even per cpu (for ifunc variants). But manual unrolling could be an option as well. > >> The memcpy, memmove, memset, memcmp are a slight different subject. Although >> current generic mem routines does use some explicit unrolling, it also does >> not take in consideration unaligned access, vector instructions, or special >> instruction (such as cache clear one). And these usually make a lot of >> difference. > > Indeed. However it is also quite difficult to make use of all these without a lot of > target specific code and inline assembler. And at that point you might as well use > assembler... > >> What I would expect it maybe we can use a similar strategy Google is doing >> with llvm libc, which based its work on the automemcpy paper [1]. It means >> that for unaligned, each architecture will reimplement the memory routine >> block. Although the project focus on static compiling, I think using hooks >> over assembly routines might be a better approach (you might reuse code >> blocks or try different strategies more easily). >> >> [1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf > > I'm still not convinced about this strategy - it's hard to beat assembler using > generic code. The way it works in LLVM is that you implement a new set of > builtins that inline an optimal memcpy for a fixed size. But you don't know the > alignment, so this only works on targets that support fast unaligned access. > And with different compiler versions/options you get major performance > variations due to code reordering, register allocation differences or failure > to emit load/store pairs... > > I believe it is reasonable to ensure the generic string functions are efficient > to avoid having to write assembler for every string function. However it > becomes crazy when you set the goal to be as close as possible to the best > assembler version in all cases. Most targets will add assembly versions for > key functions like memcpy, strlen etc. > The LLVM libc does use a lot of arch-specific code and the resulting implementation is not really generic; but at least it showed that it possible to provide competitive mem routines without the need to code it in assembly. But afaiu, their goals are different indeed, since they do focus on static linking and LTO, where a mem i implementation using C and compiler builtins provide more optimization opportunities. But I would like for generic glibc mem routines is be at least good enough where arch maintainer could tune small parts without the need to extra boilerplate. It most likely won't beet hand tuned implementations, specially if it uses builtins and instructions only available on newer compilers; but I think we can still improve our internal framework to avoid relying in assembly implementation too much. We can start by providing an unaligned variant of memcpy, memmove, memcmp, and memset using default word accesses and move to use the google paper strategy to decompose it in blocks tied with compiler builtins. It would allow the architecture to build using -mcpu or any flags to emit vector instructions (by limiting the block size along with the builtin, similar to what we did on strcspn.c to avoid the memset call). ^ permalink raw reply [flat|nested] 3+ messages in thread
* [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines @ 2023-02-07 0:15 Christoph Muellner 2023-02-07 0:16 ` [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Christoph Muellner 0 siblings, 1 reply; 3+ messages in thread From: Christoph Muellner @ 2023-02-07 0:15 UTC (permalink / raw) To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman, DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich, Heiko Stuebner Cc: Christoph Müllner From: Christoph Müllner <christoph.muellner@vrull.eu> This RFC series introduces ifunc support for RISC-V and adds optimized routines of memset(), memcpy()/memmove(), strlen(), strcmp(), strncmp(), and cpu_relax(). The ifunc mechanism desides based on the following hart features: - Available extensions - Cache block size - Fast unaligned accesses Since we don't have an interface to get this information from the kernel (at the moment), this patch uses environment variables instead, which is also why this patch should not be considered for upstream inclusion and is explicitly tagged as RFC. The environment variables are: - RISCV_RT_MARCH (e.g. "rv64gc_zicboz") - RISCV_RT_CBOZ_BLOCKSIZE (e.g. "64") - RISCV_RT_CBOM_BLOCKSIZE (e.g. "64") - RISCV_RT_FAST_UNALIGNED (e.g. "1") The environment variables are looked up and parsed early during startup, where other architectures query similar properties from the kernel or the CPU. The ifunc implementation can use test macros to select a matching implementation (e.g. HAVE_RV(zbb) or HAVE_FAST_UNALIGNED()). The following optimized routines exist: - memset - memcpy/memmove - strlen - strcmp - strncmp - cpu_relax The following optimizations have been applied: - excessive loop unrolling - Zbb's orc.b instruction - Zbb's ctz intruction - Zicboz/Zic64b ability to clear a cache block in memory - Fast unaligned accesses (but with keeping exception guarantees intact) - Fast overlapping accesses The patch was developed more than a year ago and was tested as part of a vendor SDK since then. One of the areas where this patchset was used is benchmarking (e.g. SPEC CPU2017). The optimized string functions have been tested with the glibc tests for that purpose. The first patch of the series does not strictly belong to this series, but was required to build and test SPEC CPU2017 benchmarks. To build a cross-toolchain that includes these patches, the riscv-gnu-toolchain or any other cross-toolchain builder can be used. Christoph Müllner (19): Inhibit early libcalls before ifunc support is ready riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol riscv: Add ENTRY_ALIGN() macro riscv: Add hart feature run-time detection framework riscv: Introduction of ISA extensions riscv: Adding ISA string parser for environment variables riscv: hart-features: Add fast_unaligned property riscv: Add (empty) ifunc framework riscv: Add ifunc support for memset riscv: Add accelerated memset routines for RV64 riscv: Add ifunc support for memcpy/memmove riscv: Add accelerated memcpy/memmove routines for RV64 riscv: Add ifunc support for strlen riscv: Add accelerated strlen routine riscv: Add ifunc support for strcmp riscv: Add accelerated strcmp routines riscv: Add ifunc support for strncmp riscv: Add an optimized strncmp routine riscv: Add __riscv_cpu_relax() to allow yielding in busy loops csu/libc-start.c | 1 + elf/dl-support.c | 1 + sysdeps/riscv/dl-machine.h | 13 + sysdeps/riscv/ldsodefs.h | 1 + sysdeps/riscv/multiarch/Makefile | 24 + sysdeps/riscv/multiarch/cpu_relax.c | 36 ++ sysdeps/riscv/multiarch/cpu_relax_impl.S | 40 ++ sysdeps/riscv/multiarch/ifunc-impl-list.c | 70 +++ sysdeps/riscv/multiarch/init-arch.h | 24 + sysdeps/riscv/multiarch/memcpy.c | 49 ++ sysdeps/riscv/multiarch/memcpy_generic.c | 32 ++ .../riscv/multiarch/memcpy_rv64_unaligned.S | 475 ++++++++++++++++++ sysdeps/riscv/multiarch/memmove.c | 49 ++ sysdeps/riscv/multiarch/memmove_generic.c | 32 ++ sysdeps/riscv/multiarch/memset.c | 52 ++ sysdeps/riscv/multiarch/memset_generic.c | 32 ++ .../riscv/multiarch/memset_rv64_unaligned.S | 31 ++ .../multiarch/memset_rv64_unaligned_cboz64.S | 217 ++++++++ sysdeps/riscv/multiarch/strcmp.c | 47 ++ sysdeps/riscv/multiarch/strcmp_generic.c | 32 ++ sysdeps/riscv/multiarch/strcmp_zbb.S | 104 ++++ .../riscv/multiarch/strcmp_zbb_unaligned.S | 213 ++++++++ sysdeps/riscv/multiarch/strlen.c | 44 ++ sysdeps/riscv/multiarch/strlen_generic.c | 32 ++ sysdeps/riscv/multiarch/strlen_zbb.S | 105 ++++ sysdeps/riscv/multiarch/strncmp.c | 44 ++ sysdeps/riscv/multiarch/strncmp_generic.c | 32 ++ sysdeps/riscv/multiarch/strncmp_zbb.S | 119 +++++ sysdeps/riscv/sys/asm.h | 14 +- .../unix/sysv/linux/riscv/atomic-machine.h | 3 + sysdeps/unix/sysv/linux/riscv/dl-procinfo.c | 62 +++ sysdeps/unix/sysv/linux/riscv/dl-procinfo.h | 46 ++ sysdeps/unix/sysv/linux/riscv/hart-features.c | 356 +++++++++++++ sysdeps/unix/sysv/linux/riscv/hart-features.h | 58 +++ .../unix/sysv/linux/riscv/isa-extensions.def | 72 +++ sysdeps/unix/sysv/linux/riscv/libc-start.c | 29 ++ .../unix/sysv/linux/riscv/macro-for-each.h | 24 + 37 files changed, 2610 insertions(+), 5 deletions(-) create mode 100644 sysdeps/riscv/multiarch/Makefile create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c create mode 100644 sysdeps/riscv/multiarch/init-arch.h create mode 100644 sysdeps/riscv/multiarch/memcpy.c create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S create mode 100644 sysdeps/riscv/multiarch/memmove.c create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c create mode 100644 sysdeps/riscv/multiarch/memset.c create mode 100644 sysdeps/riscv/multiarch/memset_generic.c create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S create mode 100644 sysdeps/riscv/multiarch/strcmp.c create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S create mode 100644 sysdeps/riscv/multiarch/strlen.c create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S create mode 100644 sysdeps/riscv/multiarch/strncmp.c create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h -- 2.39.1 ^ permalink raw reply [flat|nested] 3+ messages in thread
* [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 2023-02-07 0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner @ 2023-02-07 0:16 ` Christoph Muellner 0 siblings, 0 replies; 3+ messages in thread From: Christoph Muellner @ 2023-02-07 0:16 UTC (permalink / raw) To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman, DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich, Heiko Stuebner Cc: Christoph Müllner From: Christoph Müllner <christoph.muellner@vrull.eu> The implementation of memcpy()/memmove() can be accelerated by loop unrolling and fast unaligned accesses. Let's provide an implementation that is optimized accordingly. Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu> --- sysdeps/riscv/multiarch/Makefile | 2 + sysdeps/riscv/multiarch/ifunc-impl-list.c | 6 + sysdeps/riscv/multiarch/memcpy.c | 9 + .../riscv/multiarch/memcpy_rv64_unaligned.S | 475 ++++++++++++++++++ sysdeps/riscv/multiarch/memmove.c | 9 + 5 files changed, 501 insertions(+) create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile index 6bc20c4fe0..b08d7d1c8b 100644 --- a/sysdeps/riscv/multiarch/Makefile +++ b/sysdeps/riscv/multiarch/Makefile @@ -2,6 +2,8 @@ ifeq ($(subdir),string) sysdep_routines += \ memcpy_generic \ memmove_generic \ + memcpy_rv64_unaligned \ + \ memset_generic \ memset_rv64_unaligned \ memset_rv64_unaligned_cboz64 diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c index 16e4d7137f..84b3eb25a4 100644 --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c @@ -36,9 +36,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, size_t i = 0; IFUNC_IMPL (i, name, memcpy, +#if __riscv_xlen == 64 + IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_rv64_unaligned) +#endif IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic)) IFUNC_IMPL (i, name, memmove, +#if __riscv_xlen == 64 + IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_rv64_unaligned) +#endif IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic)) IFUNC_IMPL (i, name, memset, diff --git a/sysdeps/riscv/multiarch/memcpy.c b/sysdeps/riscv/multiarch/memcpy.c index cc9185912a..68ac9bbe35 100644 --- a/sysdeps/riscv/multiarch/memcpy.c +++ b/sysdeps/riscv/multiarch/memcpy.c @@ -31,7 +31,16 @@ extern __typeof (__redirect_memcpy) __libc_memcpy; extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden; +#if __riscv_xlen == 64 +extern __typeof (__redirect_memcpy) __memcpy_rv64_unaligned attribute_hidden; + +libc_ifunc (__libc_memcpy, + (IS_RV64() && HAVE_FAST_UNALIGNED() + ? __memcpy_rv64_unaligned + : __memcpy_generic)); +#else libc_ifunc (__libc_memcpy, __memcpy_generic); +#endif # undef memcpy strong_alias (__libc_memcpy, memcpy); diff --git a/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S b/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S new file mode 100644 index 0000000000..372cd0baea --- /dev/null +++ b/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S @@ -0,0 +1,475 @@ +/* Copyright (C) 2022 Free Software Foundation, Inc. + + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library. If not, see + <https://www.gnu.org/licenses/>. */ + +#if __riscv_xlen == 64 + +#include <sysdep.h> +#include <sys/asm.h> + +#define dst a0 +#define src a1 +#define count a2 +#define srcend a3 +#define dstend a4 +#define tmp1 a5 +#define dst2 t6 + +#define A_l a6 +#define A_h a7 +#define B_l t0 +#define B_h t1 +#define C_l t2 +#define C_h t3 +#define D_l t4 +#define D_h t5 +#define E_l tmp1 +#define E_h count +#define F_l dst2 +#define F_h srcend + +#ifndef MEMCPY +# define MEMCPY __memcpy_rv64_unaligned +#endif + +#ifndef MEMMOVE +# define MEMMOVE __memmove_rv64_unaligned +#endif + +#ifndef COPY97_128 +# define COPY97_128 1 +#endif + +/* Assumptions: rv64i, unaligned accesses. */ + +/* memcpy/memmove is implemented by unrolling copy loops. + We have two strategies: + 1) copy from front/start to back/end ("forward") + 2) copy from back/end to front/start ("backward") + In case of memcpy(), the strategy does not matter for correctness. + For memmove() and overlapping buffers we need to use the following strategy: + if dst < src && src-dst < count -> copy from front to back + if src < dst && dst-src < count -> copy from back to front */ + +ENTRY_ALIGN (MEMCPY, 6) + /* Calculate the end position. */ + add srcend, src, count + add dstend, dst, count + + /* Decide how to process. */ + li tmp1, 96 + bgtu count, tmp1, L(copy_long_forward) + li tmp1, 32 + bgtu count, tmp1, L(copy33_96) + li tmp1, 16 + bleu count, tmp1, L(copy0_16) + + /* Copy 17-32 bytes. */ + ld A_l, 0(src) + ld A_h, 8(src) + ld B_l, -16(srcend) + ld B_h, -8(srcend) + sd A_l, 0(dst) + sd A_h, 8(dst) + sd B_l, -16(dstend) + sd B_h, -8(dstend) + ret + +L(copy0_16): + li tmp1, 8 + bleu count, tmp1, L(copy0_8) + /* Copy 9-16 bytes. */ + ld A_l, 0(src) + ld A_h, -8(srcend) + sd A_l, 0(dst) + sd A_h, -8(dstend) + ret + + .p2align 3 +L(copy0_8): + li tmp1, 4 + bleu count, tmp1, L(copy0_4) + /* Copy 5-8 bytes. */ + lw A_l, 0(src) + lw B_l, -4(srcend) + sw A_l, 0(dst) + sw B_l, -4(dstend) + ret + +L(copy0_4): + li tmp1, 2 + bleu count, tmp1, L(copy0_2) + /* Copy 3-4 bytes. */ + lh A_l, 0(src) + lh B_l, -2(srcend) + sh A_l, 0(dst) + sh B_l, -2(dstend) + ret + +L(copy0_2): + li tmp1, 1 + bleu count, tmp1, L(copy0_1) + /* Copy 2 bytes. */ + lh A_l, 0(src) + sh A_l, 0(dst) + ret + +L(copy0_1): + beqz count, L(copy0) + /* Copy 1 byte. */ + lb A_l, 0(src) + sb A_l, 0(dst) +L(copy0): + ret + + .p2align 4 +L(copy33_96): + /* Copy 33-96 bytes. */ + ld A_l, 0(src) + ld A_h, 8(src) + ld B_l, 16(src) + ld B_h, 24(src) + ld C_l, -32(srcend) + ld C_h, -24(srcend) + ld D_l, -16(srcend) + ld D_h, -8(srcend) + + li tmp1, 64 + bgtu count, tmp1, L(copy65_96_preloaded) + + sd A_l, 0(dst) + sd A_h, 8(dst) + sd B_l, 16(dst) + sd B_h, 24(dst) + sd C_l, -32(dstend) + sd C_h, -24(dstend) + sd D_l, -16(dstend) + sd D_h, -8(dstend) + ret + + .p2align 4 +L(copy65_96_preloaded): + /* Copy 65-96 bytes with pre-loaded A, B, C and D. */ + ld E_l, 32(src) + ld E_h, 40(src) + ld F_l, 48(src) /* dst2 will be overwritten. */ + ld F_h, 56(src) /* srcend will be overwritten. */ + + sd A_l, 0(dst) + sd A_h, 8(dst) + sd B_l, 16(dst) + sd B_h, 24(dst) + sd E_l, 32(dst) + sd E_h, 40(dst) + sd F_l, 48(dst) + sd F_h, 56(dst) + sd C_l, -32(dstend) + sd C_h, -24(dstend) + sd D_l, -16(dstend) + sd D_h, -8(dstend) + ret + +#ifdef COPY97_128 + .p2align 4 +L(copy97_128_forward): + /* Copy 97-128 bytes from front to back. */ + ld A_l, 0(src) + ld A_h, 8(src) + ld B_l, 16(src) + ld B_h, 24(src) + ld C_l, -16(srcend) + ld C_h, -8(srcend) + ld D_l, -32(srcend) + ld D_h, -24(srcend) + ld E_l, -48(srcend) + ld E_h, -40(srcend) + ld F_l, -64(srcend) /* dst2 will be overwritten. */ + ld F_h, -56(srcend) /* srcend will be overwritten. */ + + sd A_l, 0(dst) + sd A_h, 8(dst) + ld A_l, 32(src) + ld A_h, 40(src) + sd B_l, 16(dst) + sd B_h, 24(dst) + ld B_l, 48(src) + ld B_h, 56(src) + + sd C_l, -16(dstend) + sd C_h, -8(dstend) + sd D_l, -32(dstend) + sd D_h, -24(dstend) + sd E_l, -48(dstend) + sd E_h, -40(dstend) + sd F_l, -64(dstend) + sd F_h, -56(dstend) + + sd A_l, 32(dst) + sd A_h, 40(dst) + sd B_l, 48(dst) + sd B_h, 56(dst) + ret +#endif + + .p2align 4 + /* Copy 97+ bytes from front to back. */ +L(copy_long_forward): +#ifdef COPY97_128 + /* Avoid loop if possible. */ + li tmp1, 128 + ble count, tmp1, L(copy97_128_forward) +#endif + + /* Copy 16 bytes and then align dst to 16-byte alignment. */ + ld D_l, 0(src) + ld D_h, 8(src) + + /* Round down to the previous 16 byte boundary (keep offset of 16). */ + andi tmp1, dst, 15 + andi dst2, dst, -16 + sub src, src, tmp1 + + ld A_l, 16(src) + ld A_h, 24(src) + sd D_l, 0(dst) + sd D_h, 8(dst) + ld B_l, 32(src) + ld B_h, 40(src) + ld C_l, 48(src) + ld C_h, 56(src) + ld D_l, 64(src) + ld D_h, 72(src) + addi src, src, 64 + + /* Calculate loop termination position. */ + addi tmp1, dstend, -(16+128) + bgeu dst2, tmp1, L(copy64_from_end) + + /* Store 64 bytes in a loop. */ + .p2align 4 +L(loop64_forward): + addi src, src, 64 + sd A_l, 16(dst2) + sd A_h, 24(dst2) + ld A_l, -48(src) + ld A_h, -40(src) + sd B_l, 32(dst2) + sd B_h, 40(dst2) + ld B_l, -32(src) + ld B_h, -24(src) + sd C_l, 48(dst2) + sd C_h, 56(dst2) + ld C_l, -16(src) + ld C_h, -8(src) + sd D_l, 64(dst2) + sd D_h, 72(dst2) + ld D_l, 0(src) + ld D_h, 8(src) + addi dst2, dst2, 64 + bltu dst2, tmp1, L(loop64_forward) + +L(copy64_from_end): + ld E_l, -64(srcend) + ld E_h, -56(srcend) + sd A_l, 16(dst2) + sd A_h, 24(dst2) + ld A_l, -48(srcend) + ld A_h, -40(srcend) + sd B_l, 32(dst2) + sd B_h, 40(dst2) + ld B_l, -32(srcend) + ld B_h, -24(srcend) + sd C_l, 48(dst2) + sd C_h, 56(dst2) + ld C_l, -16(srcend) + ld C_h, -8(srcend) + sd D_l, 64(dst2) + sd D_h, 72(dst2) + sd E_l, -64(dstend) + sd E_h, -56(dstend) + sd A_l, -48(dstend) + sd A_h, -40(dstend) + sd B_l, -32(dstend) + sd B_h, -24(dstend) + sd C_l, -16(dstend) + sd C_h, -8(dstend) + ret + +END (MEMCPY) +libc_hidden_builtin_def (MEMCPY) + +ENTRY_ALIGN (MEMMOVE, 6) + /* Calculate the end position. */ + add srcend, src, count + add dstend, dst, count + + /* Decide how to process. */ + li tmp1, 96 + bgtu count, tmp1, L(move_long) + li tmp1, 32 + bgtu count, tmp1, L(copy33_96) + li tmp1, 16 + bleu count, tmp1, L(copy0_16) + + /* Copy 17-32 bytes. */ + ld A_l, 0(src) + ld A_h, 8(src) + ld B_l, -16(srcend) + ld B_h, -8(srcend) + sd A_l, 0(dst) + sd A_h, 8(dst) + sd B_l, -16(dstend) + sd B_h, -8(dstend) + ret + +#ifdef COPY97_128 + .p2align 4 +L(copy97_128_backward): + /* Copy 97-128 bytes from back to front. */ + ld A_l, -16(srcend) + ld A_h, -8(srcend) + ld B_l, -32(srcend) + ld B_h, -24(srcend) + ld C_l, -48(srcend) + ld C_h, -40(srcend) + ld D_l, -64(srcend) + ld D_h, -56(srcend) + ld E_l, -80(srcend) + ld E_h, -72(srcend) + ld F_l, -96(srcend) /* dst2 will be overwritten. */ + ld F_h, -88(srcend) /* srcend will be overwritten. */ + + sd A_l, -16(dstend) + sd A_h, -8(dstend) + ld A_l, 16(src) + ld A_h, 24(src) + sd B_l, -32(dstend) + sd B_h, -24(dstend) + ld B_l, 0(src) + ld B_h, 8(src) + + sd C_l, -48(dstend) + sd C_h, -40(dstend) + sd D_l, -64(dstend) + sd D_h, -56(dstend) + sd E_l, -80(dstend) + sd E_h, -72(dstend) + sd F_l, -96(dstend) + sd F_h, -88(dstend) + + sd A_l, 16(dst) + sd A_h, 24(dst) + sd B_l, 0(dst) + sd B_h, 8(dst) + ret +#endif + + .p2align 4 + /* Copy 97+ bytes. */ +L(move_long): + /* dst-src is positive if src < dst. + In this case we must copy forward if dst-src >= count. + If dst-src is negative, then we can interpret the difference + as unsigned value to enforce dst-src >= count as well. */ + sub tmp1, dst, src + beqz tmp1, L(copy0) + bgeu tmp1, count, L(copy_long_forward) + +#ifdef COPY97_128 + /* Avoid loop if possible. */ + li tmp1, 128 + ble count, tmp1, L(copy97_128_backward) +#endif + + /* Copy 16 bytes and then align dst to 16-byte alignment. */ + ld D_l, -16(srcend) + ld D_h, -8(srcend) + + /* Round down to the previous 16 byte boundary (keep offset of 16). */ + andi tmp1, dstend, 15 + sub srcend, srcend, tmp1 + + ld A_l, -16(srcend) + ld A_h, -8(srcend) + ld B_l, -32(srcend) + ld B_h, -24(srcend) + ld C_l, -48(srcend) + ld C_h, -40(srcend) + sd D_l, -16(dstend) + sd D_h, -8(dstend) + ld D_l, -64(srcend) + ld D_h, -56(srcend) + andi dstend, dstend, -16 + + /* Calculate loop termination position. */ + addi tmp1, dst, 128 + bleu dstend, tmp1, L(copy64_from_start) + + /* Store 64 bytes in a loop. */ + .p2align 4 +L(loop64_backward): + addi srcend, srcend, -64 + sd A_l, -16(dstend) + sd A_h, -8(dstend) + ld A_l, -16(srcend) + ld A_h, -8(srcend) + sd B_l, -32(dstend) + sd B_h, -24(dstend) + ld B_l, -32(srcend) + ld B_h, -24(srcend) + sd C_l, -48(dstend) + sd C_h, -40(dstend) + ld C_l, -48(srcend) + ld C_h, -40(srcend) + sd D_l, -64(dstend) + sd D_h, -56(dstend) + ld D_l, -64(srcend) + ld D_h, -56(srcend) + addi dstend, dstend, -64 + bgtu dstend, tmp1, L(loop64_backward) + +L(copy64_from_start): + ld E_l, 48(src) + ld E_h, 56(src) + sd A_l, -16(dstend) + sd A_h, -8(dstend) + ld A_l, 32(src) + ld A_h, 40(src) + sd B_l, -32(dstend) + sd B_h, -24(dstend) + ld B_l, 16(src) + ld B_h, 24(src) + sd C_l, -48(dstend) + sd C_h, -40(dstend) + ld C_l, 0(src) + ld C_h, 8(src) + sd D_l, -64(dstend) + sd D_h, -56(dstend) + sd E_l, 48(dst) + sd E_h, 56(dst) + sd A_l, 32(dst) + sd A_h, 40(dst) + sd B_l, 16(dst) + sd B_h, 24(dst) + sd C_l, 0(dst) + sd C_h, 8(dst) + ret + +END (MEMMOVE) +libc_hidden_builtin_def (MEMMOVE) + +#endif /* __riscv_xlen == 64 */ diff --git a/sysdeps/riscv/multiarch/memmove.c b/sysdeps/riscv/multiarch/memmove.c index 581a8327d6..b446a9e036 100644 --- a/sysdeps/riscv/multiarch/memmove.c +++ b/sysdeps/riscv/multiarch/memmove.c @@ -31,7 +31,16 @@ extern __typeof (__redirect_memmove) __libc_memmove; extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden; +#if __riscv_xlen == 64 +extern __typeof (__redirect_memmove) __memmove_rv64_unaligned attribute_hidden; + +libc_ifunc (__libc_memmove, + (IS_RV64() && HAVE_FAST_UNALIGNED() + ? __memmove_rv64_unaligned + : __memmove_generic)); +#else libc_ifunc (__libc_memmove, __memmove_generic); +#endif # undef memmove strong_alias (__libc_memmove, memmove); -- 2.39.1 ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-02-09 12:25 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-02-09 11:43 [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Wilco Dijkstra 2023-02-09 12:25 ` Adhemerval Zanella Netto -- strict thread matches above, loose matches on Subject: below -- 2023-02-07 0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner 2023-02-07 0:16 ` [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Christoph Muellner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).