From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-106066-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 15819 invoked by alias); 17 Oct 2019 14:57:36 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 15806 invoked by uid 89); 17 Oct 2019 14:57:35 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-11.3 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,FREEMAIL_FROM,GARBLED_BODY,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=
X-HELO: mail-lf1-f66.google.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=7uFeHCaS6kBQY61WbLx/50XJ0bbH/3xBpZ6vm4WF0Wo=;
        b=L4HJVhWFsb5eFv5LRqU12kzrhqyCDPakENNKY5RhOwezgzoc8+lsyLsuGsQNWqjxCc
         hPc4Z4u8BmHl2lcUoEszQR3yfGP7gnPYpbYWpu+PVLz8OFaJoGa7LXRMMFowrBZ00S6m
         IOnejo9d486I19YGURj8koiw9hJaPg1MHgyLOUbaB4VDs4wyffUEEZdZsmGqgasx9KiF
         VuQ+1MfniD7JNmz6kybIiUp4YRCqkZznIkZGjFd2DG29za9bxkTCcWYSmJVy2/4JUHqS
         JsDLhzsVfprOvUt5cs/4k5M72pBDXqODmgVrf+f1I+alUBRIHihtBhgnypUisT/SXJhD
         JbzA==
MIME-Version: 1.0
References: <20191017131548.10808-1-zhangxuelei4@huawei.com>
In-Reply-To: <20191017131548.10808-1-zhangxuelei4@huawei.com>
From: Yikun Jiang <yikunkero@gmail.com>
Date: Thu, 17 Oct 2019 14:57:00 -0000
Message-ID: <CAArz_dBiA_boCQmyuQLkK+8--U-Mxn9G9HHpjupEFn28k8SnSg@mail.gmail.com>
Subject: Re: [PATCH v2 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor
To: Xuelei Zhang <zhangxuelei4@huawei.com>
Cc: libc-alpha@sourceware.org, nd@arm.com, 
	Siddhesh Poyarekar <siddhesh@gotplt.org>, Wilco.Dijkstra@arm.com, jiangyikun@huawei.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2019-10/txt/msg00529.txt.bz2

> Btw do you have any plans to post other string functions that you can dis=
cuss here? If so, would these
> add more ifuncs or improve the generic versions?

> Yes, memcmp, strlen, strnlen, strcpy, memrchr will be included, we will s=
ummited the patch and test results as soon as possible.

We have submitted the patches of string functions, see below:

[PATCH] aarch64: Optimized strnlen for Kunpeng processor
https://sourceware.org/ml/libc-alpha/2019-10/msg00528.html

[PATCH] aarch64: Optimized strlen for Kunpeng processor
https://sourceware.org/ml/libc-alpha/2019-10/msg00527.html

[PATCH] aarch64: Optimized implementation of memrchr
https://sourceware.org/ml/libc-alpha/2019-10/msg00526.html

[PATCH] aarch64: Optimized strcpy for Kunpeng processor.
https://sourceware.org/ml/libc-alpha/2019-10/msg00525.html

[PATCH] aarch64: Optimized memcmp for Kunpeng processor.
https://sourceware.org/ml/libc-alpha/2019-10/msg00524.html

Xuelei Zhang <zhangxuelei4@huawei.com> =E4=BA=8E2019=E5=B9=B410=E6=9C=8817=
=E6=97=A5=E5=91=A8=E5=9B=9B =E4=B8=8B=E5=8D=889:16=E5=86=99=E9=81=93=EF=BC=
=9A
>
> This is an optimized implementation of the memcpy and memmove on the
> Huawei Kunpeng processor.
>
> Based on the prefetch mechanism on Kunpeng arch, branch to handle 96
> to 2K bytes in memcpy is written without prfm instruction. Hence,
> memcpy has an optimization effect above 128 bytes, 18% improvement
> for copies above 2K bytes, and 38% for larger bytes, such as 32M
> bytes around.
>
> And for memmove, there are two main changes: i) Q register is used
> instead of  X register. ii) dst address is aligned instead of src
> address aligned to improve store operation. Hence, memmove
> implementation also has improvement above 128 bytes, that about 30%
> for 2k to 8M bytes, and about 50% for 32M or more.
> ---
>  sysdeps/aarch64/multiarch/Makefile          |   2 +-
>  sysdeps/aarch64/multiarch/ifunc-impl-list.c |   4 +-
>  sysdeps/aarch64/multiarch/memcpy.c          |   5 +-
>  sysdeps/aarch64/multiarch/memcpy_kunpeng.S  | 468 ++++++++++++++++++++++=
++++++
>  4 files changed, 476 insertions(+), 3 deletions(-)
>  create mode 100644 sysdeps/aarch64/multiarch/memcpy_kunpeng.S
>
> diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiar=
ch/Makefile
> index 4150b89a90..37ed49982d 100644
> --- a/sysdeps/aarch64/multiarch/Makefile
> +++ b/sysdeps/aarch64/multiarch/Makefile
> @@ -1,6 +1,6 @@
>  ifeq ($(subdir),string)
>  sysdep_routines +=3D memcpy_generic memcpy_thunderx memcpy_thunderx2 \
> -                  memcpy_falkor memmove_falkor \
> +                  memcpy_falkor memcpy_kunpeng memmove_falkor \
>                    memset_generic memset_falkor memset_emag \
>                    memchr_generic memchr_nosimd \
>                    strlen_generic strlen_asimd
> diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch6=
4/multiarch/ifunc-impl-list.c
> index be13b916e5..dbbe19096a 100644
> --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> @@ -25,7 +25,7 @@
>  #include <stdio.h>
>
>  /* Maximum number of IFUNC implementations.  */
> -#define MAX_IFUNC      4
> +#define MAX_IFUNC      5
>
>  size_t
>  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> @@ -42,11 +42,13 @@ __libc_ifunc_impl_list (const char *name, struct libc=
_ifunc_impl *array,
>               IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
>               IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx2)
>               IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
> +             IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_kunpeng)
>               IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
>    IFUNC_IMPL (i, name, memmove,
>               IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
>               IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx2)
>               IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
> +             IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_kunpeng)
>               IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
>    IFUNC_IMPL (i, name, memset,
>               /* Enable this on non-falkor processors too so that other c=
ores
> diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiar=
ch/memcpy.c
> index 13796f987f..b5929e2718 100644
> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -32,9 +32,12 @@ extern __typeof (__redirect_memcpy) __memcpy_generic a=
ttribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
> +extern __typeof (__redirect_memcpy) __memcpy_kunpeng attribute_hidden;
>
>  libc_ifunc (__libc_memcpy,
> -            (IS_THUNDERX (midr)
> +            IS_KUNPENG(midr)
> +           ?__memcpy_kunpeng
> +           : (IS_THUNDERX (midr)
>              ? __memcpy_thunderx
>              : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_ARES (midr)
>                 ? __memcpy_falkor
> diff --git a/sysdeps/aarch64/multiarch/memcpy_kunpeng.S b/sysdeps/aarch64=
/multiarch/memcpy_kunpeng.S
> new file mode 100644
> index 0000000000..385f282224
> --- /dev/null
> +++ b/sysdeps/aarch64/multiarch/memcpy_kunpeng.S
> @@ -0,0 +1,468 @@
> +/* Optimized memcpy and memmove for Huawei Kunpeng processor.
> +   Copyright (C) 2018-2019 Free Software Foundation, Inc.
> +
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +/* Assumptions:
> + *
> + * ARMv8-a, AArch64, unaligned accesses.
> + *
> + */
> +
> +#define dstin  x0
> +#define src    x1
> +#define count  x2
> +#define dst    x3
> +#define srcend x4
> +#define dstend x5
> +#define tmp2   x6
> +#define tmp3   x7
> +#define tmp3w   w7
> +#define A_l    x6
> +#define A_lw   w6
> +#define A_h    x7
> +#define A_hw   w7
> +#define B_l    x8
> +#define B_lw   w8
> +#define B_h    x9
> +#define C_l    x10
> +#define C_h    x11
> +#define D_l    x12
> +#define D_h    x13
> +#define E_l    src
> +#define E_h    count
> +#define F_l    srcend
> +#define F_h    dst
> +#define G_l    count
> +#define G_h    dst
> +#define tmp1   x14
> +
> +#define A_q    q0
> +#define B_q    q1
> +#define C_q    q2
> +#define D_q    q3
> +#define E_q    q4
> +#define F_q    q5
> +#define G_q    q6
> +#define H_q    q7
> +#define I_q    q16
> +#define J_q    q17
> +
> +#define A_v    v0
> +#define B_v    v1
> +#define C_v    v2
> +#define D_v    v3
> +#define E_v    v4
> +#define F_v    v5
> +#define G_v    v6
> +#define H_v    v7
> +#define I_v    v16
> +#define J_v    v17
> +
> +#ifndef MEMMOVE
> +# define MEMMOVE memmove
> +#endif
> +#ifndef MEMCPY
> +# define MEMCPY memcpy
> +#endif
> +
> +#if IS_IN (libc)
> +
> +#undef MEMCPY
> +#define MEMCPY __memcpy_kunpeng
> +#undef MEMMOVE
> +#define MEMMOVE __memmove_kunpeng
> +
> +
> +/* Overlapping large forward memmoves use a loop that copies backwards.
> +   Otherwise memcpy is used. Small moves branch to memcopy16 directly.
> +   The longer memcpy cases fall through to the memcpy head.
> +*/
> +
> +ENTRY_ALIGN (MEMMOVE, 6)
> +
> +       DELOUSE (0)
> +       DELOUSE (1)
> +       DELOUSE (2)
> +
> +       sub     tmp1, dstin, src
> +       cmp     count, 512
> +       ccmp    tmp1, count, 2, hi
> +       b.lo    L(move_long)
> +       cmp     count, 96
> +       ccmp    tmp1, count, 2, hi
> +       b.lo    L(move_middle)
> +
> +END (MEMMOVE)
> +libc_hidden_builtin_def (MEMMOVE)
> +
> +
> +/* Copies are split into 4 main cases: small copies of up to 16 bytes,
> +   medium copies of 17..96 bytes which are fully unrolled. Long copies
> +   of 97..2048 align dst address without prefetching. Large copies
> +   of more than 2048 bytes align the destination and use load-and-merge
> +   approach in the case src and dst addresses are unaligned not evenly,
> +   so that, actual loads and stores are always aligned.
> +   Large copies use the loops processing 64 bytes per iteration for
> +   unaligned case and 128 bytes per iteration for aligned ones.
> +*/
> +
> +#define MEMCPY_PREFETCH_LDR 640
> +
> +       .p2align 4
> +ENTRY (MEMCPY)
> +
> +       DELOUSE (0)
> +       DELOUSE (1)
> +       DELOUSE (2)
> +
> +       add     srcend, src, count
> +       cmp     count, 16
> +       b.ls    L(memcopy16)
> +       add     dstend, dstin, count
> +       cmp     count, 96
> +       b.hi    L(memcopy_long)
> +
> +       /* Medium copies: 17..96 bytes.  */
> +       ldr     A_q, [src], #16
> +       and     tmp1, src, 15
> +       ldr     E_q, [srcend, -16]
> +       cmp     count, 64
> +       b.gt    L(memcpy_copy96)
> +       cmp     count, 48
> +       b.le    L(bytes_17_to_48)
> +       /* 49..64 bytes */
> +       ldp     B_q, C_q, [src]
> +       str     E_q, [dstend, -16]
> +       stp     A_q, B_q, [dstin]
> +       str     C_q, [dstin, 32]
> +       ret
> +
> +L(bytes_17_to_48):
> +       /* 17..48 bytes*/
> +       cmp     count, 32
> +       b.gt    L(bytes_32_to_48)
> +       /* 17..32 bytes*/
> +       str     A_q, [dstin]
> +       str     E_q, [dstend, -16]
> +       ret
> +
> +L(bytes_32_to_48):
> +       /* 32..48 */
> +       ldr     B_q, [src]
> +       str     A_q, [dstin]
> +       str     E_q, [dstend, -16]
> +       str     B_q, [dstin, 16]
> +       ret
> +
> +       .p2align 4
> +       /* Small copies: 0..16 bytes.  */
> +L(memcopy16):
> +       cmp     count, 8
> +       b.lo    L(bytes_0_to_8)
> +       ldr     A_l, [src]
> +       ldr     A_h, [srcend, -8]
> +       add     dstend, dstin, count
> +       str     A_l, [dstin]
> +       str     A_h, [dstend, -8]
> +       ret
> +       .p2align 4
> +
> +L(bytes_0_to_8):
> +       tbz     count, 2, L(bytes_0_to_3)
> +       ldr     A_lw, [src]
> +       ldr     A_hw, [srcend, -4]
> +       add     dstend, dstin, count
> +       str     A_lw, [dstin]
> +       str     A_hw, [dstend, -4]
> +       ret
> +
> +       /* Copy 0..3 bytes.  Use a branchless sequence that copies the sa=
me
> +          byte 3 times if count=3D=3D1, or the 2nd byte twice if count=
=3D=3D2.  */
> +L(bytes_0_to_3):
> +       cbz     count, 1f
> +       lsr     tmp1, count, 1
> +       ldrb    A_lw, [src]
> +       ldrb    A_hw, [srcend, -1]
> +       add     dstend, dstin, count
> +       ldrb    B_lw, [src, tmp1]
> +       strb    B_lw, [dstin, tmp1]
> +       strb    A_hw, [dstend, -1]
> +       strb    A_lw, [dstin]
> +1:
> +       ret
> +
> +       .p2align 4
> +
> +L(memcpy_copy96):
> +       /* Copying 65..96 bytes. A_q (first 16 bytes) and
> +          E_q(last 16 bytes) are already loaded. The size
> +          is large enough to benefit from aligned loads */
> +       bic     src, src, 15
> +       ldp     B_q, C_q, [src]
> +       /* Loaded 64 bytes, second 16-bytes chunk can be
> +          overlapping with the first chunk by tmp1 bytes.
> +          Stored 16 bytes. */
> +       sub     dst, dstin, tmp1
> +       add     count, count, tmp1
> +       /* The range of count being [65..96] becomes [65..111]
> +          after tmp [0..15] gets added to it,
> +          count now is <bytes-left-to-load>+48 */
> +       cmp     count, 80
> +       b.gt    L(copy96_medium)
> +       ldr     D_q, [src, 32]
> +       stp     B_q, C_q, [dst, 16]
> +       str     D_q, [dst, 48]
> +       str     A_q, [dstin]
> +       str     E_q, [dstend, -16]
> +       ret
> +
> +       .p2align 4
> +L(copy96_medium):
> +       ldp     D_q, G_q, [src, 32]
> +       cmp     count, 96
> +       b.gt    L(copy96_large)
> +       stp     B_q, C_q, [dst, 16]
> +       stp     D_q, G_q, [dst, 48]
> +       str     A_q, [dstin]
> +       str     E_q, [dstend, -16]
> +       ret
> +
> +L(copy96_large):
> +       ldr     F_q, [src, 64]
> +       str     B_q, [dst, 16]
> +       stp     C_q, D_q, [dst, 32]
> +       stp     G_q, F_q, [dst, 64]
> +       str     A_q, [dstin]
> +       str     E_q, [dstend, -16]
> +       ret
> +
> +       .p2align 4
> +L(memcopy_long):
> +       cmp count, 2048
> +       b.ls L(copy2048_large)
> +       ldr     A_q, [src], #16
> +       and     tmp1, src, 15
> +       bic     src, src, 15
> +       ldp     B_q, C_q, [src], #32
> +       sub     dst, dstin, tmp1
> +       add     count, count, tmp1
> +       add     dst, dst, 16
> +       ldp     D_q, E_q, [src], #32
> +       str     A_q, [dstin]
> +
> +       /* Already loaded 64+16 bytes. Check if at
> +          least 64 more bytes left */
> +       subs    count, count, 64+64+16
> +       b.lt    L(loop128_exit0)
> +       cmp     count, MEMCPY_PREFETCH_LDR + 64 + 32
> +       b.lt    L(loop128)
> +       sub     count, count, MEMCPY_PREFETCH_LDR + 64 + 32
> +
> +       .p2align 4
> +
> +L(loop128_prefetch):
> +       prfm    pldl1strm, [src, MEMCPY_PREFETCH_LDR]
> +       ldp     F_q, G_q, [src], #32
> +       stp     B_q, C_q, [dst], #32
> +       ldp     H_q, I_q, [src], #32
> +       prfm    pldl1strm, [src, MEMCPY_PREFETCH_LDR]
> +       ldp     B_q, C_q, [src], #32
> +       stp     D_q, E_q, [dst], #32
> +       ldp     D_q, E_q, [src], #32
> +       stp     F_q, G_q, [dst], #32
> +       stp     H_q, I_q, [dst], #32
> +       subs    count, count, 128
> +       b.ge    L(loop128_prefetch)
> +
> +       add     count, count, MEMCPY_PREFETCH_LDR + 64 + 32
> +       .p2align 4
> +L(loop128):
> +       ldp     F_q, G_q, [src], #32
> +       ldp     H_q, I_q, [src], #32
> +       stp     B_q, C_q, [dst], #32
> +       stp     D_q, E_q, [dst], #32
> +       subs    count, count, 64
> +       b.lt    L(loop128_exit1)
> +       ldp     B_q, C_q, [src], #32
> +       ldp     D_q, E_q, [src], #32
> +       stp     F_q, G_q, [dst], #32
> +       stp     H_q, I_q, [dst], #32
> +       subs    count, count, 64
> +       b.ge    L(loop128)
> +L(loop128_exit0):
> +       ldp     F_q, G_q, [srcend, -64]
> +       ldp     H_q, I_q, [srcend, -32]
> +       stp     B_q, C_q, [dst], #32
> +       stp     D_q, E_q, [dst]
> +       stp     F_q, G_q, [dstend, -64]
> +       stp     H_q, I_q, [dstend, -32]
> +       ret
> +L(loop128_exit1):
> +       ldp     B_q, C_q, [srcend, -64]
> +       ldp     D_q, E_q, [srcend, -32]
> +       stp     F_q, G_q, [dst], #32
> +       stp     H_q, I_q, [dst]
> +       stp     B_q, C_q, [dstend, -64]
> +       stp     D_q, E_q, [dstend, -32]
> +       ret
> +
> +       /* long copies: 96..2048 bytes */
> +L(copy2048_large):
> +       and     tmp1, dstin, 15
> +       bic     dst, dstin, 15
> +       ldp     D_l, D_h, [src]
> +       sub     src, src, tmp1
> +       add     count, count, tmp1      /* Count is now 16 too large.  */
> +       ldp     A_l, A_h, [src, 16]
> +       stp     D_l, D_h, [dstin]
> +       ldp     B_l, B_h, [src, 32]
> +       ldp     C_l, C_h, [src, 48]
> +       ldp     D_l, D_h, [src, 64]!
> +       subs    count, count, 128 + 16  /* Test and readjust count.  */
> +       b.ls    L(last64)
> +
> +L(loop64):
> +       stp     A_l, A_h, [dst, 16]
> +       ldp     A_l, A_h, [src, 16]
> +       stp     B_l, B_h, [dst, 32]
> +       ldp     B_l, B_h, [src, 32]
> +       stp     C_l, C_h, [dst, 48]
> +       ldp     C_l, C_h, [src, 48]
> +       stp     D_l, D_h, [dst, 64]
> +       ldp     D_l, D_h, [src, 64]
> +       add dst, dst, 64
> +       add src, src, 64
> +       subs    count, count, 64
> +       b.hi    L(loop64)
> +
> +       /* Write the last full set of 64 bytes.  The remainder is at most=
 64
> +          bytes, so it is safe to always copy 64 bytes from the end even=
 if
> +          there is just 1 byte left.  */
> +L(last64):
> +       ldp     E_l, E_h, [srcend, -64]
> +       stp     A_l, A_h, [dst, 16]
> +       ldp     A_l, A_h, [srcend, -48]
> +       stp     B_l, B_h, [dst, 32]
> +       ldp     B_l, B_h, [srcend, -32]
> +       stp     C_l, C_h, [dst, 48]
> +       ldp     C_l, C_h, [srcend, -16]
> +       stp     D_l, D_h, [dst, 64]
> +       stp     E_l, E_h, [dstend, -64]
> +       stp     A_l, A_h, [dstend, -48]
> +       stp     B_l, B_h, [dstend, -32]
> +       stp     C_l, C_h, [dstend, -16]
> +       ret
> +
> +       /* long move: more than 512 bytes align the dstend */
> +       .p2align 4
> +L(move_long):
> +1:
> +       add     srcend, src, count
> +       add     dstend, dstin, count
> +
> +       and     tmp1, dstend, 15
> +       ldr     D_q, [srcend, -16]
> +       sub     srcend, srcend, tmp1
> +       sub     count, count, tmp1
> +       ldp     A_q, B_q, [srcend, -32]
> +       str     D_q, [dstend, -16]
> +       ldp     C_q, D_q, [srcend, -64]!
> +       sub     dstend, dstend, tmp1
> +       subs    count, count, 128
> +       b.ls    2f
> +
> +       .p2align 4
> +1:
> +       subs    count, count, 64
> +       stp     A_q, B_q, [dstend, -32]
> +       ldp     A_q, B_q, [srcend, -32]
> +       stp     C_q, D_q, [dstend, -64]!
> +       ldp     C_q, D_q, [srcend, -64]!
> +       b.hi    1b
> +
> +       /* Write the last full set of 64 bytes.  The remainder is at most=
 64
> +          bytes, so it is safe to always copy 64 bytes from the start ev=
en if
> +          there is just 1 byte left.  */
> +2:
> +       ldp     E_q, F_q, [src, 32]
> +       ldp     G_q, H_q, [src]
> +       stp     A_q, B_q, [dstend, -32]
> +       stp     C_q, D_q, [dstend, -64]
> +       stp     E_q, F_q, [dstin, 32]
> +       stp     G_q, H_q, [dstin]
> +3:     ret
> +
> +       /* midlle move: 96..512 bytes */
> +       .p2align 4
> +L(move_middle):
> +       cbz     tmp1, 3f
> +       add     srcend, src, count
> +       prfm    PLDL1STRM, [srcend, -64]
> +       add     dstend, dstin, count
> +       and     tmp1, dstend, 15
> +       ldr D_q, [srcend, -16]
> +       sub     srcend, srcend, tmp1
> +       sub     count, count, tmp1
> +       ldr     A_q, [srcend, -16]
> +       str     D_q, [dstend, -16]
> +       ldr     B_q, [srcend, -32]
> +       ldr     C_q, [srcend, -48]
> +       ldr     D_q, [srcend, -64]!
> +       sub     dstend, dstend, tmp1
> +       subs    count, count, 128
> +       b.ls    2f
> +
> +1:
> +       str     A_q, [dstend, -16]
> +       ldr     A_q, [srcend, -16]
> +       str     B_q, [dstend, -32]
> +       ldr     B_q, [srcend, -32]
> +       str     C_q, [dstend, -48]
> +       ldr     C_q, [srcend, -48]
> +       str     D_q, [dstend, -64]!
> +       ldr     D_q, [srcend, -64]!
> +       subs    count, count, 64
> +       b.hi    1b
> +
> +       /* Write the last full set of 64 bytes.  The remainder is at most=
 64
> +          bytes, so it is safe to always copy 64 bytes from the start ev=
en if
> +          there is just 1 byte left.  */
> +2:
> +       ldr     G_q, [src, 48]
> +       str     A_q, [dstend, -16]
> +       ldr     A_q, [src, 32]
> +       str     B_q, [dstend, -32]
> +       ldr     B_q, [src, 16]
> +       str     C_q, [dstend, -48]
> +       ldr     C_q, [src]
> +       str     D_q, [dstend, -64]
> +       str     G_q, [dstin, 48]
> +       str     A_q, [dstin, 32]
> +       str     B_q, [dstin, 16]
> +       str     C_q, [dstin]
> +3:     ret
> +
> +
> +END (MEMCPY)
> +       .section        .rodata
> +       .p2align        4
> +
> +libc_hidden_builtin_def (MEMCPY)
> +#endif
> --
> 2.14.1.windows.1
>
>