From: "Andrew Pinski (QUIC)" <quic_apinski@quicinc.com>
To: Carlos O'Donell <carlos@redhat.com>,
"Andrew Pinski (QUIC)" <quic_apinski@quicinc.com>,
"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
Subject: RE: [PATCH 1/2] Aarch64: Add memcpy for qualcomm's oryon-1 core
Date: Mon, 6 May 2024 15:54:39 +0000 [thread overview]
Message-ID: <DM6PR02MB40585CC4F66F27E8B82C8D4DB81C2@DM6PR02MB4058.namprd02.prod.outlook.com> (raw)
In-Reply-To: <4c6e910a-704b-40af-bd0a-178a8a6018f0@redhat.com>
> -----Original Message-----
> From: Carlos O'Donell <carlos@redhat.com>
> Sent: Monday, May 6, 2024 6:08 AM
> To: Andrew Pinski (QUIC) <quic_apinski@quicinc.com>; libc-
> alpha@sourceware.org
> Subject: Re: [PATCH 1/2] Aarch64: Add memcpy for qualcomm's oryon-1 core
>
> On 5/2/24 21:21, Andrew Pinski wrote:
> > Qualcomm's new core (oryon-1) has a different performance
> > characteristic than other cores. For memcpy, it is faster to use the
> > GPRs to do the copy for large sizes (2x faster). For even larger
> > sizes, it is better to use the nontemporal load/store instructions so
> > we don't pollute the L1/L2 caches.
>
> Fails pre-commit CI:
> https://patchwork.sourceware.org/project/glibc/patch/20240503012140.2
> 111502-1-quic_apinski@quicinc.com/
>
> Please review with Linaro's help.
>
> Adhemerval noted it fails makefile linting.
>
> Please review make check result.s
Yes I noticed that failure after I submitted the patch. What happened was that I had changed the filename of the memcpy right before submitting the patch and forgot to resort Makefile,
Since it was a small patch on top I decided it was not enough to resubmit a new patch and I thought others would notice it was just failing the sort check.
Thanks,
Andrew Pinski
>
>
> > For smaller sizes, the characteristic are very similar to other cores.
> > I used the thunderx memcpy as a starting point and expanded from there.
> >
> > Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
> > ---
> > sysdeps/aarch64/cpu-features.h | 6 +
> > sysdeps/aarch64/multiarch/Makefile | 1 +
> > sysdeps/aarch64/multiarch/ifunc-impl-list.c | 3 +
> > sysdeps/aarch64/multiarch/memcpy.c | 5 +
> > sysdeps/aarch64/multiarch/memcpy_oryon1.S | 301
> ++++++++++++++++++++
> > 5 files changed, 316 insertions(+)
> > create mode 100644 sysdeps/aarch64/multiarch/memcpy_oryon1.S
> >
> > diff --git a/sysdeps/aarch64/cpu-features.h
> > b/sysdeps/aarch64/cpu-features.h index 31782b66f9..bc8d842238
> 100644
> > --- a/sysdeps/aarch64/cpu-features.h
> > +++ b/sysdeps/aarch64/cpu-features.h
> > @@ -1,6 +1,7 @@
> > /* Initialize CPU feature data. AArch64 version.
> > This file is part of the GNU C Library.
> > Copyright (C) 2017-2024 Free Software Foundation, Inc.
> > + Copyright The GNU Toolchain Authors.
> >
> > The GNU C Library is free software; you can redistribute it and/or
> > modify it under the terms of the GNU Lesser General Public @@
> > -56,6 +57,11 @@
> > #define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F'
> \
> > && MIDR_PARTNUM(midr) == 0x001)
> >
> > +#define IS_ORYON1(midr) (MIDR_IMPLEMENTOR(midr) == 'Q'
> \
> > + && (MIDR_PARTNUM(midr) == 0x001 \
> > + || (MIDR_PARTNUM(midr) == 0x002 \
> > + && MIDR_VARIANT(midr) == 0)))
> > +
> > struct cpu_features
> > {
> > uint64_t midr_el1;
> > diff --git a/sysdeps/aarch64/multiarch/Makefile
> > b/sysdeps/aarch64/multiarch/Makefile
> > index e4720b7468..253eee6c79 100644
> > --- a/sysdeps/aarch64/multiarch/Makefile
> > +++ b/sysdeps/aarch64/multiarch/Makefile
> > @@ -8,6 +8,7 @@ sysdep_routines += \
> > memcpy_sve \
> > memcpy_thunderx \
> > memcpy_thunderx2 \
> > + memcpy_oryon1 \
> > memmove_mops \
> > memset_a64fx \
> > memset_emag \
> > diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> > b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> > index ecd0f87de6..65c56b9b41 100644
> > --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> > @@ -1,5 +1,6 @@
> > /* Enumerate available IFUNC implementations of a function. AARCH64
> version.
> > Copyright (C) 2017-2024 Free Software Foundation, Inc.
> > + Copyright The GNU Toolchain Authors.
> > This file is part of the GNU C Library.
> >
> > The GNU C Library is free software; you can redistribute it and/or
> > @@ -35,6 +36,7 @@ __libc_ifunc_impl_list (const char *name, struct
> libc_ifunc_impl *array,
> > /* Support sysdeps/aarch64/multiarch/memcpy.c, memmove.c and
> memset.c. */
> > IFUNC_IMPL (i, name, memcpy,
> > IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
> > + IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_oryon1)
> > IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
> > #if HAVE_AARCH64_SVE_ASM
> > IFUNC_IMPL_ADD (array, i, memcpy, sve && !bti,
> __memcpy_a64fx)
> > @@ -44,6 +46,7 @@ __libc_ifunc_impl_list (const char *name, struct
> libc_ifunc_impl *array,
> > IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
> > IFUNC_IMPL (i, name, memmove,
> > IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
> > + IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_oryon1)
> > IFUNC_IMPL_ADD (array, i, memmove, !bti,
> __memmove_thunderx2)
> > #if HAVE_AARCH64_SVE_ASM
> > IFUNC_IMPL_ADD (array, i, memmove, sve && !bti,
> > __memmove_a64fx) diff --git a/sysdeps/aarch64/multiarch/memcpy.c
> > b/sysdeps/aarch64/multiarch/memcpy.c
> > index ce53567dab..15c954778b 100644
> > --- a/sysdeps/aarch64/multiarch/memcpy.c
> > +++ b/sysdeps/aarch64/multiarch/memcpy.c
> > @@ -1,5 +1,6 @@
> > /* Multiple versions of memcpy. AARCH64 version.
> > Copyright (C) 2017-2024 Free Software Foundation, Inc.
> > + Copyright The GNU Toolchain Authors.
> > This file is part of the GNU C Library.
> >
> > The GNU C Library is free software; you can redistribute it and/or
> > @@ -34,6 +35,7 @@ extern __typeof (__redirect_memcpy)
> > __memcpy_thunderx2 attribute_hidden; extern __typeof
> > (__redirect_memcpy) __memcpy_a64fx attribute_hidden; extern __typeof
> > (__redirect_memcpy) __memcpy_sve attribute_hidden; extern __typeof
> > (__redirect_memcpy) __memcpy_mops attribute_hidden;
> > +extern __typeof (__redirect_memcpy) __memcpy_oryon1
> attribute_hidden;
> >
> > static inline __typeof (__redirect_memcpy) * select_memcpy_ifunc
> > (void) @@ -50,6 +52,9 @@ select_memcpy_ifunc (void)
> > return prefer_sve_ifuncs ? __memcpy_sve : __memcpy_generic;
> > }
> >
> > + if (IS_ORYON1 (midr))
> > + return __memcpy_oryon1;
> > +
> > if (IS_THUNDERX (midr))
> > return __memcpy_thunderx;
> >
> > diff --git a/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> > b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> > new file mode 100644
> > index 0000000000..21f9e1d4e0
> > --- /dev/null
> > +++ b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> > @@ -0,0 +1,301 @@
> > +/* A oryon-1 core Optimized memcpy implementation for AARCH64.
> > + Copyright (C) 2017-2024 Free Software Foundation, Inc.
> > + Copyright The GNU Toolchain Authors.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library; if not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +#include <sysdep.h>
> > +
> > +/* Assumptions:
> > + *
> > + * ARMv8-a, AArch64, unaligned accesses.
> > + *
> > + */
> > +
> > +#define dstin x0
> > +#define src x1
> > +#define count x2
> > +#define dst x3
> > +#define srcend x4
> > +#define dstend x5
> > +#define A_l x6
> > +#define A_lw w6
> > +#define A_h x7
> > +#define A_hw w7
> > +#define B_l x8
> > +#define B_lw w8
> > +#define B_h x9
> > +#define C_l x10
> > +#define C_h x11
> > +#define D_l x12
> > +#define D_h x13
> > +#define E_l src
> > +#define E_h count
> > +#define F_l srcend
> > +#define F_h dst
> > +#define G_l count
> > +#define G_h dst
> > +#define tmp1 x14
> > +
> > +/* Copies are split into 3 main cases: small copies of up to 16 bytes,
> > + medium copies of 17..96 bytes which are fully unrolled. Large copies
> > + of more than 96 bytes align the destination and use an unrolled loop
> > + processing 64 bytes per iteration.
> > + In order to share code with memmove, small and medium copies read all
> > + data before writing, allowing any kind of overlap. So small, medium
> > + and large backwards memmoves are handled by falling through into
> memcpy.
> > + Overlapping large forward memmoves use a loop that copies backwards.
> > +*/
> > +
> > +ENTRY (__memmove_oryon1)
> > +
> > + PTR_ARG (0)
> > + PTR_ARG (1)
> > + SIZE_ARG (2)
> > +
> > + sub tmp1, dstin, src
> > + cmp count, 96
> > + ccmp tmp1, count, 2, hi
> > + b.lo L(move_long)
> > +
> > + /* Common case falls through into memcpy. */ END
> (__memmove_oryon1)
> > +
> > +ENTRY (__memcpy_oryon1)
> > +
> > + PTR_ARG (0)
> > + PTR_ARG (1)
> > + SIZE_ARG (2)
> > +
> > + add srcend, src, count
> > + add dstend, dstin, count
> > + cmp count, 16
> > + b.ls L(copy16)
> > + cmp count, 96
> > + b.hi L(copy_long)
> > +
> > + /* Medium copies: 17..96 bytes. */
> > + sub tmp1, count, 1
> > + ldp A_l, A_h, [src]
> > + tbnz tmp1, 6, L(copy96)
> > + ldp D_l, D_h, [srcend, -16]
> > + tbz tmp1, 5, 1f
> > + ldp B_l, B_h, [src, 16]
> > + ldp C_l, C_h, [srcend, -32]
> > + stp B_l, B_h, [dstin, 16]
> > + stp C_l, C_h, [dstend, -32]
> > +1:
> > + stp A_l, A_h, [dstin]
> > + stp D_l, D_h, [dstend, -16]
> > + ret
> > +
> > + .p2align 6
> > + /* Small copies: 0..16 bytes. */
> > +L(copy16):
> > + cmp count, 8
> > + b.lo 1f
> > + ldr A_l, [src]
> > + ldr A_h, [srcend, -8]
> > + str A_l, [dstin]
> > + str A_h, [dstend, -8]
> > + ret
> > + .p2align 6
> > +1:
> > + tbz count, 2, 1f
> > + ldr A_lw, [src]
> > + ldr A_hw, [srcend, -4]
> > + str A_lw, [dstin]
> > + str A_hw, [dstend, -4]
> > + ret
> > +
> > + /* Copy 0..3 bytes. Use a branchless sequence that copies the same
> > + byte 3 times if count==1, or the 2nd byte twice if count==2. */
> > +1:
> > + cbz count, 2f
> > + lsr tmp1, count, 1
> > + ldrb A_lw, [src]
> > + ldrb A_hw, [srcend, -1]
> > + ldrb B_lw, [src, tmp1]
> > + strb A_lw, [dstin]
> > + strb B_lw, [dstin, tmp1]
> > + strb A_hw, [dstend, -1]
> > +2: ret
> > +
> > + .p2align 6
> > + /* Copy 64..96 bytes. Copy 64 bytes from the start and
> > + 32 bytes from the end. */
> > +L(copy96):
> > + ldp B_l, B_h, [src, 16]
> > + ldp C_l, C_h, [src, 32]
> > + ldp D_l, D_h, [src, 48]
> > + ldp E_l, E_h, [srcend, -32]
> > + ldp F_l, F_h, [srcend, -16]
> > + stp A_l, A_h, [dstin]
> > + stp B_l, B_h, [dstin, 16]
> > + stp C_l, C_h, [dstin, 32]
> > + stp D_l, D_h, [dstin, 48]
> > + stp E_l, E_h, [dstend, -32]
> > + stp F_l, F_h, [dstend, -16]
> > + ret
> > +
> > + /* Align DST to 16 byte alignment so that we don't cross cache line
> > + boundaries on both loads and stores. There are at least 96 bytes
> > + to copy, so copy 16 bytes unaligned and then align. The loop
> > + copies 64 bytes per iteration and prefetches one iteration ahead.
> > +*/
> > +
> > + .p2align 6
> > +L(copy_long):
> > +
> > + /* On oryon1 cores, large memcpy's are helped by using ldnp/stnp.
> > + This loop is identical to the one below it but using ldnp/stnp
> > + instructions. For loops that are less than 32768 bytes,
> > + the ldnp/stnp does not help and slow the code down so we only
> > + use the ldnp/stnp loop for the largest memcpys. */
> > +
> > + cmp count, #32768
> > + b.lo L(copy_long_without_nontemp)
> > + and tmp1, dstin, 15
> > + bic dst, dstin, 15
> > + ldnp D_l, D_h, [src]
> > + sub src, src, tmp1
> > + add count, count, tmp1 /* Count is now 16 too large. */
> > + ldnp A_l, A_h, [src, 16]
> > + stnp D_l, D_h, [dstin]
> > + ldnp B_l, B_h, [src, 32]
> > + ldnp C_l, C_h, [src, 48]
> > + ldnp D_l, D_h, [src, 64]
> > + add src, src, #64
> > + subs count, count, 128 + 16 /* Test and readjust count. */
> > +
> > +L(nontemp_loop64):
> > + tbz src, #6, 1f
> > +1:
> > + stnp A_l, A_h, [dst, 16]
> > + ldnp A_l, A_h, [src, 16]
> > + stnp B_l, B_h, [dst, 32]
> > + ldnp B_l, B_h, [src, 32]
> > + stnp C_l, C_h, [dst, 48]
> > + ldnp C_l, C_h, [src, 48]
> > + stnp D_l, D_h, [dst, 64]
> > + ldnp D_l, D_h, [src, 64]
> > + add src, src, #64
> > + add dst, dst, #64
> > + subs count, count, 64
> > + b.hi L(nontemp_loop64)
> > + b L(last64)
> > +
> > +L(copy_long_without_nontemp):
> > +
> > + and tmp1, dstin, 15
> > + bic dst, dstin, 15
> > + ldp D_l, D_h, [src]
> > + sub src, src, tmp1
> > + add count, count, tmp1 /* Count is now 16 too large. */
> > + ldp A_l, A_h, [src, 16]
> > + stp D_l, D_h, [dstin]
> > + ldp B_l, B_h, [src, 32]
> > + ldp C_l, C_h, [src, 48]
> > + ldp D_l, D_h, [src, 64]!
> > + subs count, count, 128 + 16 /* Test and readjust count. */
> > + b.ls L(last64)
> > +L(loop64):
> > + stp A_l, A_h, [dst, 16]
> > + ldp A_l, A_h, [src, 16]
> > + stp B_l, B_h, [dst, 32]
> > + ldp B_l, B_h, [src, 32]
> > + stp C_l, C_h, [dst, 48]
> > + ldp C_l, C_h, [src, 48]
> > + stp D_l, D_h, [dst, 64]!
> > + ldp D_l, D_h, [src, 64]!
> > + subs count, count, 64
> > + b.hi L(loop64)
> > +
> > + /* Write the last full set of 64 bytes. The remainder is at most 64
> > + bytes, so it is safe to always copy 64 bytes from the end even if
> > + there is just 1 byte left. */
> > +L(last64):
> > + ldp E_l, E_h, [srcend, -64]
> > + stp A_l, A_h, [dst, 16]
> > + ldp A_l, A_h, [srcend, -48]
> > + stp B_l, B_h, [dst, 32]
> > + ldp B_l, B_h, [srcend, -32]
> > + stp C_l, C_h, [dst, 48]
> > + ldp C_l, C_h, [srcend, -16]
> > + stp D_l, D_h, [dst, 64]
> > + stp E_l, E_h, [dstend, -64]
> > + stp A_l, A_h, [dstend, -48]
> > + stp B_l, B_h, [dstend, -32]
> > + stp C_l, C_h, [dstend, -16]
> > + ret
> > +
> > + .p2align 6
> > +L(move_long):
> > + cbz tmp1, 3f
> > +
> > + add srcend, src, count
> > + add dstend, dstin, count
> > +
> > + /* Align dstend to 16 byte alignment so that we don't cross cache line
> > + boundaries on both loads and stores. There are at least 96 bytes
> > + to copy, so copy 16 bytes unaligned and then align. The loop
> > + copies 64 bytes per iteration and prefetches one iteration ahead.
> > +*/
> > +
> > + and tmp1, dstend, 15
> > + ldp D_l, D_h, [srcend, -16]
> > + sub srcend, srcend, tmp1
> > + sub count, count, tmp1
> > + ldp A_l, A_h, [srcend, -16]
> > + stp D_l, D_h, [dstend, -16]
> > + ldp B_l, B_h, [srcend, -32]
> > + ldp C_l, C_h, [srcend, -48]
> > + ldp D_l, D_h, [srcend, -64]!
> > + sub dstend, dstend, tmp1
> > + subs count, count, 128
> > + b.ls 2f
> > +
> > + nop
> > +1:
> > + stp A_l, A_h, [dstend, -16]
> > + ldp A_l, A_h, [srcend, -16]
> > + stp B_l, B_h, [dstend, -32]
> > + ldp B_l, B_h, [srcend, -32]
> > + stp C_l, C_h, [dstend, -48]
> > + ldp C_l, C_h, [srcend, -48]
> > + stp D_l, D_h, [dstend, -64]!
> > + ldp D_l, D_h, [srcend, -64]!
> > + subs count, count, 64
> > + b.hi 1b
> > +
> > + /* Write the last full set of 64 bytes. The remainder is at most 64
> > + bytes, so it is safe to always copy 64 bytes from the start even if
> > + there is just 1 byte left. */
> > +2:
> > + ldp G_l, G_h, [src, 48]
> > + stp A_l, A_h, [dstend, -16]
> > + ldp A_l, A_h, [src, 32]
> > + stp B_l, B_h, [dstend, -32]
> > + ldp B_l, B_h, [src, 16]
> > + stp C_l, C_h, [dstend, -48]
> > + ldp C_l, C_h, [src]
> > + stp D_l, D_h, [dstend, -64]
> > + stp G_l, G_h, [dstin, 48]
> > + stp A_l, A_h, [dstin, 32]
> > + stp B_l, B_h, [dstin, 16]
> > + stp C_l, C_h, [dstin]
> > +3: ret
> > +
> > +END (__memcpy_oryon1)
>
> --
> Cheers,
> Carlos.
next prev parent reply other threads:[~2024-05-06 15:54 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-03 1:21 Andrew Pinski
2024-05-03 1:21 ` [PATCH 2/2] Aarch64: Add new memset for Qualcomm's 0ryon-1 core Andrew Pinski
2024-05-06 13:07 ` [PATCH 1/2] Aarch64: Add memcpy for qualcomm's oryon-1 core Carlos O'Donell
2024-05-06 15:54 ` Andrew Pinski (QUIC) [this message]
2024-05-08 13:32 ` Carlos O'Donell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DM6PR02MB40585CC4F66F27E8B82C8D4DB81C2@DM6PR02MB4058.namprd02.prod.outlook.com \
--to=quic_apinski@quicinc.com \
--cc=carlos@redhat.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).