From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Wilco.Dijkstra@arm.com>
Received: from EUR02-HE1-obe.outbound.protection.outlook.com
 (mail-eopbgr10040.outbound.protection.outlook.com [40.107.1.40])
 by sourceware.org (Postfix) with ESMTPS id 0D2F4385702D
 for <libc-stable@sourceware.org>; Wed, 14 Oct 2020 16:54:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 0D2F4385702D
Received: from DB6P195CA0007.EURP195.PROD.OUTLOOK.COM (2603:10a6:4:cb::17) by
 DB8PR08MB5036.eurprd08.prod.outlook.com (2603:10a6:10:ed::20) with
 Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.3477.21; Wed, 14 Oct 2020 16:54:54 +0000
Received: from DB5EUR03FT020.eop-EUR03.prod.protection.outlook.com
 (2603:10a6:4:cb:cafe::91) by DB6P195CA0007.outlook.office365.com
 (2603:10a6:4:cb::17) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.21 via Frontend
 Transport; Wed, 14 Oct 2020 16:54:54 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123)
 smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified)
 header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none
 header.from=arm.com;
Received-SPF: Pass (protection.outlook.com: domain of arm.com designates
 63.35.35.123 as permitted sender) receiver=protection.outlook.com;
 client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com;
Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by
 DB5EUR03FT020.mail.protection.outlook.com (10.152.20.134) with
 Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.3477.21 via Frontend Transport; Wed, 14 Oct 2020 16:54:54 +0000
Received: ("Tessian outbound e6c55a0b9ba9:v64");
 Wed, 14 Oct 2020 16:54:54 +0000
X-CheckRecipientChecked: true
X-CR-MTA-CID: 344fb49dc946ffcc
X-CR-MTA-TID: 64aa7808
Received: from 770cfc4950b8.1
 by 64aa7808-outbound-1.mta.getcheckrecipient.com id
 0B52F8F9-256E-4888-9E94-02D5DDDFD61D.1; 
 Wed, 14 Oct 2020 16:54:16 +0000
Received: from EUR05-DB8-obe.outbound.protection.outlook.com
 by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 770cfc4950b8.1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384);
 Wed, 14 Oct 2020 16:54:16 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=R1ZrgQg60VaKEnqGWbMnMetxoh82yMM4Sv319rseFjL6ciqat1d5UidYFTOR8mrySV+5LehY9N4Rzi3OZ+Ji5KfxBuluoYNkqyKOlrbYOcQ8/MSyf8kVUo/WIBQRM5QPyFlckX6Sxaahk0N29VgAVvPNjdcDbiVlyN9kNDsVteE/DR6Sc5lFl3UsK/XuMs7mJ4WtWaPL/T9zDBrBBx5gQkP7Egupl8qiAgZbR5vCuXNbdfzS5/ybhoZAYTO/C6m/fr1C9qQNlgyuzNfxR/7c8wVT11T5I8tozH0T94UcfV60AAkHze1sVl9/y5xiqr6UTGksJgpV+UAb7bJpwfGsng==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; 
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=qqBowRXlCLWkgvsaMWd4GMPRtFi4AK4TudDE3l3AN1M=;
 b=TvHKPporfP4encnmehHNW+OwdTAjpwall34zkDPMENMn2T7wP7DD0rjY8S3JLnzOD+IMhoziBFOUPFCrqY3hjAIGx4hXZmVoUrsC9lzR8qxdH4IXHaQjSLzvOSCCMs32i1b5Kkl5UIhpI67frlD6KfCEKTBKnPuwSVRQwE5IvGdQbvOqApF5HnUjADC7NNO5U/4RiwvaPDH8rLAA8bnDgJ9IO+ArZC9kY1oJRwoVJUehkjh6FTQTMjPKYL2RALwaZJool7lAUDVP9T2uv2PeknSOE+E4wtZ9ByL2Jtg5ojP3eLR6WFceY/EAmqYIECgUkL5+RwsZR6KjKr/uY7nADg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass
 header.d=arm.com; arc=none
Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12)
 by VI1PR08MB4285.eurprd08.prod.outlook.com (2603:10a6:803:f8::22)
 with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.20; Wed, 14 Oct
 2020 16:54:15 +0000
Received: from VE1PR08MB5599.eurprd08.prod.outlook.com
 ([fe80::60b7:2d8b:81cb:bc0d]) by VE1PR08MB5599.eurprd08.prod.outlook.com
 ([fe80::60b7:2d8b:81cb:bc0d%3]) with mapi id 15.20.3477.021; Wed, 14 Oct 2020
 16:54:15 +0000
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "libc-stable@sourceware.org" <libc-stable@sourceware.org>
CC: nd <nd@arm.com>
Subject: [2.28 COMMITTED] AArch64: Backport memcpy improvements
Thread-Topic: [2.28 COMMITTED] AArch64: Backport memcpy improvements
Thread-Index: AQHWokpl6695SD9kp0GVEkPYCETJ5g==
Date: Wed, 14 Oct 2020 16:54:15 +0000
Message-ID: <VE1PR08MB55992823A0935FD62DC2B8B383050@VE1PR08MB5599.eurprd08.prod.outlook.com>
Accept-Language: en-GB, en-US
Content-Language: en-GB
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Authentication-Results-Original: sourceware.org; dkim=none (message not
 signed) header.d=none;sourceware.org; dmarc=none action=none
 header.from=arm.com;
x-originating-ip: [82.24.199.97]
x-ms-publictraffictype: Email
X-MS-Office365-Filtering-HT: Tenant
X-MS-Office365-Filtering-Correlation-Id: b96e5ebe-32c2-4a2c-0559-08d87061dfe3
x-ms-traffictypediagnostic: VI1PR08MB4285:|DB8PR08MB5036:
X-Microsoft-Antispam-PRVS: <DB8PR08MB50364599953DE225E1E6D7FB83050@DB8PR08MB5036.eurprd08.prod.outlook.com>
x-checkrecipientrouted: true
nodisclaimer: true
x-ms-oob-tlc-oobclassifiers: OLM:883;OLM:883;
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original: XMGWylezK0GQpgaIN3PAccOHV5tCD2UtvWccXH9bBd7fwTptdJKJjkVSeAdjJlNmc3gNYuIAmOpV36ch52yCEJTUQH7HWNGTHE4Ma663wvL/MTpEk3ze3hNzh4zEa4XoXCYJCtVqxiAuGdQBiuHo8KVZfya5/9J5zE+ynJOChMYjN6r74zZMy2MJJn266GeSkQJyk+WPcOnjgx0MP0SWcMo5A04dYVBX3CVUDh12IHfyI2EML5GMBye7BgWw4rlIDmz0KzWT5QHw6KD/BaNQqBMVrgSqgcm3fBuNS3mR/F5OsGlYqlUIy6MOxwIsDf2Ieu15Sn5te3e6/+dYECo3Dn+LX33jlUfVovPk7V/zEmMQA7FP/haG0XikGn+zk1l+4fTJkRQDQhKAK8YGu8PhFDF92hZLMFkXOHHnspxxLGwbOj3kZ7AdgsA1kkzKbFNCI3YWXcCdu/UCn1Gr+sInKQ==
X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en;
 SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com;
 PTR:; CAT:NONE;
 SFS:(4636009)(376002)(346002)(39860400002)(366004)(396003)(136003)(33656002)(6506007)(76116006)(83380400001)(66446008)(66476007)(66946007)(83080400001)(30864003)(71200400001)(52536014)(64756008)(5660300002)(86362001)(6916009)(316002)(66556008)(8676002)(478600001)(9686003)(26005)(4326008)(55016002)(8936002)(186003)(7696005)(2906002)(966005)(2004002)(559001)(579004);
 DIR:OUT; SFP:1101; 
x-ms-exchange-antispam-messagedata: zEXyMkR1edyGXWZ33MmDp2JF5VryZRKUqAPbpoN/hGfuObOXQ7HpoP+0qwBnw6Z8qjG5prU2nxbrr9rXrW1KBWoVoTy0jj9pOxSPR2Napvl1GdtqVSWdjVdxidCqo2tFUC0R9AJgimB077BEvcohnX2+gsYZYnT2jESODxAoMVAHEwSMeX7o9kGkXhbSML+w1SFMbULuqlLDmCgL4RMWG3o/L4wtAvWjRTlIK11X1ATh7co82GTHAb/DF9e+bxpO/qnFkDyTCumdHjbh9b9/pwrv+bmBShA059SaF3fnxOrRyF7E/OZI8iZxqrWzMrx3F4EshFFj+Rk0HfShA19UPIuOn0F5/34mW7Tu6+2yUKkiVJHP0FFwt8TTkjczIS34JKCeeGT6sG2Cqze8c/mPqgjkP/lLp0HVyo1TtKdjS3+6hmBgHSjXxygMfwcZsV+wOhijo8LYHecmmiV4dtrBoj32gC46lpuoX8o4BunTtZXIvEEH9a4sZ/jdm+3y8HZYPZizQfSxNg3WMV8x35gEZkFXRkxnermYxue0KtOulB4mPF5sQJ+7+CzZtVIoLj5yRs1mHJJozgKj+BsMrUq8f3zhOAi0AqG2H2FHKMGcGMP+H1kjh+k0G/jyAOXcR2ufS52DHFt1wTq6KzkSM7XuiQ==
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB4285
Original-Authentication-Results: sourceware.org; dkim=none (message not signed)
 header.d=none; sourceware.org;
 dmarc=none action=none header.from=arm.com; 
X-EOPAttributedMessage: 0
X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT020.eop-EUR03.prod.protection.outlook.com
X-MS-Office365-Filtering-Correlation-Id-Prvs: a59d97bd-b5c1-408a-05e8-08d87061c89a
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: nVHx+FxFIKBvOt8GaXJPFJ+gQeydRKedjkI3HNRloDJoMykke3XDJkAZKSXOiH/Ccx6cHmOFwNfHq6rM6tfSUzPzidZhfXPO0fBRO37uRpsVNLqkPcktUhG3oiOwbNOmf5o5pxkGtt88V/aCTTpu5+r1rG9uI/jCst9HaVHc7rUnCTmMrQlvKevEVmCnyx8tPMKcnJRhRPgYrXfWLbRHDBPPLxIQtSwuT0xnzYyshpAKSnU1MBbr12g5aHdVScAPwqWx8uwxXJSs9cTDIyAjWBXqxFnYWglpbXppUwqybbemLfoQW3nWgv6ywG2mnrCWzC9Wb7c9u7hesGynJXWmg0BiHQ1LbOCcsDjOnderrr+plPKvVTFtXVAzpHOZlZJd53FZph7H3yu8Nqf0dRUJzvmcdGlZcDjBZ5eCAAihr1cOqdISKa9mXwcmvsecvP4FFTDXFxjfFgog3I14Iog9GcPpZ+icEDqG6/sQ6pf+gdmXTogkv4yOc96Q8RiLgPwW
X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:;
 IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com;
 PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE;
 SFS:(4636009)(396003)(39860400002)(136003)(376002)(346002)(46966005)(82740400003)(81166007)(70586007)(478600001)(55016002)(33656002)(86362001)(9686003)(966005)(6916009)(47076004)(316002)(82310400003)(26005)(336012)(83080400001)(186003)(5660300002)(70206006)(7696005)(4326008)(83380400001)(6506007)(30864003)(2906002)(356005)(52536014)(8676002)(8936002)(2004002);
 DIR:OUT; SFP:1101; 
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Oct 2020 16:54:54.3201 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: b96e5ebe-32c2-4a2c-0559-08d87061dfe3
X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123];
 Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com]
X-MS-Exchange-CrossTenant-AuthSource: DB5EUR03FT020.eop-EUR03.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB8PR08MB5036
X-Spam-Status: No, score=-10.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, KAM_STOCKGEN,
 RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,
 SCC_5_SHORT_WORD_LINES, SPF_HELO_PASS, SPF_PASS, TXREP,
 UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-stable@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-stable mailing list <libc-stable.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-stable>,
 <mailto:libc-stable-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-stable/>
List-Help: <mailto:libc-stable-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-stable>,
 <mailto:libc-stable-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Oct 2020 16:55:00 -0000

commit e5dac996b9c5541d5c677565d4102566734202c4=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Oct 14 13:56:21 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Use __memcpy_simd on Neoverse N2/V1=0A=
=0A=
=A0 =A0 Add CPU detection of Neoverse N2 and Neoverse V1, and select __memc=
py_simd as=0A=
=A0 =A0 the memcpy/memmove ifunc.=0A=
=0A=
=A0 =A0 Reviewed-by: Adhemerval Zanella =A0<adhemerval.zanella@linaro.org>=
=0A=
=A0 =A0 (cherry picked from commit e11ed9d2b4558eeacff81557dc9557001af42a6b=
)=0A=
=0A=
commit 98979f62b88dc781e99db84744646f298fcea62f=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Jul 15 16:58:07 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Rename IS_ARES to IS_NEOVERSE_N1=0A=
=0A=
=A0 =A0 Rename IS_ARES to IS_NEOVERSE_N1 since that is a bit clearer.=0A=
=0A=
=A0 =A0 Reviewed-by: Carlos O'Donell <carlos@redhat.com>=0A=
=A0 =A0 (cherry picked from commit 0f6278a8793a5d04ea31878119eccf99f469a02d=
)=0A=
=0A=
commit fe09348c4e183e390f0a8b806a543f8860b62559=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Mar 11 17:15:25 2020 +0000=0A=
=0A=
=A0 =A0 [AArch64] Improve integer memcpy=0A=
=0A=
=A0 =A0 Further optimize integer memcpy. =A0Small cases now include copies =
up=0A=
=A0 =A0 to 32 bytes. =A064-128 byte copies are split into two cases to impr=
ove=0A=
=A0 =A0 performance of 64-96 byte copies. =A0Comments have been rewritten.=
=0A=
=0A=
=A0 =A0 (cherry picked from commit 700065132744e0dfa6d4d9142d63f6e3a1934726=
)=0A=
=0A=
commit 722c93572e6344223cab8fbf78d2846a453f2487=0A=
Author: Krzysztof Koch <Krzysztof.Koch@arm.com>=0A=
Date: =A0 Tue Nov 5 17:35:18 2019 +0000=0A=
=0A=
=A0 =A0 aarch64: Increase small and medium cases for __memcpy_generic=0A=
=0A=
=A0 =A0 Increase the upper bound on medium cases from 96 to 128 bytes.=0A=
=A0 =A0 Now, up to 128 bytes are copied unrolled.=0A=
=0A=
=A0 =A0 Increase the upper bound on small cases from 16 to 32 bytes so that=
=0A=
=A0 =A0 copies of 17-32 bytes are not impacted by the larger medium case.=
=0A=
=0A=
=A0 =A0 Benchmarking:=0A=
=A0 =A0 The attached figures show relative timing difference with respect=
=0A=
=A0 =A0 to 'memcpy_generic', which is the existing implementation.=0A=
=A0 =A0 'memcpy_med_128' denotes the the version of memcpy_generic with=0A=
=A0 =A0 only the medium case enlarged. The 'memcpy_med_128_small_32' number=
s=0A=
=A0 =A0 are for the version of memcpy_generic submitted in this patch, whic=
h=0A=
=A0 =A0 has both medium and small cases enlarged. The figures were generate=
d=0A=
=A0 =A0 using the script from:=0A=
=A0 =A0 https://www.sourceware.org/ml/libc-alpha/2019-10/msg00563.html=0A=
=0A=
=A0 =A0 Depending on the platform, the performance improvement in the=0A=
=A0 =A0 bench-memcpy-random.c benchmark ranges from 6% to 20% between=0A=
=A0 =A0 the original and final version of memcpy.S=0A=
=0A=
=A0 =A0 Tested against GLIBC testsuite and randomized tests.=0A=
=0A=
=A0 =A0 (cherry picked from commit b9f145df85145506f8e61bac38b792584a38d88f=
)=0A=
=0A=
commit b915da29dab5d8c6b9cdb1ee6fdc1e0ec6ef39e1=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Fri Aug 28 17:51:40 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Improve backwards memmove performance=0A=
=0A=
=A0 =A0 On some microarchitectures performance of the backwards memmove imp=
roves if=0A=
=A0 =A0 the stores use STR with decreasing addresses. =A0So change the memm=
ove loop=0A=
=A0 =A0 in memcpy_advsimd.S to use 2x STR rather than STP.=0A=
=0A=
=A0 =A0 Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>=0A=
=A0 =A0 (cherry picked from commit bd394d131c10c9ec22c6424197b79410042eed99=
)=0A=
=0A=
commit 4bd28df0b0598c380f4ae63b96eaadc782c9d709=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Jul 15 16:55:07 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Add optimized Q-register memcpy=0A=
=0A=
=A0 =A0 Add a new memcpy using 128-bit Q registers - this is faster on mode=
rn=0A=
=A0 =A0 cores and reduces codesize. =A0Similar to the generic memcpy, small=
 cases=0A=
=A0 =A0 include copies up to 32 bytes. =A064-128 byte copies are split into=
 two=0A=
=A0 =A0 cases to improve performance of 64-96 byte copies. =A0Large copies =
align=0A=
=A0 =A0 the source rather than the destination.=0A=
=0A=
=A0 =A0 bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1=
,=0A=
=A0 =A0 so make this memcpy the default on N1 (on Centriq it is 15% faster =
than=0A=
=A0 =A0 memcpy_falkor).=0A=
=0A=
=A0 =A0 Passes GLIBC regression tests.=0A=
=0A=
=A0 =A0 Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>=0A=
=A0 =A0 (cherry picked from commit 4a733bf375238a6a595033b5785cea7f27d61307=
)=0A=
=0A=
commit 118fbee7a0dfac0d311b8a7a8f8bd8d1fb6e205b=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Jul 15 16:50:02 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Align ENTRY to a cacheline=0A=
=0A=
=A0 =A0 Given almost all uses of ENTRY are for string/memory functions,=0A=
=A0 =A0 align ENTRY to a cacheline to simplify things.=0A=
=0A=
=A0 =A0 Reviewed-by: Carlos O'Donell <carlos@redhat.com>=0A=
=A0 =A0 (cherry picked from commit 34f0d01d5e43c7dedd002ab47f6266dfb5b79c22=
)=0A=
=0A=
=0A=
diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S=0A=
index 7e1163e..919b28d 100644=0A=
--- a/sysdeps/aarch64/memcpy.S=0A=
+++ b/sysdeps/aarch64/memcpy.S=0A=
@@ -33,32 +33,24 @@=0A=
 #define A_l    x6=0A=
 #define A_lw   w6=0A=
 #define A_h    x7=0A=
-#define A_hw   w7=0A=
 #define B_l    x8=0A=
 #define B_lw   w8=0A=
 #define B_h    x9=0A=
 #define C_l    x10=0A=
+#define C_lw   w10=0A=
 #define C_h    x11=0A=
 #define D_l    x12=0A=
 #define D_h    x13=0A=
-#define E_l    src=0A=
-#define E_h    count=0A=
-#define F_l    srcend=0A=
-#define F_h    dst=0A=
+#define E_l    x14=0A=
+#define E_h    x15=0A=
+#define F_l    x16=0A=
+#define F_h    x17=0A=
 #define G_l    count=0A=
 #define G_h    dst=0A=
+#define H_l    src=0A=
+#define H_h    srcend=0A=
 #define tmp1   x14=0A=
=0A=
-/* Copies are split into 3 main cases: small copies of up to 16 bytes,=0A=
-   medium copies of 17..96 bytes which are fully unrolled. Large copies=0A=
-   of more than 96 bytes align the destination and use an unrolled loop=0A=
-   processing 64 bytes per iteration.=0A=
-   In order to share code with memmove, small and medium copies read all=
=0A=
-   data before writing, allowing any kind of overlap. So small, medium=0A=
-   and large backwards memmoves are handled by falling through into memcpy=
.=0A=
-   Overlapping large forward memmoves use a loop that copies backwards.=0A=
-*/=0A=
-=0A=
 #ifndef MEMMOVE=0A=
 # define MEMMOVE memmove=0A=
 #endif=0A=
@@ -66,108 +58,115 @@=0A=
 # define MEMCPY memcpy=0A=
 #endif=0A=
=0A=
-ENTRY_ALIGN (MEMMOVE, 6)=0A=
-=0A=
-       DELOUSE (0)=0A=
-       DELOUSE (1)=0A=
-       DELOUSE (2)=0A=
+/* This implementation supports both memcpy and memmove and shares most co=
de.=0A=
+   It uses unaligned accesses and branchless sequences to keep the code sm=
all,=0A=
+   simple and improve performance.=0A=
=0A=
-       sub     tmp1, dstin, src=0A=
-       cmp     count, 96=0A=
-       ccmp    tmp1, count, 2, hi=0A=
-       b.lo    L(move_long)=0A=
+   Copies are split into 3 main cases: small copies of up to 32 bytes, med=
ium=0A=
+   copies of up to 128 bytes, and large copies.  The overhead of the overl=
ap=0A=
+   check in memmove is negligible since it is only required for large copi=
es.=0A=
=0A=
-       /* Common case falls through into memcpy.  */=0A=
-END (MEMMOVE)=0A=
-libc_hidden_builtin_def (MEMMOVE)=0A=
-ENTRY (MEMCPY)=0A=
+   Large copies use a software pipelined loop processing 64 bytes per=0A=
+   iteration.  The destination pointer is 16-byte aligned to minimize=0A=
+   unaligned accesses.  The loop tail is handled by always copying 64 byte=
s=0A=
+   from the end.=0A=
+*/=0A=
=0A=
+ENTRY_ALIGN (MEMCPY, 6)=0A=
        DELOUSE (0)=0A=
        DELOUSE (1)=0A=
        DELOUSE (2)=0A=
=0A=
-       prfm    PLDL1KEEP, [src]=0A=
        add     srcend, src, count=0A=
        add     dstend, dstin, count=0A=
-       cmp     count, 16=0A=
-       b.ls    L(copy16)=0A=
-       cmp     count, 96=0A=
+       cmp     count, 128=0A=
        b.hi    L(copy_long)=0A=
+       cmp     count, 32=0A=
+       b.hi    L(copy32_128)=0A=
=0A=
-       /* Medium copies: 17..96 bytes.  */=0A=
-       sub     tmp1, count, 1=0A=
+       /* Small copies: 0..32 bytes.  */=0A=
+       cmp     count, 16=0A=
+       b.lo    L(copy16)=0A=
        ldp     A_l, A_h, [src]=0A=
-       tbnz    tmp1, 6, L(copy96)=0A=
        ldp     D_l, D_h, [srcend, -16]=0A=
-       tbz     tmp1, 5, 1f=0A=
-       ldp     B_l, B_h, [src, 16]=0A=
-       ldp     C_l, C_h, [srcend, -32]=0A=
-       stp     B_l, B_h, [dstin, 16]=0A=
-       stp     C_l, C_h, [dstend, -32]=0A=
-1:=0A=
        stp     A_l, A_h, [dstin]=0A=
        stp     D_l, D_h, [dstend, -16]=0A=
        ret=0A=
=0A=
-       .p2align 4=0A=
-       /* Small copies: 0..16 bytes.  */=0A=
+       /* Copy 8-15 bytes.  */=0A=
 L(copy16):=0A=
-       cmp     count, 8=0A=
-       b.lo    1f=0A=
+       tbz     count, 3, L(copy8)=0A=
        ldr     A_l, [src]=0A=
        ldr     A_h, [srcend, -8]=0A=
        str     A_l, [dstin]=0A=
        str     A_h, [dstend, -8]=0A=
        ret=0A=
-       .p2align 4=0A=
-1:=0A=
-       tbz     count, 2, 1f=0A=
+=0A=
+       .p2align 3=0A=
+       /* Copy 4-7 bytes.  */=0A=
+L(copy8):=0A=
+       tbz     count, 2, L(copy4)=0A=
        ldr     A_lw, [src]=0A=
-       ldr     A_hw, [srcend, -4]=0A=
+       ldr     B_lw, [srcend, -4]=0A=
        str     A_lw, [dstin]=0A=
-       str     A_hw, [dstend, -4]=0A=
+       str     B_lw, [dstend, -4]=0A=
        ret=0A=
=0A=
-       /* Copy 0..3 bytes.  Use a branchless sequence that copies the same=
=0A=
-          byte 3 times if count=3D=3D1, or the 2nd byte twice if count=3D=
=3D2.  */=0A=
-1:=0A=
-       cbz     count, 2f=0A=
+       /* Copy 0..3 bytes using a branchless sequence.  */=0A=
+L(copy4):=0A=
+       cbz     count, L(copy0)=0A=
        lsr     tmp1, count, 1=0A=
        ldrb    A_lw, [src]=0A=
-       ldrb    A_hw, [srcend, -1]=0A=
+       ldrb    C_lw, [srcend, -1]=0A=
        ldrb    B_lw, [src, tmp1]=0A=
        strb    A_lw, [dstin]=0A=
        strb    B_lw, [dstin, tmp1]=0A=
-       strb    A_hw, [dstend, -1]=0A=
-2:     ret=0A=
+       strb    C_lw, [dstend, -1]=0A=
+L(copy0):=0A=
+       ret=0A=
=0A=
        .p2align 4=0A=
-       /* Copy 64..96 bytes.  Copy 64 bytes from the start and=0A=
-          32 bytes from the end.  */=0A=
-L(copy96):=0A=
+       /* Medium copies: 33..128 bytes.  */=0A=
+L(copy32_128):=0A=
+       ldp     A_l, A_h, [src]=0A=
        ldp     B_l, B_h, [src, 16]=0A=
-       ldp     C_l, C_h, [src, 32]=0A=
-       ldp     D_l, D_h, [src, 48]=0A=
-       ldp     E_l, E_h, [srcend, -32]=0A=
-       ldp     F_l, F_h, [srcend, -16]=0A=
+       ldp     C_l, C_h, [srcend, -32]=0A=
+       ldp     D_l, D_h, [srcend, -16]=0A=
+       cmp     count, 64=0A=
+       b.hi    L(copy128)=0A=
        stp     A_l, A_h, [dstin]=0A=
        stp     B_l, B_h, [dstin, 16]=0A=
-       stp     C_l, C_h, [dstin, 32]=0A=
-       stp     D_l, D_h, [dstin, 48]=0A=
-       stp     E_l, E_h, [dstend, -32]=0A=
-       stp     F_l, F_h, [dstend, -16]=0A=
+       stp     C_l, C_h, [dstend, -32]=0A=
+       stp     D_l, D_h, [dstend, -16]=0A=
        ret=0A=
=0A=
-       /* Align DST to 16 byte alignment so that we don't cross cache line=
=0A=
-          boundaries on both loads and stores.  There are at least 96 byte=
s=0A=
-          to copy, so copy 16 bytes unaligned and then align.  The loop=0A=
-          copies 64 bytes per iteration and prefetches one iteration ahead=
.  */=0A=
+       .p2align 4=0A=
+       /* Copy 65..128 bytes.  */=0A=
+L(copy128):=0A=
+       ldp     E_l, E_h, [src, 32]=0A=
+       ldp     F_l, F_h, [src, 48]=0A=
+       cmp     count, 96=0A=
+       b.ls    L(copy96)=0A=
+       ldp     G_l, G_h, [srcend, -64]=0A=
+       ldp     H_l, H_h, [srcend, -48]=0A=
+       stp     G_l, G_h, [dstend, -64]=0A=
+       stp     H_l, H_h, [dstend, -48]=0A=
+L(copy96):=0A=
+       stp     A_l, A_h, [dstin]=0A=
+       stp     B_l, B_h, [dstin, 16]=0A=
+       stp     E_l, E_h, [dstin, 32]=0A=
+       stp     F_l, F_h, [dstin, 48]=0A=
+       stp     C_l, C_h, [dstend, -32]=0A=
+       stp     D_l, D_h, [dstend, -16]=0A=
+       ret=0A=
=0A=
        .p2align 4=0A=
+       /* Copy more than 128 bytes.  */=0A=
 L(copy_long):=0A=
+       /* Copy 16 bytes and then align dst to 16-byte alignment.  */=0A=
+       ldp     D_l, D_h, [src]=0A=
        and     tmp1, dstin, 15=0A=
        bic     dst, dstin, 15=0A=
-       ldp     D_l, D_h, [src]=0A=
        sub     src, src, tmp1=0A=
        add     count, count, tmp1      /* Count is now 16 too large.  */=
=0A=
        ldp     A_l, A_h, [src, 16]=0A=
@@ -176,7 +175,8 @@ L(copy_long):=0A=
        ldp     C_l, C_h, [src, 48]=0A=
        ldp     D_l, D_h, [src, 64]!=0A=
        subs    count, count, 128 + 16  /* Test and readjust count.  */=0A=
-       b.ls    L(last64)=0A=
+       b.ls    L(copy64_from_end)=0A=
+=0A=
 L(loop64):=0A=
        stp     A_l, A_h, [dst, 16]=0A=
        ldp     A_l, A_h, [src, 16]=0A=
@@ -189,10 +189,8 @@ L(loop64):=0A=
        subs    count, count, 64=0A=
        b.hi    L(loop64)=0A=
=0A=
-       /* Write the last full set of 64 bytes.  The remainder is at most 6=
4=0A=
-          bytes, so it is safe to always copy 64 bytes from the end even i=
f=0A=
-          there is just 1 byte left.  */=0A=
-L(last64):=0A=
+       /* Write the last iteration and copy 64 bytes from the end.  */=0A=
+L(copy64_from_end):=0A=
        ldp     E_l, E_h, [srcend, -64]=0A=
        stp     A_l, A_h, [dst, 16]=0A=
        ldp     A_l, A_h, [srcend, -48]=0A=
@@ -207,20 +205,42 @@ L(last64):=0A=
        stp     C_l, C_h, [dstend, -16]=0A=
        ret=0A=
=0A=
-       .p2align 4=0A=
-L(move_long):=0A=
-       cbz     tmp1, 3f=0A=
+END (MEMCPY)=0A=
+libc_hidden_builtin_def (MEMCPY)=0A=
+=0A=
+ENTRY_ALIGN (MEMMOVE, 4)=0A=
+       DELOUSE (0)=0A=
+       DELOUSE (1)=0A=
+       DELOUSE (2)=0A=
=0A=
        add     srcend, src, count=0A=
        add     dstend, dstin, count=0A=
+       cmp     count, 128=0A=
+       b.hi    L(move_long)=0A=
+       cmp     count, 32=0A=
+       b.hi    L(copy32_128)=0A=
+=0A=
+       /* Small copies: 0..32 bytes.  */=0A=
+       cmp     count, 16=0A=
+       b.lo    L(copy16)=0A=
+       ldp     A_l, A_h, [src]=0A=
+       ldp     D_l, D_h, [srcend, -16]=0A=
+       stp     A_l, A_h, [dstin]=0A=
+       stp     D_l, D_h, [dstend, -16]=0A=
+       ret=0A=
=0A=
-       /* Align dstend to 16 byte alignment so that we don't cross cache l=
ine=0A=
-          boundaries on both loads and stores.  There are at least 96 byte=
s=0A=
-          to copy, so copy 16 bytes unaligned and then align.  The loop=0A=
-          copies 64 bytes per iteration and prefetches one iteration ahead=
.  */=0A=
+       .p2align 4=0A=
+L(move_long):=0A=
+       /* Only use backward copy if there is an overlap.  */=0A=
+       sub     tmp1, dstin, src=0A=
+       cbz     tmp1, L(copy0)=0A=
+       cmp     tmp1, count=0A=
+       b.hs    L(copy_long)=0A=
=0A=
-       and     tmp1, dstend, 15=0A=
+       /* Large backwards copy for overlapping copies.=0A=
+          Copy 16 bytes and then align dst to 16-byte alignment.  */=0A=
        ldp     D_l, D_h, [srcend, -16]=0A=
+       and     tmp1, dstend, 15=0A=
        sub     srcend, srcend, tmp1=0A=
        sub     count, count, tmp1=0A=
        ldp     A_l, A_h, [srcend, -16]=0A=
@@ -230,10 +250,9 @@ L(move_long):=0A=
        ldp     D_l, D_h, [srcend, -64]!=0A=
        sub     dstend, dstend, tmp1=0A=
        subs    count, count, 128=0A=
-       b.ls    2f=0A=
+       b.ls    L(copy64_from_start)=0A=
=0A=
-       nop=0A=
-1:=0A=
+L(loop64_backwards):=0A=
        stp     A_l, A_h, [dstend, -16]=0A=
        ldp     A_l, A_h, [srcend, -16]=0A=
        stp     B_l, B_h, [dstend, -32]=0A=
@@ -243,12 +262,10 @@ L(move_long):=0A=
        stp     D_l, D_h, [dstend, -64]!=0A=
        ldp     D_l, D_h, [srcend, -64]!=0A=
        subs    count, count, 64=0A=
-       b.hi    1b=0A=
+       b.hi    L(loop64_backwards)=0A=
=0A=
-       /* Write the last full set of 64 bytes.  The remainder is at most 6=
4=0A=
-          bytes, so it is safe to always copy 64 bytes from the start even=
 if=0A=
-          there is just 1 byte left.  */=0A=
-2:=0A=
+       /* Write the last iteration and copy 64 bytes from the start.  */=
=0A=
+L(copy64_from_start):=0A=
        ldp     G_l, G_h, [src, 48]=0A=
        stp     A_l, A_h, [dstend, -16]=0A=
        ldp     A_l, A_h, [src, 32]=0A=
@@ -261,7 +278,7 @@ L(move_long):=0A=
        stp     A_l, A_h, [dstin, 32]=0A=
        stp     B_l, B_h, [dstin, 16]=0A=
        stp     C_l, C_h, [dstin]=0A=
-3:     ret=0A=
+       ret=0A=
=0A=
-END (MEMCPY)=0A=
-libc_hidden_builtin_def (MEMCPY)=0A=
+END (MEMMOVE)=0A=
+libc_hidden_builtin_def (MEMMOVE)=0A=
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch=
/Makefile=0A=
index 57ffdf7..134fe46 100644=0A=
--- a/sysdeps/aarch64/multiarch/Makefile=0A=
+++ b/sysdeps/aarch64/multiarch/Makefile=0A=
@@ -1,4 +1,4 @@=0A=
 ifeq ($(subdir),string)=0A=
-sysdep_routines +=3D memcpy_generic memcpy_thunderx memcpy_thunderx2 \=0A=
+sysdep_routines +=3D memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_=
thunderx2 \=0A=
                   memcpy_falkor memmove_falkor memset_generic memset_falko=
r=0A=
 endif=0A=
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/=
multiarch/ifunc-impl-list.c=0A=
index e55be80..0ccd141 100644=0A=
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c=0A=
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c=0A=
@@ -42,10 +42,12 @@ __libc_ifunc_impl_list (const char *name, struct libc_i=
func_impl *array,=0A=
              IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)=0A=
              IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx2)=0A=
              IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)=0A=
+             IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)=0A=
              IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))=0A=
   IFUNC_IMPL (i, name, memmove,=0A=
              IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)=0A=
              IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)=0A=
+             IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)=0A=
              IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))=0A=
   IFUNC_IMPL (i, name, memset,=0A=
              /* Enable this on non-falkor processors too so that other cor=
es=0A=
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch=
/memcpy.c=0A=
index 8f5d4e7..e69a1ae 100644=0A=
--- a/sysdeps/aarch64/multiarch/memcpy.c=0A=
+++ b/sysdeps/aarch64/multiarch/memcpy.c=0A=
@@ -29,6 +29,7 @@=0A=
 extern __typeof (__redirect_memcpy) __libc_memcpy;=0A=
=0A=
 extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;=0A=
+extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;=0A=
 extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;=0A=
 extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;=
=0A=
 extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;=0A=
@@ -36,11 +37,14 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor att=
ribute_hidden;=0A=
 libc_ifunc (__libc_memcpy,=0A=
             (IS_THUNDERX (midr)=0A=
             ? __memcpy_thunderx=0A=
-            : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_ARES (midr)=0A=
+            : (IS_FALKOR (midr) || IS_PHECDA (midr)=0A=
                ? __memcpy_falkor=0A=
                : (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)=0A=
                  ? __memcpy_thunderx2=0A=
-                 : __memcpy_generic))));=0A=
+                 : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)=0A=
+                    || IS_NEOVERSE_V1 (midr)=0A=
+                    ? __memcpy_simd=0A=
+                    : __memcpy_generic)))));=0A=
=0A=
 # undef memcpy=0A=
 strong_alias (__libc_memcpy, memcpy);=0A=
diff --git a/sysdeps/aarch64/multiarch/memcpy_advsimd.S b/sysdeps/aarch64/m=
ultiarch/memcpy_advsimd.S=0A=
new file mode 100644=0A=
index 0000000..48bb6d7=0A=
--- /dev/null=0A=
+++ b/sysdeps/aarch64/multiarch/memcpy_advsimd.S=0A=
@@ -0,0 +1,248 @@=0A=
+/* Generic optimized memcpy using SIMD.=0A=
+   Copyright (C) 2020 Free Software Foundation, Inc.=0A=
+=0A=
+   This file is part of the GNU C Library.=0A=
+=0A=
+   The GNU C Library is free software; you can redistribute it and/or=0A=
+   modify it under the terms of the GNU Lesser General Public=0A=
+   License as published by the Free Software Foundation; either=0A=
+   version 2.1 of the License, or (at your option) any later version.=0A=
+=0A=
+   The GNU C Library is distributed in the hope that it will be useful,=0A=
+   but WITHOUT ANY WARRANTY; without even the implied warranty of=0A=
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU=0A=
+   Lesser General Public License for more details.=0A=
+=0A=
+   You should have received a copy of the GNU Lesser General Public=0A=
+   License along with the GNU C Library.  If not, see=0A=
+   <https://www.gnu.org/licenses/>.  */=0A=
+=0A=
+#include <sysdep.h>=0A=
+=0A=
+/* Assumptions:=0A=
+ *=0A=
+ * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses.=0A=
+ *=0A=
+ */=0A=
+=0A=
+#define dstin  x0=0A=
+#define src    x1=0A=
+#define count  x2=0A=
+#define dst    x3=0A=
+#define srcend x4=0A=
+#define dstend x5=0A=
+#define A_l    x6=0A=
+#define A_lw   w6=0A=
+#define A_h    x7=0A=
+#define B_l    x8=0A=
+#define B_lw   w8=0A=
+#define B_h    x9=0A=
+#define C_lw   w10=0A=
+#define tmp1   x14=0A=
+=0A=
+#define A_q    q0=0A=
+#define B_q    q1=0A=
+#define C_q    q2=0A=
+#define D_q    q3=0A=
+#define E_q    q4=0A=
+#define F_q    q5=0A=
+#define G_q    q6=0A=
+#define H_q    q7=0A=
+=0A=
+=0A=
+/* This implementation supports both memcpy and memmove and shares most co=
de.=0A=
+   It uses unaligned accesses and branchless sequences to keep the code sm=
all,=0A=
+   simple and improve performance.=0A=
+=0A=
+   Copies are split into 3 main cases: small copies of up to 32 bytes, med=
ium=0A=
+   copies of up to 128 bytes, and large copies.  The overhead of the overl=
ap=0A=
+   check in memmove is negligible since it is only required for large copi=
es.=0A=
+=0A=
+   Large copies use a software pipelined loop processing 64 bytes per=0A=
+   iteration.  The destination pointer is 16-byte aligned to minimize=0A=
+   unaligned accesses.  The loop tail is handled by always copying 64 byte=
s=0A=
+   from the end.  */=0A=
+=0A=
+ENTRY (__memcpy_simd)=0A=
+       DELOUSE (0)=0A=
+       DELOUSE (1)=0A=
+       DELOUSE (2)=0A=
+=0A=
+       add     srcend, src, count=0A=
+       add     dstend, dstin, count=0A=
+       cmp     count, 128=0A=
+       b.hi    L(copy_long)=0A=
+       cmp     count, 32=0A=
+       b.hi    L(copy32_128)=0A=
+=0A=
+       /* Small copies: 0..32 bytes.  */=0A=
+       cmp     count, 16=0A=
+       b.lo    L(copy16)=0A=
+       ldr     A_q, [src]=0A=
+       ldr     B_q, [srcend, -16]=0A=
+       str     A_q, [dstin]=0A=
+       str     B_q, [dstend, -16]=0A=
+       ret=0A=
+=0A=
+       /* Copy 8-15 bytes.  */=0A=
+L(copy16):=0A=
+       tbz     count, 3, L(copy8)=0A=
+       ldr     A_l, [src]=0A=
+       ldr     A_h, [srcend, -8]=0A=
+       str     A_l, [dstin]=0A=
+       str     A_h, [dstend, -8]=0A=
+       ret=0A=
+=0A=
+       /* Copy 4-7 bytes.  */=0A=
+L(copy8):=0A=
+       tbz     count, 2, L(copy4)=0A=
+       ldr     A_lw, [src]=0A=
+       ldr     B_lw, [srcend, -4]=0A=
+       str     A_lw, [dstin]=0A=
+       str     B_lw, [dstend, -4]=0A=
+       ret=0A=
+=0A=
+       /* Copy 0..3 bytes using a branchless sequence.  */=0A=
+L(copy4):=0A=
+       cbz     count, L(copy0)=0A=
+       lsr     tmp1, count, 1=0A=
+       ldrb    A_lw, [src]=0A=
+       ldrb    C_lw, [srcend, -1]=0A=
+       ldrb    B_lw, [src, tmp1]=0A=
+       strb    A_lw, [dstin]=0A=
+       strb    B_lw, [dstin, tmp1]=0A=
+       strb    C_lw, [dstend, -1]=0A=
+L(copy0):=0A=
+       ret=0A=
+=0A=
+       .p2align 4=0A=
+       /* Medium copies: 33..128 bytes.  */=0A=
+L(copy32_128):=0A=
+       ldp     A_q, B_q, [src]=0A=
+       ldp     C_q, D_q, [srcend, -32]=0A=
+       cmp     count, 64=0A=
+       b.hi    L(copy128)=0A=
+       stp     A_q, B_q, [dstin]=0A=
+       stp     C_q, D_q, [dstend, -32]=0A=
+       ret=0A=
+=0A=
+       .p2align 4=0A=
+       /* Copy 65..128 bytes.  */=0A=
+L(copy128):=0A=
+       ldp     E_q, F_q, [src, 32]=0A=
+       cmp     count, 96=0A=
+       b.ls    L(copy96)=0A=
+       ldp     G_q, H_q, [srcend, -64]=0A=
+       stp     G_q, H_q, [dstend, -64]=0A=
+L(copy96):=0A=
+       stp     A_q, B_q, [dstin]=0A=
+       stp     E_q, F_q, [dstin, 32]=0A=
+       stp     C_q, D_q, [dstend, -32]=0A=
+       ret=0A=
+=0A=
+       /* Align loop64 below to 16 bytes.  */=0A=
+       nop=0A=
+=0A=
+       /* Copy more than 128 bytes.  */=0A=
+L(copy_long):=0A=
+       /* Copy 16 bytes and then align src to 16-byte alignment.  */=0A=
+       ldr     D_q, [src]=0A=
+       and     tmp1, src, 15=0A=
+       bic     src, src, 15=0A=
+       sub     dst, dstin, tmp1=0A=
+       add     count, count, tmp1      /* Count is now 16 too large.  */=
=0A=
+       ldp     A_q, B_q, [src, 16]=0A=
+       str     D_q, [dstin]=0A=
+       ldp     C_q, D_q, [src, 48]=0A=
+       subs    count, count, 128 + 16  /* Test and readjust count.  */=0A=
+       b.ls    L(copy64_from_end)=0A=
+L(loop64):=0A=
+       stp     A_q, B_q, [dst, 16]=0A=
+       ldp     A_q, B_q, [src, 80]=0A=
+       stp     C_q, D_q, [dst, 48]=0A=
+       ldp     C_q, D_q, [src, 112]=0A=
+       add     src, src, 64=0A=
+       add     dst, dst, 64=0A=
+       subs    count, count, 64=0A=
+       b.hi    L(loop64)=0A=
+=0A=
+       /* Write the last iteration and copy 64 bytes from the end.  */=0A=
+L(copy64_from_end):=0A=
+       ldp     E_q, F_q, [srcend, -64]=0A=
+       stp     A_q, B_q, [dst, 16]=0A=
+       ldp     A_q, B_q, [srcend, -32]=0A=
+       stp     C_q, D_q, [dst, 48]=0A=
+       stp     E_q, F_q, [dstend, -64]=0A=
+       stp     A_q, B_q, [dstend, -32]=0A=
+       ret=0A=
+=0A=
+END (__memcpy_simd)=0A=
+libc_hidden_builtin_def (__memcpy_simd)=0A=
+=0A=
+=0A=
+ENTRY (__memmove_simd)=0A=
+       DELOUSE (0)=0A=
+       DELOUSE (1)=0A=
+       DELOUSE (2)=0A=
+=0A=
+       add     srcend, src, count=0A=
+       add     dstend, dstin, count=0A=
+       cmp     count, 128=0A=
+       b.hi    L(move_long)=0A=
+       cmp     count, 32=0A=
+       b.hi    L(copy32_128)=0A=
+=0A=
+       /* Small moves: 0..32 bytes.  */=0A=
+       cmp     count, 16=0A=
+       b.lo    L(copy16)=0A=
+       ldr     A_q, [src]=0A=
+       ldr     B_q, [srcend, -16]=0A=
+       str     A_q, [dstin]=0A=
+       str     B_q, [dstend, -16]=0A=
+       ret=0A=
+=0A=
+L(move_long):=0A=
+       /* Only use backward copy if there is an overlap.  */=0A=
+       sub     tmp1, dstin, src=0A=
+       cbz     tmp1, L(move0)=0A=
+       cmp     tmp1, count=0A=
+       b.hs    L(copy_long)=0A=
+=0A=
+       /* Large backwards copy for overlapping copies.=0A=
+          Copy 16 bytes and then align srcend to 16-byte alignment.  */=0A=
+L(copy_long_backwards):=0A=
+       ldr     D_q, [srcend, -16]=0A=
+       and     tmp1, srcend, 15=0A=
+       bic     srcend, srcend, 15=0A=
+       sub     count, count, tmp1=0A=
+       ldp     A_q, B_q, [srcend, -32]=0A=
+       str     D_q, [dstend, -16]=0A=
+       ldp     C_q, D_q, [srcend, -64]=0A=
+       sub     dstend, dstend, tmp1=0A=
+       subs    count, count, 128=0A=
+       b.ls    L(copy64_from_start)=0A=
+=0A=
+L(loop64_backwards):=0A=
+       str     B_q, [dstend, -16]=0A=
+       str     A_q, [dstend, -32]=0A=
+       ldp     A_q, B_q, [srcend, -96]=0A=
+       str     D_q, [dstend, -48]=0A=
+       str     C_q, [dstend, -64]!=0A=
+       ldp     C_q, D_q, [srcend, -128]=0A=
+       sub     srcend, srcend, 64=0A=
+       subs    count, count, 64=0A=
+       b.hi    L(loop64_backwards)=0A=
+=0A=
+       /* Write the last iteration and copy 64 bytes from the start.  */=
=0A=
+L(copy64_from_start):=0A=
+       ldp     E_q, F_q, [src, 32]=0A=
+       stp     A_q, B_q, [dstend, -32]=0A=
+       ldp     A_q, B_q, [src]=0A=
+       stp     C_q, D_q, [dstend, -64]=0A=
+       stp     E_q, F_q, [dstin, 32]=0A=
+       stp     A_q, B_q, [dstin]=0A=
+L(move0):=0A=
+       ret=0A=
+=0A=
+END (__memmove_simd)=0A=
+libc_hidden_builtin_def (__memmove_simd)=0A=
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarc=
h/memmove.c=0A=
index e69d816..b426dad 100644=0A=
--- a/sysdeps/aarch64/multiarch/memmove.c=0A=
+++ b/sysdeps/aarch64/multiarch/memmove.c=0A=
@@ -29,6 +29,7 @@=0A=
 extern __typeof (__redirect_memmove) __libc_memmove;=0A=
=0A=
 extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;=
=0A=
+extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;=0A=
 extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;=
=0A=
 extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;=0A=
=0A=
@@ -37,7 +38,10 @@ libc_ifunc (__libc_memmove,=0A=
             ? __memmove_thunderx=0A=
             : (IS_FALKOR (midr) || IS_PHECDA (midr)=0A=
                ? __memmove_falkor=0A=
-               : __memmove_generic)));=0A=
+                 : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)=0A=
+                    || IS_NEOVERSE_V1 (midr)=0A=
+                    ? __memmove_simd=0A=
+                    : __memmove_generic))));=0A=
=0A=
 # undef memmove=0A=
 strong_alias (__libc_memmove, memmove);=0A=
diff --git a/sysdeps/aarch64/sysdep.h b/sysdeps/aarch64/sysdep.h=0A=
index 5b30709..509e3e1 100644=0A=
--- a/sysdeps/aarch64/sysdep.h=0A=
+++ b/sysdeps/aarch64/sysdep.h=0A=
@@ -45,7 +45,7 @@=0A=
 #define ENTRY(name)                                            \=0A=
   .globl C_SYMBOL_NAME(name);                                  \=0A=
   .type C_SYMBOL_NAME(name),%function;                         \=0A=
-  .align 4;                                                    \=0A=
+  .p2align 6;                                                  \=0A=
   C_LABEL(name)                                                        \=
=0A=
   cfi_startproc;                                               \=0A=
   CALL_MCOUNT=0A=
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/=
sysv/linux/aarch64/cpu-features.h=0A=
index 153d258..fbe1148 100644=0A=
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h=0A=
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h=0A=
@@ -51,8 +51,12 @@=0A=
=0A=
 #define IS_PHECDA(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'h'                =
       \=0A=
                         && MIDR_PARTNUM(midr) =3D=3D 0x000)=0A=
-#define IS_ARES(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A'                  =
       \=0A=
-                       && MIDR_PARTNUM(midr) =3D=3D 0xd0c)=0A=
+#define IS_NEOVERSE_N1(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A'           =
       \=0A=
+                             && MIDR_PARTNUM(midr) =3D=3D 0xd0c)=0A=
+#define IS_NEOVERSE_N2(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A'           =
       \=0A=
+                             && MIDR_PARTNUM(midr) =3D=3D 0xd49)=0A=
+#define IS_NEOVERSE_V1(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A'           =
       \=0A=
+                             && MIDR_PARTNUM(midr) =3D=3D 0xd40)=0A=
=0A=
 struct cpu_features=0A=
 {=0A=