From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Wilco.Dijkstra@arm.com>
Received: from EUR05-AM6-obe.outbound.protection.outlook.com
 (mail-am6eur05on2055.outbound.protection.outlook.com [40.107.22.55])
 by sourceware.org (Postfix) with ESMTPS id 39F0B3857C46
 for <libc-stable@sourceware.org>; Wed, 14 Oct 2020 15:36:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 39F0B3857C46
Received: from DB8P191CA0019.EURP191.PROD.OUTLOOK.COM (2603:10a6:10:130::29)
 by AM6PR08MB4936.eurprd08.prod.outlook.com (2603:10a6:20b:eb::14) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3455.23; Wed, 14 Oct
 2020 15:36:31 +0000
Received: from DB5EUR03FT028.eop-EUR03.prod.protection.outlook.com
 (2603:10a6:10:130:cafe::ff) by DB8P191CA0019.outlook.office365.com
 (2603:10a6:10:130::29) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.21 via Frontend
 Transport; Wed, 14 Oct 2020 15:36:31 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123)
 smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified)
 header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none
 header.from=arm.com;
Received-SPF: Pass (protection.outlook.com: domain of arm.com designates
 63.35.35.123 as permitted sender) receiver=protection.outlook.com;
 client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com;
Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by
 DB5EUR03FT028.mail.protection.outlook.com (10.152.20.99) with
 Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.3477.21 via Frontend Transport; Wed, 14 Oct 2020 15:36:31 +0000
Received: ("Tessian outbound c579d876a324:v64");
 Wed, 14 Oct 2020 15:36:31 +0000
X-CheckRecipientChecked: true
X-CR-MTA-CID: 541b83cde4af45a3
X-CR-MTA-TID: 64aa7808
Received: from ee7214e1e0f9.1
 by 64aa7808-outbound-1.mta.getcheckrecipient.com id
 371487EA-8C9A-4C0E-A0AA-A554ECDB699D.1; 
 Wed, 14 Oct 2020 15:36:06 +0000
Received: from EUR05-AM6-obe.outbound.protection.outlook.com
 by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id ee7214e1e0f9.1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384);
 Wed, 14 Oct 2020 15:36:06 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=cyNDGehyvgoUm6/WVB/27NrZMudUVmelshk6PZ+gc1JgYcxv3uoXh0L44ZFCM5BVkgkbu1j3FfmPgnpfgOAon+8dSHUCA5AEv+WWKOMtZ3u0KprkuOiFzpQUDROUjbX6vCP05tQXSztlW8kA14SQ/pyhS93oiia8TANHER9a1ZjprB0vtQCxEVFe5aI1sIPwR5m0FT3LWRuTbSrjlF/nvHuWauiMgUKEwOaOeF3l5K/cWrt0shkhFnV8NIYk4VNiNz0LEcZMIjxzhHZxlsifRjjsJ4TLKc5rY0uCswUrrq0uKMl+qil78jMzAs2w/M/z31YYdofgFk+ZqcPoi12quQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; 
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=PQHZciYlAYapp7pr96JT75buZoSMjjzVMInaKWF2Z04=;
 b=NuonnXVkAQ+mdozRYTgUxThbFN/Y+LVNbMpj+/VxXrBFcaUXo4NszP7tjdXZeKjY7GXow1GItJz0vAhFKyxIzhG8yELFkRNWeT16cV+gmmUyPJdWaysL5TXLJm7axxAnH06T27swtVeAT2YPOyPandP78bBYiZGEGcs70C/BChEZNVYMKk4MGMmqpTBlwYv5A74nHfvgJvr+w4rbgga48I8iVbdF+AVPL6c5ku+qol+Ykb7pOjb7sLtsbthml7sekhqbfob0y5elhMJoOQyGBOI3XlbQSjgY6qDsRJ6ZV/PzqaIZihvOGSR8bpX+FvjeWba4aEtzM4CUmDG90LSktg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass
 header.d=arm.com; arc=none
Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12)
 by VI1PR08MB3390.eurprd08.prod.outlook.com (2603:10a6:803:7d::27)
 with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.20; Wed, 14 Oct
 2020 15:36:02 +0000
Received: from VE1PR08MB5599.eurprd08.prod.outlook.com
 ([fe80::60b7:2d8b:81cb:bc0d]) by VE1PR08MB5599.eurprd08.prod.outlook.com
 ([fe80::60b7:2d8b:81cb:bc0d%3]) with mapi id 15.20.3477.021; Wed, 14 Oct 2020
 15:36:01 +0000
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "libc-stable@sourceware.org" <libc-stable@sourceware.org>
CC: nd <nd@arm.com>
Subject: [2.31 COMMITTED] AArch64: Backport memcpy improvements
Thread-Topic: [2.31 COMMITTED] AArch64: Backport memcpy improvements
Thread-Index: AQHWoj8vZMIo4HaxNUmESD27ToeYJA==
Date: Wed, 14 Oct 2020 15:36:01 +0000
Message-ID: <VE1PR08MB559925EC8859B4890267A0C683050@VE1PR08MB5599.eurprd08.prod.outlook.com>
Accept-Language: en-GB, en-US
Content-Language: en-GB
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Authentication-Results-Original: sourceware.org; dkim=none (message not
 signed) header.d=none;sourceware.org; dmarc=none action=none
 header.from=arm.com;
x-originating-ip: [82.24.199.97]
x-ms-publictraffictype: Email
X-MS-Office365-Filtering-HT: Tenant
X-MS-Office365-Filtering-Correlation-Id: b31e71ba-dc80-4ef9-8560-08d87056ecbe
x-ms-traffictypediagnostic: VI1PR08MB3390:|AM6PR08MB4936:
X-Microsoft-Antispam-PRVS: <AM6PR08MB4936AE2917A25B6AEE90D86283050@AM6PR08MB4936.eurprd08.prod.outlook.com>
x-checkrecipientrouted: true
nodisclaimer: true
x-ms-oob-tlc-oobclassifiers: OLM:883;OLM:883;
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original: 5bI7X6lMoQ+CkyejwPq45CM7pz5eAsGoXnuk3Vticlxf59YUCFw897Sa7rDUFhrqaQhe4tNLWWPr2JSvdfOh1hYZHpLaouctaLg89R+TMXijhul2LpvJa47QkdZ5ePRdtJRCjclru27wWOLtoRDNUwgWAMLhYu654X3KSNCju8G0zZE3pgWmDVz09Sto8LL9i44y3Qh34cjmDjkWKAb0pqs3A7FI+FqKXorli+1sYPX7VCJV3ly/KIhZ0vlct2YCOXUg8CQlEsjSs5KC232FWUBYLcNBBPe7OVpuZf+/y+UePO0tB9vTe7+YXqT5jdz0cwe5u+2wyojq/owaYG5ruENtCP8zpdlLRTwjEp4f670MLz+TGKsL4ySe82Vjg37UTvDzec0RZmfVQsBucQ/2/LtzD+e8q1fmM9Xh5/e/bopSD32ww+vI2Oq7JmaxdHa8D+hciv6P3LSewA2xKThWOg==
X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en;
 SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com;
 PTR:; CAT:NONE;
 SFS:(4636009)(396003)(39860400002)(346002)(376002)(136003)(366004)(86362001)(9686003)(83380400001)(8676002)(7696005)(33656002)(316002)(8936002)(71200400001)(83080400001)(64756008)(66946007)(52536014)(66556008)(66446008)(2906002)(66476007)(5660300002)(76116006)(478600001)(55016002)(4326008)(186003)(26005)(6506007)(6916009)(30864003)(2004002)(559001)(579004);
 DIR:OUT; SFP:1101; 
x-ms-exchange-antispam-messagedata: gB2AiSizBj7Hx0ltyVquJKEQkLoy2TIXKllqcv4l0D/x1WfvTp28Nn2sIa+CuMZVBjFCnMpb5/RKcH17NlingrDWKQiHdMBqzmeJlXzgKC4YfSm+tQVAgxZY0MVGeQMlgIxffv7/lNr4SWSUjwrHsvUZZ9A9cjmJBfDpSObhCSiL72NK+V45gZEtLKsVm13GM3R4//+qyh9AqPMRmXYO5uvKPmz5bxIYCytsllysxfl8mGwweJ3PR5FfWEQZEFrxOCF/d2f2Kv/UpRaKzDAHl7h8yGLQEec5qQrm3zVRzD3PK7vZoFw3oktnCJhAedzdmFpvmruOIzCgYa6gl1kRpxGzUrY2bVt1NpbxLoIcOHj7nqJmg5rE+X2+ncA/AkW6XQAGR7LrwY4q0gAJscVYe2z/v8ZY12slqzptpmXfcVlIYbaH2w9hjFv/A5gwskaSpWamPmrKngMKR208Ql2bzS3uJlyM08g2PuwVCiTEIAjO1CLFbTB+krL6UCVAEreeyngf2/TwGZX71/bv876AHxHHMola2I93y0AdAJldJUA0pwFsqA7a9bw6u0o0VSppTgM3T3BBwGD/BfEkJhP36c36i2kvvt1w4XQt+mvSkrbTUKRmxUF2gxcIcxHLTSmqHuDyt9/bv4JlzNqP1KX1Iw==
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB3390
Original-Authentication-Results: sourceware.org; dkim=none (message not signed)
 header.d=none; sourceware.org;
 dmarc=none action=none header.from=arm.com; 
X-EOPAttributedMessage: 0
X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT028.eop-EUR03.prod.protection.outlook.com
X-MS-Office365-Filtering-Correlation-Id-Prvs: 1a8896eb-715f-4cd7-1874-08d87056db2c
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: ZSCMo9MyjVdVkDZgPrehhYGoYsfkSzBihQ1Iw1JFTX3RKIllUWDWQRCB5SdLJe4SRpDZMbKTeZ3KqQKSug5Nl6sQnN3NgQh4vOTeCnHVJ/sSybDoH9R3WtPtZo509gJxr5GQ4NvN7Q9iliP+SwE2cw9FQ42qRgyPZHHSXqcfM2OWaf67ussM9y6vNe9Fj+2DVzfMbp4QPA5wsB46I74pfLOCbpinKCiiyRWKD6aOn0pSjHlKQcmeUF+tnLWuO8uePwz8ZNMiAEonDxAs8a+GCxifuIdftdcOwqC70cPig+/tYNSaG3eayn8FKX2uQeKXZF900xqRndBLBU/RAEPvzEuLvdu3Oou2xdeWpQTfPy92ZuVkvQwH5Q29gauv6yMGfzbfTW6XNpZLC0W85im0nKEieRlVsTfxLXJ/ZMYf2xQZHonTlf0cyk0zAB1PexXBiiwTe68gM3hSiJe4aSE9j9vOGb47auDGnT+lX8LILnt8P/FJpxk3IMinR/2opWML
X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:;
 IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com;
 PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE;
 SFS:(4636009)(136003)(376002)(346002)(39860400002)(396003)(46966005)(33656002)(336012)(47076004)(83380400001)(52536014)(81166007)(316002)(83080400001)(478600001)(6506007)(7696005)(26005)(8676002)(356005)(82740400003)(2906002)(8936002)(55016002)(4326008)(186003)(82310400003)(6916009)(9686003)(70206006)(70586007)(30864003)(86362001)(5660300002)(2004002);
 DIR:OUT; SFP:1101; 
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Oct 2020 15:36:31.4252 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: b31e71ba-dc80-4ef9-8560-08d87056ecbe
X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123];
 Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com]
X-MS-Exchange-CrossTenant-AuthSource: DB5EUR03FT028.eop-EUR03.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB4936
X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, KAM_STOCKGEN,
 RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS,
 SPF_PASS, TXREP,
 UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-stable@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-stable mailing list <libc-stable.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-stable>,
 <mailto:libc-stable-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-stable/>
List-Help: <mailto:libc-stable-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-stable>,
 <mailto:libc-stable-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Oct 2020 15:36:39 -0000

commit 4bc9918c998085800ecf5bbb3c863e66ea6252a0=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Oct 14 13:56:21 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Use __memcpy_simd on Neoverse N2/V1=0A=
=0A=
=A0 =A0 Add CPU detection of Neoverse N2 and Neoverse V1, and select __memc=
py_simd as=0A=
=A0 =A0 the memcpy/memmove ifunc.=0A=
=0A=
=A0 =A0 Reviewed-by: Adhemerval Zanella =A0<adhemerval.zanella@linaro.org>=
=0A=
=A0 =A0 (cherry picked from commit e11ed9d2b4558eeacff81557dc9557001af42a6b=
)=0A=
=0A=
commit 4722d1fb9d605e084dd0d9030a1643386c26cfdf=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Mar 11 17:15:25 2020 +0000=0A=
=0A=
=A0 =A0 [AArch64] Improve integer memcpy=0A=
=0A=
=A0 =A0 Further optimize integer memcpy. =A0Small cases now include copies =
up=0A=
=A0 =A0 to 32 bytes. =A064-128 byte copies are split into two cases to impr=
ove=0A=
=A0 =A0 performance of 64-96 byte copies. =A0Comments have been rewritten.=
=0A=
=0A=
=A0 =A0 (cherry picked from commit 700065132744e0dfa6d4d9142d63f6e3a1934726=
)=0A=
=0A=
commit bea507a3f55604064e2305fdf71e80ee4f43c2d3=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Jul 15 16:58:07 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Rename IS_ARES to IS_NEOVERSE_N1=0A=
=0A=
=A0 =A0 Rename IS_ARES to IS_NEOVERSE_N1 since that is a bit clearer.=0A=
=0A=
=A0 =A0 Reviewed-by: Carlos O'Donell <carlos@redhat.com>=0A=
=A0 =A0 (cherry picked from commit 0f6278a8793a5d04ea31878119eccf99f469a02d=
)=0A=
=0A=
commit d0a5b769027b17a7000ebc58e240ddd98ae0d719=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Fri Aug 28 17:51:40 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Improve backwards memmove performance=0A=
=0A=
=A0 =A0 On some microarchitectures performance of the backwards memmove imp=
roves if=0A=
=A0 =A0 the stores use STR with decreasing addresses. =A0So change the memm=
ove loop=0A=
=A0 =A0 in memcpy_advsimd.S to use 2x STR rather than STP.=0A=
=0A=
=A0 =A0 Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>=0A=
=A0 =A0 (cherry picked from commit bd394d131c10c9ec22c6424197b79410042eed99=
)=0A=
=0A=
commit 24a30c595958a1b23b620bc3dea62b0ab9d8f480=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Jul 15 16:55:07 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Add optimized Q-register memcpy=0A=
=0A=
=A0 =A0 Add a new memcpy using 128-bit Q registers - this is faster on mode=
rn=0A=
=A0 =A0 cores and reduces codesize. =A0Similar to the generic memcpy, small=
 cases=0A=
=A0 =A0 include copies up to 32 bytes. =A064-128 byte copies are split into=
 two=0A=
=A0 =A0 cases to improve performance of 64-96 byte copies. =A0Large copies =
align=0A=
=A0 =A0 the source rather than the destination.=0A=
=0A=
=A0 =A0 bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1=
,=0A=
=A0 =A0 so make this memcpy the default on N1 (on Centriq it is 15% faster =
than=0A=
=A0 =A0 memcpy_falkor).=0A=
=0A=
=A0 =A0 Passes GLIBC regression tests.=0A=
=0A=
=A0 =A0 Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>=0A=
=A0 =A0 (cherry picked from commit 4a733bf375238a6a595033b5785cea7f27d61307=
)=0A=
=0A=
commit 88db98fa6e427c05d202af8b3d0f1402df20c44d=0A=
Author: Wilco Dijkstra <wdijkstr@arm.com>=0A=
Date: =A0 Wed Jul 15 16:50:02 2020 +0100=0A=
=0A=
=A0 =A0 AArch64: Align ENTRY to a cacheline=0A=
=0A=
=A0 =A0 Given almost all uses of ENTRY are for string/memory functions,=0A=
=A0 =A0 align ENTRY to a cacheline to simplify things.=0A=
=0A=
=A0 =A0 Reviewed-by: Carlos O'Donell <carlos@redhat.com>=0A=
=A0 =A0 (cherry picked from commit 34f0d01d5e43c7dedd002ab47f6266dfb5b79c22=
)=0A=
=0A=
=0A=
diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S=0A=
index d0d47e9..e0b4c45 100644=0A=
--- a/sysdeps/aarch64/memcpy.S=0A=
+++ b/sysdeps/aarch64/memcpy.S=0A=
@@ -33,11 +33,11 @@=0A=
=A0#define A_l =A0 =A0x6=0A=
=A0#define A_lw =A0 w6=0A=
=A0#define A_h =A0 =A0x7=0A=
-#define A_hw =A0 w7=0A=
=A0#define B_l =A0 =A0x8=0A=
=A0#define B_lw =A0 w8=0A=
=A0#define B_h =A0 =A0x9=0A=
=A0#define C_l =A0 =A0x10=0A=
+#define C_lw =A0 w10=0A=
=A0#define C_h =A0 =A0x11=0A=
=A0#define D_l =A0 =A0x12=0A=
=A0#define D_h =A0 =A0x13=0A=
@@ -51,16 +51,6 @@=0A=
=A0#define H_h =A0 =A0srcend=0A=
=A0#define tmp1 =A0 x14=0A=
=0A=
-/* Copies are split into 3 main cases: small copies of up to 32 bytes,=0A=
- =A0 medium copies of 33..128 bytes which are fully unrolled. Large copies=
=0A=
- =A0 of more than 128 bytes align the destination and use an unrolled loop=
=0A=
- =A0 processing 64 bytes per iteration.=0A=
- =A0 In order to share code with memmove, small and medium copies read all=
=0A=
- =A0 data before writing, allowing any kind of overlap. So small, medium=
=0A=
- =A0 and large backwards memmoves are handled by falling through into memc=
py.=0A=
- =A0 Overlapping large forward memmoves use a loop that copies backwards.=
=0A=
-*/=0A=
-=0A=
=A0#ifndef MEMMOVE=0A=
=A0# define MEMMOVE memmove=0A=
=A0#endif=0A=
@@ -68,118 +58,115 @@=0A=
=A0# define MEMCPY memcpy=0A=
=A0#endif=0A=
=0A=
-ENTRY_ALIGN (MEMMOVE, 6)=0A=
+/* This implementation supports both memcpy and memmove and shares most co=
de.=0A=
+ =A0 It uses unaligned accesses and branchless sequences to keep the code =
small,=0A=
+ =A0 simple and improve performance.=0A=
=0A=
- =A0 =A0 =A0 DELOUSE (0)=0A=
- =A0 =A0 =A0 DELOUSE (1)=0A=
- =A0 =A0 =A0 DELOUSE (2)=0A=
+ =A0 Copies are split into 3 main cases: small copies of up to 32 bytes, m=
edium=0A=
+ =A0 copies of up to 128 bytes, and large copies. =A0The overhead of the o=
verlap=0A=
+ =A0 check in memmove is negligible since it is only required for large co=
pies.=0A=
=0A=
- =A0 =A0 =A0 sub =A0 =A0 tmp1, dstin, src=0A=
- =A0 =A0 =A0 cmp =A0 =A0 count, 128=0A=
- =A0 =A0 =A0 ccmp =A0 =A0tmp1, count, 2, hi=0A=
- =A0 =A0 =A0 b.lo =A0 =A0L(move_long)=0A=
-=0A=
- =A0 =A0 =A0 /* Common case falls through into memcpy. =A0*/=0A=
-END (MEMMOVE)=0A=
-libc_hidden_builtin_def (MEMMOVE)=0A=
-ENTRY (MEMCPY)=0A=
+ =A0 Large copies use a software pipelined loop processing 64 bytes per=0A=
+ =A0 iteration. =A0The destination pointer is 16-byte aligned to minimize=
=0A=
+ =A0 unaligned accesses. =A0The loop tail is handled by always copying 64 =
bytes=0A=
+ =A0 from the end.=0A=
+*/=0A=
=0A=
+ENTRY_ALIGN (MEMCPY, 6)=0A=
=A0 =A0 =A0 =A0 DELOUSE (0)=0A=
=A0 =A0 =A0 =A0 DELOUSE (1)=0A=
=A0 =A0 =A0 =A0 DELOUSE (2)=0A=
=0A=
- =A0 =A0 =A0 prfm =A0 =A0PLDL1KEEP, [src]=0A=
=A0 =A0 =A0 =A0 add =A0 =A0 srcend, src, count=0A=
=A0 =A0 =A0 =A0 add =A0 =A0 dstend, dstin, count=0A=
- =A0 =A0 =A0 cmp =A0 =A0 count, 32=0A=
- =A0 =A0 =A0 b.ls =A0 =A0L(copy32)=0A=
=A0 =A0 =A0 =A0 cmp =A0 =A0 count, 128=0A=
=A0 =A0 =A0 =A0 b.hi =A0 =A0L(copy_long)=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 32=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(copy32_128)=0A=
=0A=
- =A0 =A0 =A0 /* Medium copies: 33..128 bytes. =A0*/=0A=
+ =A0 =A0 =A0 /* Small copies: 0..32 bytes. =A0*/=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 16=0A=
+ =A0 =A0 =A0 b.lo =A0 =A0L(copy16)=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [src]=0A=
- =A0 =A0 =A0 ldp =A0 =A0 B_l, B_h, [src, 16]=0A=
- =A0 =A0 =A0 ldp =A0 =A0 C_l, C_h, [srcend, -32]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [srcend, -16]=0A=
- =A0 =A0 =A0 cmp =A0 =A0 count, 64=0A=
- =A0 =A0 =A0 b.hi =A0 =A0L(copy128)=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstin]=0A=
- =A0 =A0 =A0 stp =A0 =A0 B_l, B_h, [dstin, 16]=0A=
- =A0 =A0 =A0 stp =A0 =A0 C_l, C_h, [dstend, -32]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 D_l, D_h, [dstend, -16]=0A=
=A0 =A0 =A0 =A0 ret=0A=
=0A=
- =A0 =A0 =A0 .p2align 4=0A=
- =A0 =A0 =A0 /* Small copies: 0..32 bytes. =A0*/=0A=
-L(copy32):=0A=
- =A0 =A0 =A0 /* 16-32 bytes. =A0*/=0A=
- =A0 =A0 =A0 cmp =A0 =A0 count, 16=0A=
- =A0 =A0 =A0 b.lo =A0 =A01f=0A=
- =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [src]=0A=
- =A0 =A0 =A0 ldp =A0 =A0 B_l, B_h, [srcend, -16]=0A=
- =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstin]=0A=
- =A0 =A0 =A0 stp =A0 =A0 B_l, B_h, [dstend, -16]=0A=
- =A0 =A0 =A0 ret=0A=
- =A0 =A0 =A0 .p2align 4=0A=
-1:=0A=
- =A0 =A0 =A0 /* 8-15 bytes. =A0*/=0A=
- =A0 =A0 =A0 tbz =A0 =A0 count, 3, 1f=0A=
+ =A0 =A0 =A0 /* Copy 8-15 bytes. =A0*/=0A=
+L(copy16):=0A=
+ =A0 =A0 =A0 tbz =A0 =A0 count, 3, L(copy8)=0A=
=A0 =A0 =A0 =A0 ldr =A0 =A0 A_l, [src]=0A=
=A0 =A0 =A0 =A0 ldr =A0 =A0 A_h, [srcend, -8]=0A=
=A0 =A0 =A0 =A0 str =A0 =A0 A_l, [dstin]=0A=
=A0 =A0 =A0 =A0 str =A0 =A0 A_h, [dstend, -8]=0A=
=A0 =A0 =A0 =A0 ret=0A=
- =A0 =A0 =A0 .p2align 4=0A=
-1:=0A=
- =A0 =A0 =A0 /* 4-7 bytes. =A0*/=0A=
- =A0 =A0 =A0 tbz =A0 =A0 count, 2, 1f=0A=
+=0A=
+ =A0 =A0 =A0 .p2align 3=0A=
+ =A0 =A0 =A0 /* Copy 4-7 bytes. =A0*/=0A=
+L(copy8):=0A=
+ =A0 =A0 =A0 tbz =A0 =A0 count, 2, L(copy4)=0A=
=A0 =A0 =A0 =A0 ldr =A0 =A0 A_lw, [src]=0A=
- =A0 =A0 =A0 ldr =A0 =A0 A_hw, [srcend, -4]=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 B_lw, [srcend, -4]=0A=
=A0 =A0 =A0 =A0 str =A0 =A0 A_lw, [dstin]=0A=
- =A0 =A0 =A0 str =A0 =A0 A_hw, [dstend, -4]=0A=
+ =A0 =A0 =A0 str =A0 =A0 B_lw, [dstend, -4]=0A=
=A0 =A0 =A0 =A0 ret=0A=
=0A=
- =A0 =A0 =A0 /* Copy 0..3 bytes. =A0Use a branchless sequence that copies =
the same=0A=
- =A0 =A0 =A0 =A0 =A0byte 3 times if count=3D=3D1, or the 2nd byte twice if=
 count=3D=3D2. =A0*/=0A=
-1:=0A=
- =A0 =A0 =A0 cbz =A0 =A0 count, 2f=0A=
+ =A0 =A0 =A0 /* Copy 0..3 bytes using a branchless sequence. =A0*/=0A=
+L(copy4):=0A=
+ =A0 =A0 =A0 cbz =A0 =A0 count, L(copy0)=0A=
=A0 =A0 =A0 =A0 lsr =A0 =A0 tmp1, count, 1=0A=
=A0 =A0 =A0 =A0 ldrb =A0 =A0A_lw, [src]=0A=
- =A0 =A0 =A0 ldrb =A0 =A0A_hw, [srcend, -1]=0A=
+ =A0 =A0 =A0 ldrb =A0 =A0C_lw, [srcend, -1]=0A=
=A0 =A0 =A0 =A0 ldrb =A0 =A0B_lw, [src, tmp1]=0A=
=A0 =A0 =A0 =A0 strb =A0 =A0A_lw, [dstin]=0A=
=A0 =A0 =A0 =A0 strb =A0 =A0B_lw, [dstin, tmp1]=0A=
- =A0 =A0 =A0 strb =A0 =A0A_hw, [dstend, -1]=0A=
-2: =A0 =A0 ret=0A=
+ =A0 =A0 =A0 strb =A0 =A0C_lw, [dstend, -1]=0A=
+L(copy0):=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+ =A0 =A0 =A0 .p2align 4=0A=
+ =A0 =A0 =A0 /* Medium copies: 33..128 bytes. =A0*/=0A=
+L(copy32_128):=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [src]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 B_l, B_h, [src, 16]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 C_l, C_h, [srcend, -32]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [srcend, -16]=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 64=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(copy128)=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstin]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 B_l, B_h, [dstin, 16]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 C_l, C_h, [dstend, -32]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 D_l, D_h, [dstend, -16]=0A=
+ =A0 =A0 =A0 ret=0A=
=0A=
=A0 =A0 =A0 =A0 .p2align 4=0A=
- =A0 =A0 =A0 /* Copy 65..128 bytes. =A0Copy 64 bytes from the start and=0A=
- =A0 =A0 =A0 =A0 =A064 bytes from the end. =A0*/=0A=
+ =A0 =A0 =A0 /* Copy 65..128 bytes. =A0*/=0A=
=A0L(copy128):=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 E_l, E_h, [src, 32]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 F_l, F_h, [src, 48]=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 96=0A=
+ =A0 =A0 =A0 b.ls =A0 =A0L(copy96)=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 G_l, G_h, [srcend, -64]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 H_l, H_h, [srcend, -48]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 G_l, G_h, [dstend, -64]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 H_l, H_h, [dstend, -48]=0A=
+L(copy96):=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstin]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 B_l, B_h, [dstin, 16]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 E_l, E_h, [dstin, 32]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 F_l, F_h, [dstin, 48]=0A=
- =A0 =A0 =A0 stp =A0 =A0 G_l, G_h, [dstend, -64]=0A=
- =A0 =A0 =A0 stp =A0 =A0 H_l, H_h, [dstend, -48]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 C_l, C_h, [dstend, -32]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 D_l, D_h, [dstend, -16]=0A=
=A0 =A0 =A0 =A0 ret=0A=
=0A=
- =A0 =A0 =A0 /* Align DST to 16 byte alignment so that we don't cross cach=
e line=0A=
- =A0 =A0 =A0 =A0 =A0boundaries on both loads and stores. =A0There are at l=
east 128 bytes=0A=
- =A0 =A0 =A0 =A0 =A0to copy, so copy 16 bytes unaligned and then align. =
=A0The loop=0A=
- =A0 =A0 =A0 =A0 =A0copies 64 bytes per iteration and prefetches one itera=
tion ahead. =A0*/=0A=
-=0A=
=A0 =A0 =A0 =A0 .p2align 4=0A=
+ =A0 =A0 =A0 /* Copy more than 128 bytes. =A0*/=0A=
=A0L(copy_long):=0A=
+ =A0 =A0 =A0 /* Copy 16 bytes and then align dst to 16-byte alignment. =A0=
*/=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [src]=0A=
=A0 =A0 =A0 =A0 and =A0 =A0 tmp1, dstin, 15=0A=
=A0 =A0 =A0 =A0 bic =A0 =A0 dst, dstin, 15=0A=
- =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [src]=0A=
=A0 =A0 =A0 =A0 sub =A0 =A0 src, src, tmp1=0A=
=A0 =A0 =A0 =A0 add =A0 =A0 count, count, tmp1 =A0 =A0 =A0/* Count is now 1=
6 too large. =A0*/=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [src, 16]=0A=
@@ -188,7 +175,8 @@ L(copy_long):=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 C_l, C_h, [src, 48]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [src, 64]!=0A=
=A0 =A0 =A0 =A0 subs =A0 =A0count, count, 128 + 16 =A0/* Test and readjust =
count. =A0*/=0A=
- =A0 =A0 =A0 b.ls =A0 =A0L(last64)=0A=
+ =A0 =A0 =A0 b.ls =A0 =A0L(copy64_from_end)=0A=
+=0A=
=A0L(loop64):=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dst, 16]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [src, 16]=0A=
@@ -201,10 +189,8 @@ L(loop64):=0A=
=A0 =A0 =A0 =A0 subs =A0 =A0count, count, 64=0A=
=A0 =A0 =A0 =A0 b.hi =A0 =A0L(loop64)=0A=
=0A=
- =A0 =A0 =A0 /* Write the last full set of 64 bytes. =A0The remainder is a=
t most 64=0A=
- =A0 =A0 =A0 =A0 =A0bytes, so it is safe to always copy 64 bytes from the =
end even if=0A=
- =A0 =A0 =A0 =A0 =A0there is just 1 byte left. =A0*/=0A=
-L(last64):=0A=
+ =A0 =A0 =A0 /* Write the last iteration and copy 64 bytes from the end. =
=A0*/=0A=
+L(copy64_from_end):=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 E_l, E_h, [srcend, -64]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dst, 16]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [srcend, -48]=0A=
@@ -219,20 +205,42 @@ L(last64):=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 C_l, C_h, [dstend, -16]=0A=
=A0 =A0 =A0 =A0 ret=0A=
=0A=
- =A0 =A0 =A0 .p2align 4=0A=
-L(move_long):=0A=
- =A0 =A0 =A0 cbz =A0 =A0 tmp1, 3f=0A=
+END (MEMCPY)=0A=
+libc_hidden_builtin_def (MEMCPY)=0A=
+=0A=
+ENTRY_ALIGN (MEMMOVE, 4)=0A=
+ =A0 =A0 =A0 DELOUSE (0)=0A=
+ =A0 =A0 =A0 DELOUSE (1)=0A=
+ =A0 =A0 =A0 DELOUSE (2)=0A=
=0A=
=A0 =A0 =A0 =A0 add =A0 =A0 srcend, src, count=0A=
=A0 =A0 =A0 =A0 add =A0 =A0 dstend, dstin, count=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 128=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(move_long)=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 32=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(copy32_128)=0A=
=0A=
- =A0 =A0 =A0 /* Align dstend to 16 byte alignment so that we don't cross c=
ache line=0A=
- =A0 =A0 =A0 =A0 =A0boundaries on both loads and stores. =A0There are at l=
east 128 bytes=0A=
- =A0 =A0 =A0 =A0 =A0to copy, so copy 16 bytes unaligned and then align. =
=A0The loop=0A=
- =A0 =A0 =A0 =A0 =A0copies 64 bytes per iteration and prefetches one itera=
tion ahead. =A0*/=0A=
+ =A0 =A0 =A0 /* Small copies: 0..32 bytes. =A0*/=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 16=0A=
+ =A0 =A0 =A0 b.lo =A0 =A0L(copy16)=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [src]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [srcend, -16]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstin]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 D_l, D_h, [dstend, -16]=0A=
+ =A0 =A0 =A0 ret=0A=
=0A=
- =A0 =A0 =A0 and =A0 =A0 tmp1, dstend, 15=0A=
+ =A0 =A0 =A0 .p2align 4=0A=
+L(move_long):=0A=
+ =A0 =A0 =A0 /* Only use backward copy if there is an overlap. =A0*/=0A=
+ =A0 =A0 =A0 sub =A0 =A0 tmp1, dstin, src=0A=
+ =A0 =A0 =A0 cbz =A0 =A0 tmp1, L(copy0)=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 tmp1, count=0A=
+ =A0 =A0 =A0 b.hs =A0 =A0L(copy_long)=0A=
+=0A=
+ =A0 =A0 =A0 /* Large backwards copy for overlapping copies.=0A=
+ =A0 =A0 =A0 =A0 =A0Copy 16 bytes and then align dst to 16-byte alignment.=
 =A0*/=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [srcend, -16]=0A=
+ =A0 =A0 =A0 and =A0 =A0 tmp1, dstend, 15=0A=
=A0 =A0 =A0 =A0 sub =A0 =A0 srcend, srcend, tmp1=0A=
=A0 =A0 =A0 =A0 sub =A0 =A0 count, count, tmp1=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [srcend, -16]=0A=
@@ -242,10 +250,9 @@ L(move_long):=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [srcend, -64]!=0A=
=A0 =A0 =A0 =A0 sub =A0 =A0 dstend, dstend, tmp1=0A=
=A0 =A0 =A0 =A0 subs =A0 =A0count, count, 128=0A=
- =A0 =A0 =A0 b.ls =A0 =A02f=0A=
+ =A0 =A0 =A0 b.ls =A0 =A0L(copy64_from_start)=0A=
=0A=
- =A0 =A0 =A0 nop=0A=
-1:=0A=
+L(loop64_backwards):=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstend, -16]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [srcend, -16]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 B_l, B_h, [dstend, -32]=0A=
@@ -255,12 +262,10 @@ L(move_long):=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 D_l, D_h, [dstend, -64]!=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 D_l, D_h, [srcend, -64]!=0A=
=A0 =A0 =A0 =A0 subs =A0 =A0count, count, 64=0A=
- =A0 =A0 =A0 b.hi =A0 =A01b=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(loop64_backwards)=0A=
=0A=
- =A0 =A0 =A0 /* Write the last full set of 64 bytes. =A0The remainder is a=
t most 64=0A=
- =A0 =A0 =A0 =A0 =A0bytes, so it is safe to always copy 64 bytes from the =
start even if=0A=
- =A0 =A0 =A0 =A0 =A0there is just 1 byte left. =A0*/=0A=
-2:=0A=
+ =A0 =A0 =A0 /* Write the last iteration and copy 64 bytes from the start.=
 =A0*/=0A=
+L(copy64_from_start):=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 G_l, G_h, [src, 48]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstend, -16]=0A=
=A0 =A0 =A0 =A0 ldp =A0 =A0 A_l, A_h, [src, 32]=0A=
@@ -273,7 +278,7 @@ L(move_long):=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 A_l, A_h, [dstin, 32]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 B_l, B_h, [dstin, 16]=0A=
=A0 =A0 =A0 =A0 stp =A0 =A0 C_l, C_h, [dstin]=0A=
-3: =A0 =A0 ret=0A=
+ =A0 =A0 =A0 ret=0A=
=0A=
-END (MEMCPY)=0A=
-libc_hidden_builtin_def (MEMCPY)=0A=
+END (MEMMOVE)=0A=
+libc_hidden_builtin_def (MEMMOVE)=0A=
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch=
/Makefile=0A=
index 8378107..3c5292d 100644=0A=
--- a/sysdeps/aarch64/multiarch/Makefile=0A=
+++ b/sysdeps/aarch64/multiarch/Makefile=0A=
@@ -1,5 +1,5 @@=0A=
=A0ifeq ($(subdir),string)=0A=
-sysdep_routines +=3D memcpy_generic memcpy_thunderx memcpy_thunderx2 \=0A=
+sysdep_routines +=3D memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_=
thunderx2 \=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0memcpy_falkor memmove_falkor \=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0memset_generic memset_falkor memset_=
emag memset_kunpeng \=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0memchr_generic memchr_nosimd \=0A=
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/=
multiarch/ifunc-impl-list.c=0A=
index b7da62c..4b004ac 100644=0A=
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c=0A=
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c=0A=
@@ -42,11 +42,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_i=
func_impl *array,=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_t=
hunderx)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_t=
hunderx2)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_f=
alkor)=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_sim=
d)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_g=
eneric))=0A=
=A0 =A0IFUNC_IMPL (i, name, memmove,=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove=
_thunderx)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove=
_thunderx2)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove=
_falkor)=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_s=
imd)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove=
_generic))=0A=
=A0 =A0IFUNC_IMPL (i, name, memset,=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Enable this on non-falkor processors too so =
that other cores=0A=
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch=
/memcpy.c=0A=
index 2fafefd..799d60c 100644=0A=
--- a/sysdeps/aarch64/multiarch/memcpy.c=0A=
+++ b/sysdeps/aarch64/multiarch/memcpy.c=0A=
@@ -29,6 +29,7 @@=0A=
=A0extern __typeof (__redirect_memcpy) __libc_memcpy;=0A=
=0A=
=A0extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;=
=0A=
+extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;=0A=
=A0extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;=
=0A=
=A0extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;=
=0A=
=A0extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;=0A=
@@ -36,11 +37,14 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor att=
ribute_hidden;=0A=
=A0libc_ifunc (__libc_memcpy,=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0(IS_THUNDERX (midr)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0? __memcpy_thunderx=0A=
- =A0 =A0 =A0 =A0 =A0 =A0: (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_ARES=
 (midr) || IS_KUNPENG920 (midr)=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0: (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_KUNP=
ENG920 (midr)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ? __memcpy_falkor=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (m=
idr)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ? __memcpy_thunderx2=0A=
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : __memcpy_generic))));=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N=
2 (midr)=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0|| IS_NEOVERSE_V1 (midr)=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0? __memcpy_simd=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0: __memcpy_generic)))));=0A=
=0A=
=A0# undef memcpy=0A=
=A0strong_alias (__libc_memcpy, memcpy);=0A=
diff --git a/sysdeps/aarch64/multiarch/memcpy_advsimd.S b/sysdeps/aarch64/m=
ultiarch/memcpy_advsimd.S=0A=
new file mode 100644=0A=
index 0000000..48bb6d7=0A=
--- /dev/null=0A=
+++ b/sysdeps/aarch64/multiarch/memcpy_advsimd.S=0A=
@@ -0,0 +1,248 @@=0A=
+/* Generic optimized memcpy using SIMD.=0A=
+ =A0 Copyright (C) 2020 Free Software Foundation, Inc.=0A=
+=0A=
+ =A0 This file is part of the GNU C Library.=0A=
+=0A=
+ =A0 The GNU C Library is free software; you can redistribute it and/or=0A=
+ =A0 modify it under the terms of the GNU Lesser General Public=0A=
+ =A0 License as published by the Free Software Foundation; either=0A=
+ =A0 version 2.1 of the License, or (at your option) any later version.=0A=
+=0A=
+ =A0 The GNU C Library is distributed in the hope that it will be useful,=
=0A=
+ =A0 but WITHOUT ANY WARRANTY; without even the implied warranty of=0A=
+ =A0 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. =A0See the GNU=
=0A=
+ =A0 Lesser General Public License for more details.=0A=
+=0A=
+ =A0 You should have received a copy of the GNU Lesser General Public=0A=
+ =A0 License along with the GNU C Library. =A0If not, see=0A=
+ =A0 <https://www.gnu.org/licenses/>. =A0*/=0A=
+=0A=
+#include <sysdep.h>=0A=
+=0A=
+/* Assumptions:=0A=
+ *=0A=
+ * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses.=0A=
+ *=0A=
+ */=0A=
+=0A=
+#define dstin =A0x0=0A=
+#define src =A0 =A0x1=0A=
+#define count =A0x2=0A=
+#define dst =A0 =A0x3=0A=
+#define srcend x4=0A=
+#define dstend x5=0A=
+#define A_l =A0 =A0x6=0A=
+#define A_lw =A0 w6=0A=
+#define A_h =A0 =A0x7=0A=
+#define B_l =A0 =A0x8=0A=
+#define B_lw =A0 w8=0A=
+#define B_h =A0 =A0x9=0A=
+#define C_lw =A0 w10=0A=
+#define tmp1 =A0 x14=0A=
+=0A=
+#define A_q =A0 =A0q0=0A=
+#define B_q =A0 =A0q1=0A=
+#define C_q =A0 =A0q2=0A=
+#define D_q =A0 =A0q3=0A=
+#define E_q =A0 =A0q4=0A=
+#define F_q =A0 =A0q5=0A=
+#define G_q =A0 =A0q6=0A=
+#define H_q =A0 =A0q7=0A=
+=0A=
+=0A=
+/* This implementation supports both memcpy and memmove and shares most co=
de.=0A=
+ =A0 It uses unaligned accesses and branchless sequences to keep the code =
small,=0A=
+ =A0 simple and improve performance.=0A=
+=0A=
+ =A0 Copies are split into 3 main cases: small copies of up to 32 bytes, m=
edium=0A=
+ =A0 copies of up to 128 bytes, and large copies. =A0The overhead of the o=
verlap=0A=
+ =A0 check in memmove is negligible since it is only required for large co=
pies.=0A=
+=0A=
+ =A0 Large copies use a software pipelined loop processing 64 bytes per=0A=
+ =A0 iteration. =A0The destination pointer is 16-byte aligned to minimize=
=0A=
+ =A0 unaligned accesses. =A0The loop tail is handled by always copying 64 =
bytes=0A=
+ =A0 from the end. =A0*/=0A=
+=0A=
+ENTRY (__memcpy_simd)=0A=
+ =A0 =A0 =A0 DELOUSE (0)=0A=
+ =A0 =A0 =A0 DELOUSE (1)=0A=
+ =A0 =A0 =A0 DELOUSE (2)=0A=
+=0A=
+ =A0 =A0 =A0 add =A0 =A0 srcend, src, count=0A=
+ =A0 =A0 =A0 add =A0 =A0 dstend, dstin, count=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 128=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(copy_long)=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 32=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(copy32_128)=0A=
+=0A=
+ =A0 =A0 =A0 /* Small copies: 0..32 bytes. =A0*/=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 16=0A=
+ =A0 =A0 =A0 b.lo =A0 =A0L(copy16)=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 A_q, [src]=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 B_q, [srcend, -16]=0A=
+ =A0 =A0 =A0 str =A0 =A0 A_q, [dstin]=0A=
+ =A0 =A0 =A0 str =A0 =A0 B_q, [dstend, -16]=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+ =A0 =A0 =A0 /* Copy 8-15 bytes. =A0*/=0A=
+L(copy16):=0A=
+ =A0 =A0 =A0 tbz =A0 =A0 count, 3, L(copy8)=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 A_l, [src]=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 A_h, [srcend, -8]=0A=
+ =A0 =A0 =A0 str =A0 =A0 A_l, [dstin]=0A=
+ =A0 =A0 =A0 str =A0 =A0 A_h, [dstend, -8]=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+ =A0 =A0 =A0 /* Copy 4-7 bytes. =A0*/=0A=
+L(copy8):=0A=
+ =A0 =A0 =A0 tbz =A0 =A0 count, 2, L(copy4)=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 A_lw, [src]=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 B_lw, [srcend, -4]=0A=
+ =A0 =A0 =A0 str =A0 =A0 A_lw, [dstin]=0A=
+ =A0 =A0 =A0 str =A0 =A0 B_lw, [dstend, -4]=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+ =A0 =A0 =A0 /* Copy 0..3 bytes using a branchless sequence. =A0*/=0A=
+L(copy4):=0A=
+ =A0 =A0 =A0 cbz =A0 =A0 count, L(copy0)=0A=
+ =A0 =A0 =A0 lsr =A0 =A0 tmp1, count, 1=0A=
+ =A0 =A0 =A0 ldrb =A0 =A0A_lw, [src]=0A=
+ =A0 =A0 =A0 ldrb =A0 =A0C_lw, [srcend, -1]=0A=
+ =A0 =A0 =A0 ldrb =A0 =A0B_lw, [src, tmp1]=0A=
+ =A0 =A0 =A0 strb =A0 =A0A_lw, [dstin]=0A=
+ =A0 =A0 =A0 strb =A0 =A0B_lw, [dstin, tmp1]=0A=
+ =A0 =A0 =A0 strb =A0 =A0C_lw, [dstend, -1]=0A=
+L(copy0):=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+ =A0 =A0 =A0 .p2align 4=0A=
+ =A0 =A0 =A0 /* Medium copies: 33..128 bytes. =A0*/=0A=
+L(copy32_128):=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_q, B_q, [src]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 C_q, D_q, [srcend, -32]=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 64=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(copy128)=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_q, B_q, [dstin]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 C_q, D_q, [dstend, -32]=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+ =A0 =A0 =A0 .p2align 4=0A=
+ =A0 =A0 =A0 /* Copy 65..128 bytes. =A0*/=0A=
+L(copy128):=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 E_q, F_q, [src, 32]=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 96=0A=
+ =A0 =A0 =A0 b.ls =A0 =A0L(copy96)=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 G_q, H_q, [srcend, -64]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 G_q, H_q, [dstend, -64]=0A=
+L(copy96):=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_q, B_q, [dstin]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 E_q, F_q, [dstin, 32]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 C_q, D_q, [dstend, -32]=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+ =A0 =A0 =A0 /* Align loop64 below to 16 bytes. =A0*/=0A=
+ =A0 =A0 =A0 nop=0A=
+=0A=
+ =A0 =A0 =A0 /* Copy more than 128 bytes. =A0*/=0A=
+L(copy_long):=0A=
+ =A0 =A0 =A0 /* Copy 16 bytes and then align src to 16-byte alignment. =A0=
*/=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 D_q, [src]=0A=
+ =A0 =A0 =A0 and =A0 =A0 tmp1, src, 15=0A=
+ =A0 =A0 =A0 bic =A0 =A0 src, src, 15=0A=
+ =A0 =A0 =A0 sub =A0 =A0 dst, dstin, tmp1=0A=
+ =A0 =A0 =A0 add =A0 =A0 count, count, tmp1 =A0 =A0 =A0/* Count is now 16 =
too large. =A0*/=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_q, B_q, [src, 16]=0A=
+ =A0 =A0 =A0 str =A0 =A0 D_q, [dstin]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 C_q, D_q, [src, 48]=0A=
+ =A0 =A0 =A0 subs =A0 =A0count, count, 128 + 16 =A0/* Test and readjust co=
unt. =A0*/=0A=
+ =A0 =A0 =A0 b.ls =A0 =A0L(copy64_from_end)=0A=
+L(loop64):=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_q, B_q, [dst, 16]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_q, B_q, [src, 80]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 C_q, D_q, [dst, 48]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 C_q, D_q, [src, 112]=0A=
+ =A0 =A0 =A0 add =A0 =A0 src, src, 64=0A=
+ =A0 =A0 =A0 add =A0 =A0 dst, dst, 64=0A=
+ =A0 =A0 =A0 subs =A0 =A0count, count, 64=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(loop64)=0A=
+=0A=
+ =A0 =A0 =A0 /* Write the last iteration and copy 64 bytes from the end. =
=A0*/=0A=
+L(copy64_from_end):=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 E_q, F_q, [srcend, -64]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_q, B_q, [dst, 16]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_q, B_q, [srcend, -32]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 C_q, D_q, [dst, 48]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 E_q, F_q, [dstend, -64]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_q, B_q, [dstend, -32]=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+END (__memcpy_simd)=0A=
+libc_hidden_builtin_def (__memcpy_simd)=0A=
+=0A=
+=0A=
+ENTRY (__memmove_simd)=0A=
+ =A0 =A0 =A0 DELOUSE (0)=0A=
+ =A0 =A0 =A0 DELOUSE (1)=0A=
+ =A0 =A0 =A0 DELOUSE (2)=0A=
+=0A=
+ =A0 =A0 =A0 add =A0 =A0 srcend, src, count=0A=
+ =A0 =A0 =A0 add =A0 =A0 dstend, dstin, count=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 128=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(move_long)=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 32=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(copy32_128)=0A=
+=0A=
+ =A0 =A0 =A0 /* Small moves: 0..32 bytes. =A0*/=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 count, 16=0A=
+ =A0 =A0 =A0 b.lo =A0 =A0L(copy16)=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 A_q, [src]=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 B_q, [srcend, -16]=0A=
+ =A0 =A0 =A0 str =A0 =A0 A_q, [dstin]=0A=
+ =A0 =A0 =A0 str =A0 =A0 B_q, [dstend, -16]=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+L(move_long):=0A=
+ =A0 =A0 =A0 /* Only use backward copy if there is an overlap. =A0*/=0A=
+ =A0 =A0 =A0 sub =A0 =A0 tmp1, dstin, src=0A=
+ =A0 =A0 =A0 cbz =A0 =A0 tmp1, L(move0)=0A=
+ =A0 =A0 =A0 cmp =A0 =A0 tmp1, count=0A=
+ =A0 =A0 =A0 b.hs =A0 =A0L(copy_long)=0A=
+=0A=
+ =A0 =A0 =A0 /* Large backwards copy for overlapping copies.=0A=
+ =A0 =A0 =A0 =A0 =A0Copy 16 bytes and then align srcend to 16-byte alignme=
nt. =A0*/=0A=
+L(copy_long_backwards):=0A=
+ =A0 =A0 =A0 ldr =A0 =A0 D_q, [srcend, -16]=0A=
+ =A0 =A0 =A0 and =A0 =A0 tmp1, srcend, 15=0A=
+ =A0 =A0 =A0 bic =A0 =A0 srcend, srcend, 15=0A=
+ =A0 =A0 =A0 sub =A0 =A0 count, count, tmp1=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_q, B_q, [srcend, -32]=0A=
+ =A0 =A0 =A0 str =A0 =A0 D_q, [dstend, -16]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 C_q, D_q, [srcend, -64]=0A=
+ =A0 =A0 =A0 sub =A0 =A0 dstend, dstend, tmp1=0A=
+ =A0 =A0 =A0 subs =A0 =A0count, count, 128=0A=
+ =A0 =A0 =A0 b.ls =A0 =A0L(copy64_from_start)=0A=
+=0A=
+L(loop64_backwards):=0A=
+ =A0 =A0 =A0 str =A0 =A0 B_q, [dstend, -16]=0A=
+ =A0 =A0 =A0 str =A0 =A0 A_q, [dstend, -32]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_q, B_q, [srcend, -96]=0A=
+ =A0 =A0 =A0 str =A0 =A0 D_q, [dstend, -48]=0A=
+ =A0 =A0 =A0 str =A0 =A0 C_q, [dstend, -64]!=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 C_q, D_q, [srcend, -128]=0A=
+ =A0 =A0 =A0 sub =A0 =A0 srcend, srcend, 64=0A=
+ =A0 =A0 =A0 subs =A0 =A0count, count, 64=0A=
+ =A0 =A0 =A0 b.hi =A0 =A0L(loop64_backwards)=0A=
+=0A=
+ =A0 =A0 =A0 /* Write the last iteration and copy 64 bytes from the start.=
 =A0*/=0A=
+L(copy64_from_start):=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 E_q, F_q, [src, 32]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_q, B_q, [dstend, -32]=0A=
+ =A0 =A0 =A0 ldp =A0 =A0 A_q, B_q, [src]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 C_q, D_q, [dstend, -64]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 E_q, F_q, [dstin, 32]=0A=
+ =A0 =A0 =A0 stp =A0 =A0 A_q, B_q, [dstin]=0A=
+L(move0):=0A=
+ =A0 =A0 =A0 ret=0A=
+=0A=
+END (__memmove_simd)=0A=
+libc_hidden_builtin_def (__memmove_simd)=0A=
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarc=
h/memmove.c=0A=
index ed5a47f..46a4cb3 100644=0A=
--- a/sysdeps/aarch64/multiarch/memmove.c=0A=
+++ b/sysdeps/aarch64/multiarch/memmove.c=0A=
@@ -29,6 +29,7 @@=0A=
=A0extern __typeof (__redirect_memmove) __libc_memmove;=0A=
=0A=
=A0extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;=
=0A=
+extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;=0A=
=A0extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden=
;=0A=
=A0extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidde=
n;=0A=
=A0extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;=
=0A=
@@ -40,7 +41,10 @@ libc_ifunc (__libc_memmove,=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ? __memmove_falkor=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (m=
idr)=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ? __memmove_thunderx2=0A=
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : __memmove_generic))));=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N=
2 (midr)=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0|| IS_NEOVERSE_V1 (midr)=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0? __memmove_simd=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0: __memmove_generic)))));=0A=
=0A=
=A0# undef memmove=0A=
=A0strong_alias (__libc_memmove, memmove);=0A=
diff --git a/sysdeps/aarch64/sysdep.h b/sysdeps/aarch64/sysdep.h=0A=
index 604c489..f1feb19 100644=0A=
--- a/sysdeps/aarch64/sysdep.h=0A=
+++ b/sysdeps/aarch64/sysdep.h=0A=
@@ -45,7 +45,7 @@=0A=
=A0#define ENTRY(name) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
=A0 =A0.globl C_SYMBOL_NAME(name); =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
=A0 =A0.type C_SYMBOL_NAME(name),%function; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 \=0A=
- =A0.align 4; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
+ =A0.p2align 6; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
=A0 =A0C_LABEL(name) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
=A0 =A0cfi_startproc; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 \=0A=
=A0 =A0CALL_MCOUNT=0A=
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/=
sysv/linux/aarch64/cpu-features.h=0A=
index 1389cea..346d045 100644=0A=
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h=0A=
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h=0A=
@@ -51,8 +51,12 @@=0A=
=0A=
=A0#define IS_PHECDA(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'h' =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 \=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&& MIDR_PARTNUM(midr) =
=3D=3D 0x000)=0A=
-#define IS_ARES(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 \=0A=
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 && MIDR_PARTNUM(midr) =3D=3D =
0xd0c)=0A=
+#define IS_NEOVERSE_N1(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 && MIDR_PARTNUM(m=
idr) =3D=3D 0xd0c)=0A=
+#define IS_NEOVERSE_N2(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 && MIDR_PARTNUM(m=
idr) =3D=3D 0xd49)=0A=
+#define IS_NEOVERSE_V1(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0\=0A=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 && MIDR_PARTNUM(m=
idr) =3D=3D 0xd40)=0A=
=0A=
=A0#define IS_EMAG(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'P' =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 \=0A=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 && MIDR_PARTNUM(midr) =3D=
=3D 0x000)=0A=
=0A=
=0A=