From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR03-VE1-obe.outbound.protection.outlook.com (mail-eopbgr50086.outbound.protection.outlook.com [40.107.5.86]) by sourceware.org (Postfix) with ESMTPS id CBEC83857C46 for ; Wed, 14 Oct 2020 16:03:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org CBEC83857C46 Received: from DB6PR07CA0189.eurprd07.prod.outlook.com (2603:10a6:6:42::19) by DB7PR08MB3756.eurprd08.prod.outlook.com (2603:10a6:10:79::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.20; Wed, 14 Oct 2020 16:03:46 +0000 Received: from DB5EUR03FT004.eop-EUR03.prod.protection.outlook.com (2603:10a6:6:42:cafe::e0) by DB6PR07CA0189.outlook.office365.com (2603:10a6:6:42::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.13 via Frontend Transport; Wed, 14 Oct 2020 16:03:46 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by DB5EUR03FT004.mail.protection.outlook.com (10.152.20.128) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.21 via Frontend Transport; Wed, 14 Oct 2020 16:03:46 +0000 Received: ("Tessian outbound d5e343850048:v64"); Wed, 14 Oct 2020 16:03:46 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 523024ee3ed8c6cb X-CR-MTA-TID: 64aa7808 Received: from 267998a3d7db.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id F4DE82D2-1875-46E3-A6A9-9FA8073CEF95.1; Wed, 14 Oct 2020 16:03:08 +0000 Received: from EUR02-HE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 267998a3d7db.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Wed, 14 Oct 2020 16:03:08 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HL6UqpB+4j/QmXbwhoyJMUj4zn4DdWlX8Zh6Gio1AR+ZGQfaFa1RqgILR+q8XC+RhmsnZXowiuddCOFTlJV49s/lmWn9z8vQLEeL8+8+YMJNBJjzfFhrDn9Tysg/NmJWnAec2BxZ8s6zx+SFoht0OAGRXiULPRcJaitdHtJKpV8w70al15FqthcEA672/Wqjt8DswNgDJ9bjBkklatTGll6sK7PMI+rzfe5ExKb+NADep6gVlPXZyzcJWmqjdqGqXZR2Wwulh9PxbHzewTa/qexo6ITcM5p8qDwxCkNULUcGt/QKSMgRby8fQKouKBNpmO12TP3Q0OT66Hs102cYLQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=THKYRYBGUpJw/yLIObJXvj821jQVuIJilHStW7B2bpw=; b=G3sCKUTF0giWFjeq+9sCXPze7iWUlZVEVk1yeuFxsCL6J3QD1rhv8qi/b85dTl/h4tIb8mLKJTFcUpcsw6t1Xb2gJsRRF21zf6Q9113QsxaOipGHGjn+3JZLZfP9DqKTC2pUE5BSx0ARb8Mzd8KoapL0epYRO0rNpD/dcP8WKhf4uK2ZlDD/G81o1j5d04KYlYJg9EWhCSIASJDoVK2vhPbwFdrbgCi1LpLLG4rXsYYemcQrpIoLYEtTCyTaz1CFqQz+4orXp5sDF3JVSmYCeqMZJAaa9vjZuSUWVb5YFqU1xc8mWTp710fGzcCG3CBDTp9VL8AAyIaxdEdLNQGeLg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VI1PR08MB4032.eurprd08.prod.outlook.com (2603:10a6:803:e2::25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3477.20; Wed, 14 Oct 2020 16:03:05 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::60b7:2d8b:81cb:bc0d]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::60b7:2d8b:81cb:bc0d%3]) with mapi id 15.20.3477.021; Wed, 14 Oct 2020 16:03:05 +0000 From: Wilco Dijkstra To: "libc-stable@sourceware.org" CC: nd Subject: [2.30 COMMITTED] AArch64: Backport memcpy improvements Thread-Topic: [2.30 COMMITTED] AArch64: Backport memcpy improvements Thread-Index: AQHWokM1P//mLyUNJkWdtfNK846AXA== Date: Wed, 14 Oct 2020 16:03:04 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.199.97] x-ms-publictraffictype: Email X-MS-Office365-Filtering-HT: Tenant X-MS-Office365-Filtering-Correlation-Id: bda7ab2f-1b1d-4d2b-f42c-08d8705abb30 x-ms-traffictypediagnostic: VI1PR08MB4032:|DB7PR08MB3756: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:883;OLM:883; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: VFvHd8QAbFY7ptiTT2SiU1w7paK7vTexCHlMrmnYlP1Db9ywgXtnJhhEKa+yGgX9VmReHNdgxEXhz0T1OfN6hy0wMZjSJv9UkjJHQVVxHCRf+lVutH1sYXQmSjBIlPc/qsdW74ZYpMPgA4R2hOc596K8eI79eJDTMQmL72D1t/XFr9pQNEW7S3xnXK7vTgptDF4aeBknajmjArtgTcqgIcYO7HU3em7qVbEDPpG6HwzmnGKVux3IjEJicAs4h3ROvdOlUd/U4CmByiQf1+7YmjHX+udnCcymtPYLIr/Cxz283d3UB6ihqDwqMEdrNYG1H7+x53T529hJOFzf18Hk1kMzdw0dMzbHqk4C/A2L8Xxxb36U+E80UBkEDsVUQg5xPlIdn7u+YEoercnThK5ihYvJeHzXFcuQrWygVI1aUZgy0lvC9Pk3HOyzUWQEg0X2P4A6QhdTvL8Dz1QkalsihA== X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(376002)(346002)(366004)(136003)(396003)(39860400002)(966005)(76116006)(64756008)(2906002)(5660300002)(66476007)(66946007)(66446008)(52536014)(66556008)(30864003)(6506007)(478600001)(55016002)(4326008)(186003)(26005)(6916009)(316002)(9686003)(7696005)(86362001)(8676002)(83380400001)(83080400001)(71200400001)(8936002)(33656002)(2004002)(579004)(559001); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: 2KjrpP+w3d3AcgWjcIkvfmV01QCX4TR0IiTePsXaV+MziPq3LK4piMYlsQPSrdVnG2kRImGuAXVFOt/13n9W+lhK44dq8LWCDcDSB7fNunV8qLGQbDZWytHEer6Aq5eaNwtVo/WUT8LLLWvLqgwx0CKVCH3bOCJ/FjZJYn0KnYAaHmbLRU1YxzNpgAprfnSp26cWEmovhJHRFKhfuSYl+6cviA1gfI3KblTtFJyDL79EZaL1FXdx0xmtGmCrTeO2QPlrg28ty0KWimrohk08MLgx+AncFerbCohrm9FXR0AuIdKXnGhI/WU8r56Eo2JSMIoND0JSsRRth4jqpazKZxsOzjundG39L8ESzNZ7Pa59qtKwCoCx2jgGLu2pBNaFEMU4KKCS/gz/gZfXxNoosIoU51dn3sbCTzD/yHiQyUyrK3FhUxeb5hkWDOCUs2Gln9ho+NX0nZAqNdCvmjNaizmh7Xcpx/QllsPt0NxtaiC7d/DDg9HBMxHX/fWx93ZBJcl819+48VceWPqe3yaAdWufgeVL4LFKQBmzaehW/7RfdpXasoQq7BtVdmf4QBGsGZEG0XjeUWd/3rUTyYf+YMziR9gARLYQ5259we9zzDlrua42zW/zFpybNx+FIHLrm3jUjw7WH6Ivb5N+fbUuuA== x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB4032 Original-Authentication-Results: sourceware.org; dkim=none (message not signed) header.d=none; sourceware.org; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT004.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: f935721a-84f4-4a7a-528e-08d8705aa29c X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: MF5wQh4CWswOBc1jCCHinUy+9XnR/5LOhnPKeD2jBosfwY9KcUmk/5PuZ6/BQbjbeH8DNeYoflK4yRptAXJtey6/KC7cjgvLxfYBLCK4MLMPcOHUaBnsidKjz2QJtYLLKeWfjrITF7JObNexTZlLOY9/Iw6l/OeUzgolzxckoY/kChfDIUs6rCGPjldcBa4axj4q5kQ+jN3N9PJiexh9sEnxUfCTg4Spmih8LzHULoYuYIp1SSz4CyDV8ueBkpB5WGpos0GzSsMmKgMY/FZISqW1aKSISWqftkGJQ9/euHuuGcY4QpTTs0C3skXN62firnG0HNHlJm17yHJkHlUBM6q99MtKEqsK1f0Fndlxn6NG3gXrk3o7RegBkoPd/IIobad8chwmz3QhATjsKoaURCeI8t1o30lB7NQ+CvXZe+CIj7kFp/lKwjjEm0sj9rT2D5sGDY2zUKZdv6wj7FE8O85cmFGvdLhwIl13K+Ds+h7tkb1HcOIhixs1RdIPRJxi X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(39860400002)(396003)(346002)(376002)(136003)(46966005)(82310400003)(86362001)(8676002)(83380400001)(9686003)(316002)(356005)(83080400001)(8936002)(33656002)(81166007)(82740400003)(70586007)(70206006)(47076004)(5660300002)(2906002)(966005)(7696005)(52536014)(4326008)(478600001)(55016002)(186003)(26005)(6506007)(336012)(6916009)(30864003)(2004002); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Oct 2020 16:03:46.2759 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: bda7ab2f-1b1d-4d2b-f42c-08d8705abb30 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: DB5EUR03FT004.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR08MB3756 X-Spam-Status: No, score=-10.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SCC_5_SHORT_WORD_LINES, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-stable@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-stable mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Oct 2020 16:03:52 -0000 commit 24c0d6881503d00d06b4f9141678df2925c0b5a8=0A= Author: Wilco Dijkstra =0A= Date: Wed Oct 14 13:56:21 2020 +0100=0A= =0A= AArch64: Use __memcpy_simd on Neoverse N2/V1=0A= =0A= Add CPU detection of Neoverse N2 and Neoverse V1, and select __memcpy_s= imd as=0A= the memcpy/memmove ifunc.=0A= =0A= Reviewed-by: Adhemerval Zanella =0A= (cherry picked from commit e11ed9d2b4558eeacff81557dc9557001af42a6b)=0A= =0A= commit 80259cd098dcbb450d3f9475e5b4d14da649e292=0A= Author: Wilco Dijkstra =0A= Date: Wed Mar 11 17:15:25 2020 +0000=0A= =0A= [AArch64] Improve integer memcpy=0A= =0A= Further optimize integer memcpy. Small cases now include copies up=0A= to 32 bytes. 64-128 byte copies are split into two cases to improve=0A= performance of 64-96 byte copies. Comments have been rewritten.=0A= =0A= (cherry picked from commit 700065132744e0dfa6d4d9142d63f6e3a1934726)=0A= =0A= commit 704e18d66d2a8273f9de7e008280bcc8936becc5=0A= Author: Krzysztof Koch =0A= Date: Tue Nov 5 17:35:18 2019 +0000=0A= =0A= aarch64: Increase small and medium cases for __memcpy_generic=0A= =0A= Increase the upper bound on medium cases from 96 to 128 bytes.=0A= Now, up to 128 bytes are copied unrolled.=0A= =0A= Increase the upper bound on small cases from 16 to 32 bytes so that=0A= copies of 17-32 bytes are not impacted by the larger medium case.=0A= =0A= Benchmarking:=0A= The attached figures show relative timing difference with respect=0A= to 'memcpy_generic', which is the existing implementation.=0A= 'memcpy_med_128' denotes the the version of memcpy_generic with=0A= only the medium case enlarged. The 'memcpy_med_128_small_32' numbers=0A= are for the version of memcpy_generic submitted in this patch, which=0A= has both medium and small cases enlarged. The figures were generated=0A= using the script from:=0A= https://www.sourceware.org/ml/libc-alpha/2019-10/msg00563.html=0A= =0A= Depending on the platform, the performance improvement in the=0A= bench-memcpy-random.c benchmark ranges from 6% to 20% between=0A= the original and final version of memcpy.S=0A= =0A= Tested against GLIBC testsuite and randomized tests.=0A= =0A= (cherry picked from commit b9f145df85145506f8e61bac38b792584a38d88f)=0A= =0A= commit ad34abcad57d49f1882479a561d42bace1950c3e=0A= Author: Wilco Dijkstra =0A= Date: Wed Jul 15 16:58:07 2020 +0100=0A= =0A= AArch64: Rename IS_ARES to IS_NEOVERSE_N1=0A= =0A= Rename IS_ARES to IS_NEOVERSE_N1 since that is a bit clearer.=0A= =0A= Reviewed-by: Carlos O'Donell =0A= (cherry picked from commit 0f6278a8793a5d04ea31878119eccf99f469a02d)=0A= =0A= commit 236287f869d60b3a6dd32177dff29aa53b50d4b1=0A= Author: Wilco Dijkstra =0A= Date: Fri Aug 28 17:51:40 2020 +0100=0A= =0A= AArch64: Improve backwards memmove performance=0A= =0A= On some microarchitectures performance of the backwards memmove improve= s if=0A= the stores use STR with decreasing addresses. So change the memmove lo= op=0A= in memcpy_advsimd.S to use 2x STR rather than STP.=0A= =0A= Reviewed-by: Adhemerval Zanella =0A= (cherry picked from commit bd394d131c10c9ec22c6424197b79410042eed99)=0A= =0A= commit ade1fa24e36dd9d1821e1c693bf903cb9ffbf1be=0A= Author: Wilco Dijkstra =0A= Date: Wed Jul 15 16:55:07 2020 +0100=0A= =0A= AArch64: Add optimized Q-register memcpy=0A= =0A= Add a new memcpy using 128-bit Q registers - this is faster on modern= =0A= cores and reduces codesize. Similar to the generic memcpy, small cases= =0A= include copies up to 32 bytes. 64-128 byte copies are split into two= =0A= cases to improve performance of 64-96 byte copies. Large copies align= =0A= the source rather than the destination.=0A= =0A= bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1,=0A= so make this memcpy the default on N1 (on Centriq it is 15% faster than= =0A= memcpy_falkor).=0A= =0A= Passes GLIBC regression tests.=0A= =0A= Reviewed-by: Szabolcs Nagy =0A= (cherry picked from commit 4a733bf375238a6a595033b5785cea7f27d61307)=0A= =0A= commit afc53d52dc0f36bf5e0970c9cca4649a7cef72bd=0A= Author: Wilco Dijkstra =0A= Date: Wed Jul 15 16:50:02 2020 +0100=0A= =0A= AArch64: Align ENTRY to a cacheline=0A= =0A= Given almost all uses of ENTRY are for string/memory functions,=0A= align ENTRY to a cacheline to simplify things.=0A= =0A= Reviewed-by: Carlos O'Donell =0A= (cherry picked from commit 34f0d01d5e43c7dedd002ab47f6266dfb5b79c22)=0A= =0A= diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S=0A= index bcfef1c..cc8142d 100644=0A= --- a/sysdeps/aarch64/memcpy.S=0A= +++ b/sysdeps/aarch64/memcpy.S=0A= @@ -33,32 +33,24 @@=0A= #define A_l x6=0A= #define A_lw w6=0A= #define A_h x7=0A= -#define A_hw w7=0A= #define B_l x8=0A= #define B_lw w8=0A= #define B_h x9=0A= #define C_l x10=0A= +#define C_lw w10=0A= #define C_h x11=0A= #define D_l x12=0A= #define D_h x13=0A= -#define E_l src=0A= -#define E_h count=0A= -#define F_l srcend=0A= -#define F_h dst=0A= +#define E_l x14=0A= +#define E_h x15=0A= +#define F_l x16=0A= +#define F_h x17=0A= #define G_l count=0A= #define G_h dst=0A= +#define H_l src=0A= +#define H_h srcend=0A= #define tmp1 x14=0A= =0A= -/* Copies are split into 3 main cases: small copies of up to 16 bytes,=0A= - medium copies of 17..96 bytes which are fully unrolled. Large copies=0A= - of more than 96 bytes align the destination and use an unrolled loop=0A= - processing 64 bytes per iteration.=0A= - In order to share code with memmove, small and medium copies read all= =0A= - data before writing, allowing any kind of overlap. So small, medium=0A= - and large backwards memmoves are handled by falling through into memcpy= .=0A= - Overlapping large forward memmoves use a loop that copies backwards.=0A= -*/=0A= -=0A= #ifndef MEMMOVE=0A= # define MEMMOVE memmove=0A= #endif=0A= @@ -66,108 +58,115 @@=0A= # define MEMCPY memcpy=0A= #endif=0A= =0A= -ENTRY_ALIGN (MEMMOVE, 6)=0A= -=0A= - DELOUSE (0)=0A= - DELOUSE (1)=0A= - DELOUSE (2)=0A= +/* This implementation supports both memcpy and memmove and shares most co= de.=0A= + It uses unaligned accesses and branchless sequences to keep the code sm= all,=0A= + simple and improve performance.=0A= =0A= - sub tmp1, dstin, src=0A= - cmp count, 96=0A= - ccmp tmp1, count, 2, hi=0A= - b.lo L(move_long)=0A= + Copies are split into 3 main cases: small copies of up to 32 bytes, med= ium=0A= + copies of up to 128 bytes, and large copies. The overhead of the overl= ap=0A= + check in memmove is negligible since it is only required for large copi= es.=0A= =0A= - /* Common case falls through into memcpy. */=0A= -END (MEMMOVE)=0A= -libc_hidden_builtin_def (MEMMOVE)=0A= -ENTRY (MEMCPY)=0A= + Large copies use a software pipelined loop processing 64 bytes per=0A= + iteration. The destination pointer is 16-byte aligned to minimize=0A= + unaligned accesses. The loop tail is handled by always copying 64 byte= s=0A= + from the end.=0A= +*/=0A= =0A= +ENTRY_ALIGN (MEMCPY, 6)=0A= DELOUSE (0)=0A= DELOUSE (1)=0A= DELOUSE (2)=0A= =0A= - prfm PLDL1KEEP, [src]=0A= add srcend, src, count=0A= add dstend, dstin, count=0A= - cmp count, 16=0A= - b.ls L(copy16)=0A= - cmp count, 96=0A= + cmp count, 128=0A= b.hi L(copy_long)=0A= + cmp count, 32=0A= + b.hi L(copy32_128)=0A= =0A= - /* Medium copies: 17..96 bytes. */=0A= - sub tmp1, count, 1=0A= + /* Small copies: 0..32 bytes. */=0A= + cmp count, 16=0A= + b.lo L(copy16)=0A= ldp A_l, A_h, [src]=0A= - tbnz tmp1, 6, L(copy96)=0A= ldp D_l, D_h, [srcend, -16]=0A= - tbz tmp1, 5, 1f=0A= - ldp B_l, B_h, [src, 16]=0A= - ldp C_l, C_h, [srcend, -32]=0A= - stp B_l, B_h, [dstin, 16]=0A= - stp C_l, C_h, [dstend, -32]=0A= -1:=0A= stp A_l, A_h, [dstin]=0A= stp D_l, D_h, [dstend, -16]=0A= ret=0A= =0A= - .p2align 4=0A= - /* Small copies: 0..16 bytes. */=0A= + /* Copy 8-15 bytes. */=0A= L(copy16):=0A= - cmp count, 8=0A= - b.lo 1f=0A= + tbz count, 3, L(copy8)=0A= ldr A_l, [src]=0A= ldr A_h, [srcend, -8]=0A= str A_l, [dstin]=0A= str A_h, [dstend, -8]=0A= ret=0A= - .p2align 4=0A= -1:=0A= - tbz count, 2, 1f=0A= +=0A= + .p2align 3=0A= + /* Copy 4-7 bytes. */=0A= +L(copy8):=0A= + tbz count, 2, L(copy4)=0A= ldr A_lw, [src]=0A= - ldr A_hw, [srcend, -4]=0A= + ldr B_lw, [srcend, -4]=0A= str A_lw, [dstin]=0A= - str A_hw, [dstend, -4]=0A= + str B_lw, [dstend, -4]=0A= ret=0A= =0A= - /* Copy 0..3 bytes. Use a branchless sequence that copies the same= =0A= - byte 3 times if count=3D=3D1, or the 2nd byte twice if count=3D= =3D2. */=0A= -1:=0A= - cbz count, 2f=0A= + /* Copy 0..3 bytes using a branchless sequence. */=0A= +L(copy4):=0A= + cbz count, L(copy0)=0A= lsr tmp1, count, 1=0A= ldrb A_lw, [src]=0A= - ldrb A_hw, [srcend, -1]=0A= + ldrb C_lw, [srcend, -1]=0A= ldrb B_lw, [src, tmp1]=0A= strb A_lw, [dstin]=0A= strb B_lw, [dstin, tmp1]=0A= - strb A_hw, [dstend, -1]=0A= -2: ret=0A= + strb C_lw, [dstend, -1]=0A= +L(copy0):=0A= + ret=0A= =0A= .p2align 4=0A= - /* Copy 64..96 bytes. Copy 64 bytes from the start and=0A= - 32 bytes from the end. */=0A= -L(copy96):=0A= + /* Medium copies: 33..128 bytes. */=0A= +L(copy32_128):=0A= + ldp A_l, A_h, [src]=0A= ldp B_l, B_h, [src, 16]=0A= - ldp C_l, C_h, [src, 32]=0A= - ldp D_l, D_h, [src, 48]=0A= - ldp E_l, E_h, [srcend, -32]=0A= - ldp F_l, F_h, [srcend, -16]=0A= + ldp C_l, C_h, [srcend, -32]=0A= + ldp D_l, D_h, [srcend, -16]=0A= + cmp count, 64=0A= + b.hi L(copy128)=0A= stp A_l, A_h, [dstin]=0A= stp B_l, B_h, [dstin, 16]=0A= - stp C_l, C_h, [dstin, 32]=0A= - stp D_l, D_h, [dstin, 48]=0A= - stp E_l, E_h, [dstend, -32]=0A= - stp F_l, F_h, [dstend, -16]=0A= + stp C_l, C_h, [dstend, -32]=0A= + stp D_l, D_h, [dstend, -16]=0A= ret=0A= =0A= - /* Align DST to 16 byte alignment so that we don't cross cache line= =0A= - boundaries on both loads and stores. There are at least 96 byte= s=0A= - to copy, so copy 16 bytes unaligned and then align. The loop=0A= - copies 64 bytes per iteration and prefetches one iteration ahead= . */=0A= + .p2align 4=0A= + /* Copy 65..128 bytes. */=0A= +L(copy128):=0A= + ldp E_l, E_h, [src, 32]=0A= + ldp F_l, F_h, [src, 48]=0A= + cmp count, 96=0A= + b.ls L(copy96)=0A= + ldp G_l, G_h, [srcend, -64]=0A= + ldp H_l, H_h, [srcend, -48]=0A= + stp G_l, G_h, [dstend, -64]=0A= + stp H_l, H_h, [dstend, -48]=0A= +L(copy96):=0A= + stp A_l, A_h, [dstin]=0A= + stp B_l, B_h, [dstin, 16]=0A= + stp E_l, E_h, [dstin, 32]=0A= + stp F_l, F_h, [dstin, 48]=0A= + stp C_l, C_h, [dstend, -32]=0A= + stp D_l, D_h, [dstend, -16]=0A= + ret=0A= =0A= .p2align 4=0A= + /* Copy more than 128 bytes. */=0A= L(copy_long):=0A= + /* Copy 16 bytes and then align dst to 16-byte alignment. */=0A= + ldp D_l, D_h, [src]=0A= and tmp1, dstin, 15=0A= bic dst, dstin, 15=0A= - ldp D_l, D_h, [src]=0A= sub src, src, tmp1=0A= add count, count, tmp1 /* Count is now 16 too large. */= =0A= ldp A_l, A_h, [src, 16]=0A= @@ -176,7 +175,8 @@ L(copy_long):=0A= ldp C_l, C_h, [src, 48]=0A= ldp D_l, D_h, [src, 64]!=0A= subs count, count, 128 + 16 /* Test and readjust count. */=0A= - b.ls L(last64)=0A= + b.ls L(copy64_from_end)=0A= +=0A= L(loop64):=0A= stp A_l, A_h, [dst, 16]=0A= ldp A_l, A_h, [src, 16]=0A= @@ -189,10 +189,8 @@ L(loop64):=0A= subs count, count, 64=0A= b.hi L(loop64)=0A= =0A= - /* Write the last full set of 64 bytes. The remainder is at most 6= 4=0A= - bytes, so it is safe to always copy 64 bytes from the end even i= f=0A= - there is just 1 byte left. */=0A= -L(last64):=0A= + /* Write the last iteration and copy 64 bytes from the end. */=0A= +L(copy64_from_end):=0A= ldp E_l, E_h, [srcend, -64]=0A= stp A_l, A_h, [dst, 16]=0A= ldp A_l, A_h, [srcend, -48]=0A= @@ -207,20 +205,42 @@ L(last64):=0A= stp C_l, C_h, [dstend, -16]=0A= ret=0A= =0A= - .p2align 4=0A= -L(move_long):=0A= - cbz tmp1, 3f=0A= +END (MEMCPY)=0A= +libc_hidden_builtin_def (MEMCPY)=0A= +=0A= +ENTRY_ALIGN (MEMMOVE, 4)=0A= + DELOUSE (0)=0A= + DELOUSE (1)=0A= + DELOUSE (2)=0A= =0A= add srcend, src, count=0A= add dstend, dstin, count=0A= + cmp count, 128=0A= + b.hi L(move_long)=0A= + cmp count, 32=0A= + b.hi L(copy32_128)=0A= +=0A= + /* Small copies: 0..32 bytes. */=0A= + cmp count, 16=0A= + b.lo L(copy16)=0A= + ldp A_l, A_h, [src]=0A= + ldp D_l, D_h, [srcend, -16]=0A= + stp A_l, A_h, [dstin]=0A= + stp D_l, D_h, [dstend, -16]=0A= + ret=0A= =0A= - /* Align dstend to 16 byte alignment so that we don't cross cache l= ine=0A= - boundaries on both loads and stores. There are at least 96 byte= s=0A= - to copy, so copy 16 bytes unaligned and then align. The loop=0A= - copies 64 bytes per iteration and prefetches one iteration ahead= . */=0A= + .p2align 4=0A= +L(move_long):=0A= + /* Only use backward copy if there is an overlap. */=0A= + sub tmp1, dstin, src=0A= + cbz tmp1, L(copy0)=0A= + cmp tmp1, count=0A= + b.hs L(copy_long)=0A= =0A= - and tmp1, dstend, 15=0A= + /* Large backwards copy for overlapping copies.=0A= + Copy 16 bytes and then align dst to 16-byte alignment. */=0A= ldp D_l, D_h, [srcend, -16]=0A= + and tmp1, dstend, 15=0A= sub srcend, srcend, tmp1=0A= sub count, count, tmp1=0A= ldp A_l, A_h, [srcend, -16]=0A= @@ -230,10 +250,9 @@ L(move_long):=0A= ldp D_l, D_h, [srcend, -64]!=0A= sub dstend, dstend, tmp1=0A= subs count, count, 128=0A= - b.ls 2f=0A= + b.ls L(copy64_from_start)=0A= =0A= - nop=0A= -1:=0A= +L(loop64_backwards):=0A= stp A_l, A_h, [dstend, -16]=0A= ldp A_l, A_h, [srcend, -16]=0A= stp B_l, B_h, [dstend, -32]=0A= @@ -243,12 +262,10 @@ L(move_long):=0A= stp D_l, D_h, [dstend, -64]!=0A= ldp D_l, D_h, [srcend, -64]!=0A= subs count, count, 64=0A= - b.hi 1b=0A= + b.hi L(loop64_backwards)=0A= =0A= - /* Write the last full set of 64 bytes. The remainder is at most 6= 4=0A= - bytes, so it is safe to always copy 64 bytes from the start even= if=0A= - there is just 1 byte left. */=0A= -2:=0A= + /* Write the last iteration and copy 64 bytes from the start. */= =0A= +L(copy64_from_start):=0A= ldp G_l, G_h, [src, 48]=0A= stp A_l, A_h, [dstend, -16]=0A= ldp A_l, A_h, [src, 32]=0A= @@ -261,7 +278,7 @@ L(move_long):=0A= stp A_l, A_h, [dstin, 32]=0A= stp B_l, B_h, [dstin, 16]=0A= stp C_l, C_h, [dstin]=0A= -3: ret=0A= + ret=0A= =0A= -END (MEMCPY)=0A= -libc_hidden_builtin_def (MEMCPY)=0A= +END (MEMMOVE)=0A= +libc_hidden_builtin_def (MEMMOVE)=0A= diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch= /Makefile=0A= index 4150b89..b2b4250 100644=0A= --- a/sysdeps/aarch64/multiarch/Makefile=0A= +++ b/sysdeps/aarch64/multiarch/Makefile=0A= @@ -1,5 +1,5 @@=0A= ifeq ($(subdir),string)=0A= -sysdep_routines +=3D memcpy_generic memcpy_thunderx memcpy_thunderx2 \=0A= +sysdep_routines +=3D memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_= thunderx2 \=0A= memcpy_falkor memmove_falkor \=0A= memset_generic memset_falkor memset_emag \=0A= memchr_generic memchr_nosimd \=0A= diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/= multiarch/ifunc-impl-list.c=0A= index 10ff7d4..9704f42 100644=0A= --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c=0A= +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c=0A= @@ -42,11 +42,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_i= func_impl *array,=0A= IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)=0A= IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx2)=0A= IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)=0A= + IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)=0A= IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))=0A= IFUNC_IMPL (i, name, memmove,=0A= IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)=0A= IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx2)=0A= IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)=0A= + IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)=0A= IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))=0A= IFUNC_IMPL (i, name, memset,=0A= /* Enable this on non-falkor processors too so that other cor= es=0A= diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch= /memcpy.c=0A= index f79f84c..1528d89 100644=0A= --- a/sysdeps/aarch64/multiarch/memcpy.c=0A= +++ b/sysdeps/aarch64/multiarch/memcpy.c=0A= @@ -29,6 +29,7 @@=0A= extern __typeof (__redirect_memcpy) __libc_memcpy;=0A= =0A= extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;=0A= +extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;=0A= extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;=0A= extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;= =0A= extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;=0A= @@ -36,11 +37,14 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor att= ribute_hidden;=0A= libc_ifunc (__libc_memcpy,=0A= (IS_THUNDERX (midr)=0A= ? __memcpy_thunderx=0A= - : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_ARES (midr)=0A= + : (IS_FALKOR (midr) || IS_PHECDA (midr)=0A= ? __memcpy_falkor=0A= : (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)=0A= ? __memcpy_thunderx2=0A= - : __memcpy_generic))));=0A= + : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)=0A= + || IS_NEOVERSE_V1 (midr)=0A= + ? __memcpy_simd=0A= + : __memcpy_generic)))));=0A= =0A= # undef memcpy=0A= strong_alias (__libc_memcpy, memcpy);=0A= diff --git a/sysdeps/aarch64/multiarch/memcpy_advsimd.S b/sysdeps/aarch64/m= ultiarch/memcpy_advsimd.S=0A= new file mode 100644=0A= index 0000000..48bb6d7=0A= --- /dev/null=0A= +++ b/sysdeps/aarch64/multiarch/memcpy_advsimd.S=0A= @@ -0,0 +1,248 @@=0A= +/* Generic optimized memcpy using SIMD.=0A= + Copyright (C) 2020 Free Software Foundation, Inc.=0A= +=0A= + This file is part of the GNU C Library.=0A= +=0A= + The GNU C Library is free software; you can redistribute it and/or=0A= + modify it under the terms of the GNU Lesser General Public=0A= + License as published by the Free Software Foundation; either=0A= + version 2.1 of the License, or (at your option) any later version.=0A= +=0A= + The GNU C Library is distributed in the hope that it will be useful,=0A= + but WITHOUT ANY WARRANTY; without even the implied warranty of=0A= + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU=0A= + Lesser General Public License for more details.=0A= +=0A= + You should have received a copy of the GNU Lesser General Public=0A= + License along with the GNU C Library. If not, see=0A= + . */=0A= +=0A= +#include =0A= +=0A= +/* Assumptions:=0A= + *=0A= + * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses.=0A= + *=0A= + */=0A= +=0A= +#define dstin x0=0A= +#define src x1=0A= +#define count x2=0A= +#define dst x3=0A= +#define srcend x4=0A= +#define dstend x5=0A= +#define A_l x6=0A= +#define A_lw w6=0A= +#define A_h x7=0A= +#define B_l x8=0A= +#define B_lw w8=0A= +#define B_h x9=0A= +#define C_lw w10=0A= +#define tmp1 x14=0A= +=0A= +#define A_q q0=0A= +#define B_q q1=0A= +#define C_q q2=0A= +#define D_q q3=0A= +#define E_q q4=0A= +#define F_q q5=0A= +#define G_q q6=0A= +#define H_q q7=0A= +=0A= +=0A= +/* This implementation supports both memcpy and memmove and shares most co= de.=0A= + It uses unaligned accesses and branchless sequences to keep the code sm= all,=0A= + simple and improve performance.=0A= +=0A= + Copies are split into 3 main cases: small copies of up to 32 bytes, med= ium=0A= + copies of up to 128 bytes, and large copies. The overhead of the overl= ap=0A= + check in memmove is negligible since it is only required for large copi= es.=0A= +=0A= + Large copies use a software pipelined loop processing 64 bytes per=0A= + iteration. The destination pointer is 16-byte aligned to minimize=0A= + unaligned accesses. The loop tail is handled by always copying 64 byte= s=0A= + from the end. */=0A= +=0A= +ENTRY (__memcpy_simd)=0A= + DELOUSE (0)=0A= + DELOUSE (1)=0A= + DELOUSE (2)=0A= +=0A= + add srcend, src, count=0A= + add dstend, dstin, count=0A= + cmp count, 128=0A= + b.hi L(copy_long)=0A= + cmp count, 32=0A= + b.hi L(copy32_128)=0A= +=0A= + /* Small copies: 0..32 bytes. */=0A= + cmp count, 16=0A= + b.lo L(copy16)=0A= + ldr A_q, [src]=0A= + ldr B_q, [srcend, -16]=0A= + str A_q, [dstin]=0A= + str B_q, [dstend, -16]=0A= + ret=0A= +=0A= + /* Copy 8-15 bytes. */=0A= +L(copy16):=0A= + tbz count, 3, L(copy8)=0A= + ldr A_l, [src]=0A= + ldr A_h, [srcend, -8]=0A= + str A_l, [dstin]=0A= + str A_h, [dstend, -8]=0A= + ret=0A= +=0A= + /* Copy 4-7 bytes. */=0A= +L(copy8):=0A= + tbz count, 2, L(copy4)=0A= + ldr A_lw, [src]=0A= + ldr B_lw, [srcend, -4]=0A= + str A_lw, [dstin]=0A= + str B_lw, [dstend, -4]=0A= + ret=0A= +=0A= + /* Copy 0..3 bytes using a branchless sequence. */=0A= +L(copy4):=0A= + cbz count, L(copy0)=0A= + lsr tmp1, count, 1=0A= + ldrb A_lw, [src]=0A= + ldrb C_lw, [srcend, -1]=0A= + ldrb B_lw, [src, tmp1]=0A= + strb A_lw, [dstin]=0A= + strb B_lw, [dstin, tmp1]=0A= + strb C_lw, [dstend, -1]=0A= +L(copy0):=0A= + ret=0A= +=0A= + .p2align 4=0A= + /* Medium copies: 33..128 bytes. */=0A= +L(copy32_128):=0A= + ldp A_q, B_q, [src]=0A= + ldp C_q, D_q, [srcend, -32]=0A= + cmp count, 64=0A= + b.hi L(copy128)=0A= + stp A_q, B_q, [dstin]=0A= + stp C_q, D_q, [dstend, -32]=0A= + ret=0A= +=0A= + .p2align 4=0A= + /* Copy 65..128 bytes. */=0A= +L(copy128):=0A= + ldp E_q, F_q, [src, 32]=0A= + cmp count, 96=0A= + b.ls L(copy96)=0A= + ldp G_q, H_q, [srcend, -64]=0A= + stp G_q, H_q, [dstend, -64]=0A= +L(copy96):=0A= + stp A_q, B_q, [dstin]=0A= + stp E_q, F_q, [dstin, 32]=0A= + stp C_q, D_q, [dstend, -32]=0A= + ret=0A= +=0A= + /* Align loop64 below to 16 bytes. */=0A= + nop=0A= +=0A= + /* Copy more than 128 bytes. */=0A= +L(copy_long):=0A= + /* Copy 16 bytes and then align src to 16-byte alignment. */=0A= + ldr D_q, [src]=0A= + and tmp1, src, 15=0A= + bic src, src, 15=0A= + sub dst, dstin, tmp1=0A= + add count, count, tmp1 /* Count is now 16 too large. */= =0A= + ldp A_q, B_q, [src, 16]=0A= + str D_q, [dstin]=0A= + ldp C_q, D_q, [src, 48]=0A= + subs count, count, 128 + 16 /* Test and readjust count. */=0A= + b.ls L(copy64_from_end)=0A= +L(loop64):=0A= + stp A_q, B_q, [dst, 16]=0A= + ldp A_q, B_q, [src, 80]=0A= + stp C_q, D_q, [dst, 48]=0A= + ldp C_q, D_q, [src, 112]=0A= + add src, src, 64=0A= + add dst, dst, 64=0A= + subs count, count, 64=0A= + b.hi L(loop64)=0A= +=0A= + /* Write the last iteration and copy 64 bytes from the end. */=0A= +L(copy64_from_end):=0A= + ldp E_q, F_q, [srcend, -64]=0A= + stp A_q, B_q, [dst, 16]=0A= + ldp A_q, B_q, [srcend, -32]=0A= + stp C_q, D_q, [dst, 48]=0A= + stp E_q, F_q, [dstend, -64]=0A= + stp A_q, B_q, [dstend, -32]=0A= + ret=0A= +=0A= +END (__memcpy_simd)=0A= +libc_hidden_builtin_def (__memcpy_simd)=0A= +=0A= +=0A= +ENTRY (__memmove_simd)=0A= + DELOUSE (0)=0A= + DELOUSE (1)=0A= + DELOUSE (2)=0A= +=0A= + add srcend, src, count=0A= + add dstend, dstin, count=0A= + cmp count, 128=0A= + b.hi L(move_long)=0A= + cmp count, 32=0A= + b.hi L(copy32_128)=0A= +=0A= + /* Small moves: 0..32 bytes. */=0A= + cmp count, 16=0A= + b.lo L(copy16)=0A= + ldr A_q, [src]=0A= + ldr B_q, [srcend, -16]=0A= + str A_q, [dstin]=0A= + str B_q, [dstend, -16]=0A= + ret=0A= +=0A= +L(move_long):=0A= + /* Only use backward copy if there is an overlap. */=0A= + sub tmp1, dstin, src=0A= + cbz tmp1, L(move0)=0A= + cmp tmp1, count=0A= + b.hs L(copy_long)=0A= +=0A= + /* Large backwards copy for overlapping copies.=0A= + Copy 16 bytes and then align srcend to 16-byte alignment. */=0A= +L(copy_long_backwards):=0A= + ldr D_q, [srcend, -16]=0A= + and tmp1, srcend, 15=0A= + bic srcend, srcend, 15=0A= + sub count, count, tmp1=0A= + ldp A_q, B_q, [srcend, -32]=0A= + str D_q, [dstend, -16]=0A= + ldp C_q, D_q, [srcend, -64]=0A= + sub dstend, dstend, tmp1=0A= + subs count, count, 128=0A= + b.ls L(copy64_from_start)=0A= +=0A= +L(loop64_backwards):=0A= + str B_q, [dstend, -16]=0A= + str A_q, [dstend, -32]=0A= + ldp A_q, B_q, [srcend, -96]=0A= + str D_q, [dstend, -48]=0A= + str C_q, [dstend, -64]!=0A= + ldp C_q, D_q, [srcend, -128]=0A= + sub srcend, srcend, 64=0A= + subs count, count, 64=0A= + b.hi L(loop64_backwards)=0A= +=0A= + /* Write the last iteration and copy 64 bytes from the start. */= =0A= +L(copy64_from_start):=0A= + ldp E_q, F_q, [src, 32]=0A= + stp A_q, B_q, [dstend, -32]=0A= + ldp A_q, B_q, [src]=0A= + stp C_q, D_q, [dstend, -64]=0A= + stp E_q, F_q, [dstin, 32]=0A= + stp A_q, B_q, [dstin]=0A= +L(move0):=0A= + ret=0A= +=0A= +END (__memmove_simd)=0A= +libc_hidden_builtin_def (__memmove_simd)=0A= diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarc= h/memmove.c=0A= index f3d341b..60a0b25 100644=0A= --- a/sysdeps/aarch64/multiarch/memmove.c=0A= +++ b/sysdeps/aarch64/multiarch/memmove.c=0A= @@ -29,6 +29,7 @@=0A= extern __typeof (__redirect_memmove) __libc_memmove;=0A= =0A= extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;= =0A= +extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;=0A= extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;= =0A= extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;= =0A= extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;=0A= @@ -40,7 +41,10 @@ libc_ifunc (__libc_memmove,=0A= ? __memmove_falkor=0A= : (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)=0A= ? __memmove_thunderx2=0A= - : __memmove_generic))));=0A= + : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)=0A= + || IS_NEOVERSE_V1 (midr)=0A= + ? __memmove_simd=0A= + : __memmove_generic)))));=0A= =0A= # undef memmove=0A= strong_alias (__libc_memmove, memmove);=0A= diff --git a/sysdeps/aarch64/sysdep.h b/sysdeps/aarch64/sysdep.h=0A= index d3ff685..f995544 100644=0A= --- a/sysdeps/aarch64/sysdep.h=0A= +++ b/sysdeps/aarch64/sysdep.h=0A= @@ -45,7 +45,7 @@=0A= #define ENTRY(name) \=0A= .globl C_SYMBOL_NAME(name); \=0A= .type C_SYMBOL_NAME(name),%function; \=0A= - .align 4; \=0A= + .p2align 6; \=0A= C_LABEL(name) \= =0A= cfi_startproc; \=0A= CALL_MCOUNT=0A= diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/= sysv/linux/aarch64/cpu-features.h=0A= index 1273911..0877013 100644=0A= --- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h=0A= +++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h=0A= @@ -51,8 +51,12 @@=0A= =0A= #define IS_PHECDA(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'h' = \=0A= && MIDR_PARTNUM(midr) =3D=3D 0x000)=0A= -#define IS_ARES(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' = \=0A= - && MIDR_PARTNUM(midr) =3D=3D 0xd0c)=0A= +#define IS_NEOVERSE_N1(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' = \=0A= + && MIDR_PARTNUM(midr) =3D=3D 0xd0c)=0A= +#define IS_NEOVERSE_N2(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' = \=0A= + && MIDR_PARTNUM(midr) =3D=3D 0xd49)=0A= +#define IS_NEOVERSE_V1(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'A' = \=0A= + && MIDR_PARTNUM(midr) =3D=3D 0xd40)=0A= =0A= #define IS_EMAG(midr) (MIDR_IMPLEMENTOR(midr) =3D=3D 'P' = \=0A= && MIDR_PARTNUM(midr) =3D=3D 0x000)=0A=