From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 53921 invoked by alias); 29 Jun 2017 14:32:18 -0000 Mailing-List: contact newlib-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: newlib-owner@sourceware.org Received: (qmail 53673 invoked by uid 89); 29 Jun 2017 14:32:17 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.5 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 spammy=cbnz, endorse X-HELO: EUR01-DB5-obe.outbound.protection.outlook.com Received: from mail-db5eur01on0049.outbound.protection.outlook.com (HELO EUR01-DB5-obe.outbound.protection.outlook.com) (104.47.2.49) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 29 Jun 2017 14:32:13 +0000 Received: from AM5PR0802MB2610.eurprd08.prod.outlook.com (10.175.46.18) by AM5PR0802MB2611.eurprd08.prod.outlook.com (10.175.46.19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1220.11; Thu, 29 Jun 2017 14:32:09 +0000 Received: from AM5PR0802MB2610.eurprd08.prod.outlook.com ([fe80::a81e:e7d3:8fc6:7bb5]) by AM5PR0802MB2610.eurprd08.prod.outlook.com ([fe80::a81e:e7d3:8fc6:7bb5%18]) with mapi id 15.01.1220.013; Thu, 29 Jun 2017 14:32:09 +0000 From: Wilco Dijkstra To: "newlib@sourceware.org" CC: nd Subject: [AArch64] Optimized memcmp Date: Thu, 29 Jun 2017 14:32:00 -0000 Message-ID: authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=arm.com; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;AM5PR0802MB2611;7:CnEij8I4ZQ2HZPMKjaaqx7kZmjjMYidNP3FS3eVU5jOxmkYiNFxA41Z/5LjfZJtE7OSpuKGlNgUJdafeBkp4FQUb2UPSM+5sLS0Pyiw9eNoQHpv8od6lgQ3FOkIQQ6+arrpT/RhsZ3GvdykmeCxzJk48qKWsZMwlrBdpKRUsoYn9ZBdvvNzYRSFA+eOLPAv7PSan+RbeJ+IbyFarsGZx4eqbyrGJQ9FKEucw+fFY/RKeAqatEq1Zv9mIy0ZhdEgMvlzgRbpBTtuGjokPzRkfchSPJzXBPkdimQ+OMwyHpV80TM5q1Mxm9pNmCFZAvr5/uSmslRVE8BUZg2KQgFGBjV44+WpxGh0PZMuuHwSOsBYfTrSmrYpfLCTwE3FuPrRPEg5CceIHsOKZ0b0upA6OZLq0Y9WpgImLz+jWjJzd1qhU54O7KgFx3xolxBko9XjY069ZanvXOL14Fz3ZF0aMRtiklJRFDxGcLpaWDUkUpvyk3RdrSJO4vrvGzRF8hf0YfaUzEJi9XW5R8rKPCsUnf9OR05c7XMKPbkKgkUErxp0O9td8MJiDMEfTO3VlJEqkB3vuKwkVRglWJMLnR3oNSbQargHNybxyCSfXBNbiZ0+s+GWxwcHpOPcMZz5rUx+6hfb2d4r2k6/GzySw3Nhga7J2FdH97n8rxgM1MRC/KbXfcXVITdD5eizvfxW3Rd8E0N/p/BiOcLHiDvU0blMhv6LRRLjA1YLPVcuFxW2XkgmugUPL5jAJ67H1uDcwxL/t/WO/oiTGSkQUa1Cud1BvbUDdP3zbZJzA0rBMtyPCgYI= x-ms-office365-filtering-correlation-id: 912e22c1-48d6-4780-1f7b-08d4befba022 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(300000502095)(300135100095)(22001)(2017030254075)(48565401081)(300000503095)(300135400095)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:AM5PR0802MB2611; x-ms-traffictypediagnostic: AM5PR0802MB2611: nodisclaimer: True x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(180628864354917)(236129657087228)(148574349560750)(158140799945019)(247924648384137); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(5005006)(8121501046)(93006095)(93001095)(100000703101)(100105400095)(3002001)(10201501046)(6055026)(6041248)(20161123555025)(20161123564025)(20161123562025)(20161123560025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123558100)(6072148)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:AM5PR0802MB2611;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:AM5PR0802MB2611; x-forefront-prvs: 0353563E2B x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(6009001)(39850400002)(39410400002)(39840400002)(39400400002)(39450400003)(39860400002)(54534003)(377424004)(478600001)(7736002)(50986999)(54356999)(9686003)(1730700003)(189998001)(3280700002)(33656002)(74316002)(305945005)(5660300001)(66066001)(5640700003)(25786009)(8676002)(2906002)(6506006)(53936002)(7696004)(3660700001)(5250100002)(81166006)(8936002)(2900100001)(14454004)(3846002)(99286003)(6916009)(102836003)(72206003)(6116002)(2351001)(86362001)(38730400002)(6436002)(110136004)(4326008)(55016002)(2501003)(19627235001);DIR:OUT;SFP:1101;SCL:1;SRVR:AM5PR0802MB2611;H:AM5PR0802MB2610.eurprd08.prod.outlook.com;FPR:;SPF:None;MLV:sfv;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Jun 2017 14:32:09.7804 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0802MB2611 X-SW-Source: 2017/txt/msg00524.txt.bz2 This is an optimized memcmp for AArch64. This is a complete rewrite using a different algorithm. The previous version split into cases where both inputs were aligned, the inputs were mutually aligned and unaligned using a byte loop. The new version combines all these cases, while small inputs of less than 8 bytes are handled separately. This allows the main code to be sped up using unaligned loads since there are now at least 8 bytes to be compared. After the first 8 bytes, align the first input. This ensures each iteration does at most one unaligned access and mutually aligned inputs behave as aligned. After the main loop, process the last 8 bytes using unaligned accesses. This improves performance of (mutually) aligned cases by 25% and=20 unaligned by >500% (yes >6 times faster) on large inputs. ChangeLog: 2017-06-28 Wilco Dijkstra * newlib/libc/machine/aarch64/memcmp.S (memcmp):=20 Rewrite of optimized memcmp. GLIBC benchtests/bench-memcmp.c performance comparison for Cortex-A53: Length 1, alignment 1/ 1: 153% Length 1, alignment 1/ 1: 119% Length 1, alignment 1/ 1: 154% Length 2, alignment 2/ 2: 121% Length 2, alignment 2/ 2: 140% Length 2, alignment 2/ 2: 121% Length 3, alignment 3/ 3: 105% Length 3, alignment 3/ 3: 105% Length 3, alignment 3/ 3: 105% Length 4, alignment 4/ 4: 155% Length 4, alignment 4/ 4: 154% Length 4, alignment 4/ 4: 161% Length 5, alignment 5/ 5: 173% Length 5, alignment 5/ 5: 173% Length 5, alignment 5/ 5: 173% Length 6, alignment 6/ 6: 145% Length 6, alignment 6/ 6: 145% Length 6, alignment 6/ 6: 145% Length 7, alignment 7/ 7: 125% Length 7, alignment 7/ 7: 125% Length 7, alignment 7/ 7: 125% Length 8, alignment 8/ 8: 111% Length 8, alignment 8/ 8: 130% Length 8, alignment 8/ 8: 124% Length 9, alignment 9/ 9: 160% Length 9, alignment 9/ 9: 160% Length 9, alignment 9/ 9: 150% Length 10, alignment 10/10: 170% Length 10, alignment 10/10: 137% Length 10, alignment 10/10: 150% Length 11, alignment 11/11: 160% Length 11, alignment 11/11: 160% Length 11, alignment 11/11: 160% Length 12, alignment 12/12: 146% Length 12, alignment 12/12: 168% Length 12, alignment 12/12: 156% Length 13, alignment 13/13: 167% Length 13, alignment 13/13: 167% Length 13, alignment 13/13: 173% Length 14, alignment 14/14: 167% Length 14, alignment 14/14: 168% Length 14, alignment 14/14: 168% Length 15, alignment 15/15: 168% Length 15, alignment 15/15: 173% Length 15, alignment 15/15: 173% Length 1, alignment 0/ 0: 134% Length 1, alignment 0/ 0: 127% Length 1, alignment 0/ 0: 119% Length 2, alignment 0/ 0: 94% Length 2, alignment 0/ 0: 94% Length 2, alignment 0/ 0: 106% Length 3, alignment 0/ 0: 82% Length 3, alignment 0/ 0: 87% Length 3, alignment 0/ 0: 82% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 122% Length 5, alignment 0/ 0: 127% Length 5, alignment 0/ 0: 119% Length 5, alignment 0/ 0: 127% Length 6, alignment 0/ 0: 103% Length 6, alignment 0/ 0: 100% Length 6, alignment 0/ 0: 100% Length 7, alignment 0/ 0: 82% Length 7, alignment 0/ 0: 91% Length 7, alignment 0/ 0: 87% Length 8, alignment 0/ 0: 111% Length 8, alignment 0/ 0: 124% Length 8, alignment 0/ 0: 124% Length 9, alignment 0/ 0: 136% Length 9, alignment 0/ 0: 136% Length 9, alignment 0/ 0: 136% Length 10, alignment 0/ 0: 136% Length 10, alignment 0/ 0: 135% Length 10, alignment 0/ 0: 136% Length 11, alignment 0/ 0: 136% Length 11, alignment 0/ 0: 136% Length 11, alignment 0/ 0: 135% Length 12, alignment 0/ 0: 136% Length 12, alignment 0/ 0: 136% Length 12, alignment 0/ 0: 136% Length 13, alignment 0/ 0: 135% Length 13, alignment 0/ 0: 136% Length 13, alignment 0/ 0: 136% Length 14, alignment 0/ 0: 136% Length 14, alignment 0/ 0: 136% Length 14, alignment 0/ 0: 136% Length 15, alignment 0/ 0: 136% Length 15, alignment 0/ 0: 136% Length 15, alignment 0/ 0: 136% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 115% Length 32, alignment 0/ 0: 127% Length 32, alignment 7/ 2: 395% Length 32, alignment 0/ 0: 127% Length 32, alignment 0/ 0: 127% Length 8, alignment 0/ 0: 111% Length 8, alignment 0/ 0: 124% Length 8, alignment 0/ 0: 124% Length 64, alignment 0/ 0: 128% Length 64, alignment 6/ 4: 475% Length 64, alignment 0/ 0: 131% Length 64, alignment 0/ 0: 134% Length 16, alignment 0/ 0: 128% Length 16, alignment 0/ 0: 119% Length 16, alignment 0/ 0: 128% Length 128, alignment 0/ 0: 129% Length 128, alignment 5/ 6: 475% Length 128, alignment 0/ 0: 130% Length 128, alignment 0/ 0: 129% Length 32, alignment 0/ 0: 126% Length 32, alignment 0/ 0: 126% Length 32, alignment 0/ 0: 126% Length 256, alignment 0/ 0: 127% Length 256, alignment 4/ 8: 545% Length 256, alignment 0/ 0: 126% Length 256, alignment 0/ 0: 128% Length 64, alignment 0/ 0: 171% Length 64, alignment 0/ 0: 171% Length 64, alignment 0/ 0: 174% Length 512, alignment 0/ 0: 126% Length 512, alignment 3/10: 585% Length 512, alignment 0/ 0: 126% Length 512, alignment 0/ 0: 127% Length 128, alignment 0/ 0: 129% Length 128, alignment 0/ 0: 128% Length 128, alignment 0/ 0: 129% Length 1024, alignment 0/ 0: 125% Length 1024, alignment 2/12: 611% Length 1024, alignment 0/ 0: 126% Length 1024, alignment 0/ 0: 126% Length 256, alignment 0/ 0: 128% Length 256, alignment 0/ 0: 127% Length 256, alignment 0/ 0: 128% Length 2048, alignment 0/ 0: 125% Length 2048, alignment 1/14: 625% Length 2048, alignment 0/ 0: 125% Length 2048, alignment 0/ 0: 125% Length 512, alignment 0/ 0: 126% Length 512, alignment 0/ 0: 127% Length 512, alignment 0/ 0: 127% Length 4096, alignment 0/ 0: 125% Length 4096, alignment 0/16: 125% Length 4096, alignment 0/ 0: 125% Length 4096, alignment 0/ 0: 125% Length 1024, alignment 0/ 0: 126% Length 1024, alignment 0/ 0: 126% Length 1024, alignment 0/ 0: 126% Length 8192, alignment 0/ 0: 125% Length 8192, alignment 63/18: 636% Length 8192, alignment 0/ 0: 125% Length 8192, alignment 0/ 0: 125% Length 16, alignment 1/ 2: 317% Length 16, alignment 1/ 2: 317% Length 16, alignment 1/ 2: 317% Length 32, alignment 2/ 4: 395% Length 32, alignment 2/ 4: 395% Length 32, alignment 2/ 4: 398% Length 64, alignment 3/ 6: 475% Length 64, alignment 3/ 6: 475% Length 64, alignment 3/ 6: 477% Length 128, alignment 4/ 8: 479% Length 128, alignment 4/ 8: 479% Length 128, alignment 4/ 8: 479% Length 256, alignment 5/10: 543% Length 256, alignment 5/10: 539% Length 256, alignment 5/10: 543% Length 512, alignment 6/12: 585% Length 512, alignment 6/12: 585% Length 512, alignment 6/12: 585% Length 1024, alignment 7/14: 611% Length 1024, alignment 7/14: 611% Length 1024, alignment 7/14: 611% diff --git a/newlib/libc/machine/aarch64/memcmp.S b/newlib/libc/machine/aar= ch64/memcmp.S index 09be4c34417c16ccb5c87a5e8f1a4c08c314843c..1ffb79eb3af7fa841673ecb5a3a= 65b1e980219c2 100644 --- a/newlib/libc/machine/aarch64/memcmp.S +++ b/newlib/libc/machine/aarch64/memcmp.S @@ -1,220 +1,140 @@ -/* memcmp - compare memory - - Copyright (c) 2013, Linaro Limited - Copyright (c) 2017, Samsung Austin R&D Center - All rights reserved. - - Redistribution and use in source and binary forms, with or without - modification, are permitted provided that the following conditions are = met: - * Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - * Redistributions in binary form must reproduce the above copyright - notice, this list of conditions and the following disclaimer in t= he - documentation and/or other materials provided with the distributi= on. - * Neither the name of the Linaro nor the - names of its contributors may be used to endorse or promote produ= cts - derived from this software without specific prior written permiss= ion. - - THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS - "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT - LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR - A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT - HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, - SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT - LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, - DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY - THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE - OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ +/* + * Copyright (c) 2017 ARM Ltd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. The name of the company may not be used to endorse or promote + * products derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY ARM LTD ``AS IS'' AND ANY EXPRESS OR IMPLI= ED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL ARM LTD BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTA= L, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED + * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ =20 #if (defined (__OPTIMIZE_SIZE__) || defined (PREFER_SIZE_OVER_SPEED)) /* See memcmp-stub.c */ #else + /* Assumptions: * - * ARMv8-a, AArch64 + * ARMv8-a, AArch64, unaligned accesses. */ =20 - .macro def_fn f p2align=3D0 - .text - .p2align \p2align - .global \f - .type \f, %function -\f: - .endm - /* Parameters and result. */ #define src1 x0 #define src2 x1 #define limit x2 -#define result x0 +#define result w0 =20 /* Internal variables. */ #define data1 x3 #define data1w w3 #define data2 x4 #define data2w w4 -#define has_nul x5 -#define diff x6 -#define endloop x7 -#define tmp1 x8 -#define tmp2 x9 -#define tmp3 x10 -#define pos x11 -#define limit_wd x12 -#define mask x13 +#define tmp1 x5 =20 -def_fn memcmp p2align=3D6 - cbz limit, .Lret0 - eor tmp1, src1, src2 - tst tmp1, #7 - b.ne .Lmisaligned8 - ands tmp1, src1, #7 - b.ne .Lmutual_align - add limit_wd, limit, #7 - lsr limit_wd, limit_wd, #3 - /* Start of performance-critical section -- one 64B cache line. */ -.Lloop_aligned: - ldr data1, [src1], #8 - ldr data2, [src2], #8 -.Lstart_realigned: - subs limit_wd, limit_wd, #1 - eor diff, data1, data2 /* Non-zero if differences found. */ - csinv endloop, diff, xzr, ne /* Last Dword or differences. */ - cbz endloop, .Lloop_aligned - /* End of performance-critical section -- one 64B cache line. */ - - /* Not reached the limit, must have found a diff. */ - cbnz limit_wd, .Lnot_limit - - /* Limit % 8 =3D=3D 0 =3D> all bytes significant. */ - ands limit, limit, #7 - b.eq .Lnot_limit - - lsl limit, limit, #3 /* Bits -> bytes. */ - mov mask, #~0 -#ifdef __AARCH64EB__ - lsr mask, mask, limit -#else - lsl mask, mask, limit -#endif - bic data1, data1, mask - bic data2, data2, mask + .macro def_fn f p2align=3D0 + .text + .p2align \p2align + .global \f + .type \f, %function +\f: + .endm =20 - orr diff, diff, mask -.Lnot_limit: +/* Small inputs of less than 8 bytes are handled separately. This allows = the + main code to be sped up using unaligned loads since there are now at le= ast + 8 bytes to be compared. If the first 8 bytes are equal, align src1. + This ensures each iteration does at most one unaligned access even if b= oth + src1 and src2 are unaligned, and mutually aligned inputs behave as if + aligned. After the main loop, process the last 8 bytes using unaligned + accesses. */ =20 -#ifndef __AARCH64EB__ - rev diff, diff +def_fn memcmp p2align=3D6 + subs limit, limit, 8 + b.lo .Lless8 + + /* Limit >=3D 8, so check first 8 bytes using unaligned loads. */ + ldr data1, [src1], 8 + ldr data2, [src2], 8 + and tmp1, src1, 7 + add limit, limit, tmp1 + cmp data1, data2 + bne .Lreturn + + /* Align src1 and adjust src2 with bytes not yet done. */ + sub src1, src1, tmp1 + sub src2, src2, tmp1 + + subs limit, limit, 8 + b.ls .Llast_bytes + + /* Loop performing 8 bytes per iteration using aligned src1. + Limit is pre-decremented by 8 and must be larger than zero. + Exit if <=3D 8 bytes left to do or if the data is not equal. */ + .p2align 4 +.Lloop8: + ldr data1, [src1], 8 + ldr data2, [src2], 8 + subs limit, limit, 8 + ccmp data1, data2, 0, hi /* NZCV =3D 0b0000. */ + b.eq .Lloop8 + + cmp data1, data2 + bne .Lreturn + + /* Compare last 1-8 bytes using unaligned access. */ +.Llast_bytes: + ldr data1, [src1, limit] + ldr data2, [src2, limit] + + /* Compare data bytes and set return value to 0, -1 or 1. */ +.Lreturn: +#ifndef __AARCH64EB__ rev data1, data1 rev data2, data2 #endif - /* The MS-non-zero bit of DIFF marks either the first bit - that is different, or the end of the significant data. - Shifting left now will bring the critical information into the - top bits. */ - clz pos, diff - lsl data1, data1, pos - lsl data2, data2, pos - /* But we need to zero-extend (char is unsigned) the value and then - perform a signed 32-bit subtraction. */ - lsr data1, data1, #56 - sub result, data1, data2, lsr #56 - ret - -.Lmutual_align: - /* Sources are mutually aligned, but are not currently at an - alignment boundary. Round down the addresses and then mask off - the bytes that precede the start point. */ - bic src1, src1, #7 - bic src2, src2, #7 - add limit, limit, tmp1 /* Adjust the limit for the extra. */ - lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */ - ldr data1, [src1], #8 - neg tmp1, tmp1 /* Bits to alignment -64. */ - ldr data2, [src2], #8 - mov tmp2, #~0 -#ifdef __AARCH64EB__ - /* Big-endian. Early bytes are at MSB. */ - lsl tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */ -#else - /* Little-endian. Early bytes are at LSB. */ - lsr tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */ -#endif - add limit_wd, limit, #7 - orr data1, data1, tmp2 - orr data2, data2, tmp2 - lsr limit_wd, limit_wd, #3 - b .Lstart_realigned - -.Lret0: - mov result, #0 - ret - - .p2align 6 -.Lmisaligned8: - - cmp limit, #8 - b.lo .LmisalignedLt8 - -.LunalignedGe8 : - - /* Load the first dword with both src potentially unaligned. */ - ldr data1, [src1] - ldr data2, [src2] - - eor diff, data1, data2 /* Non-zero if differences found. */ - cbnz diff, .Lnot_limit - - /* Sources are not aligned: align one of the sources. */ - - and tmp1, src1, #0x7 - orr tmp3, xzr, #0x8 - sub pos, tmp3, tmp1 - - /* Increment SRC pointers by POS so SRC1 is word-aligned. */ - add src1, src1, pos - add src2, src2, pos - - sub limit, limit, pos - lsr limit_wd, limit, #3 - - cmp limit_wd, #0 - - /* save #bytes to go back to be able to read 8byte at end - pos=3Dnegative offset position to read 8 bytes when len%8 !=3D 0 */ - and limit, limit, #7 - sub pos, limit, #8 - - b .Lstart_part_realigned - - .p2align 5 -.Lloop_part_aligned: - ldr data1, [src1], #8 - ldr data2, [src2], #8 - subs limit_wd, limit_wd, #1 -.Lstart_part_realigned: - eor diff, data1, data2 /* Non-zero if differences found. */ - cbnz diff, .Lnot_limit - b.ne .Lloop_part_aligned - - /* process leftover bytes: read the leftover bytes, starting with - negative offset - so we can load 8 bytes. */ - ldr data1, [src1, pos] - ldr data2, [src2, pos] - eor diff, data1, data2 /* Non-zero if differences found. */ - b .Lnot_limit - -.LmisalignedLt8: - sub limit, limit, #1 -1: - ldrb data1w, [src1], #1 - ldrb data2w, [src2], #1 - subs limit, limit, #1 - ccmp data1w, data2w, #0, cs /* NZCV =3D 0b0000. */ - b.eq 1b - sub result, data1, data2 + cmp data1, data2 +.Lret_eq: + cset result, ne + cneg result, result, lo + ret + + .p2align 4 + /* Compare up to 8 bytes. Limit is [-8..-1]. */ +.Lless8: + adds limit, limit, 4 + b.lo .Lless4 + ldr data1w, [src1], 4 + ldr data2w, [src2], 4 + cmp data1w, data2w + b.ne .Lreturn + sub limit, limit, 4 +.Lless4: + adds limit, limit, 4 + beq .Lret_eq +.Lbyte_loop: + ldrb data1w, [src1], 1 + ldrb data2w, [src2], 1 + subs limit, limit, 1 + ccmp data1w, data2w, 0, ne /* NZCV =3D 0b0000. */ + b.eq .Lbyte_loop + sub result, data1w, data2w ret - .size memcmp, . - memcmp =20 + .size memcmp, . - memcmp #endif