From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 65324 invoked by alias); 10 Aug 2017 16:16:13 -0000 Mailing-List: contact newlib-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: newlib-owner@sourceware.org Received: (qmail 47763 invoked by uid 89); 10 Aug 2017 16:15:53 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.0 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 spammy=SOFTWARE, INCLUDING, HOWEVER, SPECIAL X-HELO: EUR02-VE1-obe.outbound.protection.outlook.com Received: from mail-eopbgr20045.outbound.protection.outlook.com (HELO EUR02-VE1-obe.outbound.protection.outlook.com) (40.107.2.45) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 10 Aug 2017 16:15:46 +0000 Received: from DB6PR0801MB2053.eurprd08.prod.outlook.com (10.168.86.22) by DB6PR0801MB2055.eurprd08.prod.outlook.com (10.168.86.136) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.1.1320.16; Thu, 10 Aug 2017 16:15:43 +0000 Received: from DB6PR0801MB2053.eurprd08.prod.outlook.com ([fe80::cd9d:80f1:82d8:5181]) by DB6PR0801MB2053.eurprd08.prod.outlook.com ([fe80::cd9d:80f1:82d8:5181%18]) with mapi id 15.01.1320.018; Thu, 10 Aug 2017 16:15:43 +0000 From: Wilco Dijkstra To: "newlib@sourceware.org" CC: nd Subject: Re: [AArch64] Optimized memcmp (cortex-strings) Date: Thu, 10 Aug 2017 16:16:00 -0000 Message-ID: References: In-Reply-To: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;DB6PR0801MB2055;6:JepUVsYKoe17BNzGibvCSb744Z5osN8yl3R/0Qs8BMaC3LIHI23QTOUBqBm+1ccz2mJvOEtJjHvdZaG12offVZxEyRwVv3WhwpsQmbp5Zxi1yq6FhZgzAh4mgAXPZQFBoMUtCzsYV87cV9AfaWmENerVauwCHRuX5gZy65nSSFjRyonf3mhKDCf6F+tRexQDwOK1pEFMW8FzREJ4kBp7qqvvjBnwgkG4DnbhgEDuABrvQWMQ+9exPnbO++PXoc4LhUS/AuWyctfcfwmuxpb6eoudeN/G1aos6cH4sfvKx6LeU1l+nIgFMl35qegerEyF757XuCEo3n186TJpg+luTA==;5:QXxYgcMgNzSy4ltCZ6RlR3JaEj4G7TkJ66wRlA0y+uaNUCsNNAjcIRu1koZG9z7qkLeY8UrjBX2x226/5ZVSI56khDHgGhbjDoxdnNGLd9z2rD2colw9a0r2gHuNn6uSikcXQQfLHULvB7md6xV+3w==;24:sKvL12Z6iULE8eHLlJlWqGkF0orT+wV/bd1mLDJruDCpbbJaKvSy0dPO8OqGfmHma9F9BaMwkk9w+gKza+LPZxnkIYOrVEnAZ3DFJ9okDN8=;7:WgieLn/+qY4jnXDrx7TdugUK9p9WH58N0DjMCMqfL3EaSvXBP2f2JotTUY8enTnEw01EGx2QwHNxuP+zkldQTg/RDGy+biAVKPHvCaoKQAv9NoT/VuyOfn8zrDFVjk5NY4PjLiXpGe2ZSRsj4WZ+6P4WlFOYq3HrxnaM7R/B+UoiYhQMHXd24TXubHob077VG7dySXji49IT9EDBnELORtC2zPJGvXnk2uDv0pGxpeI= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: afb01047-f636-493f-fd15-08d4e00b0cd0 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254152)(300000503095)(300135400095)(48565401081)(2017052603031)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:DB6PR0801MB2055; x-ms-traffictypediagnostic: DB6PR0801MB2055: nodisclaimer: True x-exchange-antispam-report-test: UriScan:(180628864354917); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(5005006)(8121501046)(100000703101)(100105400095)(93006095)(93001095)(10201501046)(3002001)(6055026)(6041248)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123564025)(20161123560025)(20161123558100)(20161123562025)(20161123555025)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:DB6PR0801MB2055;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:DB6PR0801MB2055; x-forefront-prvs: 03950F25EC x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(6009001)(39400400002)(39850400002)(39450400003)(39410400002)(39840400002)(39860400002)(54534003)(199003)(189002)(377424004)(53946003)(72206003)(55016002)(53936002)(99286003)(3846002)(305945005)(6246003)(38730400002)(33656002)(74316002)(110136004)(9686003)(2351001)(5640700003)(6436002)(106356001)(105586002)(6116002)(7696004)(14454004)(229853002)(25786009)(7736002)(102836003)(86362001)(50986999)(68736007)(5660300001)(8676002)(2900100001)(81156014)(81166006)(76176999)(97736004)(8936002)(53546010)(66066001)(54356999)(6916009)(2501003)(478600001)(4326008)(189998001)(101416001)(3280700002)(3660700001)(2950100002)(5250100002)(1730700003)(6506006)(2906002)(579004)(357404004);DIR:OUT;SFP:1101;SCL:1;SRVR:DB6PR0801MB2055;H:DB6PR0801MB2053.eurprd08.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-originalarrivaltime: 10 Aug 2017 16:15:42.9430 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0801MB2055 X-SW-Source: 2017/txt/msg00764.txt.bz2 Hi, Could someone push this into cortex-strings as well please? The patch is identical like below, except this should be removed: #if (defined (__OPTIMIZE_SIZE__) || defined (PREFER_SIZE_OVER_SPEED)) /* See memcmp-stub.c */ #else ... #endif Wilco From: Wilco Dijkstra Sent: 29 June 2017 15:32 To: newlib@sourceware.org Cc: nd Subject: [AArch64] Optimized memcmp =A0=20=20=20 This is an optimized memcmp for AArch64.=A0 This is a complete rewrite using a different algorithm.=A0 The previous version split into cases where both inputs were aligned, the inputs were mutually aligned and unaligned using a byte loop.=A0 The new version combines all these cases, while small inputs of less than 8 bytes are handled separately. This allows the main code to be sped up using unaligned loads since there are now at least 8 bytes to be compared.=A0 After the first 8 bytes, align the first input.=A0 This ensures each iteration does at most one unaligned access and mutually aligned inputs behave as aligned. After the main loop, process the last 8 bytes using unaligned accesses. This improves performance of (mutually) aligned cases by 25% and=20 unaligned by >500% (yes >6 times faster) on large inputs. ChangeLog: 2017-06-28=A0 Wilco Dijkstra=A0 =A0=A0=A0=A0=A0=A0=A0 * newlib/libc/machine/aarch64/memcmp.S (memcmp):=20 =A0=A0=A0=A0=A0=A0=A0 Rewrite of optimized memcmp. GLIBC benchtests/bench-memcmp.c performance comparison for Cortex-A53: Length=A0=A0=A0 1, alignment=A0 1/ 1:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 153% Length=A0=A0=A0 1, alignment=A0 1/ 1:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 119% Length=A0=A0=A0 1, alignment=A0 1/ 1:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 154% Length=A0=A0=A0 2, alignment=A0 2/ 2:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 121% Length=A0=A0=A0 2, alignment=A0 2/ 2:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 140% Length=A0=A0=A0 2, alignment=A0 2/ 2:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 121% Length=A0=A0=A0 3, alignment=A0 3/ 3:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 105% Length=A0=A0=A0 3, alignment=A0 3/ 3:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 105% Length=A0=A0=A0 3, alignment=A0 3/ 3:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 105% Length=A0=A0=A0 4, alignment=A0 4/ 4:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 155% Length=A0=A0=A0 4, alignment=A0 4/ 4:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 154% Length=A0=A0=A0 4, alignment=A0 4/ 4:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 161% Length=A0=A0=A0 5, alignment=A0 5/ 5:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 173% Length=A0=A0=A0 5, alignment=A0 5/ 5:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 173% Length=A0=A0=A0 5, alignment=A0 5/ 5:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 173% Length=A0=A0=A0 6, alignment=A0 6/ 6:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 145% Length=A0=A0=A0 6, alignment=A0 6/ 6:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 145% Length=A0=A0=A0 6, alignment=A0 6/ 6:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 145% Length=A0=A0=A0 7, alignment=A0 7/ 7:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length=A0=A0=A0 7, alignment=A0 7/ 7:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length=A0=A0=A0 7, alignment=A0 7/ 7:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length=A0=A0=A0 8, alignment=A0 8/ 8:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 111% Length=A0=A0=A0 8, alignment=A0 8/ 8:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 130% Length=A0=A0=A0 8, alignment=A0 8/ 8:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 124% Length=A0=A0=A0 9, alignment=A0 9/ 9:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 160% Length=A0=A0=A0 9, alignment=A0 9/ 9:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 160% Length=A0=A0=A0 9, alignment=A0 9/ 9:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 150% Length=A0=A0 10, alignment 10/10:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 170% Length=A0=A0 10, alignment 10/10:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 137% Length=A0=A0 10, alignment 10/10:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 150% Length=A0=A0 11, alignment 11/11:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 160% Length=A0=A0 11, alignment 11/11:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 160% Length=A0=A0 11, alignment 11/11:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 160% Length=A0=A0 12, alignment 12/12:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 146% Length=A0=A0 12, alignment 12/12:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 168% Length=A0=A0 12, alignment 12/12:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 156% Length=A0=A0 13, alignment 13/13:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 167% Length=A0=A0 13, alignment 13/13:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 167% Length=A0=A0 13, alignment 13/13:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 173% Length=A0=A0 14, alignment 14/14:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 167% Length=A0=A0 14, alignment 14/14:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 168% Length=A0=A0 14, alignment 14/14:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 168% Length=A0=A0 15, alignment 15/15:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 168% Length=A0=A0 15, alignment 15/15:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 173% Length=A0=A0 15, alignment 15/15:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 173% Length=A0=A0=A0 1, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 134% Length=A0=A0=A0 1, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0=A0=A0 1, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 119% Length=A0=A0=A0 2, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 94% Length=A0=A0=A0 2, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 94% Length=A0=A0=A0 2, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 106% Length=A0=A0=A0 3, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 82% Length=A0=A0=A0 3, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 87% Length=A0=A0=A0 3, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 82% Length=A0=A0=A0 4, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 115% Length=A0=A0=A0 4, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 115% Length=A0=A0=A0 4, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 122% Length=A0=A0=A0 5, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0=A0=A0 5, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 119% Length=A0=A0=A0 5, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0=A0=A0 6, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 103% Length=A0=A0=A0 6, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 100% Length=A0=A0=A0 6, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 100% Length=A0=A0=A0 7, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 82% Length=A0=A0=A0 7, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 91% Length=A0=A0=A0 7, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 87% Length=A0=A0=A0 8, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 111% Length=A0=A0=A0 8, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 124% Length=A0=A0=A0 8, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 124% Length=A0=A0=A0 9, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0=A0 9, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0=A0 9, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 10, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 10, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 135% Length=A0=A0 10, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 11, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 11, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 11, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 135% Length=A0=A0 12, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 12, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 12, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 13, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 135% Length=A0=A0 13, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 13, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 14, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 14, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 14, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 15, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 15, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0 15, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 136% Length=A0=A0=A0 4, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 115% Length=A0=A0=A0 4, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 115% Length=A0=A0=A0 4, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 115% Length=A0=A0 32, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0=A0 32, alignment=A0 7/ 2:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 395% Length=A0=A0 32, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0=A0 32, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0=A0=A0 8, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 111% Length=A0=A0=A0 8, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 124% Length=A0=A0=A0 8, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 124% Length=A0=A0 64, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 128% Length=A0=A0 64, alignment=A0 6/ 4:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 475% Length=A0=A0 64, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 131% Length=A0=A0 64, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 134% Length=A0=A0 16, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 128% Length=A0=A0 16, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 119% Length=A0=A0 16, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 128% Length=A0 128, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 129% Length=A0 128, alignment=A0 5/ 6:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 475% Length=A0 128, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 130% Length=A0 128, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 129% Length=A0=A0 32, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0=A0 32, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0=A0 32, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0 256, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0 256, alignment=A0 4/ 8:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 545% Length=A0 256, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0 256, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 128% Length=A0=A0 64, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 171% Length=A0=A0 64, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 171% Length=A0=A0 64, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 174% Length=A0 512, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0 512, alignment=A0 3/10:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 585% Length=A0 512, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0 512, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0 128, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 129% Length=A0 128, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 128% Length=A0 128, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 129% Length 1024, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 1024, alignment=A0 2/12:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 611% Length 1024, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length 1024, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0 256, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 128% Length=A0 256, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0 256, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 128% Length 2048, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 2048, alignment=A0 1/14:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 625% Length 2048, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 2048, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length=A0 512, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length=A0 512, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length=A0 512, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 127% Length 4096, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 4096, alignment=A0 0/16:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 4096, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 4096, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 1024, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length 1024, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length 1024, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 126% Length 8192, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 8192, alignment 63/18:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 636% Length 8192, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length 8192, alignment=A0 0/ 0:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 125% Length=A0=A0 16, alignment=A0 1/ 2:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 317% Length=A0=A0 16, alignment=A0 1/ 2:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 317% Length=A0=A0 16, alignment=A0 1/ 2:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 317% Length=A0=A0 32, alignment=A0 2/ 4:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 395% Length=A0=A0 32, alignment=A0 2/ 4:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 395% Length=A0=A0 32, alignment=A0 2/ 4:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 398% Length=A0=A0 64, alignment=A0 3/ 6:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 475% Length=A0=A0 64, alignment=A0 3/ 6:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 475% Length=A0=A0 64, alignment=A0 3/ 6:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 477% Length=A0 128, alignment=A0 4/ 8:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 479% Length=A0 128, alignment=A0 4/ 8:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 479% Length=A0 128, alignment=A0 4/ 8:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 479% Length=A0 256, alignment=A0 5/10:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 543% Length=A0 256, alignment=A0 5/10:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 539% Length=A0 256, alignment=A0 5/10:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 543% Length=A0 512, alignment=A0 6/12:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 585% Length=A0 512, alignment=A0 6/12:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 585% Length=A0 512, alignment=A0 6/12:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 585% Length 1024, alignment=A0 7/14:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 611% Length 1024, alignment=A0 7/14:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 611% Length 1024, alignment=A0 7/14:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 611% diff --git a/newlib/libc/machine/aarch64/memcmp.S b/newlib/libc/machine/aar= ch64/memcmp.S index 09be4c34417c16ccb5c87a5e8f1a4c08c314843c..1ffb79eb3af7fa841673ecb5a3a= 65b1e980219c2 100644 --- a/newlib/libc/machine/aarch64/memcmp.S +++ b/newlib/libc/machine/aarch64/memcmp.S @@ -1,220 +1,140 @@ -/* memcmp - compare memory - -=A0=A0 Copyright (c) 2013, Linaro Limited -=A0=A0 Copyright (c) 2017, Samsung Austin R&D Center -=A0=A0 All rights reserved. - -=A0=A0 Redistribution and use in source and binary forms, with or without -=A0=A0 modification, are permitted provided that the following conditions = are met: -=A0=A0=A0=A0=A0=A0 * Redistributions of source code must retain the above = copyright -=A0=A0=A0=A0=A0=A0=A0=A0 notice, this list of conditions and the following= disclaimer. -=A0=A0=A0=A0=A0=A0 * Redistributions in binary form must reproduce the abo= ve copyright -=A0=A0=A0=A0=A0=A0=A0=A0 notice, this list of conditions and the following= disclaimer in the -=A0=A0=A0=A0=A0=A0=A0=A0 documentation and/or other materials provided wit= h the distribution. -=A0=A0=A0=A0=A0=A0 * Neither the name of the Linaro nor the -=A0=A0=A0=A0=A0=A0=A0=A0 names of its contributors may be used to endorse = or promote products -=A0=A0=A0=A0=A0=A0=A0=A0 derived from this software without specific prior= written permission. - -=A0=A0 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS -=A0=A0 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT -=A0=A0 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS F= OR -=A0=A0 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT -=A0=A0 HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENT= AL, -=A0=A0 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT -=A0=A0 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF US= E, -=A0=A0 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON A= NY -=A0=A0 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -=A0=A0 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE U= SE -=A0=A0 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE= . */ +/* + * Copyright (c) 2017 ARM Ltd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + *=A0=A0=A0 notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + *=A0=A0=A0 notice, this list of conditions and the following disclaimer i= n the + *=A0=A0=A0 documentation and/or other materials provided with the distrib= ution. + * 3. The name of the company may not be used to endorse or promote + *=A0=A0=A0 products derived from this software without specific prior wri= tten + *=A0=A0=A0 permission. + * + * THIS SOFTWARE IS PROVIDED BY ARM LTD ``AS IS'' AND ANY EXPRESS OR IMPLI= ED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL ARM LTD BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTA= L, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED + * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ =A0 =A0#if (defined (__OPTIMIZE_SIZE__) || defined (PREFER_SIZE_OVER_SPEED)) =A0/* See memcmp-stub.c=A0 */ =A0#else + =A0/* Assumptions: =A0 * - * ARMv8-a, AArch64 + * ARMv8-a, AArch64, unaligned accesses. =A0 */ =A0 -=A0=A0=A0=A0=A0=A0 .macro def_fn f p2align=3D0 -=A0=A0=A0=A0=A0=A0 .text -=A0=A0=A0=A0=A0=A0 .p2align \p2align -=A0=A0=A0=A0=A0=A0 .global \f -=A0=A0=A0=A0=A0=A0 .type \f, %function -\f: -=A0=A0=A0=A0=A0=A0 .endm - =A0/* Parameters and result.=A0 */ =A0#define src1=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x0 =A0#define src2=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x1 =A0#define limit=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x2 -#define result=A0=A0=A0=A0=A0=A0=A0=A0 x0 +#define result=A0=A0=A0=A0=A0=A0=A0=A0 w0 =A0 =A0/* Internal variables.=A0 */ =A0#define data1=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x3 =A0#define data1w=A0=A0=A0=A0=A0=A0=A0=A0=A0 w3 =A0#define data2=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x4 =A0#define data2w=A0=A0=A0=A0=A0=A0=A0=A0=A0 w4 -#define has_nul=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x5 -#define diff=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x6 -#define endloop=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x7 -#define tmp1=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x8 -#define tmp2=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x9 -#define tmp3=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x10 -#define pos=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x11 -#define limit_wd=A0=A0=A0=A0=A0=A0 x12 -#define mask=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x13 +#define tmp1=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 x5 =A0 -def_fn memcmp p2align=3D6 -=A0=A0=A0=A0=A0=A0 cbz=A0=A0=A0=A0 limit, .Lret0 -=A0=A0=A0=A0=A0=A0 eor=A0=A0=A0=A0 tmp1, src1, src2 -=A0=A0=A0=A0=A0=A0 tst=A0=A0=A0=A0 tmp1, #7 -=A0=A0=A0=A0=A0=A0 b.ne=A0=A0=A0 .Lmisaligned8 -=A0=A0=A0=A0=A0=A0 ands=A0=A0=A0 tmp1, src1, #7 -=A0=A0=A0=A0=A0=A0 b.ne=A0=A0=A0 .Lmutual_align -=A0=A0=A0=A0=A0=A0 add=A0=A0=A0=A0 limit_wd, limit, #7 -=A0=A0=A0=A0=A0=A0 lsr=A0=A0=A0=A0 limit_wd, limit_wd, #3 -=A0=A0=A0=A0=A0=A0 /* Start of performance-critical section=A0 -- one 64B = cache line.=A0 */ -.Lloop_aligned: -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1], #8 -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2], #8 -.Lstart_realigned: -=A0=A0=A0=A0=A0=A0 subs=A0=A0=A0 limit_wd, limit_wd, #1 -=A0=A0=A0=A0=A0=A0 eor=A0=A0=A0=A0 diff, data1, data2=A0=A0=A0=A0=A0 /* No= n-zero if differences found.=A0 */ -=A0=A0=A0=A0=A0=A0 csinv=A0=A0 endloop, diff, xzr, ne=A0 /* Last Dword or = differences.=A0 */ -=A0=A0=A0=A0=A0=A0 cbz=A0=A0=A0=A0 endloop, .Lloop_aligned -=A0=A0=A0=A0=A0=A0 /* End of performance-critical section=A0 -- one 64B ca= che line.=A0 */ - -=A0=A0=A0=A0=A0=A0 /* Not reached the limit, must have found a diff.=A0 */ -=A0=A0=A0=A0=A0=A0 cbnz=A0=A0=A0 limit_wd, .Lnot_limit - -=A0=A0=A0=A0=A0=A0 /* Limit % 8 =3D=3D 0 =3D> all bytes significant.=A0 */ -=A0=A0=A0=A0=A0=A0 ands=A0=A0=A0 limit, limit, #7 -=A0=A0=A0=A0=A0=A0 b.eq=A0=A0=A0 .Lnot_limit - -=A0=A0=A0=A0=A0=A0 lsl=A0=A0=A0=A0 limit, limit, #3=A0=A0=A0=A0=A0=A0=A0 /= * Bits -> bytes.=A0 */ -=A0=A0=A0=A0=A0=A0 mov=A0=A0=A0=A0 mask, #~0 -#ifdef __AARCH64EB__ -=A0=A0=A0=A0=A0=A0 lsr=A0=A0=A0=A0 mask, mask, limit -#else -=A0=A0=A0=A0=A0=A0 lsl=A0=A0=A0=A0 mask, mask, limit -#endif -=A0=A0=A0=A0=A0=A0 bic=A0=A0=A0=A0 data1, data1, mask -=A0=A0=A0=A0=A0=A0 bic=A0=A0=A0=A0 data2, data2, mask +=A0=A0=A0=A0=A0=A0=A0 .macro def_fn f p2align=3D0 +=A0=A0=A0=A0=A0=A0=A0 .text +=A0=A0=A0=A0=A0=A0=A0 .p2align \p2align +=A0=A0=A0=A0=A0=A0=A0 .global \f +=A0=A0=A0=A0=A0=A0=A0 .type \f, %function +\f: +=A0=A0=A0=A0=A0=A0=A0 .endm =A0 -=A0=A0=A0=A0=A0=A0 orr=A0=A0=A0=A0 diff, diff, mask -.Lnot_limit: +/* Small inputs of less than 8 bytes are handled separately.=A0 This allow= s the +=A0=A0 main code to be sped up using unaligned loads since there are now a= t least +=A0=A0 8 bytes to be compared.=A0 If the first 8 bytes are equal, align sr= c1. +=A0=A0 This ensures each iteration does at most one unaligned access even = if both +=A0=A0 src1 and src2 are unaligned, and mutually aligned inputs behave as = if +=A0=A0 aligned.=A0 After the main loop, process the last 8 bytes using una= ligned +=A0=A0 accesses.=A0 */ =A0 -#ifndef=A0=A0=A0=A0=A0=A0=A0 __AARCH64EB__ -=A0=A0=A0=A0=A0=A0 rev=A0=A0=A0=A0 diff, diff +def_fn memcmp p2align=3D6 +=A0=A0=A0=A0=A0=A0 subs=A0=A0=A0 limit, limit, 8 +=A0=A0=A0=A0=A0=A0 b.lo=A0=A0=A0 .Lless8 + +=A0=A0=A0=A0=A0=A0 /* Limit >=3D 8, so check first 8 bytes using unaligned= loads.=A0 */ +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1], 8 +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2], 8 +=A0=A0=A0=A0=A0=A0 and=A0=A0=A0=A0 tmp1, src1, 7 +=A0=A0=A0=A0=A0=A0 add=A0=A0=A0=A0 limit, limit, tmp1 +=A0=A0=A0=A0=A0=A0 cmp=A0=A0=A0=A0 data1, data2 +=A0=A0=A0=A0=A0=A0 bne=A0=A0=A0=A0 .Lreturn + +=A0=A0=A0=A0=A0=A0 /* Align src1 and adjust src2 with bytes not yet done.= =A0 */ +=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 src1, src1, tmp1 +=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 src2, src2, tmp1 + +=A0=A0=A0=A0=A0=A0 subs=A0=A0=A0 limit, limit, 8 +=A0=A0=A0=A0=A0=A0 b.ls=A0=A0=A0 .Llast_bytes + +=A0=A0=A0=A0=A0=A0 /* Loop performing 8 bytes per iteration using aligned = src1. +=A0=A0=A0=A0=A0=A0=A0=A0=A0 Limit is pre-decremented by 8 and must be larg= er than zero. +=A0=A0=A0=A0=A0=A0=A0=A0=A0 Exit if <=3D 8 bytes left to do or if the data= is not equal.=A0 */ +=A0=A0=A0=A0=A0=A0 .p2align 4 +.Lloop8: +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1], 8 +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2], 8 +=A0=A0=A0=A0=A0=A0 subs=A0=A0=A0 limit, limit, 8 +=A0=A0=A0=A0=A0=A0 ccmp=A0=A0=A0 data1, data2, 0, hi=A0 /* NZCV =3D 0b0000= .=A0 */ +=A0=A0=A0=A0=A0=A0 b.eq=A0=A0=A0 .Lloop8 + +=A0=A0=A0=A0=A0=A0 cmp=A0=A0=A0=A0 data1, data2 +=A0=A0=A0=A0=A0=A0 bne=A0=A0=A0=A0 .Lreturn + +=A0=A0=A0=A0=A0=A0 /* Compare last 1-8 bytes using unaligned access.=A0 */ +.Llast_bytes: +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1, limit] +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2, limit] + +=A0=A0=A0=A0=A0=A0 /* Compare data bytes and set return value to 0, -1 or = 1.=A0 */ +.Lreturn: +#ifndef __AARCH64EB__ =A0=A0=A0=A0=A0=A0=A0=A0 rev=A0=A0=A0=A0 data1, data1 =A0=A0=A0=A0=A0=A0=A0=A0 rev=A0=A0=A0=A0 data2, data2 =A0#endif -=A0=A0=A0=A0=A0=A0 /* The MS-non-zero bit of DIFF marks either the first b= it -=A0=A0=A0=A0=A0=A0=A0=A0=A0 that is different, or the end of the significa= nt data. -=A0=A0=A0=A0=A0=A0=A0=A0=A0 Shifting left now will bring the critical info= rmation into the -=A0=A0=A0=A0=A0=A0=A0=A0=A0 top bits.=A0 */ -=A0=A0=A0=A0=A0=A0 clz=A0=A0=A0=A0 pos, diff -=A0=A0=A0=A0=A0=A0 lsl=A0=A0=A0=A0 data1, data1, pos -=A0=A0=A0=A0=A0=A0 lsl=A0=A0=A0=A0 data2, data2, pos -=A0=A0=A0=A0=A0=A0 /* But we need to zero-extend (char is unsigned) the va= lue and then -=A0=A0=A0=A0=A0=A0=A0=A0=A0 perform a signed 32-bit subtraction.=A0 */ -=A0=A0=A0=A0=A0=A0 lsr=A0=A0=A0=A0 data1, data1, #56 -=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 result, data1, data2, lsr #56 -=A0=A0=A0=A0=A0=A0 ret - -.Lmutual_align: -=A0=A0=A0=A0=A0=A0 /* Sources are mutually aligned, but are not currently = at an -=A0=A0=A0=A0=A0=A0=A0=A0=A0 alignment boundary.=A0 Round down the addresse= s and then mask off -=A0=A0=A0=A0=A0=A0=A0=A0=A0 the bytes that precede the start point.=A0 */ -=A0=A0=A0=A0=A0=A0 bic=A0=A0=A0=A0 src1, src1, #7 -=A0=A0=A0=A0=A0=A0 bic=A0=A0=A0=A0 src2, src2, #7 -=A0=A0=A0=A0=A0=A0 add=A0=A0=A0=A0 limit, limit, tmp1=A0=A0=A0=A0=A0 /* Ad= just the limit for the extra.=A0 */ -=A0=A0=A0=A0=A0=A0 lsl=A0=A0=A0=A0 tmp1, tmp1, #3=A0=A0=A0=A0=A0=A0=A0=A0= =A0 /* Bytes beyond alignment -> bits.=A0 */ -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1], #8 -=A0=A0=A0=A0=A0=A0 neg=A0=A0=A0=A0 tmp1, tmp1=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 /* Bits to alignment -64.=A0 */ -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2], #8 -=A0=A0=A0=A0=A0=A0 mov=A0=A0=A0=A0 tmp2, #~0 -#ifdef __AARCH64EB__ -=A0=A0=A0=A0=A0=A0 /* Big-endian.=A0 Early bytes are at MSB.=A0 */ -=A0=A0=A0=A0=A0=A0 lsl=A0=A0=A0=A0 tmp2, tmp2, tmp1=A0=A0=A0=A0=A0=A0=A0 /= * Shift (tmp1 & 63).=A0 */ -#else -=A0=A0=A0=A0=A0=A0 /* Little-endian.=A0 Early bytes are at LSB.=A0 */ -=A0=A0=A0=A0=A0=A0 lsr=A0=A0=A0=A0 tmp2, tmp2, tmp1=A0=A0=A0=A0=A0=A0=A0 /= * Shift (tmp1 & 63).=A0 */ -#endif -=A0=A0=A0=A0=A0=A0 add=A0=A0=A0=A0 limit_wd, limit, #7 -=A0=A0=A0=A0=A0=A0 orr=A0=A0=A0=A0 data1, data1, tmp2 -=A0=A0=A0=A0=A0=A0 orr=A0=A0=A0=A0 data2, data2, tmp2 -=A0=A0=A0=A0=A0=A0 lsr=A0=A0=A0=A0 limit_wd, limit_wd, #3 -=A0=A0=A0=A0=A0=A0 b=A0=A0=A0=A0=A0=A0 .Lstart_realigned - -.Lret0: -=A0=A0=A0=A0=A0=A0 mov=A0=A0=A0=A0 result, #0 -=A0=A0=A0=A0=A0=A0 ret - -=A0=A0=A0=A0=A0=A0 .p2align 6 -.Lmisaligned8: - -=A0=A0=A0=A0=A0=A0 cmp=A0=A0=A0=A0 limit, #8 -=A0=A0=A0=A0=A0=A0 b.lo=A0=A0=A0 .LmisalignedLt8 - -.LunalignedGe8 : - -=A0=A0=A0=A0=A0=A0 /* Load the first dword with both src potentially unali= gned. */ -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1] -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2] - -=A0=A0=A0=A0=A0=A0 eor=A0=A0=A0=A0 diff, data1, data2=A0=A0=A0=A0=A0 /* No= n-zero if differences found. */ -=A0=A0=A0=A0=A0=A0 cbnz=A0=A0=A0 diff, .Lnot_limit - -=A0=A0=A0=A0=A0=A0 /* Sources are not aligned: align one of the sources. */ - -=A0=A0=A0=A0=A0=A0 and=A0=A0=A0=A0 tmp1, src1, #0x7 -=A0=A0=A0=A0=A0=A0 orr=A0=A0=A0=A0 tmp3, xzr, #0x8 -=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 pos, tmp3, tmp1 - -=A0=A0=A0=A0=A0=A0 /* Increment SRC pointers by POS so SRC1 is word-aligne= d. */ -=A0=A0=A0=A0=A0=A0 add=A0=A0=A0=A0 src1, src1, pos -=A0=A0=A0=A0=A0=A0 add=A0=A0=A0=A0 src2, src2, pos - -=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 limit, limit, pos -=A0=A0=A0=A0=A0=A0 lsr=A0=A0=A0=A0 limit_wd, limit, #3 - -=A0=A0=A0=A0=A0=A0 cmp limit_wd, #0 - -=A0=A0=A0=A0=A0=A0 /* save #bytes to go back to be able to read 8byte at e= nd -=A0=A0=A0=A0=A0=A0=A0=A0=A0 pos=3Dnegative offset position to read 8 bytes= when len%8 !=3D 0 */ -=A0=A0=A0=A0=A0=A0 and=A0=A0=A0=A0 limit, limit, #7 -=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 pos, limit, #8 - -=A0=A0=A0=A0=A0=A0 b=A0=A0=A0=A0=A0=A0 .Lstart_part_realigned - -=A0=A0=A0=A0=A0=A0 .p2align 5 -.Lloop_part_aligned: -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1], #8 -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2], #8 -=A0=A0=A0=A0=A0=A0 subs=A0=A0=A0 limit_wd, limit_wd, #1 -.Lstart_part_realigned: -=A0=A0=A0=A0=A0=A0 eor=A0=A0=A0=A0 diff, data1, data2=A0=A0=A0=A0=A0 /* No= n-zero if differences found. */ -=A0=A0=A0=A0=A0=A0 cbnz=A0=A0=A0 diff, .Lnot_limit -=A0=A0=A0=A0=A0=A0 b.ne=A0=A0=A0 .Lloop_part_aligned - -=A0=A0=A0=A0=A0=A0 /* process leftover bytes: read the leftover bytes, sta= rting with -=A0=A0=A0=A0=A0=A0=A0=A0=A0 negative offset - so we can load 8 bytes. */ -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1, [src1, pos] -=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2, [src2, pos] -=A0=A0=A0=A0=A0=A0 eor=A0=A0=A0=A0 diff, data1, data2=A0=A0=A0=A0=A0 /* No= n-zero if differences found.=A0 */ -=A0=A0=A0=A0=A0=A0 b .Lnot_limit - -.LmisalignedLt8: -=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 limit, limit, #1 -1: -=A0=A0=A0=A0=A0=A0 ldrb=A0=A0=A0 data1w, [src1], #1 -=A0=A0=A0=A0=A0=A0 ldrb=A0=A0=A0 data2w, [src2], #1 -=A0=A0=A0=A0=A0=A0 subs=A0=A0=A0 limit, limit, #1 -=A0=A0=A0=A0=A0=A0 ccmp=A0=A0=A0 data1w, data2w, #0, cs=A0 /* NZCV =3D 0b0= 000.=A0 */ -=A0=A0=A0=A0=A0=A0 b.eq=A0=A0=A0 1b -=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 result, data1, data2 +=A0=A0=A0=A0=A0=A0 cmp=A0=A0=A0=A0 data1, data2 +.Lret_eq: +=A0=A0=A0=A0=A0=A0 cset=A0=A0=A0 result, ne +=A0=A0=A0=A0=A0=A0 cneg=A0=A0=A0 result, result, lo +=A0=A0=A0=A0=A0=A0=A0 ret + +=A0=A0=A0=A0=A0=A0 .p2align 4 +=A0=A0=A0=A0=A0=A0 /* Compare up to 8 bytes.=A0 Limit is [-8..-1].=A0 */ +.Lless8: +=A0=A0=A0=A0=A0=A0 adds=A0=A0=A0 limit, limit, 4 +=A0=A0=A0=A0=A0=A0 b.lo=A0=A0=A0 .Lless4 +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data1w, [src1], 4 +=A0=A0=A0=A0=A0=A0 ldr=A0=A0=A0=A0 data2w, [src2], 4 +=A0=A0=A0=A0=A0=A0 cmp=A0=A0=A0=A0 data1w, data2w +=A0=A0=A0=A0=A0=A0 b.ne=A0=A0=A0 .Lreturn +=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 limit, limit, 4 +.Lless4: +=A0=A0=A0=A0=A0=A0 adds=A0=A0=A0 limit, limit, 4 +=A0=A0=A0=A0=A0=A0 beq=A0=A0=A0=A0 .Lret_eq +.Lbyte_loop: +=A0=A0=A0=A0=A0=A0 ldrb=A0=A0=A0 data1w, [src1], 1 +=A0=A0=A0=A0=A0=A0 ldrb=A0=A0=A0 data2w, [src2], 1 +=A0=A0=A0=A0=A0=A0 subs=A0=A0=A0 limit, limit, 1 +=A0=A0=A0=A0=A0=A0 ccmp=A0=A0=A0 data1w, data2w, 0, ne=A0=A0 /* NZCV =3D 0= b0000.=A0 */ +=A0=A0=A0=A0=A0=A0 b.eq=A0=A0=A0 .Lbyte_loop +=A0=A0=A0=A0=A0=A0 sub=A0=A0=A0=A0 result, data1w, data2w =A0=A0=A0=A0=A0=A0=A0=A0 ret -=A0=A0=A0=A0=A0=A0 .size memcmp, . - memcmp =A0 +=A0=A0=A0=A0=A0=A0 .size=A0=A0 memcmp, . - memcmp =A0#endif =20=20=20=20