From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR01-HE1-obe.outbound.protection.outlook.com (mail-he1eur01on0619.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe1e::619]) by sourceware.org (Postfix) with ESMTPS id D3A9F385E448 for ; Fri, 9 Jul 2021 12:23:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D3A9F385E448 Received: from AM5PR0701CA0049.eurprd07.prod.outlook.com (2603:10a6:203:2::11) by VE1PR08MB4815.eurprd08.prod.outlook.com (2603:10a6:802:a3::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4308.19; Fri, 9 Jul 2021 12:23:43 +0000 Received: from AM5EUR03FT022.eop-EUR03.prod.protection.outlook.com (2603:10a6:203:2:cafe::cc) by AM5PR0701CA0049.outlook.office365.com (2603:10a6:203:2::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4331.11 via Frontend Transport; Fri, 9 Jul 2021 12:23:42 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by AM5EUR03FT022.mail.protection.outlook.com (10.152.16.79) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4308.20 via Frontend Transport; Fri, 9 Jul 2021 12:23:42 +0000 Received: ("Tessian outbound 17c2a40a31ce:v98"); Fri, 09 Jul 2021 12:23:41 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 5e513636b4de4874 X-CR-MTA-TID: 64aa7808 Received: from cac7a5327d92.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 61FD19AC-3243-4279-9D9E-7D60A77286A4.1; Fri, 09 Jul 2021 12:23:35 +0000 Received: from EUR05-AM6-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id cac7a5327d92.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Fri, 09 Jul 2021 12:23:35 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=KafFScRjcQnhFmy9rfs+OXuqdLwH8wKrNu9M+A4feeJu7sh8mTmwMr4J38f23mtZ8oEBTiKBoWKRkVcgtdiPst1YkGkTxgf/TV89Adx6nIJsqQDRQw8neBEbX2mWSPzgzdlpHyabrPlUkHX/LYiLaCGb50KDVtJmc+PhQhp+5i6n13aIKqUWuWmr0Z1p5VSrzpQgyIdSzycGE61sMahVe0l5eewt5ap2ThKddp0nWzb7xSlfJ0l/FEqE9g3usYm0qiO2eo/Ze5eXkwVyke+sOT+DmyPOXuTbDv2GX2HrbbCwCmz3ab9xj24GRcvn4Bdh6NzGIyyO3TIw4PefgSPG/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=utUhrp9uwe1/bpKUmE2dJrE1wZHTjGKXXJR4yiSPRg8=; b=fMJsvSyM3oEhKyEI7v+e5eVfp9M6edoXNG1XpYSa5L4mMi3zRMPRzr9VeQpInCJauYEfskBBiuu1pZqeNc85EvA7K5whzH/9xoszAw/FTlyN4YhzMZmZ1olppU6uTwmRCWFzY9Yg2XKesUtXjEbdwyTF3bxeL67hDHZLMzi7WXWiR83zvCZdOlaP4yslQMFPVzQ7romRdkDXQAU1ou6vXkdRmrza48wX5+cAz5zWcgo2A6/N1fFykQo213ztyLmLseHZ1VUKxU9EbTyMus8AxOCPFrnx4TxHfAfbi2IFjbyPnwE9s95EEigETnfNnZ4jr7xVzx581bxiArwtnVeJeg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VI1PR0801MB1759.eurprd08.prod.outlook.com (2603:10a6:800:5b::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.22; Fri, 9 Jul 2021 12:23:34 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac%5]) with mapi id 15.20.4242.023; Fri, 9 Jul 2021 12:23:34 +0000 From: Wilco Dijkstra To: "naohirot@fujitsu.com" CC: 'GNU C Library' Subject: [PATCH v2] AArch64: Improve A64FX memset Thread-Topic: [PATCH v2] AArch64: Improve A64FX memset Thread-Index: AQHXdL08ZMJY05KHOEymQRhbc5sG9Q== Date: Fri, 9 Jul 2021 12:23:34 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: b0a7dc6d-48e9-4f6b-897e-08d942d463b9 x-ms-traffictypediagnostic: VI1PR0801MB1759:|VE1PR08MB4815: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:1013;OLM:1013; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: p/R/GZPpp0e+4cyy67oMevtL316a03DSNuNJZvenRXEdmX0QtEEk5qdkBSohGBBZCsiJmA/vAqt2iOksNh1ARIRCL42pjsJks6aDyf8aEkx/fAG2QGMcietMQqNpOs94CfnU2t5LulZz6PYegmgUvYVyQffMuI4fWrKJREaJ9xjlPWhw54oc4EMzU7fHHCaG5iL2ED5ArOUrIVMrBvjnGUS1KhR8jg6VJBjZOpcb7B0WTKupnsddP1EO/HFAGEVKw+4mFW98gB8uLVpS+WlWtZGEbAS3kl3Et/NARCPhKcZYD29vRk4SKe0r6ldKVGHErh/DBoV2dAy5OIegLRzZSnzAUSUcIIm1tDvNeLvakIBk+887DMW7yl6Hq95k+6fMlEIPNuCXL5aTzkwDAdc46HYPoLGYHPDqRh6SMceVu5R7VgpAz0QrsZwO08qFXSAM8nIDCuQaNqth7Ac0nR1X5AXoR1bR963WrjoOhJ/S0XFQ3T/aDownuKwbCrFl7XiwYwneuzgdmj3MM03yhXw6lGUTgx7SpWCs7qhn6671gZ5XWqnZVaRD4eArDw0GJ4J3coot99V1qkBh26aea2CpD32q0gFQn3eFHr227BM1toVN+EPJgTa5b9YZkKqbHVvEsDu8eXuf4yWMAcjzudrovXyCPu+y1LMwvMSdbwylcTHCRx9Hhw7YklM0bX1guBkJ6YRjPfu+qx/ziKDJc9YUI58gwtMFKQnMkX/fsdhdBWQ= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(136003)(396003)(39840400004)(366004)(376002)(346002)(186003)(26005)(6506007)(66556008)(55016002)(316002)(8936002)(66476007)(8676002)(52536014)(4326008)(66946007)(5660300002)(6916009)(7696005)(33656002)(64756008)(9686003)(122000001)(86362001)(478600001)(38100700002)(66446008)(76116006)(2906002)(71200400001)(473944003)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?u+ovj3od460DIhIMhWUgTbPhYSiUzDuBr0cCNZcFW5jvsOftnD3DILwB4p?= =?iso-8859-1?Q?zHTgIHkOvCHgTMmmxaXNYcLo4xFZGmBJDEILLbK9L86Me0T7HPd7rUcPf4?= =?iso-8859-1?Q?a/gL5HAHiywH1PdVNtVTYs3yi+gtMpuckZvha89cQ7ZYPz/hPWd8SQgLRT?= =?iso-8859-1?Q?3D75+RQRXH2MgXZdFn+ZZCTy4aZQwfCpkZq/Af/75k6uAfYFsETLb/AFPU?= =?iso-8859-1?Q?gpRBfyz2AAFU1dZdM8o9PUXD4qhVlpcTaxj5EomRuyQKFw+j7KrO/UqaZk?= =?iso-8859-1?Q?mNEiN8Rc+AOsgg9v2hEN8jLC52IPx4oojRoHnydV/qpYO1LQyUHT9fPiVy?= =?iso-8859-1?Q?4JlauGjph2AcX6/NId+93Spq7fI576Py8PfmhepHLbPy/Wn+rzvqg9moIw?= =?iso-8859-1?Q?VyArhAoZDRv7wo2wdD6y2V+NwKx7uAWe+4fYQNQTNOqDeVS8nOFx/0hL3A?= =?iso-8859-1?Q?sjtzdPMHzBsKb2Of0exbqizhnU+6dnimXSQYXWthXlzZpj34b+kiXFZ0J2?= =?iso-8859-1?Q?w4+Wc+p8+DHy8s9BjxPAh1X2QxQZhkxpmsSfmXdeM846SAJsE2UCucPz7v?= =?iso-8859-1?Q?+XYJH9wf1ZRzbk2B8cTyb7bhFAEfabXeJfyKvG82S4W85Iwhj1HN7CoObp?= =?iso-8859-1?Q?kbvVX0qNNSivj/6/aRDKsIQsSe7KD0PSVW7DQg+nJ0PK4hPrTzaiLlQePh?= =?iso-8859-1?Q?S2ZE4emPj+hmT/MwsYkj5DwYvN1kofneKrCLttiUf1kRR6A7X/3gLftq7Q?= =?iso-8859-1?Q?CjbR8yA7vCKy2bKYXp4FtZn7zeDPywTLShiQuDrAGTetEF7T/5zdE7kbew?= =?iso-8859-1?Q?V2PebeeQlC1Bact+3RSyaI6wbewresF7plluwsxwI6AdE4FkV+2rCPdPM1?= =?iso-8859-1?Q?uGYe0sumQe+qyjbwPqMcLmhSIQaaimXXYK6BxIzND+TTGB/CpYXjp74/PR?= =?iso-8859-1?Q?ZxoklaQ4sE7Ww3V3bHGyuXG29kuaFctVzvQ7nNXH1TbVkPHYnxb+ZfOf8n?= =?iso-8859-1?Q?I/FvESHRKD5yi8WWW5LmzaH1qfwDgdHpiip4yjL884ox2sQch77cgCBKWj?= =?iso-8859-1?Q?n6VN+abFaETYy80c7Xn4dlmUR6awz2GAE/BUZ4cj7NmryEevqDQXECtR91?= =?iso-8859-1?Q?szXbJzmaj0Epcax65ZJwYL/H9rNRG4+/7co3w/4HXz0ezIBfw4uiB/UVKJ?= =?iso-8859-1?Q?QTgCiQk2rn3q+jcrr0He8vrnY46HzIMYyJWHyZ8tdrPGt8T6IPz3XXKxnP?= =?iso-8859-1?Q?RorIC5OoHikNS3Y++HZL9tWTaV9cAsK2vf7Ik2/AN50f1TQJst4OBzXYib?= =?iso-8859-1?Q?2Y03TDl2GQnUGJz7aO7j116twylAJGqKFUn1suk7NXFo2kA=3D?= x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0801MB1759 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT022.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: dbdf2faf-8e32-47c4-42a3-08d942d45eef X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: SPJHAwemE6iD5c1dlSk2fix87FKwL82vaPg04W2voniPUWHCDKKo/EaMXu6IDB1QTwwevCzCWGZcbV0xqDfIzpR+9plBny0KGLx3NvGgbY6CE8F42aN6l2BbiiIiQ0X1okDAefOjY08JwVWljWNx6J2dK99CFuVUXuPBEMg892h75x2rC/Z7xGZQZBmYXa4bG/irVeYF2/AFMoxc2H1/pMPL6sKsszCKuqZC0fLXnoe5xX8HUuD1zYZDaGGaFFCx3pMChKSN2Bcl/LXwB07Yq2ra/MX4b/uHIS/IhsE2BnNGW+FKp1G2jO6MxL0k/tlIvkCbQC4pXZroZidT3k9qzyiV6rGyBbt4dLFPsWaw3mMJJMAsa8ipHC2P9jf9Ox9X1erDjzIOS1C11HATXrtP6o6A90ivV9KON2b5kiqTJ148I7QLeY6AHcuwOh7izqMerEvcQ8ZYFaNcSY4q+lJiwnT1VrpqEpBkggoRhhrYIaoYl8Arz9b4Te1Mthc+gSuLldaEXury+DhOSl5GI0QooxDmIDDEXdtXxj9zPQFv8v4ppUViTu0JIKEp8Azf80dDcriPZnEPNQoJcSVG8BtPcT4DA1DrEX+5Ij0mq9AtWX+fs7Qyt0RmgBQblQzd6eAHd5BUg8JJDlv6qJlvKU2mwkgSgyFAi+0oN8JzycPHb8DOe0qP8uiHrdJBlHmUh3yXmuy6eWL3YxfDd96VUwBZknaFL0MELlr4SwKxW1WU9d82AMCsVcsQsdp2D6M9wGV8 X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(346002)(376002)(39850400004)(396003)(136003)(46966006)(36840700001)(86362001)(82740400003)(47076005)(8676002)(316002)(7696005)(36860700001)(336012)(81166007)(356005)(5660300002)(2906002)(55016002)(6506007)(9686003)(52536014)(8936002)(82310400003)(6862004)(70206006)(478600001)(4326008)(186003)(70586007)(33656002)(26005)(473944003)(357404004); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Jul 2021 12:23:42.2429 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: b0a7dc6d-48e9-4f6b-897e-08d942d463b9 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: AM5EUR03FT022.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR08MB4815 X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Jul 2021 12:23:48 -0000 Hi Naohiro,=0A= =0A= Here is version 2 which should improve things a lot:=0A= =0A= v2: Improve handling of last 512 bytes which improves medium sized memsets.= =0A= Further reduce codesize by removing unnecessary unrolling of dc zva.=0A= Speed up huge memsets of zero and non-zero.=0A= =0A= Reduce the codesize of the A64FX memset by simplifying the small memset cod= e,=0A= better handling of alignment and last 8 vectors as well as removing redunda= nt=0A= instructions and branches. The size for memset goes down from 1032 to 376 b= ytes.=0A= For large zeroing memsets use DC ZVA, which almost doubles performance. Lar= ge=0A= non-zero memsets use the unroll8 loop which is about 10% faster.=0A= =0A= Passes GLIBC regress, OK for commit?=0A= =0A= ---=0A= =0A= =0A= diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/mul= tiarch/memset_a64fx.S=0A= index ce54e5418b08c8bc0ecc7affff68a59272ba6397..2737f0cba3e1a9ac887cd8072f6= 122f4852a9f94 100644=0A= --- a/sysdeps/aarch64/multiarch/memset_a64fx.S=0A= +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S=0A= @@ -30,11 +30,7 @@=0A= #define L2_SIZE (8*1024*1024) // L2 8MB - 1MB=0A= #define CACHE_LINE_SIZE 256=0A= #define PF_DIST_L1 (CACHE_LINE_SIZE * 16) // Prefetch distance L1=0A= -#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance=0A= -#define rest x8=0A= #define vector_length x9=0A= -#define vl_remainder x10 // vector_length remainder=0A= -#define cl_remainder x11 // CACHE_LINE_SIZE remainder=0A= =0A= #if HAVE_AARCH64_SVE_ASM=0A= # if IS_IN (libc)=0A= @@ -42,224 +38,126 @@=0A= =0A= .arch armv8.2-a+sve=0A= =0A= - .macro dc_zva times=0A= - dc zva, tmp1=0A= - add tmp1, tmp1, CACHE_LINE_SIZE=0A= - .if \times-1=0A= - dc_zva "(\times-1)"=0A= - .endif=0A= - .endm=0A= -=0A= .macro st1b_unroll first=3D0, last=3D7=0A= - st1b z0.b, p0, [dst, #\first, mul vl]=0A= + st1b z0.b, p0, [dst, \first, mul vl]=0A= .if \last-\first=0A= st1b_unroll "(\first+1)", \last=0A= .endif=0A= .endm=0A= =0A= - .macro shortcut_for_small_size exit=0A= - // if rest <=3D vector_length * 2=0A= +=0A= +#undef BTI_C=0A= +#define BTI_C=0A= +=0A= +ENTRY (MEMSET)=0A= + PTR_ARG (0)=0A= + SIZE_ARG (2)=0A= +=0A= + dup z0.b, valw=0A= whilelo p0.b, xzr, count=0A= + cntb vector_length=0A= whilelo p1.b, vector_length, count=0A= + st1b z0.b, p0, [dstin, 0, mul vl]=0A= + st1b z0.b, p1, [dstin, 1, mul vl]=0A= b.last 1f=0A= - st1b z0.b, p0, [dstin, #0, mul vl]=0A= - st1b z0.b, p1, [dstin, #1, mul vl]=0A= ret=0A= -1: // if rest > vector_length * 8=0A= - cmp count, vector_length, lsl 3 // vector_length * 8=0A= - b.hi \exit=0A= - // if rest <=3D vector_length * 4=0A= - lsl tmp1, vector_length, 1 // vector_length * 2=0A= - whilelo p2.b, tmp1, count=0A= - incb tmp1=0A= - whilelo p3.b, tmp1, count=0A= - b.last 1f=0A= - st1b z0.b, p0, [dstin, #0, mul vl]=0A= - st1b z0.b, p1, [dstin, #1, mul vl]=0A= - st1b z0.b, p2, [dstin, #2, mul vl]=0A= - st1b z0.b, p3, [dstin, #3, mul vl]=0A= - ret=0A= -1: // if rest <=3D vector_length * 8=0A= - lsl tmp1, vector_length, 2 // vector_length * 4=0A= - whilelo p4.b, tmp1, count=0A= - incb tmp1=0A= - whilelo p5.b, tmp1, count=0A= - b.last 1f=0A= - st1b z0.b, p0, [dstin, #0, mul vl]=0A= - st1b z0.b, p1, [dstin, #1, mul vl]=0A= - st1b z0.b, p2, [dstin, #2, mul vl]=0A= - st1b z0.b, p3, [dstin, #3, mul vl]=0A= - st1b z0.b, p4, [dstin, #4, mul vl]=0A= - st1b z0.b, p5, [dstin, #5, mul vl]=0A= - ret=0A= -1: lsl tmp1, vector_length, 2 // vector_length * 4=0A= - incb tmp1 // vector_length * 5=0A= - incb tmp1 // vector_length * 6=0A= - whilelo p6.b, tmp1, count=0A= - incb tmp1=0A= - whilelo p7.b, tmp1, count=0A= - st1b z0.b, p0, [dstin, #0, mul vl]=0A= - st1b z0.b, p1, [dstin, #1, mul vl]=0A= - st1b z0.b, p2, [dstin, #2, mul vl]=0A= - st1b z0.b, p3, [dstin, #3, mul vl]=0A= - st1b z0.b, p4, [dstin, #4, mul vl]=0A= - st1b z0.b, p5, [dstin, #5, mul vl]=0A= - st1b z0.b, p6, [dstin, #6, mul vl]=0A= - st1b z0.b, p7, [dstin, #7, mul vl]=0A= - ret=0A= - .endm=0A= =0A= -ENTRY (MEMSET)=0A= -=0A= - PTR_ARG (0)=0A= - SIZE_ARG (2)=0A= + // count >=3D vector_length * 2=0A= + .p2align 4=0A= +1: add dst, dstin, count=0A= + cmp count, vector_length, lsl 2=0A= + b.hi 1f=0A= + st1b z0.b, p0, [dst, -2, mul vl]=0A= + st1b z0.b, p0, [dst, -1, mul vl]=0A= + ret=0A= =0A= - cbnz count, 1f=0A= + // count > vector_length * 4=0A= +1: cmp count, vector_length, lsl 3=0A= + b.hi L(vl_agnostic)=0A= + st1b z0.b, p0, [dstin, 2, mul vl]=0A= + st1b z0.b, p0, [dstin, 3, mul vl]=0A= + st1b z0.b, p0, [dst, -4, mul vl]=0A= + st1b z0.b, p0, [dst, -3, mul vl]=0A= + st1b z0.b, p0, [dst, -2, mul vl]=0A= + st1b z0.b, p0, [dst, -1, mul vl]=0A= ret=0A= -1: dup z0.b, valw=0A= - cntb vector_length=0A= - // shortcut for less than vector_length * 8=0A= - // gives a free ptrue to p0.b for n >=3D vector_length=0A= - shortcut_for_small_size L(vl_agnostic)=0A= - // end of shortcut=0A= =0A= -L(vl_agnostic): // VL Agnostic=0A= - mov rest, count=0A= + // count >=3D vector_length * 8=0A= + .p2align 4=0A= +L(vl_agnostic):=0A= mov dst, dstin=0A= - add dstend, dstin, count=0A= - // if rest >=3D L2_SIZE && vector_length =3D=3D 64 then L(L2)=0A= mov tmp1, 64=0A= - cmp rest, L2_SIZE=0A= - ccmp vector_length, tmp1, 0, cs=0A= - b.eq L(L2)=0A= - // if rest >=3D L1_SIZE && vector_length =3D=3D 64 then L(L1_prefetch)=0A= - cmp rest, L1_SIZE=0A= + // if count >=3D L1_SIZE && vector_length =3D=3D 64 then L(L1_prefetch)= =0A= + cmp count, L1_SIZE=0A= ccmp vector_length, tmp1, 0, cs=0A= b.eq L(L1_prefetch)=0A= =0A= -L(unroll32):=0A= - lsl tmp1, vector_length, 3 // vector_length * 8=0A= - lsl tmp2, vector_length, 5 // vector_length * 32=0A= - .p2align 3=0A= -1: cmp rest, tmp2=0A= - b.cc L(unroll8)=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - sub rest, rest, tmp2=0A= - b 1b=0A= -=0A= + // count >=3D 8 * vector_length=0A= L(unroll8):=0A= lsl tmp1, vector_length, 3=0A= - .p2align 3=0A= -1: cmp rest, tmp1=0A= - b.cc L(last)=0A= - st1b_unroll=0A= + sub count, count, tmp1=0A= + lsl tmp2, vector_length, 1=0A= + .p2align 4=0A= +1: subs count, count, tmp1=0A= + st1b_unroll 0, 7=0A= add dst, dst, tmp1=0A= - sub rest, rest, tmp1=0A= - b 1b=0A= -=0A= -L(last):=0A= - whilelo p0.b, xzr, rest=0A= - whilelo p1.b, vector_length, rest=0A= - b.last 1f=0A= - st1b z0.b, p0, [dst, #0, mul vl]=0A= - st1b z0.b, p1, [dst, #1, mul vl]=0A= - ret=0A= -1: lsl tmp1, vector_length, 1 // vector_length * 2=0A= - whilelo p2.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p3.b, tmp1, rest=0A= - b.last 1f=0A= - st1b z0.b, p0, [dst, #0, mul vl]=0A= - st1b z0.b, p1, [dst, #1, mul vl]=0A= - st1b z0.b, p2, [dst, #2, mul vl]=0A= - st1b z0.b, p3, [dst, #3, mul vl]=0A= - ret=0A= -1: lsl tmp1, vector_length, 2 // vector_length * 4=0A= - whilelo p4.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p5.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p6.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p7.b, tmp1, rest=0A= - st1b z0.b, p0, [dst, #0, mul vl]=0A= - st1b z0.b, p1, [dst, #1, mul vl]=0A= - st1b z0.b, p2, [dst, #2, mul vl]=0A= - st1b z0.b, p3, [dst, #3, mul vl]=0A= - st1b z0.b, p4, [dst, #4, mul vl]=0A= - st1b z0.b, p5, [dst, #5, mul vl]=0A= - st1b z0.b, p6, [dst, #6, mul vl]=0A= - st1b z0.b, p7, [dst, #7, mul vl]=0A= + b.hi 1b=0A= +=0A= + add dst, dst, count=0A= + add count, count, tmp1=0A= + cmp count, tmp2=0A= + b.ls 2f=0A= + add tmp2, vector_length, vector_length, lsl 2=0A= + cmp count, tmp2=0A= + b.ls 5f=0A= + st1b z0.b, p0, [dst, 0, mul vl]=0A= + st1b z0.b, p0, [dst, 1, mul vl]=0A= + st1b z0.b, p0, [dst, 2, mul vl]=0A= +5: st1b z0.b, p0, [dst, 3, mul vl]=0A= + st1b z0.b, p0, [dst, 4, mul vl]=0A= + st1b z0.b, p0, [dst, 5, mul vl]=0A= +2: st1b z0.b, p0, [dst, 6, mul vl]=0A= + st1b z0.b, p0, [dst, 7, mul vl]=0A= ret=0A= =0A= -L(L1_prefetch): // if rest >=3D L1_SIZE=0A= + // count >=3D L1_SIZE=0A= .p2align 3=0A= +L(L1_prefetch):=0A= + cmp count, L2_SIZE=0A= + b.hs L(L2)=0A= 1: st1b_unroll 0, 3=0A= prfm pstl1keep, [dst, PF_DIST_L1]=0A= st1b_unroll 4, 7=0A= prfm pstl1keep, [dst, PF_DIST_L1 + CACHE_LINE_SIZE]=0A= add dst, dst, CACHE_LINE_SIZE * 2=0A= - sub rest, rest, CACHE_LINE_SIZE * 2=0A= - cmp rest, L1_SIZE=0A= - b.ge 1b=0A= - cbnz rest, L(unroll32)=0A= - ret=0A= + sub count, count, CACHE_LINE_SIZE * 2=0A= + cmp count, PF_DIST_L1=0A= + b.hs 1b=0A= + b L(unroll8)=0A= =0A= + // count >=3D L2_SIZE=0A= L(L2):=0A= - // align dst address at vector_length byte boundary=0A= - sub tmp1, vector_length, 1=0A= - ands tmp2, dst, tmp1=0A= - // if vl_remainder =3D=3D 0=0A= - b.eq 1f=0A= - sub vl_remainder, vector_length, tmp2=0A= - // process remainder until the first vector_length boundary=0A= - whilelt p2.b, xzr, vl_remainder=0A= - st1b z0.b, p2, [dst]=0A= - add dst, dst, vl_remainder=0A= - sub rest, rest, vl_remainder=0A= - // align dstin address at CACHE_LINE_SIZE byte boundary=0A= -1: mov tmp1, CACHE_LINE_SIZE=0A= - ands tmp2, dst, CACHE_LINE_SIZE - 1=0A= - // if cl_remainder =3D=3D 0=0A= - b.eq L(L2_dc_zva)=0A= - sub cl_remainder, tmp1, tmp2=0A= - // process remainder until the first CACHE_LINE_SIZE boundary=0A= - mov tmp1, xzr // index=0A= -2: whilelt p2.b, tmp1, cl_remainder=0A= - st1b z0.b, p2, [dst, tmp1]=0A= - incb tmp1=0A= - cmp tmp1, cl_remainder=0A= - b.lo 2b=0A= - add dst, dst, cl_remainder=0A= - sub rest, rest, cl_remainder=0A= -=0A= -L(L2_dc_zva):=0A= - // zero fill=0A= - mov tmp1, dst=0A= - dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1=0A= - mov zva_len, ZF_DIST=0A= - add tmp1, zva_len, CACHE_LINE_SIZE * 2=0A= - // unroll=0A= - .p2align 3=0A= -1: st1b_unroll 0, 3=0A= - add tmp2, dst, zva_len=0A= - dc zva, tmp2=0A= - st1b_unroll 4, 7=0A= - add tmp2, tmp2, CACHE_LINE_SIZE=0A= - dc zva, tmp2=0A= - add dst, dst, CACHE_LINE_SIZE * 2=0A= - sub rest, rest, CACHE_LINE_SIZE * 2=0A= - cmp rest, tmp1 // ZF_DIST + CACHE_LINE_SIZE * 2=0A= - b.ge 1b=0A= - cbnz rest, L(unroll8)=0A= - ret=0A= + tst valw, 255=0A= + b.ne L(unroll8)=0A= + // align dst to CACHE_LINE_SIZE byte boundary=0A= + and tmp1, dst, CACHE_LINE_SIZE - 1=0A= + sub tmp1, tmp1, CACHE_LINE_SIZE=0A= + st1b z0.b, p0, [dst, 0, mul vl]=0A= + st1b z0.b, p0, [dst, 1, mul vl]=0A= + st1b z0.b, p0, [dst, 2, mul vl]=0A= + st1b z0.b, p0, [dst, 3, mul vl]=0A= + sub dst, dst, tmp1=0A= + add count, count, tmp1=0A= +=0A= + // clear cachelines using DC ZVA=0A= + sub count, count, CACHE_LINE_SIZE * 4=0A= + .p2align 4=0A= +1: dc zva, dst=0A= + add dst, dst, CACHE_LINE_SIZE=0A= + subs count, count, CACHE_LINE_SIZE=0A= + b.hs 1b=0A= + add count, count, CACHE_LINE_SIZE * 4=0A= + b L(unroll8)=0A= =0A= END (MEMSET)=0A= libc_hidden_builtin_def (MEMSET)=0A= =0A=