From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR05-DB8-obe.outbound.protection.outlook.com (mail-db8eur05on2065.outbound.protection.outlook.com [40.107.20.65]) by sourceware.org (Postfix) with ESMTPS id A07D43857C7F for ; Mon, 12 Apr 2021 12:52:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A07D43857C7F Received: from AM6PR0202CA0043.eurprd02.prod.outlook.com (2603:10a6:20b:3a::20) by PAXPR08MB6448.eurprd08.prod.outlook.com (2603:10a6:102:152::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.21; Mon, 12 Apr 2021 12:52:16 +0000 Received: from AM5EUR03FT007.eop-EUR03.prod.protection.outlook.com (2603:10a6:20b:3a:cafe::55) by AM6PR0202CA0043.outlook.office365.com (2603:10a6:20b:3a::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.17 via Frontend Transport; Mon, 12 Apr 2021 12:52:16 +0000 X-MS-Exchange-Authentication-Results: spf=temperror (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=temperror action=none header.from=arm.com; Received-SPF: TempError (protection.outlook.com: error in processing during lookup of arm.com: DNS Timeout) Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by AM5EUR03FT007.mail.protection.outlook.com (10.152.16.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.17 via Frontend Transport; Mon, 12 Apr 2021 12:52:15 +0000 Received: ("Tessian outbound b610e7b4d771:v90"); Mon, 12 Apr 2021 12:52:14 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: bd882f00c48293fb X-CR-MTA-TID: 64aa7808 Received: from cae8493d4718.2 by 64aa7808-outbound-1.mta.getcheckrecipient.com id C06C0390-1F77-434B-9784-B2A2F413CE47.1; Mon, 12 Apr 2021 12:52:07 +0000 Received: from EUR03-AM5-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id cae8493d4718.2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Mon, 12 Apr 2021 12:52:07 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=EEybFhVULorwfdL+z5FmksV9vJ88nEMv4WX1cYq8V1cmElsG3uMZNdEovuW4ID8ZdZCCrSRbnGKGD1wwEXcozMNPaTI2H76Nsz+lxYnwDJlUZGXXfVi7kEF775PFcutZPhIg7W36sUlnu1WFXYLrqckVdaxD9vA0cVxsPZBNUK65DYvpdmUiV6/RYVQzUHjRGpb0b2YqYwICAY7nVa30Ck7aP9PxN/BwtA6F+oAyNZjAZCA1tQzw4+f9Lz6sb1oVMX5k0faNFhbeTXxgyKJ/qVONrwsr++Apd9WrdhMGOMjvXI7BEj5cje7fqJEBVYdnaK00goT8H9pogw0QH3iQJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ZVkEsZPrStW6JgEEUHvrZSQIQ/O6fv4Ebbp4TZ86cZI=; b=T4ZWv6w2j20O4J3HGVu7xPg3rQiZzFgN/GMMy+8VGSIkadlmjw0KrVuacdsKm8wUTCeBN6Mt+M77O7y+D1RrWDzYF9TFFknI8ETflkdpMSsjIjt3r186S1N+QG/46g5zKHN64NVvPw3416sIuPWlBmDiW9PJqW2pA+8+oHQ62G+cJO9ElEvntUI6g8J1ggY+UHS+qoGRs6PenWgPHtPAJlRcF2wnAVM8SkAhSacntgWp/Y2UY0AsDZEoiTgwq74VBqkUXnWqahzXUwsQ55Z1O+qgZylJ1YNfze/ePusOgqlzaOgTk5dCn9Wt5dZzPUuHsuImGyY9RhtZxY97XKZ3Xw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VE1PR08MB5854.eurprd08.prod.outlook.com (2603:10a6:800:1b0::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.21; Mon, 12 Apr 2021 12:52:05 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::385c:f8ff:ee16:3a4d]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::385c:f8ff:ee16:3a4d%6]) with mapi id 15.20.4020.022; Mon, 12 Apr 2021 12:52:05 +0000 From: Wilco Dijkstra To: "naohirot@fujitsu.com" CC: 'GNU C Library' , Szabolcs Nagy Subject: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Topic: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Index: AQHXL5Jyw0P1gKwhEk6/DkVDv1IPJQ== Date: Mon, 12 Apr 2021 12:52:05 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.249.100] x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: df6c18c3-57dd-4f53-138b-08d8fdb1cc3f x-ms-traffictypediagnostic: VE1PR08MB5854:|PAXPR08MB6448: x-ms-exchange-transport-forked: True X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:6790;OLM:6790; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: g7YbcxTqna6EfFgRzRBjHuWMSFDcIXgSWhOdWs/mzW5uoWbBS5kfcYx5GPKhDy8EjSfbh8B4A8M1Mhg/2ujUOuXiYjT7Wy+ASnWgc1J9GsM5cdsdkpcoiRg0YhqWn43SkYOPumuOYiZ1ePxRKHToZLw6zaQgjH7orVhhOpamMxdNkRm+jnFo/bC7OGiklGjNQYhLsCTb0DlM0NJ1w/plxwyRwSUmih9r8suLu+D5fYI/o3fLoOaJbAIh07PiXpqY/UZJyhhON/rtuV/kGR89Di6QMHRKmkPvBGZNNPpdu6S+2MWSopVmKc3+NWRtZsrsrgK+G8/n94S9DhlX6597zwJenifzkrOek+QQg8xMYk0zdBx1lpQ99nedJYNp0IKVTcgcOHsPmd0txJ7KE4o/0ck12eUd4/qxH5ULtpzwoYf6nLMaZbuujYkm2ZBKovxCGM7cBKSAxooOzrpoS664xlw1M/iixam/lfb+vboG3AI32lpR7fd7LBhQ1FS3slC+t7pX3zTaxTGXxFXjqbTbRsbu6CBI9c4lx3/514+GrjsgatyFizkaBwUOQPe5gRWKI88Lj5njq8zUwr3kv3bm+PvtV6LihRkbLfGDH+unrLs= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(346002)(136003)(376002)(366004)(396003)(39860400002)(316002)(55016002)(2906002)(186003)(6916009)(91956017)(76116006)(86362001)(26005)(54906003)(4326008)(66476007)(8676002)(478600001)(38100700002)(8936002)(64756008)(5660300002)(66946007)(7696005)(6506007)(33656002)(71200400001)(66446008)(9686003)(66556008)(52536014); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-8859-1?Q?DS4aINihd5PmnHnXSyZ1WafZptUU13Q5QTM0dlQlY7jCNxe2Y43c6Oo/RI?= =?iso-8859-1?Q?da6E7vX4Qvsn5AERrvKA4ekr8hQrGqmQiaxdj1/9ltfJh9FPOiSlYkNHIu?= =?iso-8859-1?Q?FR60ThccbTEAWpYnC1YPfrlytkdhB47rEyXkO73IbAiR5nTjnMp13y209S?= =?iso-8859-1?Q?+W5T+0o/AhM0n/HMNg13vdrKkK5uI/3GgAZVX5OM9zlJqYKVoNqbEOx0Wy?= =?iso-8859-1?Q?5IfhSzL+PMai1Wc/SadAVZJcrsvGL3OifjgpcQl3/5My/8ugKgtZMe0/Ws?= =?iso-8859-1?Q?b4bJBbQn0/6kEa10v3IgMPeYtKU0zw7LxZef9+W8gjWDoisaCGeVVwtPAo?= =?iso-8859-1?Q?ZozhELAfmgkrMLGHUraO+K+w2fMrECU459I3P1j7DhLJBURr/YrPn5x5lN?= =?iso-8859-1?Q?PieH3+QN8AcJojJ1dglZKJHc/C4ywpvBpKWfXZjJqU2K1/13UIOX3qxm14?= =?iso-8859-1?Q?A8T6LxXlS4p8MqEF76qZX2ZXuYiWfXw344ApVo15IjlxOrtPKmR3QeDyJ8?= =?iso-8859-1?Q?zrS1ROVBxGDPptRthoaR/laPxf2N+BEst5tCO3FZea15INz5X01v69t/tb?= =?iso-8859-1?Q?EUWUrdERz/cXN7EjXPfSBYPnWRY/syg2T+S2cE4UaKPCMRCedkV7z/BGHX?= =?iso-8859-1?Q?Yse6kmv5FD7NAlauF1iINS5Rzi/TMKGijEN0sE0leQ23RbGc/5KTpc0QbI?= =?iso-8859-1?Q?dbVZDZYPso5r9hQ0yuhNh1t899KD7PvqQRoYGz2Ekg9SfO0ExKsLG77lUo?= =?iso-8859-1?Q?fnnIpC+ubF8E89t0IwiNgT1TZNUOe3QO1TCpa7GhIr95mADodeaCtwCT8L?= =?iso-8859-1?Q?Azg7uZQ5vRMcuJoj7GLi/yFhTLRSBYZKnLMDjvF1G20sf1K3GH8Ag4B6KN?= =?iso-8859-1?Q?ZAGiSdnngPJJocljtzVP2ryEj5+scRo6TgCU2UeYBx/bZOJGmiM79NC0wq?= =?iso-8859-1?Q?JcNxIwDQ2uUWjsC+F3BocvNrViBs/lM1ikuTMgVOkTEJHI1pMdiyTy6cuH?= =?iso-8859-1?Q?XevnG1qJfWQ49Kc8zmqTNpojwY+AbHMC+4q9/fHBvjjqwkT9YaSDXhK18h?= =?iso-8859-1?Q?BoXmylqv6QncVwvMv2r9s7AhJavnpmfxq46wr1skGrc9veLtAHjyLvjbnP?= =?iso-8859-1?Q?e4VuMip7NKIJN0y1WZV3HWB3yJAfr4z3wmWv1gTCz9Xmi6rXTlHBwdK3WF?= =?iso-8859-1?Q?buhyY90JgyRsKhybOwcdIbV5+Bwxgev7EyULUvWj7r6cqm28egZN2THNxt?= =?iso-8859-1?Q?CCwTiA4dalctagExR7vRUQSi2CyZE0D5qPL/KZOa+1OVe0zjflfpkFPFhN?= =?iso-8859-1?Q?cxxeq8qSBUONnhcDfc2129Uww1yThXzH0eJ4cw6zkjHFCCo=3D?= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR08MB5854 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT007.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: f678c652-93ed-40f0-c759-08d8fdb1c669 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 2XxyjBTpNY2b9GPl5igurEbkqJVNh+SHQxEhl1u2qb9ejjPsaEcPai2nW7wkU/DFVuipuf6wYOCoj5H/GHsO6EU+7F/cRBWrao650xqwiOvA1Yw4ql1F5bVT25O3dls6mNU+fQchk6DTPmoTXSGlYWEkOUkPl4HVH5LWVdisSJxR7KRouw5VNZywIe9V3BYOePBOHM1+fA6mNeWwJfS5GedIVvDVZxHzDlNBdHwOPBa9LsKf+K8hYtEj7whVs9fr/t0SYGfxYpOnCjhpPmrnz/WPSRpoZgyK1E392g3VCyvHr/1kdCHdrMnzE2kcVWOOzMuA9BJIj7EcwfML2x5oGUF1rGdvIfZWJiADKzxoBcgoEpATgrfHzpYJjl5sfj5R8yG+WhHhIh+bV0ZeVaogdRVJPzlWAqKc/ukIxDxzmwZe/5D4FOIU80z7ccBdrwkYO+ygzZSyrN8baF/nqY7LzmGPsealUwhVCGRTxzHCzHOrkCv4h41iaIcry2Bl0JqiCqwq3U0VWd3KT5O3j9NmFLGJHyq4sg+dcFtZy1ChZyjQP6ZN3VTVdBvlI9X9HPMVfanDHULGjJuONDcJ16jCazN1QxzQs9VRwZtb77/fiUAUx9OO3UBKKLYpUCmaMuLY5HbmeSoMKhkApdKCrYUPXgeoU+sPAYY69vTdjo/AcK8= X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(396003)(136003)(346002)(376002)(39860400002)(36840700001)(46966006)(63350400001)(6862004)(36860700001)(63370400001)(9686003)(26005)(82740400003)(54906003)(2906002)(70586007)(47076005)(6506007)(86362001)(336012)(5660300002)(7696005)(4326008)(186003)(8936002)(316002)(81166007)(8676002)(356005)(52536014)(55016002)(478600001)(70206006)(82310400003)(33656002); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Apr 2021 12:52:15.0080 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: df6c18c3-57dd-4f53-138b-08d8fdb1cc3f X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: AM5EUR03FT007.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PAXPR08MB6448 X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2021 12:52:21 -0000 Hi,=0A= =0A= I have a few comments about memcpy design (the principles apply equally to = memset):=0A= =0A= 1. Overall the code is too large due to enormous unroll factors=0A= =0A= Our current memcpy is about 300 bytes (that includes memmove), this memcpy = is ~12 times larger!=0A= This hurts performance due to the code not fitting in the I-cache for commo= n copies.=0A= On a modern OoO core you need very little unrolling since ALU operations an= d branches=0A= become essentially free while the CPU executes loads and stores. So rather = than unrolling=0A= by 32-64 times, try 4 times - you just need enough to hide the taken branch= latency.=0A= =0A= 2. I don't see any special handling for small copies=0A= =0A= Even if you want to hyper optimize gigabyte sized copies, small copies are = still extremely common,=0A= so you always want to handle those as quickly (and with as little code) as = possible. Special casing=0A= small copies does not slow down the huge copies - the reverse is more likel= y since you no longer=0A= need to handle small cases.=0A= =0A= 3. Check whether using SVE helps small/medium copies=0A= =0A= Run memcpy-random benchmark to see whether it is faster to use SVE for smal= l cases or just the SIMD=0A= copy on your uarch.=0A= =0A= 4. Avoid making the code too general or too specialistic=0A= =0A= I see both appearing in the code - trying to deal with different cacheline = sizes and different vector lengths,=0A= and also splitting these out into separate cases. If you depend on a partic= ular cacheline size, specialize=0A= the code for that and check the size in the ifunc selector (as various mems= ets do already). If you want to=0A= handle multiple vector sizes, just use a register for the increment rather = than repeating the same code=0A= several times for each vector length.=0A= =0A= 5. Odd prefetches=0A= =0A= I have a hard time believing first prefetching the data to be written, then= clearing it using DC ZVA (???),=0A= then prefetching the same data a 2nd time, before finally write the loaded = data is helping performance...=0A= Generally hardware prefetchers are able to do exactly the right thing since= memcpy is trivial to prefetch.=0A= So what is the performance gain of each prefetch/clear step? What is the di= fference between memcpy=0A= and memmove performance (given memmove doesn't do any of this)?=0A= =0A= Cheers,=0A= Wilco=0A= =0A=