From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa4.fujitsucc.c3s2.iphmx.com (esa4.fujitsucc.c3s2.iphmx.com [68.232.151.214]) by sourceware.org (Postfix) with ESMTPS id 8CF9A39540D8 for ; Mon, 19 Apr 2021 12:43:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8CF9A39540D8 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=fujitsu.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=naohirot@fujitsu.com IronPort-SDR: cowNkYcuakgXHZC02jiGsJsoiX9FOabS7H4ReB9KmNcsiD/jKwlkxHayrPmIuwv9ZnIecmELVd ET9jD58XW7S8FnKaDv/JhGi6cp49eOpzGwzGuP6ylrFNw+A3/fusCXB+M45+BHx28aKGuHcHOj 4Owl1gp59lOf3OptUJUCabdec0Pw02i69jZ9QIYQ6p9l0vjX/wUli/wdVLKj+n4K9H1NkDoWb8 iH+pd0Kuj/xe/VHEii2vjtf/x6JbEq0imS0UBL+FTCUqrbH7PUCTvb+LVTAsOAmEI4Ru5XZeYq 44Q= X-IronPort-AV: E=McAfee;i="6200,9189,9958"; a="37912211" X-IronPort-AV: E=Sophos;i="5.82,234,1613401200"; d="scan'208";a="37912211" Received: from mail-ty1jpn01lp2052.outbound.protection.outlook.com (HELO JPN01-TY1-obe.outbound.protection.outlook.com) ([104.47.93.52]) by ob1.fujitsucc.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Apr 2021 21:43:54 +0900 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=bFEJwKGoYqrAQIPOn0YPGDQV6Pl8BEC4+6xBsxXgGSCnwQPh3tsFdntRcLWpntxhT11oH992yUTMchBXoPrZiRJ7PRtnI6skM+2wFU4ivFqlqHdkol2lWr3XFh2hX8Dl6hRx8L9Ja8+sz3qIRCjNYbHXhS63aSlTf71mai2GN9guefnIQRMhKcLzFQX1jDHrR2nLSS3gIW+BRhUbnRxTrAcC8E1qnUoq+umJg4AAc+TGol55gDUM0tFHzkkCmfCpoByo2flnGQvSDADqIWsFxTcTKcZUjgWLTsY28Fvoc7c/6z4muEUbJmkt5eXN9F4PWfRVh9elWRF2gC6qEo9++Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LjmQSiTZsuwqF5YrLueuy6jFISbD6uc4OZ9otg2hwzo=; b=J4t+GORZrKmk6Ss2TP0C+WmNZb29DBREQKdR2HMkDMG6VeIGaWNNZEp/RhD4ZhP6omdaLL6S5XWa5W8w6/WEzCwtmoJt7ybFziNnNrM1SRj3WriDjFDX49tT++o3NgsoCEGr2COsAvhezlI2BV4LdCRZGPdAc1QcoWiUWuRaU8OZVX/NOlTyYA0BEBzTUZH9+7N6SPv02Aqheo+9H1pm2PW36Q9xYwR0vhtFGfcR924uRlYePfw9jPCWNqzv6Af2U1qQNJ64hxi2mOOm/KFdvzw7ejyKS395s9UAAYvuUL+F/9YZbKclHbl+kQoC4YShAmRaIEZ7sza0fD2fkKqyhg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fujitsu.com; dmarc=pass action=none header.from=fujitsu.com; dkim=pass header.d=fujitsu.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fujitsu.onmicrosoft.com; s=selector2-fujitsu-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LjmQSiTZsuwqF5YrLueuy6jFISbD6uc4OZ9otg2hwzo=; b=W/bGVBJt7TKBQGhS7xLFZg6ATg1L9ulvwXh5CLMNsD35yQlhkl8N7INFCtC6NK/U6776UVsFhWTDi172wnxx6WPvQy+MoshrkhTpFwzwjXPtFcCMX0v2Bx2cmyCOd93vWhzUbeSlJiiOXcYRmBD+wLQiW0RYd5fMhIVSr1hNKDY= Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com (2603:1096:402:36::13) by TYCPR01MB5759.jpnprd01.prod.outlook.com (2603:1096:400:41::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4042.16; Mon, 19 Apr 2021 12:43:50 +0000 Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::c8de:7917:af16:588b]) by TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::c8de:7917:af16:588b%6]) with mapi id 15.20.4042.024; Mon, 19 Apr 2021 12:43:50 +0000 From: "naohirot@fujitsu.com" To: 'Wilco Dijkstra' CC: 'GNU C Library' , Szabolcs Nagy Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Topic: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Index: AQHXL5Jyw0P1gKwhEk6/DkVDv1IPJaqyCeTQgAIMP+uAB0vJgA== Date: Mon, 19 Apr 2021 12:43:50 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-001, ja-JP, en-US Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-securitypolicycheck: OK by SHieldMailChecker v2.6.3 x-shieldmailcheckermailid: e836ae924756450aa679b219963bd965 authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=fujitsu.com; x-originating-ip: [218.44.52.178] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 89a76c31-73c4-45be-2b28-08d90330c886 x-ms-traffictypediagnostic: TYCPR01MB5759: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: IJFV9F/kFV+xevnnn1BM/E3hWhfH5EUvcIGEwD+OZ4ks3f34rjT0L6BwxIFNJPvlTroQrobtxDzcUh5hAw28hKUWIOsa7XkR1oc4NyG2/1UD+OnR468CP+Md3PBiTQUkPKUkx04QToxF0WOL+LRDHtuQmzTUClJNBTPxIkHZmQn2REkW+1YiPBho5fWvXCHQ/tVuMh91ys10iUSuiCco1vAo44/QTs2RJl9aslwVbbjHcmEBnwg24n28ZWKZbRElaglfBwTSzhmj6eMgnk0jglttAn6bJAb3vlJMPNB5XLpXFMkiRoM2jrQVER2izEnYDfU2yAwT0y0LwJqEoJse8nXSmZURHzrja49LU/LtGw1PqWknDwlXFPVdibqtCMrUsh8fS3Zz9wIumVyL3kh3GKS+zHszPtA9JlY/1BQDikBUPtJqYxpVb9vVKdUYZD5a/ua3LTH+2Sz6tS77QftiIYLsSesDRzXgjsKr4Sv1FvhQXj52UlfhFTXSpxGJ1GGcUkJvFp7rJ4E9jX+3/BeptegxW5ECIH1M5yXCl7bZshoDd91Z1ebCGMQgudnL+fnZxFPk3yKGVXWxO3lP8x/Qm5iuSjxyOygzGUEWdSH7tu21I8occX2atR+rZkeQvX+E4pGZxdlPUXzjqDOHrUmM3HuiZAcSgC2/WEjYPvD6ExgxSj2/QsCkk4+ukb2KN9SF x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB6025.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(396003)(136003)(39860400002)(366004)(346002)(376002)(6506007)(38100700002)(5660300002)(4326008)(52536014)(316002)(66946007)(122000001)(71200400001)(7696005)(2906002)(6916009)(64756008)(85182001)(26005)(478600001)(54906003)(33656002)(966005)(8676002)(186003)(55016002)(76116006)(9686003)(8936002)(86362001)(66556008)(66446008)(66476007)(45954011); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-2022-jp?B?bmVTREVXS08yR1JrQURDMHoxYlJJNnhVdGVoMENCQUxRazV2RitRSG9t?= =?iso-2022-jp?B?QytvVjhFbkw1dXBkQVdKVzVzbXV1S3ozUVVxNHcyQWVXMzBDcGNwRkkz?= =?iso-2022-jp?B?TUlkeXFYdmU3VzFRbFh5Zlg2VkhoalBpa1dTdDdHelNsSUFLMFNEcWRP?= =?iso-2022-jp?B?dy84YXp0c2FNQTZoSEZ2V2tSVHhBNDgwRFBwTEpQeVM0TnRwcDl4Vnln?= =?iso-2022-jp?B?VHl4VlhDWnR0SkZnYXlDU0w0RHVpeGlCTWRLdmxnYkZXMGVvRU1VQXNM?= =?iso-2022-jp?B?NDZuL1ovUEwxUHpla0lRaE11Z09nem5XenBrb2FZMXVMYUNqMG1EUUhN?= =?iso-2022-jp?B?eXBia1Y2UGZsVGRxdnQ2eS9LUjJBZFBnR2xsbkVWa3ZVUGxEZ3dYVGh2?= =?iso-2022-jp?B?TTVJaDNUQ0FCcnp0RC8yZ0g2UXRGMng4ak9wWjExOHptT3ZGTUkwbWNR?= =?iso-2022-jp?B?QVhNTWpIb005R1BweW5YcFc2N2FyRUhqSU4xYk9Ja3RBRklDOUh4MWJy?= =?iso-2022-jp?B?M1RJb2NJcGpYMmU1LzkwMStoaVBNU2dHUGlqRlJQUEN5S3BRbE0wazc5?= =?iso-2022-jp?B?S0grSUZGdUc4R2pXN3F3Nm5hWkJtUW1pNVlGMEFUcmE5L05VYmxIVGJK?= =?iso-2022-jp?B?Y3kvMHAwQkxBZTFHTzhqRW4veGk4NmFaSHBWVkp3ZGIxWHZWMzNYK1cz?= =?iso-2022-jp?B?SnJ2Qlg2OUg2VnRHazlOOXgvR3BxOFUzMjVmL2R3Vjl4eVZjS1pzQXpU?= =?iso-2022-jp?B?Q2xEUWVtbXo1QVhpY3hNQmk2VW93Zk9VRjNUZlY0TGRDRFAraFpFam95?= =?iso-2022-jp?B?NEVQb2VwVzRNaDZkRFJxM2hmaFZVQ2RLQ09SdmlEWE0rQ0lEamNNd0w4?= =?iso-2022-jp?B?VUU2WmNGVXNqZk9XZWtkSHo4SFZnYUtZTUJTL2Vya1REWUJ2K3Rxc2wx?= =?iso-2022-jp?B?Z2YwRlh0NTVkUlhuZlA5MVFiQUQrMzdhdVJRY1VNOFRPVXpTS3hUQURo?= =?iso-2022-jp?B?akNySXQxdm1qU0thWUp4NXN4MU9YVFRaMk5KZkpFRXo5MWczNmRKdGUz?= =?iso-2022-jp?B?VnRXRXBxWENwUkFkdFBheHowblQwaldNNlM3Nkw1T3MzQW1Jd2g1L1M3?= =?iso-2022-jp?B?NTlVeWUzVTdja21RUU9PTWR6K0pVaGgxZ2diYStkVjhGb29IbEFaT2NR?= =?iso-2022-jp?B?ZmVqdW5IRkFJZE1Xd0pWK1JSZkQ4TzgwbTFRcHZxZ0lGWFpxdHVwbUhG?= =?iso-2022-jp?B?R3RhQ2JuNVBWZkdYK0g1V0JQa3QrdjdLaGU1c21qZVFwNkRFS1l4NkpZ?= =?iso-2022-jp?B?d3c5b0R1UDllUi9Zdy81OGhvVFZ2SWdMK3paQ0Y5ZUg4V1o1R2RUVjgz?= =?iso-2022-jp?B?VU1WSHEvdTEzaWpBdjc5TW9CZ2hScXU4aWQ5MDJ4YlZyRWs2WE94RERk?= =?iso-2022-jp?B?ZFdlOHZVV3BWSTg0QzhDMUlsRHVKdkhHcGRzQ0RhbFFLYmtRNEFXalhQ?= =?iso-2022-jp?B?aS80cFJzOWV2cnBORkRJLzNTWGkwSjFmQkFSNGphMWtUVEtwRklkZ3V0?= =?iso-2022-jp?B?dytuc3cyYWV2YW1pTXRCK0pTeW1BZXZhOThyNDM2Z0tweTFXTlphcHE5?= =?iso-2022-jp?B?SC9yZjdWTkpmQjBLbGlzaVIwMVJJNXRQdm5ieDRGNjd0emxmSit0R0VJ?= =?iso-2022-jp?B?ZXZrYXM1MlpUeDdNSmdRdmpHZ0IzbHJiUTZybHl3dzdLRXRXTjNlOWJa?= =?iso-2022-jp?B?VU1yTCtGZ3N0MHF2Q0QwUzQ1WkZ1SE1pT3lFNDFwdHVPemlZWGpDVkpa?= =?iso-2022-jp?B?M1R6WVB1amRPTnlDaEs4UEx0RlB4cEZqbUd0a3dvQW9KdXhlRUUwbGE0?= =?iso-2022-jp?B?VlZsV1dIWitxWnpzOC9mZjVEWjlFVHNmcHlsRGd2eXRuRTMxeisyS053?= Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: fujitsu.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB6025.jpnprd01.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 89a76c31-73c4-45be-2b28-08d90330c886 X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Apr 2021 12:43:50.5821 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a19f121d-81e1-4858-a9d8-736e267fd4c7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: 4vPQzLU/wcPlJRIVioMM2OUbzJYvkgH9X+oP/bDT+vUUnCAjskLYQwPluZeucauB/V9pQA9b1IAIlEwLyFDjYQ== X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYCPR01MB5759 X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, KAM_LOTSOFHASH, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Apr 2021 12:44:06 -0000 Hi Wilco-san, Let me focus on L1_prefetch in this mail. > From: Wilco Dijkstra =20 > > Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unroll= s. > > This unroll configuration recorded the highest performance. When I tested "4 unrolls", I modified the source code [1][2] in the mail [0= ] such as followings: in case of memcpy,=20 I commented out L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) = and L(last), In case of memmove, I commented out L(bwd_unroll8), L(bwd_unroll2), and left L(bwd_unroll4),= L(bwd_unroll1) and L(bwd_last), In case of memset,=20 I commented out L(unroll32), L(unroll8), L(unroll2), and left L(unroll4)= , L(unroll1) and L(last). [0] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html [1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59d= c2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S [2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59d= c2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S > > In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls, > > The performance degraded minus 5 to 15 Gbps/sec at the peak. >=20 > So this is the L(L1_vl_64) loop right? I guess the problem is the large n= umber of So this is NOT the L(L1_vl_64) loop, but L(vl_agnostic). > prefetches and all the extra code that is not strictly required (you can = remove 5 > redundant mov/cmp instructions from the loop). Also assuming prefetching = helps > here (the good memmove results suggest it's not needed), prefetching dire= ctly > into L1 should be better than first into L2 and then into L1. So I don't = see a good > reason why 4x unrolling would have to be any slower. I tried to remove L(L1_prefetch) from both memcpy and memset, and also I tried to remove L2 prefetch instructions (prfm pstl2keep and pldl2keep) i= n L(L1_prefetch) from both memcpy and memset. In case of memcpy, both removing L(L1_prefetch)[3] and removing L2 prefetch instruction from L(L1_prefetch) increased the performance of the size range= 64KB-4MB from 18-20 GB/sec [4] to 20-22 GB/sec [5]. [3] https://github.com/NaohiroTamura/glibc/commit/22612299247e64dbffd62aa18= 6513bde7328d104 [4] https://drive.google.com/file/d/1hGWz4eAYWc1ktdw74rzDPxtQQ48P0-Hv/view [5] https://drive.google.com/file/d/11Pt1mWSCN2LBPHxXUE-rs7Q6JhtBfpyQ/view In case of memset, removing L(L1_prefetch)[6] decreased the performance of = the size range 128KB-4MB from 22-24 GB/sec [7] to 20-22 GB/sec[8]. But removing L2 prefetch instruction (prfm pstl2keep) in L(L1_prefetch) [9]= kept the same performance of the size range 128KB-4MB as 22-24 GB/sec [10]. [6] https://github.com/NaohiroTamura/glibc/blob/22612299247e64dbffd62aa1865= 13bde7328d104/sysdeps/aarch64/multiarch/memset_a64fx.S#L146-L163 Commented out L146-L163, I didn't commit because of decreasing the perfo= rmance. [7] https://drive.google.com/file/d/1MT1d2aBxSoYrzQuRZtv4U9NCXV4ZwHsJ/view [8] https://drive.google.com/file/d/1qUzYklLvgXTZbP1wm9n4VryF3bgUOplo/view [9] https://github.com/NaohiroTamura/glibc/commit/cc478c96bac051c9b98b9d9a1= ae6f38326f77645 [10] https://drive.google.com/file/d/1bPKHFWyhzNWXX7A_S6_UpZ2BwP2QAJK4/view In conclusion, I adopt to remove L(L1_prefetch) from memcpy [3] and to remo= ve L2 prefetch instruction (prfm pstl2keep) from L(L1_prefetch) [9]. Thanks. Naohiro