From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa3.fujitsucc.c3s2.iphmx.com (esa3.fujitsucc.c3s2.iphmx.com [68.232.151.212]) by sourceware.org (Postfix) with ESMTPS id 6C48E3857001 for ; Fri, 23 Apr 2021 00:58:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 6C48E3857001 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=fujitsu.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=naohirot@fujitsu.com IronPort-SDR: pG4fvuT4709gUcdiK4n9e/FQP0xqK9PBlgC95wFWZ135+t6mr8N8KRjbp9KbWIjG5HkDUMkEh+ O2Ve9NA/VjppG2AF1N1Y8eKj68WPUH5A8mDRwrAGZL3M15bzmJ10zjeZ5hchM2VPaoV+CHntTd UP3xH3T5P2vxdUlMpF1V5zChpIGQHN8lpmKitu4JJ7VmWsa8enfKoVOQRLqxng4fih28e6S3Q1 L6+pJT4bJwZf+wmJs7Y4TfZfMe9EdIOMBQU7MkoHgnALg4Xype/AMDnf4bMWc8XCiOG00f3hWx CMo= X-IronPort-AV: E=McAfee;i="6200,9189,9962"; a="38329417" X-IronPort-AV: E=Sophos;i="5.82,244,1613401200"; d="scan'208";a="38329417" Received: from mail-os2jpn01lp2052.outbound.protection.outlook.com (HELO JPN01-OS2-obe.outbound.protection.outlook.com) ([104.47.92.52]) by ob1.fujitsucc.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Apr 2021 09:58:29 +0900 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=QQ53NvBu2zlz/lxAjTS9NGVLxHTI0aZbUciLw6Z5vaVLcRVAKrShJ3p1idqTNxmkQy0RZkO5gluLs9/61l9GG0t9lx0UfE0qcKn3q4QzxXqh3x40ppCzFgu2od3OcJAJOhEbhJRP4UNuQHUi2RSgzPLsSUvLEOr6VAxBD8u8O/+u+RC3Si+fxvLDSzS3BKa4qiJ8y3efeshJcsgUQXHxdhz7OQC6MwoYxU+zmtpaOU4jMkHXlfYaXTmIZH2aUmxfLFfN9bQtYfVQ8hrqXXz86ophwEXuw41oiYZEOk2+QAUwqOeRzIde+nQ3NbLdg0lAX5R/yoOZO5S4BXyKEcEtHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=POu/ytPy5qSEkasnL6MChhBAgM3QnN1hShnj4ejjAJs=; b=N1wFNlUrChD5TYtpQ0Yy1NSk29/8kcADufSH9TRaj9mCEjVFAML2RoBO5iEm6QalfThUrhzSIVD1XYazCd4aCJAM8XNTbLKNc5rmZZAA+s1+ESIJZJ5gVPyDEKUw+wWfnk/NPrVZmNMZFmwyAW0vVvxRAuRvF6ocqtEdgc3nbh9hJaVtZ+Pa23K8N+1QXYvxbzto41tNvU3ve3Ds9zOuoJ0vU1lENSi7SUnhK5wGTZJYhNMQiCpR6yPGQ483DEN4PYqLRt6sR1gmHr+5Sag5OqZNFDwwycawS5oSr+bRcLl2tRyviVUkrgy5/V0blp0oBwyZRAcnR3roq7p8CoEMag== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fujitsu.com; dmarc=pass action=none header.from=fujitsu.com; dkim=pass header.d=fujitsu.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fujitsu.onmicrosoft.com; s=selector2-fujitsu-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=POu/ytPy5qSEkasnL6MChhBAgM3QnN1hShnj4ejjAJs=; b=Onckk5qhJ4IuQUElDU1/Clnv3uk2f6HQq6WUTtF4+gRn3sWeJt0hrtboXYitpkV+YxNttwe+Uw8SkiLWeqQIucqneswxaAW7o3bJbP2XT92x9KIdY1HFuFUAiXMt+Qu7pYJAAw7vjANt01uLWc92txZQBDP4XeVa2Zv1fh6ZYZM= Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com (2603:1096:402:36::13) by TYAPR01MB4720.jpnprd01.prod.outlook.com (2603:1096:404:127::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4065.23; Fri, 23 Apr 2021 00:58:27 +0000 Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::2422:2c7:39a3:5283]) by TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::2422:2c7:39a3:5283%6]) with mapi id 15.20.4065.021; Fri, 23 Apr 2021 00:58:27 +0000 From: "naohirot@fujitsu.com" To: Wilco Dijkstra CC: 'GNU C Library' , Szabolcs Nagy Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Topic: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Index: AQHXL5Jyw0P1gKwhEk6/DkVDv1IPJaqyCeTQgAIMP+uABuQJgIAA8C2qgAEgUqCAAgmxa4ABDPXQgAEquuA= Date: Fri, 23 Apr 2021 00:58:27 +0000 Message-ID: References: , , , In-Reply-To: Accept-Language: en-001, ja-JP, en-US Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-securitypolicycheck: OK by SHieldMailChecker v2.6.3 x-shieldmailcheckermailid: 28ac0d1d3ab94b7ea2e50ed0c82b3f04 authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=fujitsu.com; x-originating-ip: [218.44.52.176] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: fac6395f-a64d-4568-5b13-08d905f2e76d x-ms-traffictypediagnostic: TYAPR01MB4720: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:5236; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: DyTe4p1uQ6Uqzfxze0y/ByxFa+qxRAcuX1csPoeEAPOdU2pzmyd2ev3CSKy+u9pjKCUKDxwJ8IRzJE1RTa7YTTdjL046emPt+G48U8d1BA8kRV2o1HXqQcQzTgLptfJX5Yp4gbG7b1K9NaO2oNAZdojUPRORjrrG/89j6OkSqVORgJz/sI6sMNQkxlAkQ6WhUmr8rLpvhxT6A2YZD91zyz/2zp8QWTs04gHLVgp6rq6k0wg/Dxin77y1YAc7sCLKfgsm3+vl4+LLmjIf4c71a4i3fSjCyOPL9iBAacpB0uvesVEjF8MN1xC0tC7Q/13XvO3NXhZw+6MP1w9Ekz1gn+lW/ZuYRQ8ruLz+wgrsuTGbV8TiFNqhJkr2Y13h2gg5fYEaAyR2MMuwPR847jKjhIJfWb4itnWxbdmCwJ5KSHjEt6nc2tHkp/L37X5Bh2aQs4h+T8wQPmyYxNqVvyu5Y0D3L5TBpmo6rkYhmbZRF7dDCvsca1TP6AYhmRMJQ+02k8zDCN1s1LOxgYFOPA6ZpnD+Chedg25IJtaTaL3B8izRZbzeYMOfGz2KJIXeN6UA0oPne4AvykMyqJSuedJZEy2N3iqW02cNTBXajGXX+jsLOHmN0OLVdAF2F9GQa1sS9oskRTVPi+JKgt4Mj7RxP9p5WTv71mCutr699zMDvcBxC/uhaa+BQpBa751oGY2j5OaQBS8dAVKOM8ivT0/nkNnrXAg2+xvw0oFACCNlX3k= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB6025.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(346002)(396003)(39860400002)(376002)(136003)(366004)(186003)(2906002)(86362001)(26005)(4326008)(8936002)(8676002)(478600001)(52536014)(71200400001)(966005)(54906003)(6916009)(76116006)(316002)(66476007)(66946007)(66556008)(64756008)(6506007)(55016002)(66446008)(33656002)(5660300002)(7696005)(9686003)(38100700002)(85182001)(122000001)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-2022-jp?B?YWF1eE1uazlrbkl3ZVNjRUp5RGhXRmVUWUF6WCt5eDc5M1AxYVUyaFkz?= =?iso-2022-jp?B?bklIZ2l2KzFWTk1icGo5NnhNSHZtaFU4M3owNkVzS0NiT1A2K1RlMHY4?= =?iso-2022-jp?B?ak1JS3RVRkhuUExWL2FrT1RzczdaR3laVGU1blBEQ2NXQ2FlODAreWVv?= =?iso-2022-jp?B?bXJnazFja0lrOGlzTG9OeWFxbGtDK2FBRy9GK0d5N1lLQ2NFR0doUnEy?= =?iso-2022-jp?B?TGZrYVp4Vzc3MDE5OVBBVXREWDRHL3owTjcxN2pCVG5zd0VNb0pSOHRx?= =?iso-2022-jp?B?VXN0ZS9lcndMeWNZRUxYWHZNT3VtOWJrdE05bEFnQW5ocHVLUkNJVVZF?= =?iso-2022-jp?B?dFpsLzl5RVVkNi82N1pQQzFXZnJkMGpsOTYzdTNCQ3lTZmZEV1FGUC9S?= =?iso-2022-jp?B?VXk4Q3BXZDhXd0NpTE9QTVc0bldWaHZhdERwL1NUZ3BvcXNmcWtsbFlD?= =?iso-2022-jp?B?Z3E0WlFtS1ppRGtvVUVWaWJUVjhYbW96Z3l2eGhJTHVRZkptMmlhZlJn?= =?iso-2022-jp?B?T1hlVGNVYmZuNlk3UHExdEpObUhuREEyRTJLY1V2dkYxUkNodXM5K1Jn?= =?iso-2022-jp?B?RHYwMWdJVlhZbVVDbm9XNjRKOWhTWjdDQnJNcWhuT0NPcHNBdlNpU0RY?= =?iso-2022-jp?B?WnZhTkU3NlRBeHJRM3I3WW1mejhSQmZXOGhJTlVQUVA4eFRrMWRyeGxi?= =?iso-2022-jp?B?bGR6Q3o3cUVsS3RSbUZISjdzUFFSUU1VNlNuUndaektENHMyek1salhN?= =?iso-2022-jp?B?elA5cnlHRUtnREN6YnhTSTFlTmR6bnFSOUQ4QjE1WFROOGgrbllEeFEy?= =?iso-2022-jp?B?dXIveWZ5aVJReTFaUXBLYnYrZmxKdVdGanlUS3hLMTVsalhVbFltUU8x?= =?iso-2022-jp?B?MHdvcmY1SkdnMjBWZ3hwZTJzOXl3bUx5UFFxMmY4SklNbUY1TjhEQzNH?= =?iso-2022-jp?B?Ui91dThkWmxUK2tHYnZOb0dtdU0ybWUzTjgwQUVXTlN6TmZXZUJjODJ1?= =?iso-2022-jp?B?VTltOEZYbnYzZjhzVmhpT1Q4U1N4WlBwdVA1OWx4UlVoQXBFRnpQaW9U?= =?iso-2022-jp?B?K1B3N2FWRkVrNkxOcFZLUUJ5ZDVjTmRhVjd0WGw4NVdudWFFVDhPRU1S?= =?iso-2022-jp?B?TXZrZVZVdGpPemcxNWMrelFUM1h2S2Rvb3FoU000V3NxdGgycDk3ZjQ3?= =?iso-2022-jp?B?YktQVWtwNi9paElYck5hSTJVckx0aEsrNjZ5UTVRMThwS08rZmF2SXYx?= =?iso-2022-jp?B?SGZubHBMeTV2dnhXTkZTMllHYjk2bVNRMlF3ZU1sRTYvTU8xcTVOMmo0?= =?iso-2022-jp?B?dmQxYmlvU1lEYU1YQ1lmdUE4UVYxWDFBRGlUUTFuQTJyc0ZMck01cDVH?= =?iso-2022-jp?B?UktERnlaTUtQeXNWNTRtR054a01uNDBqU3E3cmFSdmZ4RDhpSXVmYnAr?= =?iso-2022-jp?B?QTEyRzhVT1VEVG1vWUpuRGFCYmhXa0ZBTjhGSXVSLzMzL29yUThBbE8r?= =?iso-2022-jp?B?MW5DdEhJeHRIWDQ2N0F3TGFzZWFjbWtheW81U1o5S3VoSFNDUmJWR1N2?= =?iso-2022-jp?B?c1o4V0dBaUp1Mml5YVYzRitxdTFURzB4VkNFUS9qN1VHUC80dTdzYXFV?= =?iso-2022-jp?B?QWhNVXJ0TFRINmxvUmRTVTZobHNrdTN4R21HRkVKTThWVWs1L0lIRFNo?= =?iso-2022-jp?B?YkxwcmlLYUtKY052N3V1NWVLMDFnTEJLMTRVU0FacVdoeHZlcTdDdlV5?= =?iso-2022-jp?B?NVBkZVoyOWVCMk5uRkxJUUhGRHU5MXZwOTA0eEtEdVdiZ0k2eTJQbDV0?= =?iso-2022-jp?B?RjFuRWpiZDNjRzBGdk9TTFhkMTY5MVFaT3NiSkpNSFBMYmVzdmtsSnl3?= =?iso-2022-jp?B?cFR2bTJxRG5ISXJOYmJQaXpyVTNqaXhQRThaTklBd25nSXdIbkJUK3lt?= Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: fujitsu.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB6025.jpnprd01.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: fac6395f-a64d-4568-5b13-08d905f2e76d X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Apr 2021 00:58:27.1936 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a19f121d-81e1-4858-a9d8-736e267fd4c7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: WpSQmmI3prx3vN3aLa2eL3r92ye9f+k+7EcF59fzRwATeTZGhoaSiTzKcl7tKnhd5GUgJoU/XzHymhkMlQsPTg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYAPR01MB4720 X-Spam-Status: No, score=-10.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Apr 2021 00:58:35 -0000 Hi Wilco-san, Let me make one correction, I forgot about free ptrue to p0.b. > From: Tamura, Naohiro/=1B$BEDB<=1B(B =1B$BD>9-=1B(B > > If it is WHILELO, > > it is possible to remove 3x WHILELO from the earlier cases by moving > > them after a branch (so that the 256-512 case only needs to execute 5x > > WHILELO rather than 8 into total). >=20 > As shown in Graph 2 in Google Sheet [2], this approach didn't make the di= p small, > because I assume that we can reduce two WHILELO, but we needed to add two > PTRUE. I didn't have to add two PTREU because of the free p0.b. As shown in Graph 4 in Google Sheet [2], this approach without adding two P= TRUE made the dip small a little bit, but improvement is smaller than the last way [4= ] shown in Graph 3. So the conclusion seems not to change. [2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfG= mSagXlIV2F-c/edit?usp=3Dsharing The code without adding two PTRUE is like the following diff. $ git diff diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/mul= tiarch/memcpy_a64fx.S index 6d0ae1cd1f..c3779d0147 100644 --- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S @@ -139,12 +139,13 @@ 1: // if rest > vector_length * 8 cmp n, vector_length, lsl 3 // vector_length * 8 b.hi \exit + cmp n, vector_length, lsl 2 // vector_length * 4 + b.hi 1f // if rest <=3D vector_length * 4 lsl tmp1, vector_length, 1 // vector_length * 2 whilelo p2.b, tmp1, n incb tmp1 whilelo p3.b, tmp1, n - b.last 1f ld1b z0.b, p0/z, [src, #0, mul vl] ld1b z1.b, p1/z, [src, #1, mul vl] ld1b z2.b, p2/z, [src, #2, mul vl] @@ -165,16 +166,16 @@ whilelo p7.b, tmp1, n ld1b z0.b, p0/z, [src, #0, mul vl] ld1b z1.b, p1/z, [src, #1, mul vl] - ld1b z2.b, p2/z, [src, #2, mul vl] - ld1b z3.b, p3/z, [src, #3, mul vl] + ld1b z2.b, p0/z, [src, #2, mul vl] + ld1b z3.b, p0/z, [src, #3, mul vl] ld1b z4.b, p4/z, [src, #4, mul vl] ld1b z5.b, p5/z, [src, #5, mul vl] ld1b z6.b, p6/z, [src, #6, mul vl] ld1b z7.b, p7/z, [src, #7, mul vl] st1b z0.b, p0, [dest, #0, mul vl] st1b z1.b, p1, [dest, #1, mul vl] - st1b z2.b, p2, [dest, #2, mul vl] - st1b z3.b, p3, [dest, #3, mul vl] + st1b z2.b, p0, [dest, #2, mul vl] + st1b z3.b, p0, [dest, #3, mul vl] st1b z4.b, p4, [dest, #4, mul vl] st1b z5.b, p5, [dest, #5, mul vl] st1b z6.b, p6, [dest, #6, mul vl] > I changed the code [1] like the following diff. >=20 > $ git diff > diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S > b/sysdeps/aarch64/multiarch/memcpy_a64fx.S > index 6d0ae1cd1f..2ae1f4e3b9 100644 > --- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S > +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S > @@ -139,12 +139,13 @@ > 1: // if rest > vector_length * 8 > cmp n, vector_length, lsl 3 // vector_length * 8 > b.hi \exit > + cmp n, vector_length, lsl 2 // vector_length * 4 > + b.hi 1f > // if rest <=3D vector_length * 4 > lsl tmp1, vector_length, 1 // vector_length * 2 > whilelo p2.b, tmp1, n > incb tmp1 > whilelo p3.b, tmp1, n > - b.last 1f > ld1b z0.b, p0/z, [src, #0, mul vl] > ld1b z1.b, p1/z, [src, #1, mul vl] > ld1b z2.b, p2/z, [src, #2, mul vl] > @@ -155,6 +156,8 @@ > st1b z3.b, p3, [dest, #3, mul vl] > ret > 1: // if rest <=3D vector_length * 8 > + ptrue p2.b > + ptrue p3.b > lsl tmp1, vector_length, 2 // vector_length * 4 > whilelo p4.b, tmp1, n > incb tmp1 > > If all that doesn't help, it may be best to split into 256-384 and > > 384-512 so you only need 2x WHILELO. >=20 > This way [4] made the dip small as shown in Graph3 in Google Sheet [2]. > So it seems that this is the way we should take. >=20 > [4] > https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c4262 > 7a6ca12c3245a86 Thanks. Naohiro