From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa8.fujitsucc.c3s2.iphmx.com (esa8.fujitsucc.c3s2.iphmx.com [68.232.159.88]) by sourceware.org (Postfix) with ESMTPS id 5AF0A394D89C for ; Mon, 19 Apr 2021 02:51:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 5AF0A394D89C Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=fujitsu.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=naohirot@fujitsu.com IronPort-SDR: FFjnCw5BKlUcgNbtd3YDWUr29wkZqTmXLCi53hd+TGaY/PEAYp4wSdTKOYRhB3oRHYHT6BxkL2 ah+/bzTF3Z1yux5YanDh3J6dRuWW7JZhaI92YPbz/zwHrcLTGpNo3EufW2xtsqRItc1jaEKBK6 f2S4OBVpeUM6tV27eN2goi6UY6P0FoJWPfCbHhlL8KFl+CurSZpqMZGXcH8juD4YZdN5QU1zFs h4KoOf4NP7nDrbjzeX4l2I0bliJrCuGKTAL2Ntq8ODzBsn1tj6mSpO1WRzP0tJZTtcyEPltm31 bBU= X-IronPort-AV: E=McAfee;i="6200,9189,9958"; a="29961443" X-IronPort-AV: E=Sophos;i="5.82,233,1613401200"; d="scan'208";a="29961443" Received: from mail-os2jpn01lp2051.outbound.protection.outlook.com (HELO JPN01-OS2-obe.outbound.protection.outlook.com) ([104.47.92.51]) by ob1.fujitsucc.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Apr 2021 11:51:42 +0900 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=KpNJouNgBY43Ysp29omU2Qk2y0ezmoLpWAnq5bb2XliCOwZDTmdNBVsmBVCFIfQu0b0SmVF1RhcknZjVLeF3DjVnV9amQtfwNwNcH/q8VZ0i8DcBEZ8mNb3rZAJGU16NVg/8tHbxBa9FR4pcUGQalkcTQCuf5T+AS8bzW3ZnkFeVhzCoct5Ug6rTZ2qEEEZVRR+p0RBysd/VwkErzTy8yQg+vW6e0ZQOpeJVcupoOF6rqUVsvoSHDgCY8VU/LFebSVVdWHmMQ8tZDOdCVnPKD+sKFgiT20LXLDPF1Onmr9dYD2mEtCDGgDPQoIVlmScgwkAAkEWdgTIECc3Sqs9J1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GvOOebzkUnTAjembNdbdvIU5fPfFZsA0b/OwkP17eE8=; b=WL+q7KikKTXUIjgeiymTDbltY46y3VuKo8l2k4k6/HhZ+732a6VedboCodrvXi+3wii98kVJfjclD1B+7EkSipVfS6+Vmokh5wtras4AmZnC9uzPWBIsylw+w6P9KWwL9oCqDoP/PqzpLvx5H+KcKh+HB8SgW1FhZsI4R8HnflUX/jNUzk+rVkccgijJtv5Dqne4QtsgOnzvkHzCi1V1k7t5SeM84v5h9XbTy4eIAa127DJHWOOfVrIevW/6scZC7by0oaeorwT3gHczXoZILKzp8Vjn3ItQ7PJFwh3ieiWEpoXAButf239XeaI3w/Tctd2EC30oBnje+Si/l9IO5A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fujitsu.com; dmarc=pass action=none header.from=fujitsu.com; dkim=pass header.d=fujitsu.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fujitsu.onmicrosoft.com; s=selector2-fujitsu-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GvOOebzkUnTAjembNdbdvIU5fPfFZsA0b/OwkP17eE8=; b=ZiEA/WkiZe2CrdEaNlwCED6exYP1WL4OT/dceLCO6F3zZeEIR7SgIROuBWOZMKGhJ3K4CNRixLe06RrZ6Gi9xQf/8UKMzNzQLGh2MQbgFKdim8ZSYa0pwj8r8bGVFedhcTAZCWo/9Xatd4BpdYDadBC8kDTCJ0+un1oUzLD/rO4= Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com (2603:1096:402:36::13) by TYCPR01MB6190.jpnprd01.prod.outlook.com (2603:1096:400:4f::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4042.16; Mon, 19 Apr 2021 02:51:39 +0000 Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::c8de:7917:af16:588b]) by TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::c8de:7917:af16:588b%6]) with mapi id 15.20.4042.024; Mon, 19 Apr 2021 02:51:39 +0000 From: "naohirot@fujitsu.com" To: 'Wilco Dijkstra' CC: 'GNU C Library' , Szabolcs Nagy Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Topic: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Index: AQHXL5Jyw0P1gKwhEk6/DkVDv1IPJaqyCeTQgAIMP+uABuQJgA== Date: Mon, 19 Apr 2021 02:51:39 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-001, ja-JP, en-US Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-securitypolicycheck: OK by SHieldMailChecker v2.6.3 x-shieldmailcheckermailid: 1b0249cd49f340809202ad0d4d63a0b4 authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=fujitsu.com; x-originating-ip: [218.44.52.178] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 57d17719-c939-4893-80c0-08d902de0e49 x-ms-traffictypediagnostic: TYCPR01MB6190: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:4714; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: 2Mht+pHfCPLtkbcEhlDMXe1929mJjOrCHYgqMg8+cPAiYr555Ff7nKkkz0dFUg+HzfLgk3frG4Izj0E3/zUajq4mIPZPA/pMB+/XmGKjSDkSZTIltqekow1S805hPkxBAdCVHFSdNxSvjmkLdP7PBzszEqF5j4QgHDEtpSrtUkkOU7U3NuWtGx/OJCmiINftSYtNoY9ynCGH8p0xlds1+8mabSm8+eixOf0Ius1u+JRDH8WlU8TWtKwfsmxIDskfHreDHcL88Q/AlTYPZBhdEf2U7ZlPF5moP97MT3ru/F4MMzn0DSVfNy4CN/zyW9VMYVXEYZdI1rUvj6k3rreS4GVZUmsvG7JW2fXHPOaJBLxZjOSaRN7+9IYKLydqZ6QzPUCSsuK3fbb3Bmd9baQ++VCVS4QsNqcC5GXf4O1smlq7vDf7+ibEVpTmb70XsUE+3vCkd68D2oyIcdVZJGWXsaAlmSq536RxwpmHgHlhMqsElzVfg1uaUjNYcQ0e6QcjTa63RLYQXRFq5HGDyAjTtrVH4oj68EP4wE6TdxjZvCG9+34V4pDr87cpl3B4c+yZRysod0zllaPxEQ78ITQqX6oCt+HhP8UerCMKr3VoYLASDkL5pMvFwBgWEjI+hSQUm/nKMc35qgIZnVeDskAzYOb9oDjrLoTiruLvjhzjquD6kM1R2rBXHdvPYJc1KtlO x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB6025.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(376002)(39860400002)(136003)(396003)(346002)(366004)(8936002)(478600001)(316002)(9686003)(52536014)(966005)(5660300002)(54906003)(86362001)(55016002)(64756008)(4326008)(66476007)(83380400001)(66446008)(66556008)(8676002)(186003)(71200400001)(85182001)(6916009)(26005)(122000001)(38100700002)(2906002)(7696005)(6506007)(33656002)(76116006)(66946007); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-2022-jp?B?Ulp1cnlENE83c2htdmVYNnoreGU0cHdFY0cydDVlUHZoajZET2w1R2cz?= =?iso-2022-jp?B?NWpKRmN1NVBVT25aM01vSUZYZkhUb1g3cEZqK1NROXV1Wjl0T2xtTFVL?= =?iso-2022-jp?B?VVFEMGx1bFhVN29aNjNidUZOeHY5ekxBclhIYmx0VG9BaDRFbE1FVm9R?= =?iso-2022-jp?B?Yk5Gd0FsNUlyaW5aU0prN012SjBab1N0Y0g1RFo1MnF2SzNGSCtVSVd6?= =?iso-2022-jp?B?bEJROC95YldZL1N5YmxZaS8vbGVvSGREb3FIS3JNMUU0eFI5QnlueEJy?= =?iso-2022-jp?B?NmowYjlZOWN2ZlZ0ZzdGK3lXRDBTczZHemNwR1BSTlFiSGVRYUJvaVBM?= =?iso-2022-jp?B?Rm5zWVdYdmt2UlJ4eDJEMFlXSU5yR21qMGJBSDdtWGVlODYrYlZDMUZ6?= =?iso-2022-jp?B?UGRUMU5EQmtqMW5ucStUUUpzZ2drc3FQSFMwekxCa0xqdldMTzByTU5h?= =?iso-2022-jp?B?ZkZ4YVRhTlpZY1I2ZjdKRG9kNURUM3E2Q1p6WTZ5M3JNd3V6UXc1amxN?= =?iso-2022-jp?B?Wm1YTytPRVFIbjZIWXF2dmZBNWVkQ0ZFcFJjL1lzOGNIT3A4T1Iwb2RM?= =?iso-2022-jp?B?ZXFveklyV0pwa1I4YzNQdTVIejZQYlh1OW5hMUZselpYL1dwNk9SbHJt?= =?iso-2022-jp?B?eStLSEdFY1N4U2NoU3Q1a1crblpxL0RzaUZLa1FnMkVuUVZHT0RxZjVz?= =?iso-2022-jp?B?ZHpxK1hoaXhicUY5VDJCZ2VscGRpN1gxanpnb0QzcHA5Z29wR0x6WkxP?= =?iso-2022-jp?B?VDB5bHBkZFZleVl4NjRLQkVyVmNqLzlmWHIvMmpGTVN6QldsY09YRkpz?= =?iso-2022-jp?B?ZUtyOHpEbTVkK2w3Rmo4cTRVWVBiWE5CK1FIN2poZmlwdE1GYXBjSVdx?= =?iso-2022-jp?B?dUJkU1RmQW5JWS9WV2g0d1U4WHNvSXZVMWxGQVg4UlJyUUMzSEl3b1pM?= =?iso-2022-jp?B?cDl3TU1zRHhyVGhFSmIxckt0dzl5L0haUlU3RVA0aTRrQTNRcXNPL0tt?= =?iso-2022-jp?B?cThZdVN5eUFTUEtZSjY1TFVkRlZlYzRXSWM5RGFNemh4UUFLemRoQU5w?= =?iso-2022-jp?B?a1Y2dXVNc3FwQXpDamwxTS9QekdJb1htMHRaRjZ0VkFBcHl1N0tkYjUz?= =?iso-2022-jp?B?WTJlRnluU0d5SXlzL2ZHeDA0blRUN04vU0l4emZjK0xtWmY1eDgySlJs?= =?iso-2022-jp?B?MDBsNndJTGQ3Wms5djNoVlB6RFZYcndpVnRxdmZEYXlFVVdxY1hQODhZ?= =?iso-2022-jp?B?dmN6SWhiaWk2T1c3YytNaDhPL0RqSkRtMHozaW1US1ljTkR3bHlvdjdT?= =?iso-2022-jp?B?L0g0NWJJRXpUdG5sTW1pY0w1TGxjbzZPUTdpdkhQTUg3V25ERUsxckla?= =?iso-2022-jp?B?dUptcnVlS1U4VWgvUmUzdWtTZHlnUTJnMVRxN2dpb28weHFrTHB0T1pm?= =?iso-2022-jp?B?aGNadnRObVk3QWxKWHZyZGZOZmVsd1hpTS83UkY0R0s5OVZvZk1NaTVw?= =?iso-2022-jp?B?UUJ6OWY5Q0pZV3Q3SWZxMVphRjA4aW5kWFZiWnI3QUtFNjAxREV3bGhV?= =?iso-2022-jp?B?OUNtR1BkL21PeVE2aml6d0ZEd2tCYU5WdDNyUVJZVk1SSzJFYXhJLzhT?= =?iso-2022-jp?B?SFd0VmNETTdyMFN5bWU4VHl5djdJQjRNTE1SZmhsa3RBOHZUTUFucHA0?= =?iso-2022-jp?B?YWlvdmNtbit2SjA5Zy9jWnZUUkdrSTdRRFRSQndlRHVVT29GQmtLV3la?= =?iso-2022-jp?B?bVNrSUUxTTk1Z1dLYmIrSWNaNzEyL3NXek5VdElRSjJ1SFRYUzFXRUFl?= =?iso-2022-jp?B?ZXJKQmQ1bWJJVjNoYStOY3NTVUxBMDZscU9lVVNBeGplL2F6V3NFS3pi?= =?iso-2022-jp?B?WGVYTm1WbDUyazhjaDBvaWkwMmZKN1hsNmttWG5ld09uRFZEL29yVk1G?= Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: fujitsu.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB6025.jpnprd01.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 57d17719-c939-4893-80c0-08d902de0e49 X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Apr 2021 02:51:39.4741 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a19f121d-81e1-4858-a9d8-736e267fd4c7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: 0bS2OhyekVToT8bwslOnfuTVyX8XUblqV6zjzmdivmLIO/nSvaCxeqAJDRMZDGlYeloeXlnilEjJUc4g2PPXJA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYCPR01MB6190 X-Spam-Status: No, score=-0.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, KAM_LOTSOFHASH, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Apr 2021 02:51:46 -0000 Hi Wilco-san, Let me focus on the macro " shortcut_for_small_size" for small/medium, less= than 512 byte in this mail.=20 > From: Wilco Dijkstra > > Yes, I implemented for the case of 1 byte to 512 byte [9][10]. > > SVE code seems faster than ASIMD in small/medium range too [11][12][13]= . >=20 > That adds quite a lot of code and uses a slow linear chain of comparisons= . A small > loop like used in the memset should work fine to handle copies smaller th= an > 256 or 512 bytes (you can handle the zero bytes case for free in this cod= e rather > than special casing it). >=20 I compared performance of the size less than 512 byte for the following fiv= e implementation cases. CASE 1: liner chain As mentioned in the reply [0] I removed BTI_J [1], but the macro " shortcut= _for_small_size" stays linear chain [2] A64FX performance is 4-14 Gbps [3]. The other arch implementations call BTI_J, so performance is degraded. . [0] https://sourceware.org/pipermail/libc-alpha/2021-April/125079.html [1] https://github.com/NaohiroTamura/glibc/commit/7d7217b518e59c78582ac4e89= cae725cf620877e [2] https://github.com/NaohiroTamura/glibc/blob/7d7217b518e59c78582ac4e89ca= e725cf620877e/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267 [3] https://drive.google.com/file/d/16qo7N05W526H9j7_9qjm-_Q7gZmOXwpY/view CASE 2: whilelt loop such as memset I tested "whilelt loop" implementation instead of the macro " shortcut_for_= small_size". And after having tested, I commented out "whilelt loop" implementation [4] Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-1= 0 Gbps [5].=20 Please notice that "whilelt loop" implementation cannot be used for memmove= , because it doesn't work for backward copy. On the other hand, the macro " shortcut_for_small_size" works for backward = copy, because it loads up to all 512 byte of data into z0 to z7 SVE registers at once, an= d then store all data. [4] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314= cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5= a0ff6a259eR308-R318 [5] https://drive.google.com/file/d/1xdw7mr0c90VupVkQwelFafQHNkXslCwv/view CASE 3: binary tree chain I updated the macro " shortcut_for_small_size" to use binary tree chain [6]= [7]. Comparing with the CASE 1, the size less than 96 byte degraded from 4.0-6.0= Gbps to 2.5-5.0 Gbps, but the size 512 byte improved from 14.0 Gbps to 17.5 Gbps= . [6] https://github.com/NaohiroTamura/glibc/commit/5c17af8c57561ede5ed2c2af9= 6c9efde4092f02f [7] https://github.com/NaohiroTamura/glibc/blob/5c17af8c57561ede5ed2c2af96c= 9efde4092f02f/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L177-L204 [8] https://drive.google.com/file/d/13w8yKdeLpVbp-uJmCttKBKtScya1tXqP/view CASE 4: binary tree chain except up to 64 byte I handled up to 64 byte so as to return quickly [9]. Comparing with the CASE 3, the size less than 64 byte improved from 2.5 Gbp= s to 4.0 Gbps, but the size 512 byte degraded from 17.5 Gbps to 16.5 Gbps [10]. [9] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314= cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5= a0ff6a259eR177-R184 [10] https://drive.google.com/file/d/1lFsjns9g_7fySAsvx_RVS9o6HSrk6ir9/view CASE 5: binary tree chain except up to 128 byte I handled up to 128 byte so as to return quickly [11]. Comparing with the CASE 4, the size less than 128 byte improved from 4.0-6.= 0 Gbps to 4.0-7.0 Gbps, but the size 512 byte degraded from 16.5 Gbps to 16.0 Gbps= [12]. [11] https://github.com/NaohiroTamura/glibc/commit/fefc59f01ecfd6a207fe261d= e5ab133f4409d687#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe= 5a0ff6a259eR184-R195 [12] https://drive.google.com/file/d/1HS277_qQUuEeZqLUo0H2XRlFhOhIdI_o/view In conclusion, I'd like to adopt the CASE 5 implementation, considering the performance balance between the small size (less than 128 byte) and medium = size (close to 512 byte). Thanks. Naohiro