From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR01-DB5-obe.outbound.protection.outlook.com (mail-eopbgr150070.outbound.protection.outlook.com [40.107.15.70]) by sourceware.org (Postfix) with ESMTPS id 8ADB8395C06F for ; Wed, 14 Apr 2021 16:03:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8ADB8395C06F Received: from AS8PR04CA0163.eurprd04.prod.outlook.com (2603:10a6:20b:331::18) by AM9PR08MB6020.eurprd08.prod.outlook.com (2603:10a6:20b:2d6::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.18; Wed, 14 Apr 2021 16:03:02 +0000 Received: from VE1EUR03FT055.eop-EUR03.prod.protection.outlook.com (2603:10a6:20b:331:cafe::59) by AS8PR04CA0163.outlook.office365.com (2603:10a6:20b:331::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4042.16 via Frontend Transport; Wed, 14 Apr 2021 16:03:02 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by VE1EUR03FT055.mail.protection.outlook.com (10.152.19.158) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.17 via Frontend Transport; Wed, 14 Apr 2021 16:03:01 +0000 Received: ("Tessian outbound 9bcb3c8d6cb1:v90"); Wed, 14 Apr 2021 16:03:01 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 3981c67596c6a10c X-CR-MTA-TID: 64aa7808 Received: from 2eeb172a85cd.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 3506526C-CF8D-405D-9FC6-DDE72253B09D.1; Wed, 14 Apr 2021 16:02:53 +0000 Received: from EUR05-VI1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 2eeb172a85cd.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Wed, 14 Apr 2021 16:02:53 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=KBzWxH0Gs5m2UF73dvj+760VTAxX2dhRe+7XyYB/oy412CsPkzRr8BkQ7b6Y0k9qZvOfbSJS8VdMmuFHySoznS1tfNCXPAKJUXmC1rT/uEOkE2xkIRfInpDGlr7rIuweEIoV1og0dSZ0NqCj5w8z7Myeqzy77W/TBbrYRCEJkIFkYwvnei7pbdMHwr8twfOhYrOI6QfdT/H9muyE+++wZtrRGBvrKk+NNa6iQ03haCdOLAHGo1vnT+oCYz4oo3z7uDUH/veqyCuhDzg2layN2IRZcb7nKC70zluLdfC01+2yU+m+WVMWVHh0wq0Am8fG4hQJ1v2y4j8UYu3g4Z52rQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LjFDuGUqAtQI5HplfRRwowBz0LhfqzFPcL++kEO6CWI=; b=fFL/9mL9LpDh/s5LERiMFriKyhhHk407dV8tuafQPb2af3ZVe7848rrj5Dkpl4EpHGommwA2In4XvVOMysyxzePGfE0oqiaYepeH43YR3FboOsfuMuwUsYXngSSz1kgcHa+mi62/6E4Vmx6eWv1sBC1+wWN72ETrXLi9tNdG5yz1YTBpdDFLterzgdLqeTSrcyIW01S3qQjrB/bZX6uQWr5Y2yasOo0QPT0z4CSCcQhTussF8ohKUtjcvYsE+LBqHaDSArVmlkBch9MwntPhrcPmSe01MQvLlLCIgOgChcdKutTMbGNT67XfhiE/ePgbP/WZ1QcdzOG0bsZ6k6vV+g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VI1PR0801MB2000.eurprd08.prod.outlook.com (2603:10a6:800:8e::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.21; Wed, 14 Apr 2021 16:02:51 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::385c:f8ff:ee16:3a4d]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::385c:f8ff:ee16:3a4d%6]) with mapi id 15.20.4020.022; Wed, 14 Apr 2021 16:02:50 +0000 From: Wilco Dijkstra To: "naohirot@fujitsu.com" CC: 'GNU C Library' , Szabolcs Nagy Subject: Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Topic: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Index: AQHXL5Jyw0P1gKwhEk6/DkVDv1IPJaqyCeTQgAIMP+s= Date: Wed, 14 Apr 2021 16:02:50 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.249.100] x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: 4fa18803-ab92-4de4-450c-08d8ff5ec805 x-ms-traffictypediagnostic: VI1PR0801MB2000:|AM9PR08MB6020: x-ms-exchange-transport-forked: True X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:10000;OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: rEzFXaDB6IaPqZvnMrCz6lnlltOPravEWkhKdOOVDrPWYDFBCYauQFPIBotbgdIiFEuRTV43haHvxBmnJhMVFdfnmJv/7VQ/IGhws+kRACRGNoflhw4vVBG9eO2Trl7ShmfMYdtBOadxaFlogelTtfq78t05aF/nQRO8DKm138GxjdL1p3UMIIRqd4D1RlWEdYMmvnfFByA/WCyVf9Zgcf+U1aY57+8h1eLro65DDpOTN6jk+Ql/mUhFlifUf9PiPaNcMkIbyLOJze/q6YIg2E0XCzrr3tzzdBnUeay9Ugl/mvS0Ii8I1CjRPr2+fk8ZomYQ9bpsBZlmX9GCw3NSTsETJ68sWVgl/aJdPePRS+xnOKKBJTGkYMqH4/GEMxyELsIg/DTAFIHEaG+TpeC2+DY1Oju9SYfy/Ky++7u3Pp43HMv8vDvJZ9+aKrIjMjuga7WZNnwk9Q5z7GZtORZ5PhI0qbhVA9YGQlmwsstb176oajxIfo8b/v4crn/dInZhMNO2JUxSMKWkSsxaS+90XaN0P56QEj01cV95jJ7cg2VkJfBK7SaE9X6j+LpyqCD8NePeM2LBoh1dztwkSs2XeRZhm5vE9Qd7iod7G5r+OXA= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(396003)(39860400002)(136003)(366004)(376002)(346002)(66556008)(2906002)(71200400001)(66446008)(66476007)(5660300002)(8676002)(8936002)(52536014)(9686003)(64756008)(54906003)(66946007)(33656002)(316002)(7696005)(6506007)(55016002)(83380400001)(86362001)(26005)(186003)(91956017)(6916009)(478600001)(4326008)(38100700002)(122000001)(76116006); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-8859-1?Q?87J7aJ4G9IXcIl/tEPMKlBarLS+VWMuimiZjTYtT6HUuDGdJYmjcFMH18k?= =?iso-8859-1?Q?pjJdbYe+iUyvfUn26cu1h7XB4hceDi0GRFmZHZUj0pEwRYpoZh3tvf5gzY?= =?iso-8859-1?Q?Cn98xkIMrTFZxUSvkw5o2I4OJXMGCyL3drMZniQAX7IER6WziSaWtd1cJL?= =?iso-8859-1?Q?9/tyohpu15vS3nfaJ19n04i3Pi5l3sPj4qwVy0BkWescwlR8j0hz8WIrjk?= =?iso-8859-1?Q?DAd4NKw6c8G73tZoD9hyDAiUPwXUwZqk+Hl/g/rYuMPH4/C3lLnPTodu34?= =?iso-8859-1?Q?TlXXK34g6R7a+q62kTw+8GjON/Lb1dVLsWvZiu5UQvtyZv9tKQbpJ6t9Qa?= =?iso-8859-1?Q?rv/JE7C19LtXdItkuRNZiNKVYxqRXnjI5cepYNZp9M9l68Hm4c0mRUSFom?= =?iso-8859-1?Q?6lrZAmSh8V/7lN75YwAHRROGv/qbsQRrWFzWrGxK3xe3ZwwKqA7sZMgLMZ?= =?iso-8859-1?Q?SW/bXJWOlwGbREj59eOZvutR4c1XyIbgpt27tAw+xkgqS5cWD9tQpe55MI?= =?iso-8859-1?Q?ymcOClsF2PWn+5XcUTtPXltJ8Uw1tEXDPL2bspGJFlwGtByktErWl+Pd8I?= =?iso-8859-1?Q?SFMFBpkvtjG/Lqz6/TxzMuW4RF4PYKaM5L8QYOl9yyHXAmym4nlOMp1bUA?= =?iso-8859-1?Q?zL2kHIgvWKhc2D2SjLCqRd9U5xTJRrm9fOIs1ZvE+jrYkDWlM+Vg8CmqAT?= =?iso-8859-1?Q?463mGcZzAAmSHpQVwmHbNRDZ4Lpdo6ZBmwGy5N/WRo9j/IFgFoDtr1yEkY?= =?iso-8859-1?Q?aCwTCjyW/5iNBzBWsHIQtAtAbhGiYOwqaBON9QUHMow/9Fkl8mG4Iki6qh?= =?iso-8859-1?Q?PUSxOpD5Nk0I+B/ZoqOX8InHAjGCqXmmK5XgpT5H0g+5oRBxlBysPn7KGf?= =?iso-8859-1?Q?PqbNi1HNCEM4PAEyfCh4NSelxV+xEY/rvmO6h+b6TSs5c3uUjvt7434XgE?= =?iso-8859-1?Q?98l3CjAiV5Aw1STeI7IhK78lQ0tDW2quIczlJaVOUI5xMsXp24v/WLgm6o?= =?iso-8859-1?Q?ZG7nKEJkzCilrtO46t7Yr3oQCkXEj/CwdkYNXtLSo5EAtd/eRXP/OhF1dP?= =?iso-8859-1?Q?fuLlP9sTotErLANiXIjkwS6220Jnl4uPv3BdkgVg4NbX7p/du3rJXNtjvB?= =?iso-8859-1?Q?txML1oN6LwkeHIKtzE6yeCzT52wZudnmcUgVe7+p/piV5EMkMx0lG0AZwC?= =?iso-8859-1?Q?12x45dTlWwyxgPM0RLsUmWZkGOFpC2D18tQtil4V9J1KPwP/dW/otmyyur?= =?iso-8859-1?Q?Rkew+Ok/yvUpJ6fZ8trXamyTRoLS7AiVQpWtIiA/xMzDeFzoFCFT3D6b4b?= =?iso-8859-1?Q?Z/3pcuf3TLOs6n8lj3Sv9Oyb1P7hJTfQoy0i4WFKVqX67QM=3D?= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0801MB2000 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: VE1EUR03FT055.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 4de6d836-0ab4-4980-2950-08d8ff5ec140 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: SIkT4/e4NzfsBTFMbaYcIOPNpRx9t0e8PIfmQ6L0ZG2R80vv/9hR5BjcdEYDfx6HohFHXWzWQruz2/gz7fsXm559VKQREQ9vaw2XOI2RFAOqohfvkE5zWyNsf74GuBDO3W4w0j2vslfHfZSDk06+xwaBzHxWuL2+7xGCuWGDFBz/gv+Ro3/u/DZVjRf1IzjbWnmaER6AMxUFH52PJZeLw515TPjnvbsfUd+goPoPIG46VunCQAsZpvXWTI6zam8U+/76S6XnumTBHQ2HMtvXXCcLskC+clQAa7/5L1RPP43xQ2P6Vr4hK9yWcoFhokjyqlotp09oxK3sbMGv4lnfXuCS263y1nZFQDamJ6BcXB6sBgSlLaaljeWR+SbvSAs+UHBd6EeHhB/6AEBLf6T/tPUUWlp+lqI5PS/lmq/5d8pJViN93zOIi/uLSFcHUBJ8zt1ioXZLplT54psZfH37RTOkFeEk/O2y0ZKl3C9pBQ6mkeYJWSA/vGteX9Qjc7R4IM934NgIJOKg0Jmcyxha9bHCoxe78OCmUbBjpvsaQLJ/nI9BRqqWbwo5bGZ62y+G0JoPf2D4TxbEmcqu0ug/+2HgzjojlPQ8qr20VP9Eg1FtlcsxBiioQFd8Zi6YMD8aTjeF+QkPc3/6KfFuFf9nvLn15vzLgfvKo0T8YBpxi4Y= X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(39860400002)(396003)(376002)(346002)(136003)(36840700001)(46966006)(26005)(70206006)(47076005)(186003)(4326008)(336012)(54906003)(7696005)(6506007)(36860700001)(316002)(478600001)(6862004)(2906002)(82740400003)(55016002)(9686003)(33656002)(82310400003)(8936002)(8676002)(5660300002)(86362001)(52536014)(83380400001)(70586007)(356005)(81166007); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Apr 2021 16:03:01.9393 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 4fa18803-ab92-4de4-450c-08d8ff5ec805 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: VE1EUR03FT055.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM9PR08MB6020 X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Apr 2021 16:03:07 -0000 Hi Naohiro,=0A= =0A= Thanks for the comprehensive reply, especially the graphs are quite useful!= =0A= (I'd avoid adding generic_memcpy/memmove though since those are unoptimized= C=0A= implementations).=0A= =0A= > OK, I'll try to remove unnecessary code which doesn't contribute performa= nce gain=0A= > based on benchtests performance data. =0A= =0A= Yes that is a good idea - you could also check whether the software pipelin= ing actually=0A= helps on an OoO core (it shouldn't) since that contributes a lot to the com= plexity and the=0A= amount of code and unrolling required.=0A= =0A= It is also possible to remove a lot of unnecessary code - eg. rather than u= se 2 instructions=0A= per prefetch, merge the constant offset in the prefetch instruction itself = (since they allow=0A= up to 32KB offset). There are also lots of branches that skip a few instruc= tions if a value is=0A= zero, this is often counterproductive due to adding branch mispredictions.= =0A= =0A= > Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.= =0A= > This unroll configuration recorded the highest performance.=0A= =0A= > In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,=0A= > The performance degraded minus 5 to 15 Gbps/sec at the peak.=0A= =0A= So this is the L(L1_vl_64) loop right? I guess the problem is the large num= ber of=0A= prefetches and all the extra code that is not strictly required (you can re= move 5=0A= redundant mov/cmp instructions from the loop). Also assuming prefetching he= lps=0A= here (the good memmove results suggest it's not needed), prefetching direct= ly=0A= into L1 should be better than first into L2 and then into L1. So I don't se= e a good=0A= reason why 4x unrolling would have to be any slower.=0A= =0A= > Yes, I implemented for the case of 1 byte to 512 byte [9][10].=0A= > SVE code seems faster than ASIMD in small/medium range too [11][12][13].= =0A= =0A= That adds quite a lot of code and uses a slow linear chain of comparisons. = A small=0A= loop like used in the memset should work fine to handle copies smaller than= =0A= 256 or 512 bytes (you can handle the zero bytes case for free in this code = rather=0A= than special casing it).=0A= =0A= > For small/medium copies, I needed to remove BTI macro from ASM ENTRY in o= rder=0A= > to see the distinct performance difference between ASIMD and SVE.=0A= > I'll post the patch [14] with the A64FX second patch.=0A= =0A= I'm not sure I understand - the BTI macro just emits a NOP hint so it is ha= rmless. We always emit=0A= it so that it works seamlessly when BTI is enabled.=0A= =0A= > And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random= =0A= > didn't start due to mprotect error.=0A= =0A= Yes it looks like the size isn't rounded up to a pagesize. It really needs = the extra space, so=0A= changing +4096 into getpagesize () will work.=0A= =0A= > Without DC_VZA and L2 prefetch, memcpy and memset performance degraded ov= er 4MB.=0A= =0A= > DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch d= oesn't get any improvement.=0A= =0A= That seems odd. Was that using the L1 prefetch with the L2 distance? It see= ms to me one of the L1 or L2=0A= prefetches is unnecessary. Also why would the DC_ZVA need to be done so ear= ly? It seems to me that=0A= cleaning the cacheline just before you write it works best since that avoid= s accidentally replacing it.=0A= =0A= > Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.=0A= >=0A= > The reason why I didn't implement DC_VZA and L2 prefetch is that memmove = calls memcpy in=0A= > most cases, and memmove code only handles backward copy.=0A= > Maybe most of memmove-large benchtest cases are backward copy, I need to = check.=0A= =0A= Most of the memmove tests do indeed overlap (so DC_ZVA does not work). Howe= ver it also shows=0A= that it performs well across the L2 cache size range without any prefetch o= r DC_ZVA.=0A= =0A= Cheers,=0A= Wilco=0A=