From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 35470 invoked by alias); 29 Oct 2019 15:34:49 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 35461 invoked by uid 89); 29 Oct 2019 15:34:48 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-7.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.1 spammy= X-HELO: EUR02-HE1-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Auvnhlux9MhEKG4j/SpxjQYTDjM+hwotpVQuin7CcKQ=; b=s6uSYs+Nl+TjmHUEm0MIU/Ywm8LIW0MgOZE3Y2Wh3Tb+Da6fIUQW92KakVMgV0W6xEimH7ol7DnVOLbXDsDRDdvGxNcGNAFdEs5SFwNP6J4j6ztI93h6QLPJJaHG40k25AZSjyuTpYdpjo5CXYcAIcYjOT5cB8H7HFibVLJDIC4= Authentication-Results: spf=fail (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=none action=none header.from=arm.com; Received-SPF: Fail (protection.outlook.com: domain of arm.com does not designate 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; X-CheckRecipientChecked: true X-CR-MTA-CID: f4e652580e9a7040 X-CR-MTA-TID: 64aa7808 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=NVZp59kwO+htkKDYDZl/mB5mTNqh9ihFwyKvRqce7R8WoS1SVmCVW2NX/geKw0x44hIbTjO6yyWYRpZsxKil9+Wc51XQ2/uCObCoLVcTBi98vixoUuLrwr4jLT5do7lFVRZCqD8wPlBpS3xFz8wHPel1vel9Q8S1eeF53ZZ9DF5otD4Av3tgqK3LKTyualtsSS/kfiEbJ3NCkMIeJaiDf0XN4LnX/9I+aqUHUla82X0TAaI8i7E0HD9Ro+T7dFN/+MuZ/egqu4zFmhmM2xnbt7SvDog1LgUXwD4nM+KQuWImrdED1pjwaZLECW0dgQd/OSIMNbyc+O6Rnp5sF348Dg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Auvnhlux9MhEKG4j/SpxjQYTDjM+hwotpVQuin7CcKQ=; b=FEB7sJbzw2+QCV9ng52JoWWRQy/ma+e2MyWs6FhSCBKTOfVGjHfhAVvslcoAL6MnHh/UkdezLr2xoFvU3fCeB9dIS+AYmMOV3VpcesP9Nhj4eRz61pzd9+evn2rFA0xLHI9M/3be5Eg7ki1eZoVyXSEzNL0bgOCTbUIpdvUIUosF/1AzSprTnL7qWj0l5w+lUDWhBOQwUlttSr9KdrPVmGRdo2a+3VUrUd/57sYElZTu1Q9iiXzc7y00R8Zr5hwSTjguqtWD6VCVYR2Rm5qwCSnBfnM4IqZ+ZdlO9NSg9hH9NwgCVXte6iTs5ZYoTv/lwJJm6BjQRErSuVslDfokzg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Auvnhlux9MhEKG4j/SpxjQYTDjM+hwotpVQuin7CcKQ=; b=s6uSYs+Nl+TjmHUEm0MIU/Ywm8LIW0MgOZE3Y2Wh3Tb+Da6fIUQW92KakVMgV0W6xEimH7ol7DnVOLbXDsDRDdvGxNcGNAFdEs5SFwNP6J4j6ztI93h6QLPJJaHG40k25AZSjyuTpYdpjo5CXYcAIcYjOT5cB8H7HFibVLJDIC4= From: Wilco Dijkstra To: "Zhangxuelei (Derek)" , Szabolcs Nagy , "libc-alpha@sourceware.org" , "yikunkero@gmail.com" , jiangyikun CC: nd Subject: Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor Date: Tue, 29 Oct 2019 15:34:00 -0000 Message-ID: References: <8DC571DDDE171B4094D3D33E9685917BD854D1@DGGEMI529-MBX.china.huawei.com> In-Reply-To: <8DC571DDDE171B4094D3D33E9685917BD854D1@DGGEMI529-MBX.china.huawei.com> Authentication-Results-Original: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-ms-exchange-transport-forked: True x-checkrecipientrouted: true x-ms-oob-tlc-oobclassifiers: OLM:6430;OLM:6430; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM;SFS:(10009020)(4636009)(366004)(396003)(136003)(39860400002)(376002)(346002)(199004)(189003)(6246003)(33656002)(7696005)(55016002)(5660300002)(76176011)(71190400001)(71200400001)(316002)(8676002)(9686003)(11346002)(8936002)(14454004)(486006)(446003)(81156014)(81166006)(476003)(3846002)(4326008)(110136005)(6116002)(305945005)(76116006)(74316002)(86362001)(2501003)(478600001)(2906002)(229853002)(66946007)(66556008)(64756008)(66446008)(6436002)(25786009)(66476007)(6506007)(66066001)(52536014)(99286004)(102836004)(186003)(26005)(7736002)(2201001)(14444005)(256004);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR0801MB1888;H:VI1PR0801MB2127.eurprd08.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;A:1;MX:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: Vrb54asTFstOS8Pk1xnxM3ckmZ9MWmlZt5hTtMNVuXnugBvC7hhUYWR7TLOBceJMaNGCRpK4/Jb/xpjFTlBZV0BoQBO11BKSRkNw43dTCkgCmEKW3cN4iAHgvuomW2+7ZmX1Z3nt2ohAaLtE8iRSA2KYb/ax431S1JDBoJ8Zif2Sp7TI6pYvzU4X1eOLX8/n7j7pXhGRHIbQGFEvy52fI1b/AHVh7qkRVV+mEpvPY9dahQUqUVOVwmaXnkThoe5mpxK4WzUkTXufKmh0g946qcYIsYbQ9RnH+nG7JsIUdX/ElmGyJHnPX7dfi+DK4CoZAjG0Hkyo5VZHmMTUvGk9DEY4aDnSjgMHI4KWJaYgeOltaHvITuVA4xCcJO0wKOTLmEHnlVUB/36wkMiCnO3+1lXGCAvSm8MGQWa2XJ86hMS7FlPEGjeYHEIzafPx/e0t Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; Return-Path: Wilco.Dijkstra@arm.com X-MS-Exchange-Transport-CrossTenantHeadersStripped: VE1EUR03FT036.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: e6303742-8030-4827-48b9-08d75c85804d X-SW-Source: 2019-10/txt/msg00891.txt.bz2 Hi Derek, >> Well these results show a very significant 4% win for Falkor memcpy! It = seems strange to only optimize >> for large sizes when the vast majority of copies in real code are very s= mall (note the distribution of the >> sizes and alignment for the random benchmark come from SPEC). > > Sure, we agree the falkor memcpy has 4% win on small size. However, at th= e beginning we start to Kunpeng > optimized the memcpy, one of the most important case is database case, wh= ich really need more improvement > on large size. Being fast on small copies is not mutually exclusive with being fast on lar= ge copies - there are different code paths for these cases. I'd find it hard to believe databases do huge m= emcopies, that would be stupid! > And what confusing us now is that, we removed dst_unaligned code in memcp= y according to the previous comments, > which did not affect performance after testing in memcpy cases. But in th= e case when uses memmove function and > enters the memcpy part, unaligned cases is significantly slower than alig= ned case according to the results of the first > half part of memmove-walk as shown in the bottom. So do you think we shou= ld still remove dst_unaligned code? Well it seems to me the issue is related to prefetching/caching. memcpy-wal= k walks both src and dst backwards, memmove-walk is pretty much identical for the non-overlap case but it does = a forward walk on dst and a backward one on src. There shouldn't be any performance difference between the two c= ases. On most microarchitectures I see no difference between these walks, memmove-walk and memcpy-walk basically h= ave identical performance both for aligned and unaligned cases (alignment doesn't matter for large copies at a= ll). > We analyse the reason is of more judgement in the begin of memmove and ma= y weak processor ability to handle > this case, and so dst_unaligned make difference. The extra instructions at the start couldn't possibly make a difference apa= rt from slowing down the small memmoves. Wilco