From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21222 invoked by alias); 16 Oct 2019 16:20:31 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 21212 invoked by uid 89); 16 Oct 2019 16:20:30 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-6.3 required=5.0 tests=AWL,BAYES_50,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.1 spammy=32773, 071, 4138, 040 X-HELO: EUR04-HE1-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ygxDk9LBqYkrm+bN5CAhkz8ZqfI0RGk0mrP3APau0nc=; b=m+7qJaZyEDJbY0Hk6MkjuQfLwzCzdu45qIjAjiqwAH3PyjkXMKZ4Dqy1XZEod1xk2WITvYFRA79VvYEB8Slb/kc4JwsRV4U+1HoogFK5simbNb8gNvdGSb44aoAT9Gd/13M1i4xLcWo5Bi4M9LPRqgFyyuXQBpK3BCBFHF1NMrM= Authentication-Results: spf=temperror (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=none action=none header.from=arm.com; Received-SPF: TempError (protection.outlook.com: error in processing during lookup of arm.com: DNS Timeout) X-CheckRecipientChecked: true X-CR-MTA-CID: dcd29c59af63b34b X-CR-MTA-TID: 64aa7808 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=eKNmQVs39tdOcuP8yf3Ihxk1/oGPhyMmQKP1t7JHhofHoBPygAotXqrkZ7OqKYNtX7RJ4Ew0hYkxm3dEiKCiX/V6B2hE/SYC2RsbWrMSWVAodO874lMBGV7N6Ami8AqnMZmZibn+gW1CJ0hbcDc0/RD3VQ9mLPRlyZf5fVU2YoLCu9c0pUjaI8w/o2HmmJXewzmLmSXyRRv98ISaC9+2lKN+fJSn4wJiWJM26Pa79FLq4TOaiaJ8FPpw1vOVbkG1KVGKdGqHvQY08D7ewsXcajwsW6pVTIzO+PmwKugFrtgV0EszM0v3I+N1p4YyjLrFskdc892k81Wf+XSWbIOIog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ygxDk9LBqYkrm+bN5CAhkz8ZqfI0RGk0mrP3APau0nc=; b=hXafWmHVUyMePOKFSH6HHSTx2T29cgnUK6DKQH4dA5D+f9jJF4zI5RRxnBSShHylO4dv3uzbZTZCeAoUWw1J1XoG5UMgLFn2M2z6Nx4hCIFcxF4734eypuRxc/EHkOtRQi7SRw6mvP+VYQ1BU25ZCdVW4CbhH2774lXxbI0N25oULOQZfhcNr4gUKtrGkzMrEsMrefIPg7mhF1nLb5JFg12IN3QowNCm7QbGlgh+OvyuW5jZnlRULnEdEdzaGQhJHnrjiCMTBFINssVeVsbty89PqtA5mSh2bBVfjQmj1CLkbPrOLeopo9/P9FTK8eU4gqyAntnseEl3CBrKgJPTZg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ygxDk9LBqYkrm+bN5CAhkz8ZqfI0RGk0mrP3APau0nc=; b=m+7qJaZyEDJbY0Hk6MkjuQfLwzCzdu45qIjAjiqwAH3PyjkXMKZ4Dqy1XZEod1xk2WITvYFRA79VvYEB8Slb/kc4JwsRV4U+1HoogFK5simbNb8gNvdGSb44aoAT9Gd/13M1i4xLcWo5Bi4M9LPRqgFyyuXQBpK3BCBFHF1NMrM= From: Wilco Dijkstra To: "Zhangxuelei (Derek)" , Szabolcs Nagy , "libc-alpha@sourceware.org" , "yikunkero@gmail.com" CC: nd Subject: Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng processor Date: Wed, 16 Oct 2019 16:20:00 -0000 Message-ID: References: <8DC571DDDE171B4094D3D33E9685917BD84F01@DGGEMI529-MBX.china.huawei.com> In-Reply-To: <8DC571DDDE171B4094D3D33E9685917BD84F01@DGGEMI529-MBX.china.huawei.com> Authentication-Results-Original: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-ms-exchange-transport-forked: True x-checkrecipientrouted: true x-ms-oob-tlc-oobclassifiers: OLM:10000;OLM:10000; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM;SFS:(10009020)(4636009)(346002)(366004)(136003)(396003)(39860400002)(376002)(199004)(189003)(7736002)(66476007)(2906002)(66556008)(74316002)(5660300002)(64756008)(476003)(446003)(11346002)(66446008)(6436002)(305945005)(110136005)(52536014)(256004)(14444005)(76116006)(66066001)(486006)(2501003)(66946007)(55016002)(6246003)(229853002)(8936002)(478600001)(4326008)(81166006)(86362001)(316002)(3846002)(26005)(25786009)(102836004)(9686003)(33656002)(76176011)(6116002)(99286004)(7696005)(186003)(71200400001)(81156014)(6506007)(14454004)(71190400001)(8676002)(2201001);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR0801MB1853;H:VI1PR0801MB2127.eurprd08.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;A:1;MX:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: MZDzCbUxQD4pg4D+qNySXUvFy3UTjNsLB6Qq4L2FCPDxb+fkRNGYJh0dph8UCQA286qrnAs8ffyWvmjiehq7PLJXLtSqKmhFv+GMOe2f6M49uZgCt2VbtaRQnuNo+0TOGh5QiIdPu/oKDnrVu9g0ggF+1rjpaCnTTKeni4BrVI1z1YX6Cix3ZN8wDHYDIoP2BZrFXBcp7YqOrm1IIhhvc13gy/QhY2bBkdi22ijoDkC7FHOZuaRVJ2nDXzg83yBwggym53n7OcR4jcctJQA4fWMc6AYaGGhZuOr/g4TEcJ8MFwT9z6+3QBS2t6uZ/AUUYJEfUgBxE3FvfTNcg1DLoe9OezALCxzqCKaoYmkpyKJ6yYXvkkDYrzPnM5CkRquSMiTzwxhAMIl1HSf8mYSo/ZBDaIkaC0OBXBKmN8XZyQY= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; Return-Path: Wilco.Dijkstra@arm.com X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT057.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 050279c0-7a61-4cb2-39cb-08d75254b9e0 X-SW-Source: 2019-10/txt/msg00495.txt.bz2 Hi Derek, > We do vary based on ThunderX2 for its good performance for large copies, = which is needed by us firstly.=20 But it's not clear whether that is a win over the Falkor version according = to memcpy-walk output. Consider eg. __memcpy_thunderx __memcpy_thunderx2 __= memcpy_falkor __memcpy_kunpeng __memcpy_generic length=3D32768: 7629.73 (-46.76%) 4473.26 ( 13.95%) 394= 7.03 ( 24.08%) 5201.79 ( -0.06%) 5198.66 length=3D32784: 7666.55 (-44.44%) 4705.93 ( 11.34%) 378= 2.04 ( 28.75%) 4759.99 ( 10.32%) 5307.92 length=3D32769: 7589.72 (-42.41%) 4776.54 ( 10.37%) 388= 9.61 ( 27.02%) 4789.02 ( 10.14%) 5329.43 length=3D32783: 7502.45 (-44.25%) 4688.97 ( 9.85%) 385= 8.48 ( 25.81%) 4714.48 ( 9.36%) 5201.10 length=3D32770: 7438.46 (-43.39%) 4647.77 ( 10.41%) 384= 8.94 ( 25.81%) 4680.76 ( 9.77%) 5187.71 length=3D32782: 7225.10 (-38.35%) 4609.10 ( 11.74%) 385= 5.37 ( 26.17%) 4643.70 ( 11.08%) 5222.19 length=3D32771: 7326.40 (-42.87%) 4587.85 ( 10.53%) 382= 8.78 ( 25.33%) 4580.34 ( 10.68%) 5127.85 length=3D32781: 7261.12 (-41.38%) 4548.17 ( 11.44%) 385= 1.30 ( 25.01%) 4584.78 ( 10.73%) 5135.97 length=3D32772: 7178.11 (-41.83%) 4510.12 ( 10.89%) 380= 2.44 ( 24.87%) 4521.19 ( 10.67%) 5061.20 length=3D32780: 7186.99 (-42.34%) 4481.01 ( 11.25%) 383= 5.00 ( 24.00%) 4532.33 ( 10.23%) 5049.00 length=3D32773: 7089.60 (-38.93%) 4482.70 ( 12.15%) 383= 0.79 ( 24.93%) 4487.18 ( 12.07%) 5102.88 length=3D32779: 7076.27 (-25.65%) 4498.21 ( 20.13%) 388= 1.99 ( 31.07%) 5371.42 ( 4.62%) 5631.73 length=3D32774: 8362.27 (-48.81%) 5190.76 ( 7.63%) 384= 5.61 ( 31.57%) 5176.08 ( 7.89%) 5619.46 length=3D32778: 8186.09 (-48.41%) 5109.29 ( 7.37%) 386= 1.96 ( 29.98%) 5211.88 ( 5.51%) 5515.68 length=3D32775: 8186.51 (-49.43%) 5096.24 ( 6.98%) 384= 6.12 ( 29.80%) 5128.88 ( 6.38%) 5478.44 length=3D32777: 8038.38 (-49.89%) 5001.09 ( 6.75%) 383= 7.79 ( 28.44%) 5095.39 ( 4.99%) 5362.88 Here the Falkor variant is 20-30% faster... It doesn't help the existing benchmarks don't report an average across all = the inputs for each ifunc... A colleague is working on a script to visualise benchmark results in a grap= h which should make these comparisons much easier. > And we do find the detect of ThunderX2 version for 96 to 2M bytes copy, a= t least when running on the > Kunpeng arch, even the falkor version is not much better. Well it looks the dst_unaligned code (which deals with a specific issue on = ThunderX2) is completely unnecessary on Kunpeng since the unaligned cases in eg. Falkor and generic = aren't slower than the aligned cases. So I'd suggest to remove this code - it's adds a lot of code= , thus making memcpy unnecessarily large. > Therefore, branch was written and we used generic copy, that 64 bytes loo= p, dst aligned, without prefetch, > and it works. We also have simply tried Q register replacing X register i= n this branch, but it didn't make more sense.=A0 Yes, using Q register copy is best on modern micro-architectures (hence the= idea to do this even in the the generic version). > And here is the result of memcpy-random benchmarks : __memcpy_thunderx __memcpy_thunderx2 = __memcpy_falkor __memcpy_kunpeng __memcpy_generic=A0=A0=A0 max-size=3D4096: 32558.90 ( -2.08%) 31987.80 ( -0.29%) = 30474.30 ( 4.46%) 31666.30 ( 0.72%) 31896.60=20=20=20=20 max-size=3D8192: 31796.80 ( -1.18%) 31423.90 ( 0.01%) = 29974.40 ( 4.62%) 30917.90 ( 1.62%) 31427.40=20=20=20=20 max-size=3D16384: 33122.80 ( -1.05%) 32058.30 ( 2.20%) = 30470.40 ( 7.05%) 31727.90 ( 3.21%) 32779.90=20=20=20=20 max-size=3D32768: 32530.10 ( -1.22%) 31912.80 ( 0.71%) = 29960.80 ( 6.78%) 31567.60 ( 1.78%) 32139.40=20=20=20=20 max-size=3D65536: 33373.60 ( -0.40%) 32476.30 ( 2.30%) = 30957.70 ( 6.87%) 32137.00 ( 3.32%) 33240.10 Well these results show a very significant 4% win for Falkor memcpy! It see= ms strange to only optimize for large sizes when the vast majority of copies in real code are very smal= l (note the distribution of the sizes and alignment for the random benchmark come from SPEC). +ENTRY_ALIGN (MEMMOVE, 6) ... + sub tmp1, dstin, src=20 + cmp count, 512=20 + ccmp tmp1, count, 2, hi=20 + b.lo L(move_long) + cmp count, 96=20 + ccmp tmp1, count, 2, hi=20 + b.lo L(move_middle)=09=09 This has the effect of slowing down all small memmoves and no-overlap memmo= ves (ie. 99% of calls). Is there a reason to special case 96-512? I don't see an obvious difference= between the cases, there is one extra prefetch but outside the loop. Even if it helps somehow, why not = do the test for >512 in the move_long code? That removes 4 instructions (3 and a NOP) from the memmove = fallthrough path. Btw do you have any plans to post other string functions that you can discu= ss here? If so, would these add more ifuncs or improve the generic versions? Cheers, Wilco