From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-106032-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 21222 invoked by alias); 16 Oct 2019 16:20:31 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 21212 invoked by uid 89); 16 Oct 2019 16:20:30 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-6.3 required=5.0 tests=AWL,BAYES_50,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.1 spammy=32773, 071, 4138, 040
X-HELO: EUR04-HE1-obe.outbound.protection.outlook.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
 s=selector2-armh-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=ygxDk9LBqYkrm+bN5CAhkz8ZqfI0RGk0mrP3APau0nc=;
 b=m+7qJaZyEDJbY0Hk6MkjuQfLwzCzdu45qIjAjiqwAH3PyjkXMKZ4Dqy1XZEod1xk2WITvYFRA79VvYEB8Slb/kc4JwsRV4U+1HoogFK5simbNb8gNvdGSb44aoAT9Gd/13M1i4xLcWo5Bi4M9LPRqgFyyuXQBpK3BCBFHF1NMrM=
Authentication-Results: spf=temperror (sender IP is 63.35.35.123)
 smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified)
 header.d=armh.onmicrosoft.com;sourceware.org; dmarc=none action=none
 header.from=arm.com;
Received-SPF: TempError (protection.outlook.com: error in processing during
 lookup of arm.com: DNS Timeout)
X-CheckRecipientChecked: true
X-CR-MTA-CID: dcd29c59af63b34b
X-CR-MTA-TID: 64aa7808
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=eKNmQVs39tdOcuP8yf3Ihxk1/oGPhyMmQKP1t7JHhofHoBPygAotXqrkZ7OqKYNtX7RJ4Ew0hYkxm3dEiKCiX/V6B2hE/SYC2RsbWrMSWVAodO874lMBGV7N6Ami8AqnMZmZibn+gW1CJ0hbcDc0/RD3VQ9mLPRlyZf5fVU2YoLCu9c0pUjaI8w/o2HmmJXewzmLmSXyRRv98ISaC9+2lKN+fJSn4wJiWJM26Pa79FLq4TOaiaJ8FPpw1vOVbkG1KVGKdGqHvQY08D7ewsXcajwsW6pVTIzO+PmwKugFrtgV0EszM0v3I+N1p4YyjLrFskdc892k81Wf+XSWbIOIog==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=ygxDk9LBqYkrm+bN5CAhkz8ZqfI0RGk0mrP3APau0nc=;
 b=hXafWmHVUyMePOKFSH6HHSTx2T29cgnUK6DKQH4dA5D+f9jJF4zI5RRxnBSShHylO4dv3uzbZTZCeAoUWw1J1XoG5UMgLFn2M2z6Nx4hCIFcxF4734eypuRxc/EHkOtRQi7SRw6mvP+VYQ1BU25ZCdVW4CbhH2774lXxbI0N25oULOQZfhcNr4gUKtrGkzMrEsMrefIPg7mhF1nLb5JFg12IN3QowNCm7QbGlgh+OvyuW5jZnlRULnEdEdzaGQhJHnrjiCMTBFINssVeVsbty89PqtA5mSh2bBVfjQmj1CLkbPrOLeopo9/P9FTK8eU4gqyAntnseEl3CBrKgJPTZg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass
 header.d=arm.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
 s=selector2-armh-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=ygxDk9LBqYkrm+bN5CAhkz8ZqfI0RGk0mrP3APau0nc=;
 b=m+7qJaZyEDJbY0Hk6MkjuQfLwzCzdu45qIjAjiqwAH3PyjkXMKZ4Dqy1XZEod1xk2WITvYFRA79VvYEB8Slb/kc4JwsRV4U+1HoogFK5simbNb8gNvdGSb44aoAT9Gd/13M1i4xLcWo5Bi4M9LPRqgFyyuXQBpK3BCBFHF1NMrM=
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "Zhangxuelei (Derek)" <zhangxuelei4@huawei.com>, Szabolcs Nagy
	<Szabolcs.Nagy@arm.com>, "libc-alpha@sourceware.org"
	<libc-alpha@sourceware.org>, "yikunkero@gmail.com" <yikunkero@gmail.com>
CC: nd <nd@arm.com>
Subject: Re: [PATCH 2/2] aarch64: Optimized memcpy and memmove for Kunpeng
 processor
Date: Wed, 16 Oct 2019 16:20:00 -0000
Message-ID:
 <VI1PR0801MB21271AA3ABE21DB26500415783920@VI1PR0801MB2127.eurprd08.prod.outlook.com>
References:
 <8DC571DDDE171B4094D3D33E9685917BD84F01@DGGEMI529-MBX.china.huawei.com>
In-Reply-To:
 <8DC571DDDE171B4094D3D33E9685917BD84F01@DGGEMI529-MBX.china.huawei.com>
Authentication-Results-Original: spf=none (sender IP is )
 smtp.mailfrom=Wilco.Dijkstra@arm.com; 
x-ms-exchange-transport-forked: True
x-checkrecipientrouted: true
x-ms-oob-tlc-oobclassifiers: OLM:10000;OLM:10000;
X-Forefront-Antispam-Report-Untrusted:
 SFV:NSPM;SFS:(10009020)(4636009)(346002)(366004)(136003)(396003)(39860400002)(376002)(199004)(189003)(7736002)(66476007)(2906002)(66556008)(74316002)(5660300002)(64756008)(476003)(446003)(11346002)(66446008)(6436002)(305945005)(110136005)(52536014)(256004)(14444005)(76116006)(66066001)(486006)(2501003)(66946007)(55016002)(6246003)(229853002)(8936002)(478600001)(4326008)(81166006)(86362001)(316002)(3846002)(26005)(25786009)(102836004)(9686003)(33656002)(76176011)(6116002)(99286004)(7696005)(186003)(71200400001)(81156014)(6506007)(14454004)(71190400001)(8676002)(2201001);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR0801MB1853;H:VI1PR0801MB2127.eurprd08.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;A:1;MX:1;
received-spf: None (protection.outlook.com: arm.com does not designate
 permitted sender hosts)
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original:
 MZDzCbUxQD4pg4D+qNySXUvFy3UTjNsLB6Qq4L2FCPDxb+fkRNGYJh0dph8UCQA286qrnAs8ffyWvmjiehq7PLJXLtSqKmhFv+GMOe2f6M49uZgCt2VbtaRQnuNo+0TOGh5QiIdPu/oKDnrVu9g0ggF+1rjpaCnTTKeni4BrVI1z1YX6Cix3ZN8wDHYDIoP2BZrFXBcp7YqOrm1IIhhvc13gy/QhY2bBkdi22ijoDkC7FHOZuaRVJ2nDXzg83yBwggym53n7OcR4jcctJQA4fWMc6AYaGGhZuOr/g4TEcJ8MFwT9z6+3QBS2t6uZ/AUUYJEfUgBxE3FvfTNcg1DLoe9OezALCxzqCKaoYmkpyKJ6yYXvkkDYrzPnM5CkRquSMiTzwxhAMIl1HSf8mYSo/ZBDaIkaC0OBXBKmN8XZyQY=
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Original-Authentication-Results: spf=none (sender IP is )
 smtp.mailfrom=Wilco.Dijkstra@arm.com; 
Return-Path: Wilco.Dijkstra@arm.com
X-MS-Exchange-Transport-CrossTenantHeadersStripped:
 DB5EUR03FT057.eop-EUR03.prod.protection.outlook.com
X-MS-Office365-Filtering-Correlation-Id-Prvs:
	050279c0-7a61-4cb2-39cb-08d75254b9e0
X-SW-Source: 2019-10/txt/msg00495.txt.bz2

Hi Derek,

> We do vary based on ThunderX2 for its good performance for large copies, =
which is needed by us firstly.=20

But it's not clear whether that is a win over the Falkor version according =
to memcpy-walk output. Consider eg.

                                    __memcpy_thunderx	__memcpy_thunderx2	__=
memcpy_falkor	__memcpy_kunpeng	__memcpy_generic
     length=3D32768:      7629.73 (-46.76%)	     4473.26 ( 13.95%)	     394=
7.03 ( 24.08%)	     5201.79 ( -0.06%)	     5198.66
     length=3D32784:      7666.55 (-44.44%)	     4705.93 ( 11.34%)	     378=
2.04 ( 28.75%)	     4759.99 ( 10.32%)	     5307.92
     length=3D32769:      7589.72 (-42.41%)	     4776.54 ( 10.37%)	     388=
9.61 ( 27.02%)	     4789.02 ( 10.14%)	     5329.43
     length=3D32783:      7502.45 (-44.25%)	     4688.97 (  9.85%)	     385=
8.48 ( 25.81%)	     4714.48 (  9.36%)	     5201.10
     length=3D32770:      7438.46 (-43.39%)	     4647.77 ( 10.41%)	     384=
8.94 ( 25.81%)	     4680.76 (  9.77%)	     5187.71
     length=3D32782:      7225.10 (-38.35%)	     4609.10 ( 11.74%)	     385=
5.37 ( 26.17%)	     4643.70 ( 11.08%)	     5222.19
     length=3D32771:      7326.40 (-42.87%)	     4587.85 ( 10.53%)	     382=
8.78 ( 25.33%)	     4580.34 ( 10.68%)	     5127.85
     length=3D32781:      7261.12 (-41.38%)	     4548.17 ( 11.44%)	     385=
1.30 ( 25.01%)	     4584.78 ( 10.73%)	     5135.97
     length=3D32772:      7178.11 (-41.83%)	     4510.12 ( 10.89%)	     380=
2.44 ( 24.87%)	     4521.19 ( 10.67%)	     5061.20
     length=3D32780:      7186.99 (-42.34%)	     4481.01 ( 11.25%)	     383=
5.00 ( 24.00%)	     4532.33 ( 10.23%)	     5049.00
     length=3D32773:      7089.60 (-38.93%)	     4482.70 ( 12.15%)	     383=
0.79 ( 24.93%)	     4487.18 ( 12.07%)	     5102.88
     length=3D32779:      7076.27 (-25.65%)	     4498.21 ( 20.13%)	     388=
1.99 ( 31.07%)	     5371.42 (  4.62%)	     5631.73
     length=3D32774:      8362.27 (-48.81%)	     5190.76 (  7.63%)	     384=
5.61 ( 31.57%)	     5176.08 (  7.89%)	     5619.46
     length=3D32778:      8186.09 (-48.41%)	     5109.29 (  7.37%)	     386=
1.96 ( 29.98%)	     5211.88 (  5.51%)	     5515.68
     length=3D32775:      8186.51 (-49.43%)	     5096.24 (  6.98%)	     384=
6.12 ( 29.80%)	     5128.88 (  6.38%)	     5478.44
     length=3D32777:      8038.38 (-49.89%)	     5001.09 (  6.75%)	     383=
7.79 ( 28.44%)	     5095.39 (  4.99%)	     5362.88

Here the Falkor variant is 20-30% faster...

It doesn't help the existing benchmarks don't report an average across all =
the inputs for each ifunc...
A colleague is working on a script to visualise benchmark results in a grap=
h which should make these
comparisons much easier.

> And we do find the detect of ThunderX2 version for 96 to 2M bytes copy, a=
t least when running on the
> Kunpeng arch, even the falkor version is not much better.

Well it looks the dst_unaligned code (which deals with a specific issue on =
ThunderX2) is completely
unnecessary on Kunpeng since the unaligned cases in eg. Falkor and generic =
aren't slower than the
aligned cases. So I'd suggest to remove this code - it's adds a lot of code=
, thus making memcpy
unnecessarily large.

> Therefore, branch was written and we used generic copy, that 64 bytes loo=
p, dst aligned, without prefetch,
> and it works. We also have simply tried Q register replacing X register i=
n this branch, but it didn't make more sense.=A0

Yes, using Q register copy is best on modern micro-architectures (hence the=
 idea to do this even in the
the generic version).

> And here is the result of memcpy-random benchmarks :

                                    __memcpy_thunderx   __memcpy_thunderx2 =
     __memcpy_falkor __memcpy_kunpeng        __memcpy_generic=A0=A0=A0
   max-size=3D4096:     32558.90 ( -2.08%)            31987.80 ( -0.29%)   =
    30474.30 (  4.46%)       31666.30 (  0.72%)       31896.60=20=20=20=20
   max-size=3D8192:     31796.80 ( -1.18%)            31423.90 (  0.01%)   =
    29974.40 (  4.62%)       30917.90 (  1.62%)       31427.40=20=20=20=20
  max-size=3D16384:     33122.80 ( -1.05%)            32058.30 (  2.20%)   =
    30470.40 (  7.05%)       31727.90 (  3.21%)       32779.90=20=20=20=20
  max-size=3D32768:     32530.10 ( -1.22%)            31912.80 (  0.71%)   =
    29960.80 (  6.78%)       31567.60 (  1.78%)       32139.40=20=20=20=20
  max-size=3D65536:     33373.60 ( -0.40%)            32476.30 (  2.30%)   =
    30957.70 (  6.87%)       32137.00 (  3.32%)       33240.10

Well these results show a very significant 4% win for Falkor memcpy! It see=
ms strange to only optimize
for large sizes when the vast majority of copies in real code are very smal=
l (note the distribution of the
sizes and alignment for the random benchmark come from SPEC).

+ENTRY_ALIGN (MEMMOVE, 6)
...
+	sub	tmp1, dstin, src=20
+	cmp	count, 512=20
+	ccmp	tmp1, count, 2, hi=20
+	b.lo	L(move_long)
+	cmp	count, 96=20
+	ccmp	tmp1, count, 2, hi=20
+	b.lo	L(move_middle)=09=09

This has the effect of slowing down all small memmoves and no-overlap memmo=
ves (ie. 99% of calls).
Is there a reason to special case 96-512? I don't see an obvious difference=
 between the cases, there is
one extra prefetch but outside the loop. Even if it helps somehow, why not =
do the test for >512 in the
move_long code? That removes 4 instructions (3 and a NOP) from the memmove =
fallthrough path.

Btw do you have any plans to post other string functions that you can discu=
ss here? If so, would these
add more ifuncs or improve the generic versions?

Cheers,
Wilco