From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-63525-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 2025 invoked by alias); 28 Sep 2015 09:35:33 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 2013 invoked by uid 89); 28 Sep 2015 09:35:32 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
From: "Wilco Dijkstra" <wdijkstr@arm.com>
To: =?iso-8859-2?Q?'Ond=F8ej_B=EDlka'?= <neleai@seznam.cz>
Cc: "'GNU C Library'" <libc-alpha@sourceware.org>
References: <002901d0f794$66138480$323a8d80$@com> <20150927084319.GA22368@domone>
In-Reply-To: <20150927084319.GA22368@domone>
Subject: RE: [PATCH][AArch64] Optimized memcpy/memmove
Date: Mon, 28 Sep 2015 09:35:00 -0000
Message-ID: <003701d0f9d0$fe013470$fa039d50$@com>
MIME-Version: 1.0
X-MC-Unique: 4wIJrdUcTl-p1qyoc0oedg-1
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2015-09/txt/msg00679.txt.bz2

> Ond=F8ej B=EDlka wrote:
> On Fri, Sep 25, 2015 at 02:16:33PM +0100, Wilco Dijkstra wrote:
> > Further optimize memcpy/memmove for AArch64. Copies are split into 3 ma=
in cases: small
> copies of up
> > to 16 bytes, medium copies of 17..96 bytes which are fully unrolled. La=
rge copies of more
> than 96
> > bytes align the destination and use an unrolled loop processing 64 byte=
s per iteration. In
> order to
> > share code with memmove, small and medium copies read all data before w=
riting, allowing any
> kind of
> > overlap. All memmoves except for the large backwards case fall into mem=
cpy for optimal
> performance.
> > On a random copy test memcpy/memmove are 40% faster on A57 and 28% on A=
53.
> >
>=20
> Looks ok on high level, I didn't inspected this patch in detail but you
> should test it with dryrun to see real impact on performance.

Thanks. I haven't looked at dryrun yet but that is on my todo list. With mo=
re accurate
stats it may be possible to tweak it a bit further.

> I would here simply alias memcpy to memmove as there is minimal
> performance impact when you do check only for sizes larger than 96
> bytes.

That is an option indeed, however the entry check for memmove takes 1-2 cyc=
les
on most CPUs, and it means more executed branches and more I-cache footprin=
t for=20
memcpy, so I'd have to be absolutely sure it doesn't slow down memcpy.

Wilco