From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-4051-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 15630 invoked by alias); 18 Apr 2013 11:56:35 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 15620 invoked by uid 89); 18 Apr 2013 11:56:34 -0000
X-Spam-SWARE-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL,TW_CP autolearn=no version=3.3.1
Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131)    by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Thu, 18 Apr 2013 11:56:33 +0000
Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131])	by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 599A349A37;	Thu, 18 Apr 2013 13:56:28 +0200 (CEST)
Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000)	id B07206046E; Thu, 18 Apr 2013 13:56:04 +0200 (CEST)
Date: Thu, 18 Apr 2013 11:56:00 -0000
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Will Newton <will.newton@linaro.org>
Cc: =?iso-8859-1?Q?M=E5ns_Rullg=E5rd?= <mans@mansr.com>,	libc-ports@sourceware.org, Patch Tracking <patches@linaro.org>
Subject: Re: [PATCH] ARM: Add Cortex-A15 optimized NEON and VFP memcpy routines, with IFUNC.
Message-ID: <20130418115604.GA31357@domone.kolej.mff.cuni.cz>
References: <516BCEE5.9070809@linaro.org> <yw1x8v4k6rcc.fsf@unicorn.mansr.com> <CANu=DmjJUZ319+7_M8cyxMga_rYxbGb_QSs87Q29JBdkKX_97g@mail.gmail.com> <20130418093900.GA3653@domone.kolej.mff.cuni.cz> <CANu=DmiVS4y6Cmdw_K8Gpbp=LjkaQ8Pf6eDvjBfsTcKLmcue3g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CANu=DmiVS4y6Cmdw_K8Gpbp=LjkaQ8Pf6eDvjBfsTcKLmcue3g@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-SW-Source: 2013-04/txt/msg00092.txt.bz2

On Thu, Apr 18, 2013 at 10:47:26AM +0100, Will Newton wrote:
> On 18 April 2013 10:39, OndÅej BÃ­lka <neleai@seznam.cz> wrote:
> > On Mon, Apr 15, 2013 at 11:38:49AM +0100, Will Newton wrote:
> >> On 15 April 2013 11:06, MÃ¥ns RullgÃ¥rd <mans@mansr.com> wrote:
> >>
> >> Hi MÃ¥ns,
> >>
> >> >> Add a high performance memcpy routine optimized for Cortex-A15 with
> >> >> variants for use in the presence of NEON and VFP hardware, selected
> >> >> at runtime using indirect function support.
> >> >
> >> > How does this perform on Cortex-A9?
> >>
> >> The code is also faster on A9 although the gains are not quite as
> >> pronounced. A set of numbers is attached (they linewrap pretty
> >> horribly inline).
> >>
> >>
> > I forget to ask where to get benchmark source. Without it there is no
> > way to tell if it was done correctly.
> > You must randomly vary sizes in range n..2n and also vary alignments.
> 
> The benchmark is taken from the cortex-strings package:
> 
> https://launchpad.net/cortex-strings
> 
> I wrote a wrapper around the benchmark to vary alignment in {1, 2, 4,
> 8} and a variety of block lengths between 8 and 200.
> 
Could you post wrapper?

I could find there only following if it is what you meant:
http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/tests/test-memcpy.c

If this is a case then benchmark contains several serious mistake and data 
generated by it cannot be accepted.

I attached a modification of simple benchmark used by gcc. Could you try
it and post results to be sure.

First put your need place neon implementation into neon.s file with
function name memcpy_neon. 
Then run
./memcpy_test 64 6000000000 gcc 

Mistakes in http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/tests/test-memcpy.c
follow.

First is that original benchmark did not vary sizes and alignments. 

Second is timing is in loop over same data (see code below). 
Even if you vary lengths this loop will undo all your work on
randomizing inputs. 
Every branch becomes predicted. All data it kept in cache. 

These conditions cause performance to be far from performance on real
inputs. 

 for (i = 0; i < 32; ++i)
	{
	  HP_TIMING_NOW (start);
	  CALL (impl, dst, src, len);
	  HP_TIMING_NOW (stop);
	  HP_TIMING_BEST (best_time, start, stop);
	}

Third problem is that benchmark takes minimum over times. 
This obviously does not measure average time but minimal time.

This is statisticaly unsound practice. Any article that would used minimum in
benchmark would immidietaly get rejected on review.

Reason is easy, consider function.

if (rand()%4<1) 
 sleep(1);
else 
 sleep(15000);

Which is according to minimum metric 100 times faster than one below
despite opposite is true.

if (rand()%2<1)
 sleep(100);
else
 sleep(200);