On Mon, 2017-03-27 at 11:52 +0100, Ramana Radhakrishnan wrote:
>Â 
> > Also Adhemerval Zanella did some benchmarking that showed the
> > prefetching done in the thunderx version might be appropriate for the
> > generic version.Â Â However if you look at the prefetching we only do it
> > every other time through the loop.Â Â This is because the loop copies 64
> > bytes and the ThunderX cache line size is 128 bytes.Â Â If other aarch64
> > chips have a 64 byte cache line they might want a different prefetching
> > setup.

> Can you link to the benchmark numbers, workloads and what systems ?
> 
> Ramana

The only reference I have toÂ Adhemerval's results are at:

https://sourceware.org/ml/libc-alpha/2017-02/msg00118.html

Attached are my latest results on ThunderX with the IFUNC numbers from
the glibc memcpy performance benchmarks. Â They include the new bench-
memcpy-random benchmark which doesn't show much difference. Â It is
really bench-memcpy-large that stands out.

Steve Ellcey
sellcey@cavium.com