On Mon, 2017-03-27 at 11:52 +0100, Ramana Radhakrishnan wrote: >  > > Also Adhemerval Zanella did some benchmarking that showed the > > prefetching done in the thunderx version might be appropriate for the > > generic version.  However if you look at the prefetching we only do it > > every other time through the loop.  This is because the loop copies 64 > > bytes and the ThunderX cache line size is 128 bytes.  If other aarch64 > > chips have a 64 byte cache line they might want a different prefetching > > setup. > Can you link to the benchmark numbers, workloads and what systems ? > > Ramana The only reference I have to Adhemerval's results are at: https://sourceware.org/ml/libc-alpha/2017-02/msg00118.html Attached are my latest results on ThunderX with the IFUNC numbers from the glibc memcpy performance benchmarks.  They include the new bench- memcpy-random benchmark which doesn't show much difference.  It is really bench-memcpy-large that stands out. Steve Ellcey sellcey@cavium.com