This patch improves MIPS assembly implementations of memcpy.  Two optimizations are added: prefetching of data for subsequent iterations of memcpy loop and pipelined expansion of unaligned memcpy.  These optimizations speed up MIPS memcpy by about 10%.

The prefetching part is straightforward: it adds prefetching of a cache line (32 bytes) for +1 iteration for unaligned case and +2 iteration for aligned case.  The rationale here is that it will take prefetch to acquire data about same time as 1 iteration of unaligned loop or 2 iterations of aligned loop.  Values for these parameters were tuned on a modern MIPS processor.

The pipelined expansion of unaligned loop is implemented in a similar fashion as expansion of the aligned loop.  The assembly is tricky, but it works.

These changes are almost 3 years old, and have been thoroughly tested in CodeSourcery MIPS toolchains.  Retested with current trunk with no regressions for n32, n64 and o32 ABIs.

OK to apply?

--
Maxim Kuvyrkov
Mentor Graphics