On 18/09/2017 10:54, Florian Weimer wrote: > On 09/13/2017 03:12 PM, Tulio Magno Quites Machado Filho wrote: >>> So I think the implementation constraint on the mem* functions is wrong. >>>   It leads to a slower implementation of the mem* function for most of >>> userspace which does not access device memory, and even for device >>> memory, it is probably not what you want. >> Makes sense.  But as there is nothing in the standard allowing or prohibiting >> the usage of mem* functions to access caching-inhibited memory, I thought it >> would make sense to provide functions that are as generic as possible. > > But I have shown that you aren't doing that because of the GCC optimization which inlines the memset call. > > But I won't continue this conversation as I don't see it particularly useful to anyone.  In the end, you are the architecture maintainers, and you should do what you think is best. > > Thanks, > Florian I think one way to provide a slight better memcpy implementation for POWER8 and still be able to circumvent the non-aligned on non-cacheable memory is to use tunables. The branch azanella/memcpy-power8 [1] has a power8 memcpy optimization which uses unaligned load and stores that I created some time ago but never actually send upstream. It shows better performance on both bench-memcpy and bench-memcpy-random (about 10% on latter) and mixed results on bench-memcpy-large (which it is mainly dominated by memory throughput and on the environment I am using, a shared PowerKVM instance, the results does not seem to be reliable). It could use some tunning, specially on some the range I used for unrolling the load/stores and it also does not care for unaligned access on cross-page boundary (which tend to be quite slow on current hardware, but also on current page size of usual 64k also uncommon). This first patch does not enable this option as a default for POWER8, it just add on string tests as an option. The second patch changes the selection to: 1. If glibc is configure with tunables, set the new implementation as the default for ISA 2.07 (power8). 2. Also if tunable is active, add the parameter glibc.tune.aligned_memopt to disable the new implementation selection. So programs that rely on aligned loads can set: GLIBC_TUNABLES=glibc.tune.aligned_memopt=1 And then the memcpy ifunc selection would pick the power7 one which uses only aligned load and stores. This is a RFC patch and if the idea sounds to powerpc arch mantainers I can work on finishing the patch with more comments and send upstream. I tried to apply same unaligned idea for memset and memmove, but I could get any real improvement in neither. [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/memcpy-power8