On Thu, May 18, 2017 at 1:59 PM, Erich Elsen wrote: > Hi H.J., > > I was on vacation, sorry for the slow reply. The updated benchmark > still shows the same behavior, thanks. > > I'll try my hand at creating a patch that makes that variable > __x86_shared_non_temporal_threshold a tunable. It will be necessary > to do internal experiments anyway. > __x86_shared_non_temporal_threshold was set to 6 times of per-core shared cache size, based on the large memcpy micro benchmark in glibc on a 8-core processor. For a processor with more than 8 cores, the threshold is too low. Set __x86_shared_non_temporal_threshold to the 3/4 of the total shared cache size so that it is unchanged on 8-core processors. On processors with less than 8 cores, the threshold is lower. Any comments? -- H.J.