* memcpy performance on skylake server @ 2021-07-06 8:17 Ji, Cheng 2021-07-14 12:58 ` Adhemerval Zanella 0 siblings, 1 reply; 5+ messages in thread From: Ji, Cheng @ 2021-07-06 8:17 UTC (permalink / raw) To: libc-help Hello, I found that memcpy is slower on skylake server CPUs during our optimization work, and I can't really explain what we got and need some guidance here. The problem is that memcpy is noticeably slower than a simple for loop when copying large chunks of data. This genuinely sounds like an amateur mistake in our testing code but here's what we have tried: * The test data is large enough: 1GB. * We noticed a change quite a while ago regarding skylake and AVX512: https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/ * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5% faster but still slower than a simple loop. * We tested on multiple bare metal machines with different cpus: Xeon Gold 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud, the result is reproducible. * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than the simple loop. On my desktop (i7-7700k) memcpy is also significantly faster. * numactl is used to ensure everything is running on a single core. * The code is compiled by gcc 10.3 The numbers on a Xeon Gold 6132, with glibc 2.33: simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s The result is worse with system provided glibc 2.17: simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s The code to generate this result (compiled with g++ -O2 -g, run with: numactl --membind 0 --physcpubind 0 -- ./a.out) ===== #include <chrono> #include <cstring> #include <functional> #include <string> #include <vector> class TestCase { using clock_t = std::chrono::high_resolution_clock; using sec_t = std::chrono::duration<double, std::ratio<1>>; public: static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million * 8 bytes = 1GiB void init() { vals_.resize(NUM_VALUES); for (size_t i = 0; i < NUM_VALUES; ++i) { vals_[i] = i; } dest_.resize(NUM_VALUES); } void run(std::string name, std::function<void(const int64_t *, int64_t *, size_t)> &&func) { // ignore the result from first run func(vals_.data(), dest_.data(), vals_.size()); constexpr size_t count = 20; auto start = clock_t::now(); for (size_t i = 0; i < count; ++i) { func(vals_.data(), dest_.data(), vals_.size()); } auto end = clock_t::now(); double duration = std::chrono::duration_cast<sec_t>(end-start).count(); printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(), duration, sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count / duration, sizeof(int64_t) * NUM_VALUES / double(1e9) * count / duration); } private: std::vector<int64_t> vals_; std::vector<int64_t> dest_; }; void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) { memcpy(dest, src, n * sizeof(int64_t)); } void simple_copy(const int64_t *src, int64_t *dest, size_t n) { for (size_t i = 0; i < n; ++i) { dest[i] = src[i]; } } int main(int, char **) { TestCase c; c.init(); c.run("simple_memcpy", simple_memcpy); c.run("simple_copy", simple_copy); c.run("simple_memcpy", simple_memcpy); c.run("simple_copy", simple_copy); } ===== The assembly of simple_copy generated by gcc is very simple: Dump of assembler code for function _Z11simple_copyPKlPlm: 0x0000000000401440 <+0>: mov %rdx,%rcx 0x0000000000401443 <+3>: test %rdx,%rdx 0x0000000000401446 <+6>: je 0x401460 <_Z11simple_copyPKlPlm+32> 0x0000000000401448 <+8>: xor %eax,%eax 0x000000000040144a <+10>: nopw 0x0(%rax,%rax,1) 0x0000000000401450 <+16>: mov (%rdi,%rax,8),%rdx 0x0000000000401454 <+20>: mov %rdx,(%rsi,%rax,8) 0x0000000000401458 <+24>: inc %rax 0x000000000040145b <+27>: cmp %rax,%rcx 0x000000000040145e <+30>: jne 0x401450 <_Z11simple_copyPKlPlm+16> 0x0000000000401460 <+32>: retq When compiling with -O3, gcc vectorized the loop using xmm0, the simple_loop is around 1% faster. I took a brief look at the glibc source code. Though I don't have enough knowledge to understand it yet, I'm curious about the underlying mechanism. Thanks. Cheng ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: memcpy performance on skylake server 2021-07-06 8:17 memcpy performance on skylake server Ji, Cheng @ 2021-07-14 12:58 ` Adhemerval Zanella 2021-07-14 13:26 ` H.J. Lu 0 siblings, 1 reply; 5+ messages in thread From: Adhemerval Zanella @ 2021-07-14 12:58 UTC (permalink / raw) To: Ji, Cheng, Libc-help, H.J. Lu On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote: > Hello, > > I found that memcpy is slower on skylake server CPUs during our > optimization work, and I can't really explain what we got and need some > guidance here. > > The problem is that memcpy is noticeably slower than a simple for loop when > copying large chunks of data. This genuinely sounds like an amateur mistake > in our testing code but here's what we have tried: > > * The test data is large enough: 1GB. > * We noticed a change quite a while ago regarding skylake and AVX512: > https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/ > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5% > faster but still slower than a simple loop. > * We tested on multiple bare metal machines with different cpus: Xeon Gold > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud, > the result is reproducible. > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than > the simple loop. On my desktop (i7-7700k) memcpy is also significantly > faster. > * numactl is used to ensure everything is running on a single core. > * The code is compiled by gcc 10.3 > > The numbers on a Xeon Gold 6132, with glibc 2.33: > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s > > The result is worse with system provided glibc 2.17: > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > > > The code to generate this result (compiled with g++ -O2 -g, run with: numactl > --membind 0 --physcpubind 0 -- ./a.out) > ===== > > #include <chrono> > #include <cstring> > #include <functional> > #include <string> > #include <vector> > > class TestCase { > using clock_t = std::chrono::high_resolution_clock; > using sec_t = std::chrono::duration<double, std::ratio<1>>; > > public: > static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million * > 8 bytes = 1GiB > > void init() { > vals_.resize(NUM_VALUES); > for (size_t i = 0; i < NUM_VALUES; ++i) { > vals_[i] = i; > } > dest_.resize(NUM_VALUES); > } > > void run(std::string name, std::function<void(const int64_t *, int64_t > *, size_t)> &&func) { > // ignore the result from first run > func(vals_.data(), dest_.data(), vals_.size()); > constexpr size_t count = 20; > auto start = clock_t::now(); > for (size_t i = 0; i < count; ++i) { > func(vals_.data(), dest_.data(), vals_.size()); > } > auto end = clock_t::now(); > double duration = > std::chrono::duration_cast<sec_t>(end-start).count(); > printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(), > duration, > sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count / > duration, > sizeof(int64_t) * NUM_VALUES / double(1e9) * count / > duration); > } > > private: > std::vector<int64_t> vals_; > std::vector<int64_t> dest_; > }; > > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) { > memcpy(dest, src, n * sizeof(int64_t)); > } > > void simple_copy(const int64_t *src, int64_t *dest, size_t n) { > for (size_t i = 0; i < n; ++i) { > dest[i] = src[i]; > } > } > > int main(int, char **) { > TestCase c; > c.init(); > > c.run("simple_memcpy", simple_memcpy); > c.run("simple_copy", simple_copy); > c.run("simple_memcpy", simple_memcpy); > c.run("simple_copy", simple_copy); > } > > ===== > > The assembly of simple_copy generated by gcc is very simple: > Dump of assembler code for function _Z11simple_copyPKlPlm: > 0x0000000000401440 <+0>: mov %rdx,%rcx > 0x0000000000401443 <+3>: test %rdx,%rdx > 0x0000000000401446 <+6>: je 0x401460 <_Z11simple_copyPKlPlm+32> > 0x0000000000401448 <+8>: xor %eax,%eax > 0x000000000040144a <+10>: nopw 0x0(%rax,%rax,1) > 0x0000000000401450 <+16>: mov (%rdi,%rax,8),%rdx > 0x0000000000401454 <+20>: mov %rdx,(%rsi,%rax,8) > 0x0000000000401458 <+24>: inc %rax > 0x000000000040145b <+27>: cmp %rax,%rcx > 0x000000000040145e <+30>: jne 0x401450 <_Z11simple_copyPKlPlm+16> > 0x0000000000401460 <+32>: retq > > When compiling with -O3, gcc vectorized the loop using xmm0, the > simple_loop is around 1% faster. Usually differences of that magnitude falls either in noise or may be something related to OS jitter. > > I took a brief look at the glibc source code. Though I don't have enough > knowledge to understand it yet, I'm curious about the underlying mechanism. > Thanks. H.J, do you have any idea what might be happening here? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: memcpy performance on skylake server 2021-07-14 12:58 ` Adhemerval Zanella @ 2021-07-14 13:26 ` H.J. Lu 2021-07-15 7:32 ` Ji, Cheng 0 siblings, 1 reply; 5+ messages in thread From: H.J. Lu @ 2021-07-14 13:26 UTC (permalink / raw) To: Adhemerval Zanella; +Cc: Ji, Cheng, Libc-help On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > > > > On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote: > > Hello, > > > > I found that memcpy is slower on skylake server CPUs during our > > optimization work, and I can't really explain what we got and need some > > guidance here. > > > > The problem is that memcpy is noticeably slower than a simple for loop when > > copying large chunks of data. This genuinely sounds like an amateur mistake > > in our testing code but here's what we have tried: > > > > * The test data is large enough: 1GB. > > * We noticed a change quite a while ago regarding skylake and AVX512: > > https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/ > > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5% > > faster but still slower than a simple loop. > > * We tested on multiple bare metal machines with different cpus: Xeon Gold > > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud, > > the result is reproducible. > > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than > > the simple loop. On my desktop (i7-7700k) memcpy is also significantly > > faster. > > * numactl is used to ensure everything is running on a single core. > > * The code is compiled by gcc 10.3 > > > > The numbers on a Xeon Gold 6132, with glibc 2.33: > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s > > > > The result is worse with system provided glibc 2.17: > > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s > > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s > > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > > > > > > The code to generate this result (compiled with g++ -O2 -g, run with: numactl > > --membind 0 --physcpubind 0 -- ./a.out) > > ===== > > > > #include <chrono> > > #include <cstring> > > #include <functional> > > #include <string> > > #include <vector> > > > > class TestCase { > > using clock_t = std::chrono::high_resolution_clock; > > using sec_t = std::chrono::duration<double, std::ratio<1>>; > > > > public: > > static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million * > > 8 bytes = 1GiB > > > > void init() { > > vals_.resize(NUM_VALUES); > > for (size_t i = 0; i < NUM_VALUES; ++i) { > > vals_[i] = i; > > } > > dest_.resize(NUM_VALUES); > > } > > > > void run(std::string name, std::function<void(const int64_t *, int64_t > > *, size_t)> &&func) { > > // ignore the result from first run > > func(vals_.data(), dest_.data(), vals_.size()); > > constexpr size_t count = 20; > > auto start = clock_t::now(); > > for (size_t i = 0; i < count; ++i) { > > func(vals_.data(), dest_.data(), vals_.size()); > > } > > auto end = clock_t::now(); > > double duration = > > std::chrono::duration_cast<sec_t>(end-start).count(); > > printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(), > > duration, > > sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count / > > duration, > > sizeof(int64_t) * NUM_VALUES / double(1e9) * count / > > duration); > > } > > > > private: > > std::vector<int64_t> vals_; > > std::vector<int64_t> dest_; > > }; > > > > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) { > > memcpy(dest, src, n * sizeof(int64_t)); > > } > > > > void simple_copy(const int64_t *src, int64_t *dest, size_t n) { > > for (size_t i = 0; i < n; ++i) { > > dest[i] = src[i]; > > } > > } > > > > int main(int, char **) { > > TestCase c; > > c.init(); > > > > c.run("simple_memcpy", simple_memcpy); > > c.run("simple_copy", simple_copy); > > c.run("simple_memcpy", simple_memcpy); > > c.run("simple_copy", simple_copy); > > } > > > > ===== > > > > The assembly of simple_copy generated by gcc is very simple: > > Dump of assembler code for function _Z11simple_copyPKlPlm: > > 0x0000000000401440 <+0>: mov %rdx,%rcx > > 0x0000000000401443 <+3>: test %rdx,%rdx > > 0x0000000000401446 <+6>: je 0x401460 <_Z11simple_copyPKlPlm+32> > > 0x0000000000401448 <+8>: xor %eax,%eax > > 0x000000000040144a <+10>: nopw 0x0(%rax,%rax,1) > > 0x0000000000401450 <+16>: mov (%rdi,%rax,8),%rdx > > 0x0000000000401454 <+20>: mov %rdx,(%rsi,%rax,8) > > 0x0000000000401458 <+24>: inc %rax > > 0x000000000040145b <+27>: cmp %rax,%rcx > > 0x000000000040145e <+30>: jne 0x401450 <_Z11simple_copyPKlPlm+16> > > 0x0000000000401460 <+32>: retq > > > > When compiling with -O3, gcc vectorized the loop using xmm0, the > > simple_loop is around 1% faster. > > Usually differences of that magnitude falls either in noise or may be something > related to OS jitter. > > > > > I took a brief look at the glibc source code. Though I don't have enough > > knowledge to understand it yet, I'm curious about the underlying mechanism. > > Thanks. > > H.J, do you have any idea what might be happening here? From Intel optimization guide: 2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture Because of the change in the size of each bank of last level cache on Skylake Server microarchitecture, if an application, library, or driver only considers the last level cache to determine the size of on-chip cacheper-core, it may see a reduction with Skylake Server microarchitecture and may use non-temporal store with smaller blocks of memory writes. Since non-temporal stores evict cache lines back to memory, this may result in an increase in the number of subsequent cache misses and memory bandwidth demands on Skylake Server microarchitecture, compared to the previous Intel Xeon processor family. Also, because of a change in the handling of accesses resulting from non-temporal stores by Skylake Server microarchitecture, the resources within each core remain busy for a longer duration compared to similar accesses on the previous Intel Xeon processor family. As a result, if a series of such instructions are executed, there is a potential that the processor may run out of resources and stall, thus limiting the memory write bandwidth from each core. The increase in cache misses due to overuse of non-temporal stores and the limit on the memory write bandwidth per core for non-temporal stores may result in reduced performance for some applications. -- H.J. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: memcpy performance on skylake server 2021-07-14 13:26 ` H.J. Lu @ 2021-07-15 7:32 ` Ji, Cheng 2021-07-15 16:51 ` Patrick McGehearty 0 siblings, 1 reply; 5+ messages in thread From: Ji, Cheng @ 2021-07-15 7:32 UTC (permalink / raw) To: H.J. Lu; +Cc: Adhemerval Zanella, Libc-help Thanks for the information. We did some quick experiments. Indeed, using normal temporal stores is ~20% faster than using non-temporal stores in this case. Cheng On Wed, Jul 14, 2021 at 9:27 PM H.J. Lu <hjl.tools@gmail.com> wrote: > On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella > <adhemerval.zanella@linaro.org> wrote: > > > > > > > > On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote: > > > Hello, > > > > > > I found that memcpy is slower on skylake server CPUs during our > > > optimization work, and I can't really explain what we got and need some > > > guidance here. > > > > > > The problem is that memcpy is noticeably slower than a simple for loop > when > > > copying large chunks of data. This genuinely sounds like an amateur > mistake > > > in our testing code but here's what we have tried: > > > > > > * The test data is large enough: 1GB. > > > * We noticed a change quite a while ago regarding skylake and AVX512: > > > > https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/ > > > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is > 5% > > > faster but still slower than a simple loop. > > > * We tested on multiple bare metal machines with different cpus: Xeon > Gold > > > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google > cloud, > > > the result is reproducible. > > > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster > than > > > the simple loop. On my desktop (i7-7700k) memcpy is also significantly > > > faster. > > > * numactl is used to ensure everything is running on a single core. > > > * The code is compiled by gcc 10.3 > > > > > > The numbers on a Xeon Gold 6132, with glibc 2.33: > > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > > > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s > > > > > > The result is worse with system provided glibc 2.17: > > > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s > > > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s > > > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s > > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > > > > > > > > > The code to generate this result (compiled with g++ -O2 -g, run with: > numactl > > > --membind 0 --physcpubind 0 -- ./a.out) > > > ===== > > > > > > #include <chrono> > > > #include <cstring> > > > #include <functional> > > > #include <string> > > > #include <vector> > > > > > > class TestCase { > > > using clock_t = std::chrono::high_resolution_clock; > > > using sec_t = std::chrono::duration<double, std::ratio<1>>; > > > > > > public: > > > static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 > million * > > > 8 bytes = 1GiB > > > > > > void init() { > > > vals_.resize(NUM_VALUES); > > > for (size_t i = 0; i < NUM_VALUES; ++i) { > > > vals_[i] = i; > > > } > > > dest_.resize(NUM_VALUES); > > > } > > > > > > void run(std::string name, std::function<void(const int64_t *, > int64_t > > > *, size_t)> &&func) { > > > // ignore the result from first run > > > func(vals_.data(), dest_.data(), vals_.size()); > > > constexpr size_t count = 20; > > > auto start = clock_t::now(); > > > for (size_t i = 0; i < count; ++i) { > > > func(vals_.data(), dest_.data(), vals_.size()); > > > } > > > auto end = clock_t::now(); > > > double duration = > > > std::chrono::duration_cast<sec_t>(end-start).count(); > > > printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(), > > > duration, > > > sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count / > > > duration, > > > sizeof(int64_t) * NUM_VALUES / double(1e9) * count / > > > duration); > > > } > > > > > > private: > > > std::vector<int64_t> vals_; > > > std::vector<int64_t> dest_; > > > }; > > > > > > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) { > > > memcpy(dest, src, n * sizeof(int64_t)); > > > } > > > > > > void simple_copy(const int64_t *src, int64_t *dest, size_t n) { > > > for (size_t i = 0; i < n; ++i) { > > > dest[i] = src[i]; > > > } > > > } > > > > > > int main(int, char **) { > > > TestCase c; > > > c.init(); > > > > > > c.run("simple_memcpy", simple_memcpy); > > > c.run("simple_copy", simple_copy); > > > c.run("simple_memcpy", simple_memcpy); > > > c.run("simple_copy", simple_copy); > > > } > > > > > > ===== > > > > > > The assembly of simple_copy generated by gcc is very simple: > > > Dump of assembler code for function _Z11simple_copyPKlPlm: > > > 0x0000000000401440 <+0>: mov %rdx,%rcx > > > 0x0000000000401443 <+3>: test %rdx,%rdx > > > 0x0000000000401446 <+6>: je 0x401460 > <_Z11simple_copyPKlPlm+32> > > > 0x0000000000401448 <+8>: xor %eax,%eax > > > 0x000000000040144a <+10>: nopw 0x0(%rax,%rax,1) > > > 0x0000000000401450 <+16>: mov (%rdi,%rax,8),%rdx > > > 0x0000000000401454 <+20>: mov %rdx,(%rsi,%rax,8) > > > 0x0000000000401458 <+24>: inc %rax > > > 0x000000000040145b <+27>: cmp %rax,%rcx > > > 0x000000000040145e <+30>: jne 0x401450 > <_Z11simple_copyPKlPlm+16> > > > 0x0000000000401460 <+32>: retq > > > > > > When compiling with -O3, gcc vectorized the loop using xmm0, the > > > simple_loop is around 1% faster. > > > > Usually differences of that magnitude falls either in noise or may be > something > > related to OS jitter. > > > > > > > > I took a brief look at the glibc source code. Though I don't have > enough > > > knowledge to understand it yet, I'm curious about the underlying > mechanism. > > > Thanks. > > > > H.J, do you have any idea what might be happening here? > > From Intel optimization guide: > > 2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture > Because of the change in the size of each bank of last level cache on > Skylake Server microarchitecture, if > an application, library, or driver only considers the last level cache > to determine the size of on-chip cacheper-core, it may see a reduction > with Skylake Server microarchitecture and may use non-temporal store > with smaller blocks of memory writes. Since non-temporal stores evict > cache lines back to memory, this > may result in an increase in the number of subsequent cache misses and > memory bandwidth demands > on Skylake Server microarchitecture, compared to the previous Intel > Xeon processor family. > Also, because of a change in the handling of accesses resulting from > non-temporal stores by Skylake > Server microarchitecture, the resources within each core remain busy > for a longer duration compared to > similar accesses on the previous Intel Xeon processor family. As a > result, if a series of such instructions > are executed, there is a potential that the processor may run out of > resources and stall, thus limiting the > memory write bandwidth from each core. > The increase in cache misses due to overuse of non-temporal stores and > the limit on the memory write > bandwidth per core for non-temporal stores may result in reduced > performance for some applications. > > -- > H.J. > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: memcpy performance on skylake server 2021-07-15 7:32 ` Ji, Cheng @ 2021-07-15 16:51 ` Patrick McGehearty 0 siblings, 0 replies; 5+ messages in thread From: Patrick McGehearty @ 2021-07-15 16:51 UTC (permalink / raw) To: Ji, Cheng, H.J. Lu; +Cc: Libc-help More in-depth discussion of tuning non-temporal stores for x86 can be found at: http://patches-tcwg.linaro.org/patch/41797/ - Patrick McGehearty On 7/15/2021 2:32 AM, Ji, Cheng via Libc-help wrote: > Thanks for the information. We did some quick experiments. Indeed, using > normal temporal stores is ~20% faster than using non-temporal stores in > this case. > > Cheng > > On Wed, Jul 14, 2021 at 9:27 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella >> <adhemerval.zanella@linaro.org> wrote: >>> >>> >>> On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote: >>>> Hello, >>>> >>>> I found that memcpy is slower on skylake server CPUs during our >>>> optimization work, and I can't really explain what we got and need some >>>> guidance here. >>>> >>>> The problem is that memcpy is noticeably slower than a simple for loop >> when >>>> copying large chunks of data. This genuinely sounds like an amateur >> mistake >>>> in our testing code but here's what we have tried: >>>> >>>> * The test data is large enough: 1GB. >>>> * We noticed a change quite a while ago regarding skylake and AVX512: >>>> >> https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/ >>>> * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is >> 5% >>>> faster but still slower than a simple loop. >>>> * We tested on multiple bare metal machines with different cpus: Xeon >> Gold >>>> 6132, Gold 6252, Silver 4114, as well as a virtual machine on google >> cloud, >>>> the result is reproducible. >>>> * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster >> than >>>> the simple loop. On my desktop (i7-7700k) memcpy is also significantly >>>> faster. >>>> * numactl is used to ensure everything is running on a single core. >>>> * The code is compiled by gcc 10.3 >>>> >>>> The numbers on a Xeon Gold 6132, with glibc 2.33: >>>> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s >>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s >>>> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s >>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s >>>> >>>> The result is worse with system provided glibc 2.17: >>>> simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s >>>> simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s >>>> simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s >>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s >>>> >>>> >>>> The code to generate this result (compiled with g++ -O2 -g, run with: >> numactl >>>> --membind 0 --physcpubind 0 -- ./a.out) >>>> ===== >>>> >>>> #include <chrono> >>>> #include <cstring> >>>> #include <functional> >>>> #include <string> >>>> #include <vector> >>>> >>>> class TestCase { >>>> using clock_t = std::chrono::high_resolution_clock; >>>> using sec_t = std::chrono::duration<double, std::ratio<1>>; >>>> >>>> public: >>>> static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 >> million * >>>> 8 bytes = 1GiB >>>> >>>> void init() { >>>> vals_.resize(NUM_VALUES); >>>> for (size_t i = 0; i < NUM_VALUES; ++i) { >>>> vals_[i] = i; >>>> } >>>> dest_.resize(NUM_VALUES); >>>> } >>>> >>>> void run(std::string name, std::function<void(const int64_t *, >> int64_t >>>> *, size_t)> &&func) { >>>> // ignore the result from first run >>>> func(vals_.data(), dest_.data(), vals_.size()); >>>> constexpr size_t count = 20; >>>> auto start = clock_t::now(); >>>> for (size_t i = 0; i < count; ++i) { >>>> func(vals_.data(), dest_.data(), vals_.size()); >>>> } >>>> auto end = clock_t::now(); >>>> double duration = >>>> std::chrono::duration_cast<sec_t>(end-start).count(); >>>> printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(), >>>> duration, >>>> sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count / >>>> duration, >>>> sizeof(int64_t) * NUM_VALUES / double(1e9) * count / >>>> duration); >>>> } >>>> >>>> private: >>>> std::vector<int64_t> vals_; >>>> std::vector<int64_t> dest_; >>>> }; >>>> >>>> void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) { >>>> memcpy(dest, src, n * sizeof(int64_t)); >>>> } >>>> >>>> void simple_copy(const int64_t *src, int64_t *dest, size_t n) { >>>> for (size_t i = 0; i < n; ++i) { >>>> dest[i] = src[i]; >>>> } >>>> } >>>> >>>> int main(int, char **) { >>>> TestCase c; >>>> c.init(); >>>> >>>> c.run("simple_memcpy", simple_memcpy); >>>> c.run("simple_copy", simple_copy); >>>> c.run("simple_memcpy", simple_memcpy); >>>> c.run("simple_copy", simple_copy); >>>> } >>>> >>>> ===== >>>> >>>> The assembly of simple_copy generated by gcc is very simple: >>>> Dump of assembler code for function _Z11simple_copyPKlPlm: >>>> 0x0000000000401440 <+0>: mov %rdx,%rcx >>>> 0x0000000000401443 <+3>: test %rdx,%rdx >>>> 0x0000000000401446 <+6>: je 0x401460 >> <_Z11simple_copyPKlPlm+32> >>>> 0x0000000000401448 <+8>: xor %eax,%eax >>>> 0x000000000040144a <+10>: nopw 0x0(%rax,%rax,1) >>>> 0x0000000000401450 <+16>: mov (%rdi,%rax,8),%rdx >>>> 0x0000000000401454 <+20>: mov %rdx,(%rsi,%rax,8) >>>> 0x0000000000401458 <+24>: inc %rax >>>> 0x000000000040145b <+27>: cmp %rax,%rcx >>>> 0x000000000040145e <+30>: jne 0x401450 >> <_Z11simple_copyPKlPlm+16> >>>> 0x0000000000401460 <+32>: retq >>>> >>>> When compiling with -O3, gcc vectorized the loop using xmm0, the >>>> simple_loop is around 1% faster. >>> Usually differences of that magnitude falls either in noise or may be >> something >>> related to OS jitter. >>> >>>> I took a brief look at the glibc source code. Though I don't have >> enough >>>> knowledge to understand it yet, I'm curious about the underlying >> mechanism. >>>> Thanks. >>> H.J, do you have any idea what might be happening here? >> From Intel optimization guide: >> >> 2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture >> Because of the change in the size of each bank of last level cache on >> Skylake Server microarchitecture, if >> an application, library, or driver only considers the last level cache >> to determine the size of on-chip cacheper-core, it may see a reduction >> with Skylake Server microarchitecture and may use non-temporal store >> with smaller blocks of memory writes. Since non-temporal stores evict >> cache lines back to memory, this >> may result in an increase in the number of subsequent cache misses and >> memory bandwidth demands >> on Skylake Server microarchitecture, compared to the previous Intel >> Xeon processor family. >> Also, because of a change in the handling of accesses resulting from >> non-temporal stores by Skylake >> Server microarchitecture, the resources within each core remain busy >> for a longer duration compared to >> similar accesses on the previous Intel Xeon processor family. As a >> result, if a series of such instructions >> are executed, there is a potential that the processor may run out of >> resources and stall, thus limiting the >> memory write bandwidth from each core. >> The increase in cache misses due to overuse of non-temporal stores and >> the limit on the memory write >> bandwidth per core for non-temporal stores may result in reduced >> performance for some applications. >> >> -- >> H.J. >> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-07-15 16:51 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-06 8:17 memcpy performance on skylake server Ji, Cheng 2021-07-14 12:58 ` Adhemerval Zanella 2021-07-14 13:26 ` H.J. Lu 2021-07-15 7:32 ` Ji, Cheng 2021-07-15 16:51 ` Patrick McGehearty
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).