memcpy performance on skylake server

public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed

* memcpy performance on skylake server
@ 2021-07-06  8:17 Ji, Cheng
  2021-07-14 12:58 ` Adhemerval Zanella
  0 siblings, 1 reply; 5+ messages in thread
From: Ji, Cheng @ 2021-07-06  8:17 UTC (permalink / raw)
  To: libc-help

Hello,

I found that memcpy is slower on skylake server CPUs during our
optimization work, and I can't really explain what we got and need some
guidance here.

The problem is that memcpy is noticeably slower than a simple for loop when
copying large chunks of data. This genuinely sounds like an amateur mistake
in our testing code but here's what we have tried:

* The test data is large enough: 1GB.
* We noticed a change quite a while ago regarding skylake and AVX512:
https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/
* We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5%
faster but still slower than a simple loop.
* We tested on multiple bare metal machines with different cpus: Xeon Gold
6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud,
the result is reproducible.
* On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than
the simple loop. On my desktop (i7-7700k) memcpy is also significantly
faster.
* numactl is used to ensure everything is running on a single core.
* The code is compiled by gcc 10.3

The numbers on a Xeon Gold 6132, with glibc 2.33:
simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s

The result is worse with system provided glibc 2.17:
simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s
simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s
simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s
simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s


The code to generate this result (compiled with g++ -O2 -g, run with: numactl
--membind 0 --physcpubind 0 -- ./a.out)
=====

#include <chrono>
#include <cstring>
#include <functional>
#include <string>
#include <vector>

class TestCase {
    using clock_t = std::chrono::high_resolution_clock;
    using sec_t = std::chrono::duration<double, std::ratio<1>>;

public:
    static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million *
8 bytes = 1GiB

    void init() {
        vals_.resize(NUM_VALUES);
        for (size_t i = 0; i < NUM_VALUES; ++i) {
            vals_[i] = i;
        }
        dest_.resize(NUM_VALUES);
    }

    void run(std::string name, std::function<void(const int64_t *, int64_t
*, size_t)> &&func) {
        // ignore the result from first run
        func(vals_.data(), dest_.data(), vals_.size());
        constexpr size_t count = 20;
        auto start = clock_t::now();
        for (size_t i = 0; i < count; ++i) {
            func(vals_.data(), dest_.data(), vals_.size());
        }
        auto end = clock_t::now();
        double duration =
std::chrono::duration_cast<sec_t>(end-start).count();
        printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(),
duration,
               sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count /
duration,
               sizeof(int64_t) * NUM_VALUES / double(1e9) * count /
duration);
    }

private:
    std::vector<int64_t> vals_;
    std::vector<int64_t> dest_;
};

void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) {
    memcpy(dest, src, n * sizeof(int64_t));
}

void simple_copy(const int64_t *src, int64_t *dest, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        dest[i] = src[i];
    }
}

int main(int, char **) {
    TestCase c;
    c.init();

    c.run("simple_memcpy", simple_memcpy);
    c.run("simple_copy", simple_copy);
    c.run("simple_memcpy", simple_memcpy);
    c.run("simple_copy", simple_copy);
}

=====

The assembly of simple_copy generated by gcc is very simple:
Dump of assembler code for function _Z11simple_copyPKlPlm:
   0x0000000000401440 <+0>:     mov    %rdx,%rcx
   0x0000000000401443 <+3>:     test   %rdx,%rdx
   0x0000000000401446 <+6>:     je     0x401460 <_Z11simple_copyPKlPlm+32>
   0x0000000000401448 <+8>:     xor    %eax,%eax
   0x000000000040144a <+10>:    nopw   0x0(%rax,%rax,1)
   0x0000000000401450 <+16>:    mov    (%rdi,%rax,8),%rdx
   0x0000000000401454 <+20>:    mov    %rdx,(%rsi,%rax,8)
   0x0000000000401458 <+24>:    inc    %rax
   0x000000000040145b <+27>:    cmp    %rax,%rcx
   0x000000000040145e <+30>:    jne    0x401450 <_Z11simple_copyPKlPlm+16>
   0x0000000000401460 <+32>:    retq

When compiling with -O3, gcc vectorized the loop using xmm0, the
simple_loop is around 1% faster.

I took a brief look at the glibc source code. Though I don't have enough
knowledge to understand it yet, I'm curious about the underlying mechanism.
Thanks.

Cheng

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memcpy performance on skylake server
  2021-07-06  8:17 memcpy performance on skylake server Ji, Cheng
@ 2021-07-14 12:58 ` Adhemerval Zanella
  2021-07-14 13:26   ` H.J. Lu
  0 siblings, 1 reply; 5+ messages in thread
From: Adhemerval Zanella @ 2021-07-14 12:58 UTC (permalink / raw)
  To: Ji, Cheng, Libc-help, H.J. Lu



On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote:
> Hello,
> 
> I found that memcpy is slower on skylake server CPUs during our
> optimization work, and I can't really explain what we got and need some
> guidance here.
> 
> The problem is that memcpy is noticeably slower than a simple for loop when
> copying large chunks of data. This genuinely sounds like an amateur mistake
> in our testing code but here's what we have tried:
> 
> * The test data is large enough: 1GB.
> * We noticed a change quite a while ago regarding skylake and AVX512:
> https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/
> * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5%
> faster but still slower than a simple loop.
> * We tested on multiple bare metal machines with different cpus: Xeon Gold
> 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud,
> the result is reproducible.
> * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than
> the simple loop. On my desktop (i7-7700k) memcpy is also significantly
> faster.
> * numactl is used to ensure everything is running on a single core.
> * The code is compiled by gcc 10.3
> 
> The numbers on a Xeon Gold 6132, with glibc 2.33:
> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s
> 
> The result is worse with system provided glibc 2.17:
> simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s
> simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s
> simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s
> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> 
> 
> The code to generate this result (compiled with g++ -O2 -g, run with: numactl
> --membind 0 --physcpubind 0 -- ./a.out)
> =====
> 
> #include <chrono>
> #include <cstring>
> #include <functional>
> #include <string>
> #include <vector>
> 
> class TestCase {
>     using clock_t = std::chrono::high_resolution_clock;
>     using sec_t = std::chrono::duration<double, std::ratio<1>>;
> 
> public:
>     static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million *
> 8 bytes = 1GiB
> 
>     void init() {
>         vals_.resize(NUM_VALUES);
>         for (size_t i = 0; i < NUM_VALUES; ++i) {
>             vals_[i] = i;
>         }
>         dest_.resize(NUM_VALUES);
>     }
> 
>     void run(std::string name, std::function<void(const int64_t *, int64_t
> *, size_t)> &&func) {
>         // ignore the result from first run
>         func(vals_.data(), dest_.data(), vals_.size());
>         constexpr size_t count = 20;
>         auto start = clock_t::now();
>         for (size_t i = 0; i < count; ++i) {
>             func(vals_.data(), dest_.data(), vals_.size());
>         }
>         auto end = clock_t::now();
>         double duration =
> std::chrono::duration_cast<sec_t>(end-start).count();
>         printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(),
> duration,
>                sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count /
> duration,
>                sizeof(int64_t) * NUM_VALUES / double(1e9) * count /
> duration);
>     }
> 
> private:
>     std::vector<int64_t> vals_;
>     std::vector<int64_t> dest_;
> };
> 
> void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) {
>     memcpy(dest, src, n * sizeof(int64_t));
> }
> 
> void simple_copy(const int64_t *src, int64_t *dest, size_t n) {
>     for (size_t i = 0; i < n; ++i) {
>         dest[i] = src[i];
>     }
> }
> 
> int main(int, char **) {
>     TestCase c;
>     c.init();
> 
>     c.run("simple_memcpy", simple_memcpy);
>     c.run("simple_copy", simple_copy);
>     c.run("simple_memcpy", simple_memcpy);
>     c.run("simple_copy", simple_copy);
> }
> 
> =====
> 
> The assembly of simple_copy generated by gcc is very simple:
> Dump of assembler code for function _Z11simple_copyPKlPlm:
>    0x0000000000401440 <+0>:     mov    %rdx,%rcx
>    0x0000000000401443 <+3>:     test   %rdx,%rdx
>    0x0000000000401446 <+6>:     je     0x401460 <_Z11simple_copyPKlPlm+32>
>    0x0000000000401448 <+8>:     xor    %eax,%eax
>    0x000000000040144a <+10>:    nopw   0x0(%rax,%rax,1)
>    0x0000000000401450 <+16>:    mov    (%rdi,%rax,8),%rdx
>    0x0000000000401454 <+20>:    mov    %rdx,(%rsi,%rax,8)
>    0x0000000000401458 <+24>:    inc    %rax
>    0x000000000040145b <+27>:    cmp    %rax,%rcx
>    0x000000000040145e <+30>:    jne    0x401450 <_Z11simple_copyPKlPlm+16>
>    0x0000000000401460 <+32>:    retq
> 
> When compiling with -O3, gcc vectorized the loop using xmm0, the
> simple_loop is around 1% faster.

Usually differences of that magnitude falls either in noise or may be something
related to OS jitter. 

> 
> I took a brief look at the glibc source code. Though I don't have enough
> knowledge to understand it yet, I'm curious about the underlying mechanism.
> Thanks.

H.J, do you have any idea what might be happening here? 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memcpy performance on skylake server
  2021-07-14 12:58 ` Adhemerval Zanella
@ 2021-07-14 13:26   ` H.J. Lu
  2021-07-15  7:32     ` Ji, Cheng
  0 siblings, 1 reply; 5+ messages in thread
From: H.J. Lu @ 2021-07-14 13:26 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Ji, Cheng, Libc-help

On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote:
> > Hello,
> >
> > I found that memcpy is slower on skylake server CPUs during our
> > optimization work, and I can't really explain what we got and need some
> > guidance here.
> >
> > The problem is that memcpy is noticeably slower than a simple for loop when
> > copying large chunks of data. This genuinely sounds like an amateur mistake
> > in our testing code but here's what we have tried:
> >
> > * The test data is large enough: 1GB.
> > * We noticed a change quite a while ago regarding skylake and AVX512:
> > https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/
> > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5%
> > faster but still slower than a simple loop.
> > * We tested on multiple bare metal machines with different cpus: Xeon Gold
> > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud,
> > the result is reproducible.
> > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than
> > the simple loop. On my desktop (i7-7700k) memcpy is also significantly
> > faster.
> > * numactl is used to ensure everything is running on a single core.
> > * The code is compiled by gcc 10.3
> >
> > The numbers on a Xeon Gold 6132, with glibc 2.33:
> > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s
> >
> > The result is worse with system provided glibc 2.17:
> > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s
> > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s
> > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> >
> >
> > The code to generate this result (compiled with g++ -O2 -g, run with: numactl
> > --membind 0 --physcpubind 0 -- ./a.out)
> > =====
> >
> > #include <chrono>
> > #include <cstring>
> > #include <functional>
> > #include <string>
> > #include <vector>
> >
> > class TestCase {
> >     using clock_t = std::chrono::high_resolution_clock;
> >     using sec_t = std::chrono::duration<double, std::ratio<1>>;
> >
> > public:
> >     static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million *
> > 8 bytes = 1GiB
> >
> >     void init() {
> >         vals_.resize(NUM_VALUES);
> >         for (size_t i = 0; i < NUM_VALUES; ++i) {
> >             vals_[i] = i;
> >         }
> >         dest_.resize(NUM_VALUES);
> >     }
> >
> >     void run(std::string name, std::function<void(const int64_t *, int64_t
> > *, size_t)> &&func) {
> >         // ignore the result from first run
> >         func(vals_.data(), dest_.data(), vals_.size());
> >         constexpr size_t count = 20;
> >         auto start = clock_t::now();
> >         for (size_t i = 0; i < count; ++i) {
> >             func(vals_.data(), dest_.data(), vals_.size());
> >         }
> >         auto end = clock_t::now();
> >         double duration =
> > std::chrono::duration_cast<sec_t>(end-start).count();
> >         printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(),
> > duration,
> >                sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count /
> > duration,
> >                sizeof(int64_t) * NUM_VALUES / double(1e9) * count /
> > duration);
> >     }
> >
> > private:
> >     std::vector<int64_t> vals_;
> >     std::vector<int64_t> dest_;
> > };
> >
> > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) {
> >     memcpy(dest, src, n * sizeof(int64_t));
> > }
> >
> > void simple_copy(const int64_t *src, int64_t *dest, size_t n) {
> >     for (size_t i = 0; i < n; ++i) {
> >         dest[i] = src[i];
> >     }
> > }
> >
> > int main(int, char **) {
> >     TestCase c;
> >     c.init();
> >
> >     c.run("simple_memcpy", simple_memcpy);
> >     c.run("simple_copy", simple_copy);
> >     c.run("simple_memcpy", simple_memcpy);
> >     c.run("simple_copy", simple_copy);
> > }
> >
> > =====
> >
> > The assembly of simple_copy generated by gcc is very simple:
> > Dump of assembler code for function _Z11simple_copyPKlPlm:
> >    0x0000000000401440 <+0>:     mov    %rdx,%rcx
> >    0x0000000000401443 <+3>:     test   %rdx,%rdx
> >    0x0000000000401446 <+6>:     je     0x401460 <_Z11simple_copyPKlPlm+32>
> >    0x0000000000401448 <+8>:     xor    %eax,%eax
> >    0x000000000040144a <+10>:    nopw   0x0(%rax,%rax,1)
> >    0x0000000000401450 <+16>:    mov    (%rdi,%rax,8),%rdx
> >    0x0000000000401454 <+20>:    mov    %rdx,(%rsi,%rax,8)
> >    0x0000000000401458 <+24>:    inc    %rax
> >    0x000000000040145b <+27>:    cmp    %rax,%rcx
> >    0x000000000040145e <+30>:    jne    0x401450 <_Z11simple_copyPKlPlm+16>
> >    0x0000000000401460 <+32>:    retq
> >
> > When compiling with -O3, gcc vectorized the loop using xmm0, the
> > simple_loop is around 1% faster.
>
> Usually differences of that magnitude falls either in noise or may be something
> related to OS jitter.
>
> >
> > I took a brief look at the glibc source code. Though I don't have enough
> > knowledge to understand it yet, I'm curious about the underlying mechanism.
> > Thanks.
>
> H.J, do you have any idea what might be happening here?

From Intel optimization guide:

2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture
Because of the change in the size of each bank of last level cache on
Skylake Server microarchitecture, if
an application, library, or driver only considers the last level cache
to determine the size of on-chip cacheper-core, it may see a reduction
with Skylake Server microarchitecture and may use non-temporal store
with smaller blocks of memory writes. Since non-temporal stores evict
cache lines back to memory, this
may result in an increase in the number of subsequent cache misses and
memory bandwidth demands
on Skylake Server microarchitecture, compared to the previous Intel
Xeon processor family.
Also, because of a change in the handling of accesses resulting from
non-temporal stores by Skylake
Server microarchitecture, the resources within each core remain busy
for a longer duration compared to
similar accesses on the previous Intel Xeon processor family. As a
result, if a series of such instructions
are executed, there is a potential that the processor may run out of
resources and stall, thus limiting the
memory write bandwidth from each core.
The increase in cache misses due to overuse of non-temporal stores and
the limit on the memory write
bandwidth per core for non-temporal stores may result in reduced
performance for some applications.

-- 
H.J.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memcpy performance on skylake server
  2021-07-14 13:26   ` H.J. Lu
@ 2021-07-15  7:32     ` Ji, Cheng
  2021-07-15 16:51       ` Patrick McGehearty
  0 siblings, 1 reply; 5+ messages in thread
From: Ji, Cheng @ 2021-07-15  7:32 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Adhemerval Zanella, Libc-help

Thanks for the information. We did some quick experiments. Indeed, using
normal temporal stores is ~20% faster than using non-temporal stores in
this case.

Cheng

On Wed, Jul 14, 2021 at 9:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote:
> > > Hello,
> > >
> > > I found that memcpy is slower on skylake server CPUs during our
> > > optimization work, and I can't really explain what we got and need some
> > > guidance here.
> > >
> > > The problem is that memcpy is noticeably slower than a simple for loop
> when
> > > copying large chunks of data. This genuinely sounds like an amateur
> mistake
> > > in our testing code but here's what we have tried:
> > >
> > > * The test data is large enough: 1GB.
> > > * We noticed a change quite a while ago regarding skylake and AVX512:
> > >
> https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/
> > > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is
> 5%
> > > faster but still slower than a simple loop.
> > > * We tested on multiple bare metal machines with different cpus: Xeon
> Gold
> > > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google
> cloud,
> > > the result is reproducible.
> > > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster
> than
> > > the simple loop. On my desktop (i7-7700k) memcpy is also significantly
> > > faster.
> > > * numactl is used to ensure everything is running on a single core.
> > > * The code is compiled by gcc 10.3
> > >
> > > The numbers on a Xeon Gold 6132, with glibc 2.33:
> > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s
> > >
> > > The result is worse with system provided glibc 2.17:
> > > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s
> > > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s
> > > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s
> > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> > >
> > >
> > > The code to generate this result (compiled with g++ -O2 -g, run with:
> numactl
> > > --membind 0 --physcpubind 0 -- ./a.out)
> > > =====
> > >
> > > #include <chrono>
> > > #include <cstring>
> > > #include <functional>
> > > #include <string>
> > > #include <vector>
> > >
> > > class TestCase {
> > >     using clock_t = std::chrono::high_resolution_clock;
> > >     using sec_t = std::chrono::duration<double, std::ratio<1>>;
> > >
> > > public:
> > >     static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128
> million *
> > > 8 bytes = 1GiB
> > >
> > >     void init() {
> > >         vals_.resize(NUM_VALUES);
> > >         for (size_t i = 0; i < NUM_VALUES; ++i) {
> > >             vals_[i] = i;
> > >         }
> > >         dest_.resize(NUM_VALUES);
> > >     }
> > >
> > >     void run(std::string name, std::function<void(const int64_t *,
> int64_t
> > > *, size_t)> &&func) {
> > >         // ignore the result from first run
> > >         func(vals_.data(), dest_.data(), vals_.size());
> > >         constexpr size_t count = 20;
> > >         auto start = clock_t::now();
> > >         for (size_t i = 0; i < count; ++i) {
> > >             func(vals_.data(), dest_.data(), vals_.size());
> > >         }
> > >         auto end = clock_t::now();
> > >         double duration =
> > > std::chrono::duration_cast<sec_t>(end-start).count();
> > >         printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(),
> > > duration,
> > >                sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count /
> > > duration,
> > >                sizeof(int64_t) * NUM_VALUES / double(1e9) * count /
> > > duration);
> > >     }
> > >
> > > private:
> > >     std::vector<int64_t> vals_;
> > >     std::vector<int64_t> dest_;
> > > };
> > >
> > > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) {
> > >     memcpy(dest, src, n * sizeof(int64_t));
> > > }
> > >
> > > void simple_copy(const int64_t *src, int64_t *dest, size_t n) {
> > >     for (size_t i = 0; i < n; ++i) {
> > >         dest[i] = src[i];
> > >     }
> > > }
> > >
> > > int main(int, char **) {
> > >     TestCase c;
> > >     c.init();
> > >
> > >     c.run("simple_memcpy", simple_memcpy);
> > >     c.run("simple_copy", simple_copy);
> > >     c.run("simple_memcpy", simple_memcpy);
> > >     c.run("simple_copy", simple_copy);
> > > }
> > >
> > > =====
> > >
> > > The assembly of simple_copy generated by gcc is very simple:
> > > Dump of assembler code for function _Z11simple_copyPKlPlm:
> > >    0x0000000000401440 <+0>:     mov    %rdx,%rcx
> > >    0x0000000000401443 <+3>:     test   %rdx,%rdx
> > >    0x0000000000401446 <+6>:     je     0x401460
> <_Z11simple_copyPKlPlm+32>
> > >    0x0000000000401448 <+8>:     xor    %eax,%eax
> > >    0x000000000040144a <+10>:    nopw   0x0(%rax,%rax,1)
> > >    0x0000000000401450 <+16>:    mov    (%rdi,%rax,8),%rdx
> > >    0x0000000000401454 <+20>:    mov    %rdx,(%rsi,%rax,8)
> > >    0x0000000000401458 <+24>:    inc    %rax
> > >    0x000000000040145b <+27>:    cmp    %rax,%rcx
> > >    0x000000000040145e <+30>:    jne    0x401450
> <_Z11simple_copyPKlPlm+16>
> > >    0x0000000000401460 <+32>:    retq
> > >
> > > When compiling with -O3, gcc vectorized the loop using xmm0, the
> > > simple_loop is around 1% faster.
> >
> > Usually differences of that magnitude falls either in noise or may be
> something
> > related to OS jitter.
> >
> > >
> > > I took a brief look at the glibc source code. Though I don't have
> enough
> > > knowledge to understand it yet, I'm curious about the underlying
> mechanism.
> > > Thanks.
> >
> > H.J, do you have any idea what might be happening here?
>
> From Intel optimization guide:
>
> 2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture
> Because of the change in the size of each bank of last level cache on
> Skylake Server microarchitecture, if
> an application, library, or driver only considers the last level cache
> to determine the size of on-chip cacheper-core, it may see a reduction
> with Skylake Server microarchitecture and may use non-temporal store
> with smaller blocks of memory writes. Since non-temporal stores evict
> cache lines back to memory, this
> may result in an increase in the number of subsequent cache misses and
> memory bandwidth demands
> on Skylake Server microarchitecture, compared to the previous Intel
> Xeon processor family.
> Also, because of a change in the handling of accesses resulting from
> non-temporal stores by Skylake
> Server microarchitecture, the resources within each core remain busy
> for a longer duration compared to
> similar accesses on the previous Intel Xeon processor family. As a
> result, if a series of such instructions
> are executed, there is a potential that the processor may run out of
> resources and stall, thus limiting the
> memory write bandwidth from each core.
> The increase in cache misses due to overuse of non-temporal stores and
> the limit on the memory write
> bandwidth per core for non-temporal stores may result in reduced
> performance for some applications.
>
> --
> H.J.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memcpy performance on skylake server
  2021-07-15  7:32     ` Ji, Cheng
@ 2021-07-15 16:51       ` Patrick McGehearty
  0 siblings, 0 replies; 5+ messages in thread
From: Patrick McGehearty @ 2021-07-15 16:51 UTC (permalink / raw)
  To: Ji, Cheng, H.J. Lu; +Cc: Libc-help

More in-depth discussion of tuning non-temporal stores for x86
can be found at:
http://patches-tcwg.linaro.org/patch/41797/

- Patrick McGehearty


On 7/15/2021 2:32 AM, Ji, Cheng via Libc-help wrote:
> Thanks for the information. We did some quick experiments. Indeed, using
> normal temporal stores is ~20% faster than using non-temporal stores in
> this case.
>
> Cheng
>
> On Wed, Jul 14, 2021 at 9:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
>> On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella
>> <adhemerval.zanella@linaro.org> wrote:
>>>
>>>
>>> On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote:
>>>> Hello,
>>>>
>>>> I found that memcpy is slower on skylake server CPUs during our
>>>> optimization work, and I can't really explain what we got and need some
>>>> guidance here.
>>>>
>>>> The problem is that memcpy is noticeably slower than a simple for loop
>> when
>>>> copying large chunks of data. This genuinely sounds like an amateur
>> mistake
>>>> in our testing code but here's what we have tried:
>>>>
>>>> * The test data is large enough: 1GB.
>>>> * We noticed a change quite a while ago regarding skylake and AVX512:
>>>>
>> https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/
>>>> * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is
>> 5%
>>>> faster but still slower than a simple loop.
>>>> * We tested on multiple bare metal machines with different cpus: Xeon
>> Gold
>>>> 6132, Gold 6252, Silver 4114, as well as a virtual machine on google
>> cloud,
>>>> the result is reproducible.
>>>> * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster
>> than
>>>> the simple loop. On my desktop (i7-7700k) memcpy is also significantly
>>>> faster.
>>>> * numactl is used to ensure everything is running on a single core.
>>>> * The code is compiled by gcc 10.3
>>>>
>>>> The numbers on a Xeon Gold 6132, with glibc 2.33:
>>>> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
>>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
>>>> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
>>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s
>>>>
>>>> The result is worse with system provided glibc 2.17:
>>>> simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s
>>>> simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s
>>>> simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s
>>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
>>>>
>>>>
>>>> The code to generate this result (compiled with g++ -O2 -g, run with:
>> numactl
>>>> --membind 0 --physcpubind 0 -- ./a.out)
>>>> =====
>>>>
>>>> #include <chrono>
>>>> #include <cstring>
>>>> #include <functional>
>>>> #include <string>
>>>> #include <vector>
>>>>
>>>> class TestCase {
>>>>      using clock_t = std::chrono::high_resolution_clock;
>>>>      using sec_t = std::chrono::duration<double, std::ratio<1>>;
>>>>
>>>> public:
>>>>      static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128
>> million *
>>>> 8 bytes = 1GiB
>>>>
>>>>      void init() {
>>>>          vals_.resize(NUM_VALUES);
>>>>          for (size_t i = 0; i < NUM_VALUES; ++i) {
>>>>              vals_[i] = i;
>>>>          }
>>>>          dest_.resize(NUM_VALUES);
>>>>      }
>>>>
>>>>      void run(std::string name, std::function<void(const int64_t *,
>> int64_t
>>>> *, size_t)> &&func) {
>>>>          // ignore the result from first run
>>>>          func(vals_.data(), dest_.data(), vals_.size());
>>>>          constexpr size_t count = 20;
>>>>          auto start = clock_t::now();
>>>>          for (size_t i = 0; i < count; ++i) {
>>>>              func(vals_.data(), dest_.data(), vals_.size());
>>>>          }
>>>>          auto end = clock_t::now();
>>>>          double duration =
>>>> std::chrono::duration_cast<sec_t>(end-start).count();
>>>>          printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(),
>>>> duration,
>>>>                 sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count /
>>>> duration,
>>>>                 sizeof(int64_t) * NUM_VALUES / double(1e9) * count /
>>>> duration);
>>>>      }
>>>>
>>>> private:
>>>>      std::vector<int64_t> vals_;
>>>>      std::vector<int64_t> dest_;
>>>> };
>>>>
>>>> void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) {
>>>>      memcpy(dest, src, n * sizeof(int64_t));
>>>> }
>>>>
>>>> void simple_copy(const int64_t *src, int64_t *dest, size_t n) {
>>>>      for (size_t i = 0; i < n; ++i) {
>>>>          dest[i] = src[i];
>>>>      }
>>>> }
>>>>
>>>> int main(int, char **) {
>>>>      TestCase c;
>>>>      c.init();
>>>>
>>>>      c.run("simple_memcpy", simple_memcpy);
>>>>      c.run("simple_copy", simple_copy);
>>>>      c.run("simple_memcpy", simple_memcpy);
>>>>      c.run("simple_copy", simple_copy);
>>>> }
>>>>
>>>> =====
>>>>
>>>> The assembly of simple_copy generated by gcc is very simple:
>>>> Dump of assembler code for function _Z11simple_copyPKlPlm:
>>>>     0x0000000000401440 <+0>:     mov    %rdx,%rcx
>>>>     0x0000000000401443 <+3>:     test   %rdx,%rdx
>>>>     0x0000000000401446 <+6>:     je     0x401460
>> <_Z11simple_copyPKlPlm+32>
>>>>     0x0000000000401448 <+8>:     xor    %eax,%eax
>>>>     0x000000000040144a <+10>:    nopw   0x0(%rax,%rax,1)
>>>>     0x0000000000401450 <+16>:    mov    (%rdi,%rax,8),%rdx
>>>>     0x0000000000401454 <+20>:    mov    %rdx,(%rsi,%rax,8)
>>>>     0x0000000000401458 <+24>:    inc    %rax
>>>>     0x000000000040145b <+27>:    cmp    %rax,%rcx
>>>>     0x000000000040145e <+30>:    jne    0x401450
>> <_Z11simple_copyPKlPlm+16>
>>>>     0x0000000000401460 <+32>:    retq
>>>>
>>>> When compiling with -O3, gcc vectorized the loop using xmm0, the
>>>> simple_loop is around 1% faster.
>>> Usually differences of that magnitude falls either in noise or may be
>> something
>>> related to OS jitter.
>>>
>>>> I took a brief look at the glibc source code. Though I don't have
>> enough
>>>> knowledge to understand it yet, I'm curious about the underlying
>> mechanism.
>>>> Thanks.
>>> H.J, do you have any idea what might be happening here?
>>  From Intel optimization guide:
>>
>> 2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture
>> Because of the change in the size of each bank of last level cache on
>> Skylake Server microarchitecture, if
>> an application, library, or driver only considers the last level cache
>> to determine the size of on-chip cacheper-core, it may see a reduction
>> with Skylake Server microarchitecture and may use non-temporal store
>> with smaller blocks of memory writes. Since non-temporal stores evict
>> cache lines back to memory, this
>> may result in an increase in the number of subsequent cache misses and
>> memory bandwidth demands
>> on Skylake Server microarchitecture, compared to the previous Intel
>> Xeon processor family.
>> Also, because of a change in the handling of accesses resulting from
>> non-temporal stores by Skylake
>> Server microarchitecture, the resources within each core remain busy
>> for a longer duration compared to
>> similar accesses on the previous Intel Xeon processor family. As a
>> result, if a series of such instructions
>> are executed, there is a potential that the processor may run out of
>> resources and stall, thus limiting the
>> memory write bandwidth from each core.
>> The increase in cache misses due to overuse of non-temporal stores and
>> the limit on the memory write
>> bandwidth per core for non-temporal stores may result in reduced
>> performance for some applications.
>>
>> --
>> H.J.
>>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-07-15 16:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-06  8:17 memcpy performance on skylake server Ji, Cheng
2021-07-14 12:58 ` Adhemerval Zanella
2021-07-14 13:26   ` H.J. Lu
2021-07-15  7:32     ` Ji, Cheng
2021-07-15 16:51       ` Patrick McGehearty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).