public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
From: "H.J. Lu" <hjl.tools@gmail.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: "Ji, Cheng" <jicheng1017@gmail.com>,
	Libc-help <libc-help@sourceware.org>
Subject: Re: memcpy performance on skylake server
Date: Wed, 14 Jul 2021 06:26:49 -0700	[thread overview]
Message-ID: <CAMe9rOrpWoKJrSX4CPFO2by_SMpj7nc4k_zvzwb+bv=8zsCvQQ@mail.gmail.com> (raw)
In-Reply-To: <6ee56912-dbe1-181e-6981-8d286c0325f3@linaro.org>

On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote:
> > Hello,
> >
> > I found that memcpy is slower on skylake server CPUs during our
> > optimization work, and I can't really explain what we got and need some
> > guidance here.
> >
> > The problem is that memcpy is noticeably slower than a simple for loop when
> > copying large chunks of data. This genuinely sounds like an amateur mistake
> > in our testing code but here's what we have tried:
> >
> > * The test data is large enough: 1GB.
> > * We noticed a change quite a while ago regarding skylake and AVX512:
> > https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/
> > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5%
> > faster but still slower than a simple loop.
> > * We tested on multiple bare metal machines with different cpus: Xeon Gold
> > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud,
> > the result is reproducible.
> > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than
> > the simple loop. On my desktop (i7-7700k) memcpy is also significantly
> > faster.
> > * numactl is used to ensure everything is running on a single core.
> > * The code is compiled by gcc 10.3
> >
> > The numbers on a Xeon Gold 6132, with glibc 2.33:
> > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s
> >
> > The result is worse with system provided glibc 2.17:
> > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s
> > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s
> > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s
> > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s
> >
> >
> > The code to generate this result (compiled with g++ -O2 -g, run with: numactl
> > --membind 0 --physcpubind 0 -- ./a.out)
> > =====
> >
> > #include <chrono>
> > #include <cstring>
> > #include <functional>
> > #include <string>
> > #include <vector>
> >
> > class TestCase {
> >     using clock_t = std::chrono::high_resolution_clock;
> >     using sec_t = std::chrono::duration<double, std::ratio<1>>;
> >
> > public:
> >     static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million *
> > 8 bytes = 1GiB
> >
> >     void init() {
> >         vals_.resize(NUM_VALUES);
> >         for (size_t i = 0; i < NUM_VALUES; ++i) {
> >             vals_[i] = i;
> >         }
> >         dest_.resize(NUM_VALUES);
> >     }
> >
> >     void run(std::string name, std::function<void(const int64_t *, int64_t
> > *, size_t)> &&func) {
> >         // ignore the result from first run
> >         func(vals_.data(), dest_.data(), vals_.size());
> >         constexpr size_t count = 20;
> >         auto start = clock_t::now();
> >         for (size_t i = 0; i < count; ++i) {
> >             func(vals_.data(), dest_.data(), vals_.size());
> >         }
> >         auto end = clock_t::now();
> >         double duration =
> > std::chrono::duration_cast<sec_t>(end-start).count();
> >         printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(),
> > duration,
> >                sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count /
> > duration,
> >                sizeof(int64_t) * NUM_VALUES / double(1e9) * count /
> > duration);
> >     }
> >
> > private:
> >     std::vector<int64_t> vals_;
> >     std::vector<int64_t> dest_;
> > };
> >
> > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) {
> >     memcpy(dest, src, n * sizeof(int64_t));
> > }
> >
> > void simple_copy(const int64_t *src, int64_t *dest, size_t n) {
> >     for (size_t i = 0; i < n; ++i) {
> >         dest[i] = src[i];
> >     }
> > }
> >
> > int main(int, char **) {
> >     TestCase c;
> >     c.init();
> >
> >     c.run("simple_memcpy", simple_memcpy);
> >     c.run("simple_copy", simple_copy);
> >     c.run("simple_memcpy", simple_memcpy);
> >     c.run("simple_copy", simple_copy);
> > }
> >
> > =====
> >
> > The assembly of simple_copy generated by gcc is very simple:
> > Dump of assembler code for function _Z11simple_copyPKlPlm:
> >    0x0000000000401440 <+0>:     mov    %rdx,%rcx
> >    0x0000000000401443 <+3>:     test   %rdx,%rdx
> >    0x0000000000401446 <+6>:     je     0x401460 <_Z11simple_copyPKlPlm+32>
> >    0x0000000000401448 <+8>:     xor    %eax,%eax
> >    0x000000000040144a <+10>:    nopw   0x0(%rax,%rax,1)
> >    0x0000000000401450 <+16>:    mov    (%rdi,%rax,8),%rdx
> >    0x0000000000401454 <+20>:    mov    %rdx,(%rsi,%rax,8)
> >    0x0000000000401458 <+24>:    inc    %rax
> >    0x000000000040145b <+27>:    cmp    %rax,%rcx
> >    0x000000000040145e <+30>:    jne    0x401450 <_Z11simple_copyPKlPlm+16>
> >    0x0000000000401460 <+32>:    retq
> >
> > When compiling with -O3, gcc vectorized the loop using xmm0, the
> > simple_loop is around 1% faster.
>
> Usually differences of that magnitude falls either in noise or may be something
> related to OS jitter.
>
> >
> > I took a brief look at the glibc source code. Though I don't have enough
> > knowledge to understand it yet, I'm curious about the underlying mechanism.
> > Thanks.
>
> H.J, do you have any idea what might be happening here?

From Intel optimization guide:

2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture
Because of the change in the size of each bank of last level cache on
Skylake Server microarchitecture, if
an application, library, or driver only considers the last level cache
to determine the size of on-chip cacheper-core, it may see a reduction
with Skylake Server microarchitecture and may use non-temporal store
with smaller blocks of memory writes. Since non-temporal stores evict
cache lines back to memory, this
may result in an increase in the number of subsequent cache misses and
memory bandwidth demands
on Skylake Server microarchitecture, compared to the previous Intel
Xeon processor family.
Also, because of a change in the handling of accesses resulting from
non-temporal stores by Skylake
Server microarchitecture, the resources within each core remain busy
for a longer duration compared to
similar accesses on the previous Intel Xeon processor family. As a
result, if a series of such instructions
are executed, there is a potential that the processor may run out of
resources and stall, thus limiting the
memory write bandwidth from each core.
The increase in cache misses due to overuse of non-temporal stores and
the limit on the memory write
bandwidth per core for non-temporal stores may result in reduced
performance for some applications.

-- 
H.J.

  reply	other threads:[~2021-07-14 13:27 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-06  8:17 Ji, Cheng
2021-07-14 12:58 ` Adhemerval Zanella
2021-07-14 13:26   ` H.J. Lu [this message]
2021-07-15  7:32     ` Ji, Cheng
2021-07-15 16:51       ` Patrick McGehearty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMe9rOrpWoKJrSX4CPFO2by_SMpj7nc4k_zvzwb+bv=8zsCvQQ@mail.gmail.com' \
    --to=hjl.tools@gmail.com \
    --cc=adhemerval.zanella@linaro.org \
    --cc=jicheng1017@gmail.com \
    --cc=libc-help@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).