From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by sourceware.org (Postfix) with ESMTPS id 44957398407A for ; Wed, 14 Jul 2021 13:27:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 44957398407A Received: by mail-pf1-x433.google.com with SMTP id j199so2032972pfd.7 for ; Wed, 14 Jul 2021 06:27:26 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GCTaEkrYs0U2Fat29LAn27Q41nxn9KfCE9+EE1vSPtQ=; b=cbEwzXNvu4GA7lj/UYfAJUilcaRU7gScAOeG8KhPeBqx6/wscHEQz4wSCTn5/FOb67 yh0+61SnkPNu1J5/08IIlbjmTS2nLInOAy9G99G3PLwQrXj3vc0R7ls5WxbJc9b+Uso+ X6RUru7T4GsXClAXFBN8euZ0eVKmZz5vhzyECeyb7/T6iLwuo/Ubd2SOrw5+Wc30XPBz pGJ3yArPfBEhG26y6I3ROCZPn2ZnoPYwtc8RpkJO/3Hvfoj9VchnNaUjI4F+I7tz1QDg RAY4lHeb93VHTaV74DHUanEMxfSMpVWK01RHzIOLd7aNjp1trg+nrfkTSoufAGqUE6IT Oxcw== X-Gm-Message-State: AOAM5303dRa4etSLq4aAW0ZPJnxRB56/k7skgkeRdiJDu1Ah9WHqHkTd ATlx1JNaokwYEBoEmlhasJBvrHV01hH+shD9e0E= X-Google-Smtp-Source: ABdhPJy05V8NcGS1k+QWCvSycHxr4iov6zma6/WMalB5E7w6ALRcxer7YtG3TcfvBYMGLyVweX8Yt5s71faWro+jHPM= X-Received: by 2002:aa7:8812:0:b029:32d:8252:fd0 with SMTP id c18-20020aa788120000b029032d82520fd0mr10202650pfo.48.1626269245401; Wed, 14 Jul 2021 06:27:25 -0700 (PDT) MIME-Version: 1.0 References: <6ee56912-dbe1-181e-6981-8d286c0325f3@linaro.org> In-Reply-To: <6ee56912-dbe1-181e-6981-8d286c0325f3@linaro.org> From: "H.J. Lu" Date: Wed, 14 Jul 2021 06:26:49 -0700 Message-ID: Subject: Re: memcpy performance on skylake server To: Adhemerval Zanella Cc: "Ji, Cheng" , Libc-help Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3025.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-help@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-help mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Jul 2021 13:27:28 -0000 On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella wrote: > > > > On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote: > > Hello, > > > > I found that memcpy is slower on skylake server CPUs during our > > optimization work, and I can't really explain what we got and need some > > guidance here. > > > > The problem is that memcpy is noticeably slower than a simple for loop when > > copying large chunks of data. This genuinely sounds like an amateur mistake > > in our testing code but here's what we have tried: > > > > * The test data is large enough: 1GB. > > * We noticed a change quite a while ago regarding skylake and AVX512: > > https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/ > > * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is 5% > > faster but still slower than a simple loop. > > * We tested on multiple bare metal machines with different cpus: Xeon Gold > > 6132, Gold 6252, Silver 4114, as well as a virtual machine on google cloud, > > the result is reproducible. > > * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster than > > the simple loop. On my desktop (i7-7700k) memcpy is also significantly > > faster. > > * numactl is used to ensure everything is running on a single core. > > * The code is compiled by gcc 10.3 > > > > The numbers on a Xeon Gold 6132, with glibc 2.33: > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > > simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s > > simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s > > > > The result is worse with system provided glibc 2.17: > > simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s > > simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s > > simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s > > simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s > > > > > > The code to generate this result (compiled with g++ -O2 -g, run with: numactl > > --membind 0 --physcpubind 0 -- ./a.out) > > ===== > > > > #include > > #include > > #include > > #include > > #include > > > > class TestCase { > > using clock_t = std::chrono::high_resolution_clock; > > using sec_t = std::chrono::duration>; > > > > public: > > static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 million * > > 8 bytes = 1GiB > > > > void init() { > > vals_.resize(NUM_VALUES); > > for (size_t i = 0; i < NUM_VALUES; ++i) { > > vals_[i] = i; > > } > > dest_.resize(NUM_VALUES); > > } > > > > void run(std::string name, std::function > *, size_t)> &&func) { > > // ignore the result from first run > > func(vals_.data(), dest_.data(), vals_.size()); > > constexpr size_t count = 20; > > auto start = clock_t::now(); > > for (size_t i = 0; i < count; ++i) { > > func(vals_.data(), dest_.data(), vals_.size()); > > } > > auto end = clock_t::now(); > > double duration = > > std::chrono::duration_cast(end-start).count(); > > printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(), > > duration, > > sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count / > > duration, > > sizeof(int64_t) * NUM_VALUES / double(1e9) * count / > > duration); > > } > > > > private: > > std::vector vals_; > > std::vector dest_; > > }; > > > > void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) { > > memcpy(dest, src, n * sizeof(int64_t)); > > } > > > > void simple_copy(const int64_t *src, int64_t *dest, size_t n) { > > for (size_t i = 0; i < n; ++i) { > > dest[i] = src[i]; > > } > > } > > > > int main(int, char **) { > > TestCase c; > > c.init(); > > > > c.run("simple_memcpy", simple_memcpy); > > c.run("simple_copy", simple_copy); > > c.run("simple_memcpy", simple_memcpy); > > c.run("simple_copy", simple_copy); > > } > > > > ===== > > > > The assembly of simple_copy generated by gcc is very simple: > > Dump of assembler code for function _Z11simple_copyPKlPlm: > > 0x0000000000401440 <+0>: mov %rdx,%rcx > > 0x0000000000401443 <+3>: test %rdx,%rdx > > 0x0000000000401446 <+6>: je 0x401460 <_Z11simple_copyPKlPlm+32> > > 0x0000000000401448 <+8>: xor %eax,%eax > > 0x000000000040144a <+10>: nopw 0x0(%rax,%rax,1) > > 0x0000000000401450 <+16>: mov (%rdi,%rax,8),%rdx > > 0x0000000000401454 <+20>: mov %rdx,(%rsi,%rax,8) > > 0x0000000000401458 <+24>: inc %rax > > 0x000000000040145b <+27>: cmp %rax,%rcx > > 0x000000000040145e <+30>: jne 0x401450 <_Z11simple_copyPKlPlm+16> > > 0x0000000000401460 <+32>: retq > > > > When compiling with -O3, gcc vectorized the loop using xmm0, the > > simple_loop is around 1% faster. > > Usually differences of that magnitude falls either in noise or may be something > related to OS jitter. > > > > > I took a brief look at the glibc source code. Though I don't have enough > > knowledge to understand it yet, I'm curious about the underlying mechanism. > > Thanks. > > H.J, do you have any idea what might be happening here? >From Intel optimization guide: 2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture Because of the change in the size of each bank of last level cache on Skylake Server microarchitecture, if an application, library, or driver only considers the last level cache to determine the size of on-chip cacheper-core, it may see a reduction with Skylake Server microarchitecture and may use non-temporal store with smaller blocks of memory writes. Since non-temporal stores evict cache lines back to memory, this may result in an increase in the number of subsequent cache misses and memory bandwidth demands on Skylake Server microarchitecture, compared to the previous Intel Xeon processor family. Also, because of a change in the handling of accesses resulting from non-temporal stores by Skylake Server microarchitecture, the resources within each core remain busy for a longer duration compared to similar accesses on the previous Intel Xeon processor family. As a result, if a series of such instructions are executed, there is a potential that the processor may run out of resources and stall, thus limiting the memory write bandwidth from each core. The increase in cache misses due to overuse of non-temporal stores and the limit on the memory write bandwidth per core for non-temporal stores may result in reduced performance for some applications. -- H.J.