From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <wangshuo47@huawei.com>
Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191])
 by sourceware.org (Postfix) with ESMTPS id 8AD673858D33
 for <libc-alpha@sourceware.org>; Mon, 11 Jan 2021 08:42:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8AD673858D33
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=huawei.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=wangshuo47@huawei.com
Received: from DGGEMS406-HUB.china.huawei.com (unknown [172.30.72.59])
 by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4DDnHK27yMzj5t7
 for <libc-alpha@sourceware.org>; Mon, 11 Jan 2021 16:41:17 +0800 (CST)
Received: from huawei.com (10.174.176.87) by DGGEMS406-HUB.china.huawei.com
 (10.3.19.206) with Microsoft SMTP Server id 14.3.498.0; Mon, 11 Jan 2021
 16:42:09 +0800
From: Shuo Wang <wangshuo47@huawei.com>
To: <hjl.tools@gmail.com>, <libc-alpha@sourceware.org>
CC: <hushiyuan@huawei.com>, <liqingqing3@huawei.com>
Subject: x86-64: memcpy performance reduce when running in virtual mechine
Date: Mon, 11 Jan 2021 16:41:57 +0800
Message-ID: <20210111084157.15188-1-wangshuo47@huawei.com>
X-Mailer: git-send-email 2.19.0.windows.1
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-Originating-IP: [10.174.176.87]
X-CFilter-Loop: Reflected
X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Jan 2021 08:42:57 -0000

There is also performance reduce when memcpy enter __memmove_avx_unaligned_erms in
vm compared with host.
>memcpy performance reduce when running in virtual mechine compared with host.
>This is test result:
>-----------------------
>|       | host |  vm  | 
>|cycle: |  78  | 1503 |
>-----------------------
>
>From perf, we believe that they enter same bracnch between host and vm:
>[host]
>  78.61%  libc-2.28.so     [.] __memmove_sse2_unaligned_erms
>  12.85%  [kernel]         [k] nmi
>   6.38%  hot_host_memcpy  [.] main
>   
>[virtual machine]
>  98.64%  libc-2.28.so   [.] __memmove_sse2_unaligned_erms
>   0.17%  hot_vm_memcpy  [.] main
>   
>This is our demo:
>#include <unistd.h>
>#include <stdlib.h>
>#include <stdio.h>
>#include <string.h>
>
>static __inline__ unsigned long long rdtsc(void)
>{
>  unsigned hi, lo;
>  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
>}
>
>int main(int argc, char **argv)
>{
>        int i, defs, lm_optb;
>    if (argc == 3) {
>        defs = atoi(argv[1]);
>        lm_optb = atoi(argv[2]);
>    } else {
>        printf("error input!\n");
>        return 1;
>    }
>    char *src = (char *)valloc(defs);
>    char *dest = (char *)valloc(defs);
>    int opts = defs;
>
>    memset(src, 1, defs);
>    memset(dest, 1, defs);
>
>    unsigned long long begin, end;
>    begin = rdtsc();
>
>//while (1) {
>    for (i = 0; i < lm_optb; i++) {
>        (void) memcpy(dest, src, opts);
>    }
>//}
>
>    end = rdtsc();
>    printf("all cycle = %llu, percall = %llu\n", end - begin, (end - begin) / lm_optb);
>
>    return (0);
>}
>
>This is the test log:
># taskset -c 2 ./host_memcpy 1024 1024000
>all cycle = 80149652, percall = 78
># taskset -c 2 ./host_memcpy 1024 1024000
>all cycle = 93075200, percall = 90
>
># taskset -c 2 ./vm_memcpy 1024 1024000
>all cycle = 1539990968, percall = 1503
># taskset -c 2 ./vm_memcpy 1024 1024000
>all cycle = 1541243316, percall = 1505
>
>We build it by:
># gcc -g -O0 memcpy.c -o host_memcpy
># gcc -g -O0 memcpy.c -o vm_memcpy
>
>
>The environment information is as follows:
>[host]
>- kernel version: 4.18.0
>- glibc version: 2.28
>- gcc version: 8.3.1
>- qemu version: 2.12.0
>- libvirtd version: 4.5.0
>
># lscpu
>Architecture:        x86_64
>CPU op-mode(s):      32-bit, 64-bit
>Byte Order:          Little Endian
>CPU(s):              60
>On-line CPU(s) list: 0-59
>Thread(s) per core:  2
>Core(s) per socket:  15
>Socket(s):           8
>NUMA node(s):        8
>Vendor ID:           GenuineIntel
>CPU family:          6
>Model:               62
>Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
>Stepping:            7
>CPU MHz:             2294.529
>CPU max MHz:         2300.0000
>CPU min MHz:         1200.0000
>BogoMIPS:            4589.07
>Virtualization:      VT-x
>L1d cache:           32K
>L1i cache:           32K
>L2 cache:            256K
>L3 cache:            30720K
>NUMA node0 CPU(s):   0-14,30-44
>NUMA node1 CPU(s):   15-29,45-59
>Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d
>
>[virtual machine]
>- kernel version: 4.18.0
>- glibc version: 2.28
>- gcc version: 8.3.1
>- qemu version: 2.12.0
>- libvirtd version: 4.5.0
>
># lscpu
>Architecture:        x86_64
>CPU op-mode(s):      32-bit, 64-bit
>Byte Order:          Little Endian
>CPU(s):              4
>On-line CPU(s) list: 0-3
>Thread(s) per core:  1
>Core(s) per socket:  1
>Socket(s):           4
>NUMA node(s):        1
>Vendor ID:           GenuineIntel
>CPU family:          6
>Model:               62
>Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
>Stepping:            7
>CPU MHz:             2294.468
>BogoMIPS:            4588.93
>Hypervisor vendor:   KVM
>Virtualization type: full
>L1d cache:           32K
>L1i cache:           32K
>L2 cache:            4096K
>L3 cache:            16384K
>NUMA node0 CPU(s):   0-3
>Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms xsaveopt arat umip md_clear arch_capabilities
>