From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by sourceware.org (Postfix) with ESMTPS id 4B244384601F for ; Mon, 26 Jul 2021 18:50:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4B244384601F Received: by mail-pj1-x1035.google.com with SMTP id k4-20020a17090a5144b02901731c776526so268769pjm.4 for ; Mon, 26 Jul 2021 11:50:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=xUxE8s++Xm41y5ujIs8mEo6yS+a0m5hzIhuDJTyZ4fA=; b=JmdLjWyJZjvo5aOn3RPVnPFdXYUkzf8oGrmMMpC+9VaX43DDBwCVB+aTnFHMGoNjUx ZS1l8mf3yMuKQGOlexYgZDN9ZuQ19Nnomopebd3RTfix6nIPOpqa+QRHgHxsRV+N5DSA sYi2m9PYTHErY5VZcFcaed77VEK66PjxST7eJO2dyU7foqj/wM6T0Yh/eKeBEphE6lWY pKSlZYwR5q3kbg7hJ2hbtCaakper5EmZnYFfmm9sqcUchWVbyQN27qBWSs2O6TxqUIh1 548qCfyEXG9GD36wWN98PrxinBLk9u5pyMFcGbLf3KTtpfEk+Z7RBkhYfs0DcpXlyp9s opgg== X-Gm-Message-State: AOAM532vNAHXnX76KciK9D8K1ULsIvT+r8SddBIP/W1Gsw67D9TtP2C1 MKU42QowJd7x33nuRTHj2Q06VBLruS2l49urTLI= X-Google-Smtp-Source: ABdhPJyLsaYaClU/mhhwnD1zEpCEI+hooseMS/YYCRe3xvG+x3oljtbCRsVzHbqAA/o9ypn1l5hjEwAgH1oI8pXC0VM= X-Received: by 2002:a17:90a:c902:: with SMTP id v2mr396570pjt.136.1627325442395; Mon, 26 Jul 2021 11:50:42 -0700 (PDT) MIME-Version: 1.0 References: <20210726120055.1089971-1-hjl.tools@gmail.com> In-Reply-To: From: "H.J. Lu" Date: Mon, 26 Jul 2021 11:50:06 -0700 Message-ID: Subject: Re: [PATCH] x86-64: Add Avoid_Short_Distance_REP_MOVSB To: Noah Goldstein Cc: GNU C Library Content-Type: multipart/mixed; boundary="000000000000edb69d05c80b3ac8" X-Spam-Status: No, score=-3031.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Jul 2021 18:50:45 -0000 --000000000000edb69d05c80b3ac8 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Jul 26, 2021 at 10:20 AM Noah Goldstein w= rote: > > > > On Mon, Jul 26, 2021 at 8:02 AM H.J. Lu via Libc-alpha wrote: >> >> commit 3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5 >> Author: H.J. Lu >> Date: Sat Jan 25 14:19:40 2020 -0800 >> >> x86-64: Avoid rep movsb with short distance [BZ #27130] >> >> introduced some regressions on Intel processors without Fast Short REP >> MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with >> short distance only on Intel processors with FSRM. bench-memmove-large >> on Skylake server shows that cycles of __memmove_evex_unaligned_erms are >> improved for the following data size: >> >> before after Improvement >> length=3D4127, align1=3D3, align2=3D0: 479.38 343.00 28% >> length=3D4223, align1=3D9, align2=3D5: 405.62 335.50 17% >> length=3D8223, align1=3D3, align2=3D0: 786.12 495.00 37% >> length=3D8319, align1=3D9, align2=3D5: 256.69 170.38 33% >> length=3D16415, align1=3D3, align2=3D0: 1436.88 839.50 41% >> length=3D16511, align1=3D9, align2=3D5: 1375.50 840.62 39% >> length=3D32799, align1=3D3, align2=3D0: 2890.00 1850.62 36% >> length=3D32895, align1=3D9, align2=3D5: 2891.38 1948.62 32% >> >> There are no regression on Ice Lake server. > > > On Tigerlake I see some strange results for the random tests: > > "ifuncs": ["__memcpy_avx_unaligned", "__memcpy_avx_unaligned_erms", "__me= mcpy_evex_unaligned", "__memcpy_evex_unaligned_erms", "__memcpy_ssse3_back"= , "__memcpy_ssse3", "__memcpy_avx512_no_vzeroupper", "__memcpy_avx512_unali= gned", "__memcpy_avx512_unaligned_erms", "__memcpy_sse2_unaligned", "__memc= py_sse2_unaligned_erms", "__memcpy_erms"], > > Without the Patch > "length": 4096, > "timings": [117793, 118814, 95009.2, 140061, 209016, 162007, 112210, 1130= 11, 139953, 106604, 106483, 116845] > > With the patch > "length": 4096, > "timings": [136386, 95256.7, 134947, 102466, 182687, 163942, 110546, 1277= 66, 98344.5, 107647, 109190, 118613] > > > It seems like some of the erms versions are heavily pessimized while the = non-erms versions are significantly > benefited. I think it has to do with the change in alignment of L(less_ve= c) though I am not certain. I also saw it on Tiger Lake. Please try this patch on top of my patch. > Are you seeing the same performance changes on Skylake/Icelake server? I will check it out. >> >> --- >> sysdeps/x86/cacheinfo.h | 7 +++++++ >> sysdeps/x86/cpu-features.c | 5 +++++ >> .../x86/include/cpu-features-preferred_feature_index_1.def | 1 + >> sysdeps/x86/sysdep.h | 3 +++ >> sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S | 5 +++++ >> 5 files changed, 21 insertions(+) >> >> diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h >> index eba8dbc4a6..174ea38f5b 100644 >> --- a/sysdeps/x86/cacheinfo.h >> +++ b/sysdeps/x86/cacheinfo.h >> @@ -49,6 +49,9 @@ long int __x86_rep_stosb_threshold attribute_hidden = =3D 2048; >> /* Threshold to stop using Enhanced REP MOVSB. */ >> long int __x86_rep_movsb_stop_threshold attribute_hidden; >> >> +/* String/memory function control. */ >> +int __x86_string_control attribute_hidden; >> + >> static void >> init_cacheinfo (void) >> { >> @@ -71,5 +74,9 @@ init_cacheinfo (void) >> __x86_rep_movsb_threshold =3D cpu_features->rep_movsb_threshold; >> __x86_rep_stosb_threshold =3D cpu_features->rep_stosb_threshold; >> __x86_rep_movsb_stop_threshold =3D cpu_features->rep_movsb_stop_thre= shold; >> + >> + if (CPU_FEATURES_ARCH_P (cpu_features, Avoid_Short_Distance_REP_MOVSB= )) >> + __x86_string_control >> + |=3D X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB; >> } >> #endif >> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c >> index 706a172ba9..645bba6314 100644 >> --- a/sysdeps/x86/cpu-features.c >> +++ b/sysdeps/x86/cpu-features.c >> @@ -555,6 +555,11 @@ init_cpu_features (struct cpu_features *cpu_feature= s) >> cpu_features->preferred[index_arch_Prefer_AVX2_STRCMP] >> |=3D bit_arch_Prefer_AVX2_STRCMP; >> } >> + >> + /* Avoid avoid short distance REP MOVSB on processor with FSRM. = */ >> + if (CPU_FEATURES_CPU_P (cpu_features, FSRM)) >> + cpu_features->preferred[index_arch_Avoid_Short_Distance_REP_MOVS= B] >> + |=3D bit_arch_Avoid_Short_Distance_REP_MOVSB; >> } >> /* This spells out "AuthenticAMD" or "HygonGenuine". */ >> else if ((ebx =3D=3D 0x68747541 && ecx =3D=3D 0x444d4163 && edx =3D= =3D 0x69746e65) >> diff --git a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.= def b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def >> index 133aab19f1..d7c93f00c5 100644 >> --- a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def >> +++ b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def >> @@ -33,3 +33,4 @@ BIT (Prefer_No_AVX512) >> BIT (MathVec_Prefer_No_AVX512) >> BIT (Prefer_FSRM) >> BIT (Prefer_AVX2_STRCMP) >> +BIT (Avoid_Short_Distance_REP_MOVSB) >> diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h >> index 51c069bfe1..35cb90d507 100644 >> --- a/sysdeps/x86/sysdep.h >> +++ b/sysdeps/x86/sysdep.h >> @@ -57,6 +57,9 @@ enum cf_protection_level >> #define STATE_SAVE_MASK \ >> ((1 << 1) | (1 << 2) | (1 << 3) | (1 << 5) | (1 << 6) | (1 << 7)) >> >> +/* Avoid short distance REP MOVSB. */ >> +#define X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB (1 << 0) >> + >> #ifdef __ASSEMBLER__ >> >> /* Syntactic details of assembler. */ >> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sys= deps/x86_64/multiarch/memmove-vec-unaligned-erms.S >> index a783da5de2..9f02624375 100644 >> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S >> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S >> @@ -325,12 +325,16 @@ L(movsb): >> /* Avoid slow backward REP MOVSB. */ >> jb L(more_8x_vec_backward) >> # if AVOID_SHORT_DISTANCE_REP_MOVSB >> + andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, __x8= 6_string_control(%rip) >> + jz 3f >> movq %rdi, %rcx >> subq %rsi, %rcx >> jmp 2f >> # endif >> 1: >> # if AVOID_SHORT_DISTANCE_REP_MOVSB >> + andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, __x8= 6_string_control(%rip) >> + jz 3f >> movq %rsi, %rcx >> subq %rdi, %rcx >> 2: >> @@ -338,6 +342,7 @@ L(movsb): >> is N*4GB + [1..63] with N >=3D 0. */ >> cmpl $63, %ecx >> jbe L(more_2x_vec) /* Avoid "rep movsb" if ECX <=3D 63. */ >> +3: >> # endif >> mov %RDX_LP, %RCX_LP >> rep movsb >> -- >> 2.31.1 >> --=20 H.J. --000000000000edb69d05c80b3ac8 Content-Type: text/x-patch; charset="US-ASCII"; name="p.diff" Content-Disposition: attachment; filename="p.diff" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_krkzhr1h0 ZGlmZiAtLWdpdCBhL3N5c2RlcHMveDg2XzY0L211bHRpYXJjaC9tZW1tb3ZlLXZlYy11bmFsaWdu ZWQtZXJtcy5TIGIvc3lzZGVwcy94ODZfNjQvbXVsdGlhcmNoL21lbW1vdmUtdmVjLXVuYWxpZ25l ZC1lcm1zLlMKaW5kZXggOWYwMjYyNDM3NS4uNWU4ODJiMGNkZSAxMDA2NDQKLS0tIGEvc3lzZGVw cy94ODZfNjQvbXVsdGlhcmNoL21lbW1vdmUtdmVjLXVuYWxpZ25lZC1lcm1zLlMKKysrIGIvc3lz ZGVwcy94ODZfNjQvbXVsdGlhcmNoL21lbW1vdmUtdmVjLXVuYWxpZ25lZC1lcm1zLlMKQEAgLTMz MCw2ICszMzAsMTIgQEAgTChtb3ZzYik6CiAJbW92cQklcmRpLCAlcmN4CiAJc3VicQklcnNpLCAl cmN4CiAJam1wCTJmCisJLyogZGF0YTE2IGNzIG5vcHcgMHgwKCVyYXgsJXJheCwxKSAqLworCS5i eXRlIDB4NjYsIDB4NjYsIDB4MmUsIDB4MGYsIDB4MWYsIDB4ODQsIDB4MDAsIDB4MDAsIDB4MDAs IDB4MDAsIDB4MDAKKwkvKiBkYXRhMTYgY3Mgbm9wdyAweDAoJXJheCwlcmF4LDEpICovCisJLmJ5 dGUgMHg2NiwgMHg2NiwgMHgyZSwgMHgwZiwgMHgxZiwgMHg4NCwgMHgwMCwgMHgwMCwgMHgwMCwg MHgwMCwgMHgwMAorCS8qIG5vcHcgMHgwKCVyYXgsJXJheCwxKSAqLworCS5ieXRlIDB4NjYsIDB4 MGYsIDB4MWYsIDB4ODQsIDB4MDAsIDB4MDAsIDB4MDAsIDB4MDAsIDB4MDAKICMgZW5kaWYKIDE6 CiAjIGlmIEFWT0lEX1NIT1JUX0RJU1RBTkNFX1JFUF9NT1ZTQgo= --000000000000edb69d05c80b3ac8--