From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x332.google.com (mail-ot1-x332.google.com [IPv6:2607:f8b0:4864:20::332]) by sourceware.org (Postfix) with ESMTPS id 394A53858025 for ; Fri, 30 Apr 2021 23:51:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 394A53858025 Received: by mail-ot1-x332.google.com with SMTP id n32-20020a9d1ea30000b02902a53d6ad4bdso53976otn.3 for ; Fri, 30 Apr 2021 16:51:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=fgjmNL8HlhX3iXTWMjmw+x1LQf138b6kWQ8AB8CwIBg=; b=ZfYuZs2Wa8OzoEdSdryuskb2esaKcM2rrxGgSLCQdd7foLCLKIMa0N0oD/CZ5HKS44 eURWw/514ICIXfarXLmeRMi2gwJxiVuTkz/xKTgH/ym5XH8GRKOHq6pXPT8YRvSVykLF ZB1H/bS9Hq89VOuRmkYPLpyHRm+dCoSxwhhLskxZ7TFgB/+P95PHQexiohMPR9nnBuNg KiJfmYDZ/YWDT3pFeLBaNOJmv/kdxcBVylM1nilNbvOKs0Z3c3NdGPZyfRJSzZJEiLnp TzZu+vSwiGdzVt/3NinmLtsKVvq35eRBGopY0HI/KHTwLcljUcvcKLYoM29BKB9SsQAm k5rA== X-Gm-Message-State: AOAM532fZVjFynOv09J+LL+GTehR7vezngk5oPaUxUe4itraTM2dgTRT 8PogQ1Vu8XAzClmFb1mSHxS5HVzNeThx/XgxvQI= X-Google-Smtp-Source: ABdhPJwfhrjjRsUJOthH7GgsuNoTqzzUY7/ll4azMS84gTakM/PW7jB6Z+a1atCuekK858xWnideng+PcdOK8/xeBqE= X-Received: by 2002:a9d:6f19:: with SMTP id n25mr5649102otq.89.1619826692568; Fri, 30 Apr 2021 16:51:32 -0700 (PDT) MIME-Version: 1.0 References: <20210430182442.3612464-1-hjl.tools@gmail.com> In-Reply-To: From: "H.J. Lu" Date: Fri, 30 Apr 2021 16:50:56 -0700 Message-ID: Subject: Re: [PATCH] x86: Set rep_movsb_threshold to 2112 on processors with FSRM To: Noah Goldstein Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3035.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Apr 2021 23:51:34 -0000 On Fri, Apr 30, 2021 at 3:59 PM Noah Goldstein wrote: > > On Fri, Apr 30, 2021 at 4:43 PM H.J. Lu via Libc-alpha > wrote: > > > > The glibc memcpy benchmark on Intel Core i7-1065G7 (Ice Lake) showed > > that REP MOVSB became faster after 2112 bytes: > > > > Vector + REP MOVSB REP MOVSB > > length=2112, align1=0, align2=0: 24.20 24.40 > > length=2112, align1=1, align2=0: 26.07 23.13 > > length=2112, align1=0, align2=1: 27.18 28.13 > > length=2112, align1=1, align2=1: 26.23 25.16 > > length=2176, align1=0, align2=0: 23.18 22.52 > > length=2176, align1=2, align2=0: 25.45 22.52 > > length=2176, align1=0, align2=2: 27.14 27.82 > > length=2176, align1=2, align2=2: 22.73 25.56 > > length=2240, align1=0, align2=0: 24.62 24.25 > > length=2240, align1=3, align2=0: 29.77 27.15 > > length=2240, align1=0, align2=3: 35.55 29.93 > > length=2240, align1=3, align2=3: 34.49 25.15 > > length=2304, align1=0, align2=0: 34.75 26.64 > > length=2304, align1=4, align2=0: 32.09 22.63 > > length=2304, align1=0, align2=4: 28.43 31.24 > > > > Do you know what is happening at: > length=2304, align1=0, align2=4: 28.43 31.24 > > Seems align2 > align1 gets worse for larger sizes as well. > > I.e from my Icelake (1 run): > w/o > Patch w/ Patch > length=3648, align1=0, align2=25: 72.91 83.99 > length=3712, align1=0, align2=26: 72.52 84.92 > length=3776, align1=0, align2=27: 76.16 84.02 > length=3840, align1=0, align2=28: 75.44 90.13 > length=3904, align1=0, align2=29: 81.62 84.43 > length=3968, align1=0, align2=30: 76.82 93.39 > length=4032, align1=0, align2=31: 80.01 89.89 > length=4096, align1=0, align2=32: 72.89 97.50 On Intel Core i7-1165G7 (Tiger Lake), I got Vector REP MOVSB length=2112, align1=0, align2=1: 45.21 43.69 length=2176, align1=0, align2=2: 45.20 45.82 length=2240, align1=0, align2=3: 53.46 51.41 length=2304, align1=0, align2=4: 47.65 48.28 length=2368, align1=0, align2=5: 50.10 48.27 length=2432, align1=0, align2=6: 50.10 50.70 length=2496, align1=0, align2=7: 52.54 50.71 length=2560, align1=0, align2=8: 52.83 53.15 length=2624, align1=0, align2=9: 64.98 53.15 length=2688, align1=0, align2=10: 54.98 55.59 length=2752, align1=0, align2=11: 57.43 55.59 length=2816, align1=0, align2=12: 57.87 58.03 length=2880, align1=0, align2=13: 61.97 58.03 length=2944, align1=0, align2=14: 71.97 60.79 length=3008, align1=0, align2=15: 65.02 60.47 length=3072, align1=0, align2=16: 65.39 63.23 length=3136, align1=0, align2=17: 72.54 62.92 length=3200, align1=0, align2=18: 69.91 65.39 length=3264, align1=0, align2=19: 73.38 65.39 length=3328, align1=0, align2=20: 73.50 67.81 length=3392, align1=0, align2=21: 76.81 67.81 length=3456, align1=0, align2=22: 76.88 70.79 length=3520, align1=0, align2=23: 80.30 71.07 length=3584, align1=0, align2=24: 80.25 72.69 length=3648, align1=0, align2=25: 84.58 73.01 length=3712, align1=0, align2=26: 85.02 75.13 length=3776, align1=0, align2=27: 87.51 75.14 length=3840, align1=0, align2=28: 87.20 77.67 length=3904, align1=0, align2=29: 89.82 77.99 length=3968, align1=0, align2=30: 102.59 80.45 length=4032, align1=0, align2=31: 92.38 80.60 length=4096, align1=0, align2=32: 85.69 84.69 REP MOVSB is a little bit faster. > > Use REP MOVSB for data size > 2112 bytes in memcpy on processors with > > fast short REP MOVSB (FSRM). > > > > * sysdeps/x86/dl-cacheinfo.h (dl_init_cacheinfo): Set > > rep_movsb_threshold to 2112 on processors with fast short REP > > MOVSB (FSRM). > > --- > > sysdeps/x86/dl-cacheinfo.h | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > index d9944250fc..3f04fb5019 100644 > > --- a/sysdeps/x86/dl-cacheinfo.h > > +++ b/sysdeps/x86/dl-cacheinfo.h > > @@ -871,7 +871,10 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > if (CPU_FEATURE_USABLE_P (cpu_features, AVX512F) > > && !CPU_FEATURE_PREFERRED_P (cpu_features, Prefer_No_AVX512)) > > { > > - rep_movsb_threshold = 2048 * (64 / 16); > > + if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) > > + rep_movsb_threshold = 2112; > > + else > > + rep_movsb_threshold = 2048 * (64 / 16); > > #if HAVE_TUNABLES > > minimum_rep_movsb_threshold = 64 * 8; > > #endif > > @@ -879,7 +882,10 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > else if (CPU_FEATURE_PREFERRED_P (cpu_features, > > AVX_Fast_Unaligned_Load)) > > { > > - rep_movsb_threshold = 2048 * (32 / 16); > > + if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) > > + rep_movsb_threshold = 2112; > > + else > > + rep_movsb_threshold = 2048 * (32 / 16); > > #if HAVE_TUNABLES > > minimum_rep_movsb_threshold = 32 * 8; > > #endif > > -- > > 2.31.1 > > -- H.J.