From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) by sourceware.org (Postfix) with ESMTPS id 9B95F3858401 for ; Sat, 6 Nov 2021 19:11:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9B95F3858401 Received: by mail-pj1-x102a.google.com with SMTP id fv9-20020a17090b0e8900b001a6a5ab1392so5733558pjb.1 for ; Sat, 06 Nov 2021 12:11:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Ob4cocZQrcxFkerfwHobFB/SbMP5+7sa/nqVPkSoCFs=; b=JDSP+pV/JMqu68ys1li1EOlhDXVKqB8MD9XKP9fuiTLvj7fRlFhoUiLBPDVF5EZXD4 a2eNQdlO3sKvKbojB8jEoWm+vyL5+JNF8+c0vcALnn9U/YJaq1Yj2piK2ep6mFwp45Uf p4hTWKYAe7B4Esnou0QXU9nMBvdZiiFrLrDb6wauuqrgx5lG26GA0aLV8v/M870YCxVm 1g4w41LxngCkPUpjUGoOTbrEP+sryDAojrhOSfdmh3qUmq35rf/P2AxAYAHWzCTs4DO3 sDO3eQYjxws2I82qiYqCt/jsL6oPxhgVQYxywDgD9dOu+aqBOh1wwqvZ8PQGdc09cn56 nDJw== X-Gm-Message-State: AOAM533KwGvcXSMO1cPQXUncKaopos01Aw55RfMNU5LT93prkBjSStPl KoitpkwCkwI7fh4dDrIXGWmI4O9S6+wAkoYH8JH3ZlMmTzA= X-Google-Smtp-Source: ABdhPJxovFO2wLGa4bLZmHOzQC7efuln07KHS81R14g/LRDvIj76kj4XTWdFTtDME/oO37Pa+zRBvwij0Jdd7+WdESo= X-Received: by 2002:a17:90b:1e0e:: with SMTP id pg14mr39122909pjb.143.1636225871751; Sat, 06 Nov 2021 12:11:11 -0700 (PDT) MIME-Version: 1.0 References: <20211101054952.2349590-1-goldstein.w.n@gmail.com> <20211106183322.3129442-1-goldstein.w.n@gmail.com> <20211106183322.3129442-5-goldstein.w.n@gmail.com> In-Reply-To: <20211106183322.3129442-5-goldstein.w.n@gmail.com> From: "H.J. Lu" Date: Sat, 6 Nov 2021 12:10:35 -0700 Message-ID: Subject: Re: [PATCH v4 5/5] x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h To: Noah Goldstein Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3029.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 Nov 2021 19:11:14 -0000 On Sat, Nov 6, 2021 at 11:36 AM Noah Goldstein via Libc-alpha wrote: > > No bug. > > This patch doubles the rep_movsb_threshold when using ERMS. Based on > benchmarks the vector copy loop, especially now that it handles 4k > aliasing, is better for these medium ranged. > > On Skylake with ERMS: > > Size, Align1, Align2, dst>src,(rep movsb) / (vec copy) > 4096, 0, 0, 0, 0.975 > 4096, 0, 0, 1, 0.953 > 4096, 12, 0, 0, 0.969 > 4096, 12, 0, 1, 0.872 > 4096, 44, 0, 0, 0.979 > 4096, 44, 0, 1, 0.83 > 4096, 0, 12, 0, 1.006 > 4096, 0, 12, 1, 0.989 > 4096, 0, 44, 0, 0.739 > 4096, 0, 44, 1, 0.942 > 4096, 12, 12, 0, 1.009 > 4096, 12, 12, 1, 0.973 > 4096, 44, 44, 0, 0.791 > 4096, 44, 44, 1, 0.961 > 4096, 2048, 0, 0, 0.978 > 4096, 2048, 0, 1, 0.951 > 4096, 2060, 0, 0, 0.986 > 4096, 2060, 0, 1, 0.963 > 4096, 2048, 12, 0, 0.971 > 4096, 2048, 12, 1, 0.941 > 4096, 2060, 12, 0, 0.977 > 4096, 2060, 12, 1, 0.949 > 8192, 0, 0, 0, 0.85 > 8192, 0, 0, 1, 0.845 > 8192, 13, 0, 0, 0.937 > 8192, 13, 0, 1, 0.939 > 8192, 45, 0, 0, 0.932 > 8192, 45, 0, 1, 0.927 > 8192, 0, 13, 0, 0.621 > 8192, 0, 13, 1, 0.62 > 8192, 0, 45, 0, 0.53 > 8192, 0, 45, 1, 0.516 > 8192, 13, 13, 0, 0.664 > 8192, 13, 13, 1, 0.659 > 8192, 45, 45, 0, 0.593 > 8192, 45, 45, 1, 0.575 > 8192, 2048, 0, 0, 0.854 > 8192, 2048, 0, 1, 0.834 > 8192, 2061, 0, 0, 0.863 > 8192, 2061, 0, 1, 0.857 > 8192, 2048, 13, 0, 0.63 > 8192, 2048, 13, 1, 0.629 > 8192, 2061, 13, 0, 0.627 > 8192, 2061, 13, 1, 0.62 > --- > sysdeps/x86/dl-cacheinfo.h | 8 +++++--- > sysdeps/x86/dl-tunables.list | 26 +++++++++++++++----------- > 2 files changed, 20 insertions(+), 14 deletions(-) > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index e6c94dfd02..2e43e67e4f 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -866,12 +866,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */ > unsigned int minimum_rep_movsb_threshold; > #endif > - /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). */ > + /* NB: The default REP MOVSB threshold is 4096 * (VEC_SIZE / 16) for > + VEC_SIZE == 64 or 32. For VEC_SIZE == 16, the default REP MOVSB > + threshold is 2048 * (VEC_SIZE / 16). */ > unsigned int rep_movsb_threshold; > if (CPU_FEATURE_USABLE_P (cpu_features, AVX512F) > && !CPU_FEATURE_PREFERRED_P (cpu_features, Prefer_No_AVX512)) > { > - rep_movsb_threshold = 2048 * (64 / 16); > + rep_movsb_threshold = 4096 * (64 / 16); > #if HAVE_TUNABLES > minimum_rep_movsb_threshold = 64 * 8; > #endif > @@ -879,7 +881,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > else if (CPU_FEATURE_PREFERRED_P (cpu_features, > AVX_Fast_Unaligned_Load)) > { > - rep_movsb_threshold = 2048 * (32 / 16); > + rep_movsb_threshold = 4096 * (32 / 16); > #if HAVE_TUNABLES > minimum_rep_movsb_threshold = 32 * 8; > #endif > diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list > index dd6e1d65c9..419313804d 100644 > --- a/sysdeps/x86/dl-tunables.list > +++ b/sysdeps/x86/dl-tunables.list > @@ -32,17 +32,21 @@ glibc { > } > x86_rep_movsb_threshold { > type: SIZE_T > - # Since there is overhead to set up REP MOVSB operation, REP MOVSB > - # isn't faster on short data. The memcpy micro benchmark in glibc > - # shows that 2KB is the approximate value above which REP MOVSB > - # becomes faster than SSE2 optimization on processors with Enhanced > - # REP MOVSB. Since larger register size can move more data with a > - # single load and store, the threshold is higher with larger register > - # size. Note: Since the REP MOVSB threshold must be greater than 8 > - # times of vector size and the default value is 2048 * (vector size > - # / 16), the default value and the minimum value must be updated at > - # run-time. NB: Don't set the default value since we can't tell if > - # the tunable value is set by user or not [BZ #27069]. > + # Since there is overhead to set up REP MOVSB operation, REP > + # MOVSB isn't faster on short data. The memcpy micro benchmark > + # in glibc shows that 2KB is the approximate value above which > + # REP MOVSB becomes faster than SSE2 optimization on processors > + # with Enhanced REP MOVSB. Since larger register size can move > + # more data with a single load and store, the threshold is > + # higher with larger register size. Micro benchmarks show AVX > + # REP MOVSB becomes faster apprximately at 8KB. The AVX512 > + # threshold is extrapolated to 16KB. For machines with FSRM the > + # threshold is universally set at 2112 bytes. Note: Since the > + # REP MOVSB threshold must be greater than 8 times of vector > + # size and the default value is 4096 * (vector size / 16), the > + # default value and the minimum value must be updated at > + # run-time. NB: Don't set the default value since we can't tell > + # if the tunable value is set by user or not [BZ #27069]. > minval: 1 > } > x86_rep_stosb_threshold { > -- > 2.25.1 > LGTM. Reviewed-by: H.J. Lu Thanks. -- H.J.