From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by sourceware.org (Postfix) with ESMTP id 762EB3842429 for ; Fri, 3 Jul 2020 19:49:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 762EB3842429 Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-391-YtN7zcmmOvKiA4QrUfiKKA-1; Fri, 03 Jul 2020 15:49:24 -0400 X-MC-Unique: YtN7zcmmOvKiA4QrUfiKKA-1 Received: by mail-qv1-f72.google.com with SMTP id s2so5782175qvn.19 for ; Fri, 03 Jul 2020 12:49:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=m24J5lF70fwi5T1jBaMIYe4ovdKakS1N3/vEwXxCuVw=; b=Vy9t49MYS1WMnPqH3oQJGMtS2ydTwFLReXnBcFBxvRqza1qxtfO55c/cZFTp1MF5J5 cScpegr6DWPdCp5e+UVc5/WX8L1RZ8+2wSOBErnNeN47mHmxq3SmR7KT+O1zWMfFQrcv RmCQxakr6ecJ/hHAdjee9IWFU/+NcQssZgrFZYrPLH/yT4iJnCE7HQ9JbgUzJo3mQlI3 KR9d4gPenbhM7ZRo7NsdIOqtSumanJCeYDjaNF8C5Jhn6DTiNOy74lp0DVTiQKUyEjtB lNSPUT6GAxy7AiPXrGzwtRoDU5aegthIwByL8cEdsvPuja6QjgaF55KD6R72HrHmZm6N terg== X-Gm-Message-State: AOAM533ogOFL/U3b80DD5eMJCMzVCWBn2JFpFwgE/zK0tK4/uRXcd7qC oBFOue+0kXCrx+DRZV5a5bxcB2LDZodEc06JtqF6gWeIZp5YbU37yjl1tjaCLNBZcFoC2yTyNHo OssqP48uYoAMS/G4DjEBc X-Received: by 2002:a37:6488:: with SMTP id y130mr34680101qkb.194.1593805763561; Fri, 03 Jul 2020 12:49:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyGbnTaKG6pGPPB9/7rU7U/ZDvfJs6RerRt2e4esj+6Kvl5GiVBoZgMBPSSU37kTrxICjNrNA== X-Received: by 2002:a37:6488:: with SMTP id y130mr34680076qkb.194.1593805763130; Fri, 03 Jul 2020 12:49:23 -0700 (PDT) Received: from [192.168.1.4] (198-84-170-103.cpe.teksavvy.com. [198.84.170.103]) by smtp.gmail.com with ESMTPSA id l67sm11400887qkd.7.2020.07.03.12.49.22 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 03 Jul 2020 12:49:22 -0700 (PDT) Subject: Re: [PATCH 2/2] x86: Add thresholds for "rep movsb/stosb" to tunables To: "H.J. Lu" , libc-alpha@sourceware.org References: <20200703175220.1178840-1-hjl.tools@gmail.com> <20200703175220.1178840-3-hjl.tools@gmail.com> From: Carlos O'Donell Organization: Red Hat Message-ID: <8cef5b4a-cdda-6eaa-a859-5a410560a4ce@redhat.com> Date: Fri, 3 Jul 2020 15:49:21 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <20200703175220.1178840-3-hjl.tools@gmail.com> Content-Language: en-US X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 03 Jul 2020 19:49:28 -0000 On 7/3/20 1:52 PM, H.J. Lu wrote: > Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables > to update thresholds for "rep movsb" and "rep stosb" at run-time. > > Note that the user specified threshold for "rep movsb" smaller than the > minimum threshold will be ignored. Post v2 please. Almost there. > --- > manual/tunables.texi | 14 +++++++ > sysdeps/x86/cacheinfo.c | 20 ++++++++++ > sysdeps/x86/cpu-features.h | 4 ++ > sysdeps/x86/dl-cacheinfo.c | 38 +++++++++++++++++++ > sysdeps/x86/dl-tunables.list | 6 +++ > .../multiarch/memmove-vec-unaligned-erms.S | 16 +------- > .../multiarch/memset-vec-unaligned-erms.S | 12 +----- > 7 files changed, 84 insertions(+), 26 deletions(-) > > diff --git a/manual/tunables.texi b/manual/tunables.texi > index ec18b10834..61edd62425 100644 > --- a/manual/tunables.texi > +++ b/manual/tunables.texi > @@ -396,6 +396,20 @@ to set threshold in bytes for non temporal store. > This tunable is specific to i386 and x86-64. > @end deftp > > +@deftp Tunable glibc.cpu.x86_rep_movsb_threshold > +The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user > +to set threshold in bytes to start using "rep movsb". > + > +This tunable is specific to i386 and x86-64. > +@end deftp > + > +@deftp Tunable glibc.cpu.x86_rep_stosb_threshold > +The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user > +to set threshold in bytes to start using "rep stosb". > + > +This tunable is specific to i386 and x86-64. > +@end deftp > + > @deftp Tunable glibc.cpu.x86_ibt > The @code{glibc.cpu.x86_ibt} tunable allows the user to control how > indirect branch tracking (IBT) should be enabled. Accepted values are > diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > index 8c4c7f9972..bb536d96ef 100644 > --- a/sysdeps/x86/cacheinfo.c > +++ b/sysdeps/x86/cacheinfo.c > @@ -41,6 +41,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024; > /* Threshold to use non temporal store. */ > long int __x86_shared_non_temporal_threshold attribute_hidden; > > +/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set > + up REP MOVSB operation, REP MOVSB isn't faster on short data. The > + memcpy micro benchmark in glibc shows that 2KB is the approximate > + value above which REP MOVSB becomes faster than SSE2 optimization > + on processors with Enhanced REP MOVSB. Since larger register size > + can move more data with a single load and store, the threshold is > + higher with larger register size. */ > +long int __x86_rep_movsb_threshold attribute_hidden = 2048; > + > +/* Threshold to use Enhanced REP STOSB. Since there is overhead to set > + up REP STOSB operation, REP STOSB isn't faster on short data. The > + memset micro benchmark in glibc shows that 2KB is the approximate > + value above which REP STOSB becomes faster on processors with > + Enhanced REP STOSB. Since the stored value is fixed, larger register > + size has minimal impact on threshold. */ > +long int __x86_rep_stosb_threshold attribute_hidden = 2048; > + > #ifndef __x86_64__ > /* PREFETCHW support flag for use in memory and string routines. */ > int __x86_prefetchw attribute_hidden; > @@ -117,6 +134,9 @@ init_cacheinfo (void) > __x86_shared_non_temporal_threshold > = cpu_features->non_temporal_threshold; > > + __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold; > + __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; > + OK. Update global from cpu_features with values. I would really like to see some kind of "assert (cpu_features->initialized);" that way we know we didn't break the startup sequence unintentionally. > #ifndef __x86_64__ > __x86_prefetchw = cpu_features->prefetchw; > #endif > diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h > index 3aaed33cbc..002e12e11f 100644 > --- a/sysdeps/x86/cpu-features.h > +++ b/sysdeps/x86/cpu-features.h > @@ -128,6 +128,10 @@ struct cpu_features > /* PREFETCHW support flag for use in memory and string routines. */ > unsigned long int prefetchw; > #endif > + /* Threshold to use "rep movsb". */ > + unsigned long int rep_movsb_threshold; > + /* Threshold to use "rep stosb". */ > + unsigned long int rep_stosb_threshold; OK. > }; > > /* Used from outside of glibc to get access to the CPU features > diff --git a/sysdeps/x86/dl-cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c > index 8e2a6f552c..aff9bd1067 100644 > --- a/sysdeps/x86/dl-cacheinfo.c > +++ b/sysdeps/x86/dl-cacheinfo.c > @@ -860,6 +860,31 @@ __init_cacheinfo (void) > total shared cache size. */ > unsigned long int non_temporal_threshold = (shared * threads * 3 / 4); > > + /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */ > + unsigned long int minimum_rep_movsb_threshold; > + /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). See > + comments for __x86_rep_movsb_threshold in cacheinfo.c. */ > + unsigned long int rep_movsb_threshold; > + if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable) > + && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)) > + { > + rep_movsb_threshold = 2048 * (64 / 16); > + minimum_rep_movsb_threshold = 64 * 8; > + } > + else if (CPU_FEATURES_ARCH_P (cpu_features, > + AVX_Fast_Unaligned_Load)) > + { > + rep_movsb_threshold = 2048 * (32 / 16); > + minimum_rep_movsb_threshold = 32 * 8; > + } > + else > + { > + rep_movsb_threshold = 2048 * (16 / 16); > + minimum_rep_movsb_threshold = 16 * 8; > + } > + /* NB: See comments for __x86_rep_stosb_threshold in cacheinfo.c. */ > + unsigned long int rep_stosb_threshold = 2048; > + > #if HAVE_TUNABLES > long int tunable_size; > tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL); > @@ -871,11 +896,19 @@ __init_cacheinfo (void) > tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL); > if (tunable_size != 0) > non_temporal_threshold = tunable_size; > + tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL); > + if (tunable_size > minimum_rep_movsb_threshold) > + rep_movsb_threshold = tunable_size; OK. Good, we only set rep_movsb_threshold if it's greater than min. > + tunable_size = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL); > + if (tunable_size != 0) > + rep_stosb_threshold = tunable_size; This should be min=1, default=2048 in dl-tunables.list, and would remove this code since the range is not dynamic. The point of the tunables framework is to remove such boiler plate for range a default processing and clearing parameters for security settings. > #endif > > cpu_features->data_cache_size = data; > cpu_features->shared_cache_size = shared; > cpu_features->non_temporal_threshold = non_temporal_threshold; > + cpu_features->rep_movsb_threshold = rep_movsb_threshold; > + cpu_features->rep_stosb_threshold = rep_stosb_threshold; > > #if HAVE_TUNABLES > TUNABLE_UPDATE (x86_data_cache_size, long int, > @@ -884,5 +917,10 @@ __init_cacheinfo (void) > shared, 0, (long int) -1); > TUNABLE_UPDATE (x86_non_temporal_threshold, long int, > non_temporal_threshold, 0, (long int) -1); > + TUNABLE_UPDATE (x86_rep_movsb_threshold, long int, > + rep_movsb_threshold, minimum_rep_movsb_threshold, > + (long int) -1); OK. Store the new value and the computed minimum. > + TUNABLE_UPDATE (x86_rep_stosb_threshold, long int, > + rep_stosb_threshold, 0, (long int) -1); This one can be deleted. > #endif > } > diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list > index 251b926ce4..43bf6c2389 100644 > --- a/sysdeps/x86/dl-tunables.list > +++ b/sysdeps/x86/dl-tunables.list > @@ -30,6 +30,12 @@ glibc { > x86_non_temporal_threshold { > type: SIZE_T > } > + x86_rep_movsb_threshold { > + type: SIZE_T > + } > + x86_rep_stosb_threshold { > + type: SIZE_T min: 1 default: 2048 > + } > x86_data_cache_size { > type: SIZE_T > } > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > index 74953245aa..bd5dc1a3f3 100644 > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > @@ -56,17 +56,6 @@ > # endif > #endif > > -/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set > - up REP MOVSB operation, REP MOVSB isn't faster on short data. The > - memcpy micro benchmark in glibc shows that 2KB is the approximate > - value above which REP MOVSB becomes faster than SSE2 optimization > - on processors with Enhanced REP MOVSB. Since larger register size > - can move more data with a single load and store, the threshold is > - higher with larger register size. */ > -#ifndef REP_MOVSB_THRESHOLD > -# define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16))> -#endif OK. > - > #ifndef PREFETCH > # define PREFETCH(addr) prefetcht0 addr > #endif > @@ -253,9 +242,6 @@ L(movsb): > leaq (%rsi,%rdx), %r9 > cmpq %r9, %rdi > /* Avoid slow backward REP MOVSB. */ > -# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8) > -# error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE! > -# endif OK. > jb L(more_8x_vec_backward) > 1: > mov %RDX_LP, %RCX_LP > @@ -331,7 +317,7 @@ L(between_2_3): > > #if defined USE_MULTIARCH && IS_IN (libc) > L(movsb_more_2x_vec): > - cmpq $REP_MOVSB_THRESHOLD, %rdx > + cmp __x86_rep_movsb_threshold(%rip), %RDX_LP OK. > ja L(movsb) > #endif > L(more_2x_vec): > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > index af2299709c..2bfc95de05 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -58,16 +58,6 @@ > # endif > #endif > > -/* Threshold to use Enhanced REP STOSB. Since there is overhead to set > - up REP STOSB operation, REP STOSB isn't faster on short data. The > - memset micro benchmark in glibc shows that 2KB is the approximate > - value above which REP STOSB becomes faster on processors with > - Enhanced REP STOSB. Since the stored value is fixed, larger register > - size has minimal impact on threshold. */ > -#ifndef REP_STOSB_THRESHOLD > -# define REP_STOSB_THRESHOLD 2048 > -#endif > - > #ifndef SECTION > # error SECTION is not defined! > #endif > @@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) > ret > > L(stosb_more_2x_vec): > - cmpq $REP_STOSB_THRESHOLD, %rdx > + cmp __x86_rep_stosb_threshold(%rip), %RDX_LP OK. > ja L(stosb) > #endif > L(more_2x_vec): > -- Cheers, Carlos.