From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-xb2c.google.com (mail-yb1-xb2c.google.com [IPv6:2607:f8b0:4864:20::b2c]) by sourceware.org (Postfix) with ESMTPS id 484D4385E458 for ; Wed, 29 May 2024 22:53:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 484D4385E458 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 484D4385E458 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::b2c ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717023239; cv=none; b=dbytdPLvdSsKTZ8qRxfcie0cLL8DxShDLtTLtyrVvq98YgVBgs6RIHisAYl80rx4v2HkkSkcnjYVaoPfw7D2cZusbszRDpaC57tlVFNUW3Mgt3TMQR7LI3auPe6Lufey1givwSwaRx2M+Dq+2taoi6Rj09bj7tFx0m6fxCJdwzY= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717023239; c=relaxed/simple; bh=Ma/DTMEjsFnQQQUd+k3UfnhqR/CO2rvYlF7sXuId2sk=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=mheCqeskjKUxonYkyZ95lp9Y2r1kXRnjUN9QlU4uyRe6NNwtrPur86qTTXvmWfU6TwNKL8PENv5pWKscB6OKvPtTjmbQiQStcRE2G+epS93fF2AH/vQZp7fm6qaYn3lascUHKTu4mHyjYcovJfpIJ9z8S/zeWGeztJBHE/RvuHU= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-yb1-xb2c.google.com with SMTP id 3f1490d57ef6-dfa584ea2ffso191463276.1 for ; Wed, 29 May 2024 15:53:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717023236; x=1717628036; darn=sourceware.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=429lpcLtAEWd0M6SGvR7QkUvZ21BZwIyENX+vP5Qi6k=; b=HmyyN52fZU2ytbZt7xyYkGTKlzGLveu9d2Mo9/Eg51qkxUl0IGMCG8qQiowir5HDey N2a76W+TKOug4DtK5PdYEM97zRzEQshF7Uzzl6n5hLADJ41FEkOdPrN9lm5sjkeE6Wng EfshHZ/TJBP1mOpwe43Chk+1qEfjFHYZ34YujC92mWaxVhb+Ys/VhbB3/g48erqBq0kv YHfkwve/x6N2uV0a4cd53wDOF0RFZ4JFun4uuINgK1N9QB+Ndi2JEy+FWKUZljaBlzww bqFclOX448atdu6fvU8tdMmZJP3CPzvuUn5MirQr8tP0KKl0hsSjt3u9p1689vBQCnjN /MPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717023236; x=1717628036; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=429lpcLtAEWd0M6SGvR7QkUvZ21BZwIyENX+vP5Qi6k=; b=jBWHqp902QwewmK7Ausxidty1WzsQozabPYjTQa6HLt5BuAFHAzPx4wZb/4PiDVSA8 /rvqtgRTF5OMy8irhkms8kH//ClFDauKcTldX+cYMkw9//3BKlsi85jVzNmRAjDIOhva 9Q7EKY1uAW3+n5zV7z+c3kcLyiflZaEwR1IvVI9zL0cDCAhQkVTX87adUG/BqUADZCGh tukyyZlb/DtifKeRkPQixYx8s6m94Ud48Akg4GJcF2jbawEuxzneeroE7/LvVn9kO/Dg iXAVh2KgW/NsSNLDLPFUb1Rv6MywOmNRCx3K3z9UY5RGX9qmSfo/K9pQFTknwYRiqw0g PJ4A== X-Gm-Message-State: AOJu0Yy74eRFhg00ayeVKVQZU2xTdLv29aKyNrjEqswxYV0fpHOOV3ta KB8tDfMsYwtexrCD35Bd8z7vqaTT9mXdFkx2TSTe+Nk6kALmEjCSXIjUGbBeQNp2prcwRBNCAn/ 3EJpTU2qKIJpSdWradshQMRM6bRBxsBUlHyhM7w== X-Google-Smtp-Source: AGHT+IFGCI7kRseJHy6hy3fBTjL7F/xtlRRRdbtCplbl3Qb2fN1aK9iUQzgjuI/hIVX9uddPdkhNVQE/eAsOS05OWU4= X-Received: by 2002:a25:bc45:0:b0:dc2:421e:c943 with SMTP id 3f1490d57ef6-dfa5a7b41ecmr680768276.42.1717023236401; Wed, 29 May 2024 15:53:56 -0700 (PDT) MIME-Version: 1.0 References: <20240519004347.2759850-1-goldstein.w.n@gmail.com> <20240524173851.2483952-1-goldstein.w.n@gmail.com> <20240524173851.2483952-2-goldstein.w.n@gmail.com> In-Reply-To: <20240524173851.2483952-2-goldstein.w.n@gmail.com> From: "H.J. Lu" Date: Wed, 29 May 2024 15:53:20 -0700 Message-ID: Subject: Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset To: Noah Goldstein Cc: libc-alpha@sourceware.org, carlos@systemhalted.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3018.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, May 24, 2024 at 10:39=E2=80=AFAM Noah Goldstein wrote: > > The tuning for non-temporal stores for memset vs memcpy is not always > the same. This includes both the exact value and whether non-temporal > stores are profitable at all for a given arch. > > This patch add `x86_memset_non_temporal_threshold`. Currently we > disable non-temporal stores for non Intel vendors as the only > benchmarks showing its benefit have been on Intel hardware. > --- > manual/tunables.texi | 16 +++++++++++++++- > sysdeps/x86/cacheinfo.h | 8 +++++++- > sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++ > sysdeps/x86/dl-diagnostics-cpu.c | 2 ++ > sysdeps/x86/dl-tunables.list | 3 +++ > sysdeps/x86/include/cpu-features.h | 4 +++- > .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++--- > 7 files changed, 49 insertions(+), 6 deletions(-) > > diff --git a/manual/tunables.texi b/manual/tunables.texi > index baaf751721..8dd02d8149 100644 > --- a/manual/tunables.texi > +++ b/manual/tunables.texi > @@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 214748364= 7) > glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff) > glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xfffffffffffff= fff) > glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffff= ffffffffff) > +glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: = 0xfffffffffffffff) > glibc.cpu.x86_shstk: > glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffff= ffff) > glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff) > @@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to defau= lt pages. > @cindex shared_cache_size tunables > @cindex tunables, shared_cache_size > @cindex non_temporal_threshold tunables > -@cindex tunables, non_temporal_threshold > +@cindex memset_non_temporal_threshold tunables > +@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold > > @deftp {Tunable namespace} glibc.cpu > Behavior of @theglibc{} can be tuned to assume specific hardware capabil= ities > @@ -574,6 +576,18 @@ like memmove and memcpy. > This tunable is specific to i386 and x86-64. > @end deftp > > +@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold > +The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows > +the user to set threshold in bytes for non temporal store in > +memset. Non temporal stores give a hint to the hardware to move data > +directly to memory without displacing other data from the cache. This > +tunable is used by some platforms to determine when to use non > +temporal stores memset. > + > +This tunable is specific to i386 and x86-64. > +@end deftp > + > + > @deftp Tunable glibc.cpu.x86_rep_movsb_threshold > The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to > set threshold in bytes to start using "rep movsb". The value must be > diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h > index ab73556772..83491607c7 100644 > --- a/sysdeps/x86/cacheinfo.h > +++ b/sysdeps/x86/cacheinfo.h > @@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden =3D 32= * 1024; > long int __x86_shared_cache_size_half attribute_hidden =3D 1024 * 1024 /= 2; > long int __x86_shared_cache_size attribute_hidden =3D 1024 * 1024; > > -/* Threshold to use non temporal store. */ > +/* Threshold to use non temporal store in memmove. */ > long int __x86_shared_non_temporal_threshold attribute_hidden; > > +/* Threshold to use non temporal store in memset. */ > +long int __x86_memset_non_temporal_threshold attribute_hidden; > + > /* Threshold to use Enhanced REP MOVSB. */ > long int __x86_rep_movsb_threshold attribute_hidden =3D 2048; > > @@ -77,6 +80,9 @@ init_cacheinfo (void) > __x86_shared_non_temporal_threshold > =3D cpu_features->non_temporal_threshold; > > + __x86_memset_non_temporal_threshold > + =3D cpu_features->memset_non_temporal_threshold; > + > __x86_rep_movsb_threshold =3D cpu_features->rep_movsb_threshold; > __x86_rep_stosb_threshold =3D cpu_features->rep_stosb_threshold; > __x86_rep_movsb_stop_threshold =3D cpu_features->rep_movsb_stop_thres= hold; > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index 5a98f70364..d375a7cba6 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features= ) > if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) > rep_movsb_threshold =3D 2112; > > + /* Non-temporal stores in memset have only been tested on Intel hardwa= re. > + Until we benchmark data on other x86 processor, disable non-tempora= l > + stores in memset. */ > + unsigned long int memset_non_temporal_threshold =3D SIZE_MAX; > + if (cpu_features->basic.kind =3D=3D arch_kind_intel) > + memset_non_temporal_threshold =3D non_temporal_threshold; > + > /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of > cases slower than the vectorized path (and for some alignments, > it is really slow, check BZ #30994). */ > @@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_featur= es) > && tunable_size <=3D maximum_non_temporal_threshold) > non_temporal_threshold =3D tunable_size; > > + tunable_size =3D TUNABLE_GET (x86_memset_non_temporal_threshold, long = int, NULL); > + if (tunable_size > minimum_non_temporal_threshold > + && tunable_size <=3D maximum_non_temporal_threshold) > + memset_non_temporal_threshold =3D tunable_size; > + > tunable_size =3D TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL)= ; > if (tunable_size > minimum_rep_movsb_threshold) > rep_movsb_threshold =3D tunable_size; > @@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_feature= s) > TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_thre= shold, > minimum_non_temporal_threshold, > maximum_non_temporal_threshold); > + TUNABLE_SET_WITH_BOUNDS ( > + x86_memset_non_temporal_threshold, memset_non_temporal_threshold, > + minimum_non_temporal_threshold, maximum_non_temporal_threshold); > TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold, > minimum_rep_movsb_threshold, SIZE_MAX); > TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold,= 1, > @@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_feature= s) > cpu_features->data_cache_size =3D data; > cpu_features->shared_cache_size =3D shared; > cpu_features->non_temporal_threshold =3D non_temporal_threshold; > + cpu_features->memset_non_temporal_threshold =3D memset_non_temporal_th= reshold; > cpu_features->rep_movsb_threshold =3D rep_movsb_threshold; > cpu_features->rep_stosb_threshold =3D rep_stosb_threshold; > cpu_features->rep_movsb_stop_threshold =3D rep_movsb_stop_threshold; > diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostic= s-cpu.c > index ceafde9481..49eeb5f70a 100644 > --- a/sysdeps/x86/dl-diagnostics-cpu.c > +++ b/sysdeps/x86/dl-diagnostics-cpu.c > @@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void) > cpu_features->shared_cache_size); > print_cpu_features_value ("non_temporal_threshold", > cpu_features->non_temporal_threshold); > + print_cpu_features_value ("memset_non_temporal_threshold", > + cpu_features->memset_non_temporal_threshold)= ; > print_cpu_features_value ("rep_movsb_threshold", > cpu_features->rep_movsb_threshold); > print_cpu_features_value ("rep_movsb_stop_threshold", > diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list > index 7d82da0dec..a0a1299592 100644 > --- a/sysdeps/x86/dl-tunables.list > +++ b/sysdeps/x86/dl-tunables.list > @@ -30,6 +30,9 @@ glibc { > x86_non_temporal_threshold { > type: SIZE_T > } > + x86_memset_non_temporal_threshold { > + type: SIZE_T > + } > x86_rep_movsb_threshold { > type: SIZE_T > # Since there is overhead to set up REP MOVSB operation, REP > diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu= -features.h > index cd7bd27cf3..aaae44f0e1 100644 > --- a/sysdeps/x86/include/cpu-features.h > +++ b/sysdeps/x86/include/cpu-features.h > @@ -944,8 +944,10 @@ struct cpu_features > /* Shared cache size for use in memory and string routines, typically > L2 or L3 size. */ > unsigned long int shared_cache_size; > - /* Threshold to use non temporal store. */ > + /* Threshold to use non temporal store in memmove. */ > unsigned long int non_temporal_threshold; > + /* Threshold to use non temporal store in memset. */ > + unsigned long int memset_non_temporal_threshold; > /* Threshold to use "rep movsb". */ > unsigned long int rep_movsb_threshold; > /* Threshold to stop using "rep movsb". */ > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysde= ps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 637caadb40..88bf08e4f4 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -24,9 +24,9 @@ > 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > 4 VEC stores and store 4 * VEC at a time until done. > 6. On machines ERMS feature, if size is range > - [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold= ) > + [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold= ) > then REP STOSB will be used. > - 7. If size >=3D __x86_shared_non_temporal_threshold, use a > + 7. If size >=3D __x86_memset_non_temporal_threshold, use a > non-temporal stores. */ > > #include > @@ -318,7 +318,7 @@ L(return_vzeroupper): > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > range for 2-byte jump encoding. */ > L(stosb_local): > - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > + cmp __x86_memset_non_temporal_threshold(%rip), %RDX_LP > jae L(nt_memset) > movzbl %sil, %eax > mov %RDX_LP, %RCX_LP > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu Thanks. --=20 H.J.