From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) by sourceware.org (Postfix) with ESMTPS id 7714F3858C41 for ; Mon, 12 Feb 2024 15:56:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7714F3858C41 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 7714F3858C41 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::112b ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707753408; cv=none; b=wRwGAam3nE/9YeY3KvM9D0VO5x7rIOBTlWXuHUMh7+V+ASsQxNqDKK6fpUHPLOXcv7qcCmuoeMlSSGVW1occW5jQk2kyYYn5Zi8dIUYVtZgqXGcd9fhE6OH+jueen0cwiLf+8sHxhhT5AcCsH988en49dq9VRA0GL9L6jO8Fsl0= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707753408; c=relaxed/simple; bh=WvvIdiJl+fefyTcfpe7GvdLXJfxkhj+/fGffBo6yIO8=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=l4YlfSy/Q9AQs077pWY2y3SP09Hs6vdDxMMC4DqtCG4aJ7mV7M9xhF8eE6+AIOC4SjOPyLXKNLfBjc/sfZiTdg0SZDO6c74+y779T1VHA6KVE4H1kRWgMF1D0DLhWjowTuK18qD2e+3qgcQeeNKTjpSDT1aJN3fhbNn5nvKInFs= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-60778d7b02bso3130927b3.2 for ; Mon, 12 Feb 2024 07:56:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707753406; x=1708358206; darn=sourceware.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=n2Fe0bvFpEsBp7jt8q/EMWWiLM+utWGoxn94gsKJ6jE=; b=ZGriNyak2uWQZ6mx06nUFYTyrcc87EnWme2bFDVoTr12K12OR17BCFEPr3KFd47l4n KjvgQrKCHUA5RiuMuyw9s8l1oz6OPBmNGqq6o50NG/FsZW+hXvEefyQL5IQb78czzBGg XgWG6uYy4/s1Z5ecqKOVCd+zB5uk7sZFi9H2kX/urYUvJercfXO7W4aWe9Eg3IqKFa2u IFKSdtseYfjEyxBxz63CGl6xgSwUA41ut3XC6YpN6LBjEo4Gqu4zCafwM2ZEKMkGZNsv vJ61ETUAlUL2HZ2XMfxociQlb0VivQ5toFRwfK7viRyeJhh6ygR5Hs0hA1CfUKVHWo+3 XB9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707753406; x=1708358206; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=n2Fe0bvFpEsBp7jt8q/EMWWiLM+utWGoxn94gsKJ6jE=; b=CnMC80kQiMfiJmDmRLPnspOnaryvyTbhYjTecZ/QDE8RTtYwz6ZIIA8SyXkO3kB77u wEBm8vjvOj3yvZ8zYsddMvbvBaeLQYOKQ49zh3o0eeeiVLR5+TWmrfmm90z5qKoWcJTT y8XxqskQy4CHbqWe1CEZh9jBmmAroWSCL6K4aJvOW9urkQAKQxQMFw2H7ep8ITxTrEss mZpXXg2USCBrpE7Wa/zwjifCOC1LZE1xMeSQypUIbG3gHeSxSwkRn0fODm+aw+sfpUGR KHfQo2mOlAsmxQ9cOc1vVI/0O3HqQjs6aSEr+sykffKw759mGAD/i11hpwMAgNIxBF0a wPpg== X-Gm-Message-State: AOJu0YyXLIFRK3w1JwiGIYlBJX53CzePMoAW0G2AfxUKb1xDt/en2egN LMeda5dS7lrOjlj1Yrann7IKpnRc9UyLgJ/PNb9mnB9ABBPvc8m239P+zmaL3oMY6OvdO+BNROD rUrnX2muCpiGNLgMw4E/MbxhfINytwPBCRTQ= X-Google-Smtp-Source: AGHT+IGNooFZHKtAcNImbjWoPX0Zyx8Fa3D86iurddu5KIbh/iYFOxfI8lS1rq3LEc959xGZBo2bREGOtd2Ui3sCUdA= X-Received: by 2002:a0d:d615:0:b0:5ff:55f6:7530 with SMTP id y21-20020a0dd615000000b005ff55f67530mr5013148ywd.12.1707753405719; Mon, 12 Feb 2024 07:56:45 -0800 (PST) MIME-Version: 1.0 References: <20240208130840.533348-1-adhemerval.zanella@linaro.org> <20240208130840.533348-2-adhemerval.zanella@linaro.org> In-Reply-To: <20240208130840.533348-2-adhemerval.zanella@linaro.org> From: "H.J. Lu" Date: Mon, 12 Feb 2024 07:56:09 -0800 Message-ID: Subject: Re: [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) To: Adhemerval Zanella Cc: libc-alpha@sourceware.org, Noah Goldstein , Sajan Karumanchi , bmerry@sarao.ac.za, pmallapp@amd.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3020.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, Feb 8, 2024 at 5:08=E2=80=AFAM Adhemerval Zanella wrote: > > The REP MOVSB usage on memcpy/memmove does not show much performance > improvement on Zen3/Zen4 cores compared to the vectorized loops. Also, > as from BZ 30994, if the source is aligned and the destination is not > the performance can be 20x slower. > > The performance difference is noticeable with small buffer sizes, closer > to the lower bounds limits when memcpy/memmove starts to use ERMS. The > performance of REP MOVSB is similar to vectorized instruction on the > size limit (the L2 cache). Also, there is no drawback to multiple cores > sharing the cache. > > Checked on x86_64-linux-gnu on Zen3. > --- > sysdeps/x86/dl-cacheinfo.h | 38 ++++++++++++++++++-------------------- > 1 file changed, 18 insertions(+), 20 deletions(-) > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index d5101615e3..f34d12846c 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > long int data =3D -1; > long int shared =3D -1; > long int shared_per_thread =3D -1; > - long int core =3D -1; > unsigned int threads =3D 0; > unsigned long int level1_icache_size =3D -1; > unsigned long int level1_icache_linesize =3D -1; > @@ -809,7 +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > if (cpu_features->basic.kind =3D=3D arch_kind_intel) > { > data =3D handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features); > - core =3D handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); > shared =3D handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features); > shared_per_thread =3D shared; > > @@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > =3D handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features); > level1_dcache_linesize > =3D handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features); > - level2_cache_size =3D core; > + level2_cache_size > + =3D handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); > level2_cache_assoc > =3D handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features); > level2_cache_linesize > @@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_feature= s) > level4_cache_size > =3D handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features); > > - get_common_cache_info (&shared, &shared_per_thread, &threads, core= ); > + get_common_cache_info (&shared, &shared_per_thread, &threads, > + level2_cache_size); > } > else if (cpu_features->basic.kind =3D=3D arch_kind_zhaoxin) > { > data =3D handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE); > - core =3D handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); > shared =3D handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE); > shared_per_thread =3D shared; > > @@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_feature= s) > level1_dcache_size =3D data; > level1_dcache_assoc =3D handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC); > level1_dcache_linesize =3D handle_zhaoxin (_SC_LEVEL1_DCACHE_LINES= IZE); > - level2_cache_size =3D core; > + level2_cache_size =3D handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); > level2_cache_assoc =3D handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC); > level2_cache_linesize =3D handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZ= E); > level3_cache_size =3D shared; > level3_cache_assoc =3D handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC); > level3_cache_linesize =3D handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZ= E); > > - get_common_cache_info (&shared, &shared_per_thread, &threads, core= ); > + get_common_cache_info (&shared, &shared_per_thread, &threads, > + level2_cache_size); > } > else if (cpu_features->basic.kind =3D=3D arch_kind_amd) > { > data =3D handle_amd (_SC_LEVEL1_DCACHE_SIZE); > - core =3D handle_amd (_SC_LEVEL2_CACHE_SIZE); > shared =3D handle_amd (_SC_LEVEL3_CACHE_SIZE); > > level1_icache_size =3D handle_amd (_SC_LEVEL1_ICACHE_SIZE); > @@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > level1_dcache_size =3D data; > level1_dcache_assoc =3D handle_amd (_SC_LEVEL1_DCACHE_ASSOC); > level1_dcache_linesize =3D handle_amd (_SC_LEVEL1_DCACHE_LINESIZE)= ; > - level2_cache_size =3D core; > + level2_cache_size =3D handle_amd (_SC_LEVEL2_CACHE_SIZE);; > level2_cache_assoc =3D handle_amd (_SC_LEVEL2_CACHE_ASSOC); > level2_cache_linesize =3D handle_amd (_SC_LEVEL2_CACHE_LINESIZE); > level3_cache_size =3D shared; > @@ -880,12 +879,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_feature= s) > if (shared <=3D 0) > { > /* No shared L3 cache. All we have is the L2 cache. */ > - shared =3D core; > + shared =3D level2_cache_size; > } > else if (cpu_features->basic.family < 0x17) > { > /* Account for exclusive L2 and L3 caches. */ > - shared +=3D core; > + shared +=3D level2_cache_size; > } > > shared_per_thread =3D shared; > @@ -987,6 +986,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features= ) > if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) > rep_movsb_threshold =3D 2112; > > + /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of > + cases slower than the vectorized path (and for some alignments, > + it is really slow, check BZ #30994). */ > + if (cpu_features->basic.kind =3D=3D arch_kind_amd) > + rep_movsb_threshold =3D non_temporal_threshold; > + > /* The default threshold to use Enhanced REP STOSB. */ > unsigned long int rep_stosb_threshold =3D 2048; > > @@ -1028,16 +1033,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_featur= es) > SIZE_MAX); > > unsigned long int rep_movsb_stop_threshold; > - /* ERMS feature is implemented from AMD Zen3 architecture and it is > - performing poorly for data above L2 cache size. Henceforth, adding > - an upper bound threshold parameter to limit the usage of Enhanced > - REP MOVSB operations and setting its value to L2 cache size. */ > - if (cpu_features->basic.kind =3D=3D arch_kind_amd) > - rep_movsb_stop_threshold =3D core; > /* Setting the upper bound of ERMS to the computed value of > - non-temporal threshold for architectures other than AMD. */ > - else > - rep_movsb_stop_threshold =3D non_temporal_threshold; > + non-temporal threshold for all architectures. */ > + rep_movsb_stop_threshold =3D non_temporal_threshold; > > cpu_features->data_cache_size =3D data; > cpu_features->shared_cache_size =3D shared; > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu Thanks. --=20 H.J.