From: "Karumanchi, Sajan" <Sajan.Karumanchi@amd.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>,
Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: "libc-alpha@sourceware.org" <libc-alpha@sourceware.org>,
"H . J . Lu" <hjl.tools@gmail.com>,
Sajan Karumanchi <sajan.karumanchi@gmail.com>,
"bmerry@sarao.ac.za" <bmerry@sarao.ac.za>,
"Mallappa, Premachandra" <Premachandra.Mallappa@amd.com>,
"Karumanchi, Sajan" <Sajan.Karumanchi@amd.com>
Subject: RE: [PATCH v2 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
Date: Fri, 9 Feb 2024 10:59:17 +0000 [thread overview]
Message-ID: <SJ0PR12MB5633AC6D0242AE97C3EDE4B2894B2@SJ0PR12MB5633.namprd12.prod.outlook.com> (raw)
In-Reply-To: <CAFUsyfKtR6qyumD29fn23V84TMe0Ro83EMfqDX-qv7fB6Zsb8w@mail.gmail.com>
[AMD Official Use Only - General]
> On 08/02/24 3:37, Noah Goldstein wrote:
> On Wed, Feb 7, 2024 at 6:10 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 07/02/24 15:06, Adhemerval Zanella Netto wrote:
> > >
> > >
> > > On 07/02/24 14:39, Noah Goldstein wrote:
> > >> On Wed, Feb 7, 2024 at 12:10 PM Adhemerval Zanella Netto
> > >> <adhemerval.zanella@linaro.org> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On 06/02/24 15:36, Noah Goldstein wrote:
> > >>>> On Tue, Feb 6, 2024 at 5:43 PM Adhemerval Zanella
> > >>>> <adhemerval.zanella@linaro.org> wrote:
> > >>>>>
> > >>>>> The REP MOVSB usage on memcpy/memmove does not show much
> > >>>>> performance improvement on Zen3/Zen4 cores compared to the
> > >>>>> vectorized loops. Also, as from BZ 30994, if the source is
> > >>>>> aligned and the destination is not the performance can be 20x slower.
> > >>>>>
> > >>>>> The performance difference is noticeable with small buffer
> > >>>>> sizes, closer to the lower bounds limits when memcpy/memmove
> > >>>>> starts to use ERMS. The performance of REP MOVSB is similar to
> > >>>>> vectorized instruction on the size limit (the L2 cache). Also,
> > >>>>> there is no drawback to multiple cores sharing the cache.
> > >>>>>
> > >>>>> A new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, allows to
> > >>>>> set up the higher bound size to use 'rep movsb'.
> > >>>>>
> > >>>>> Checked on x86_64-linux-gnu on Zen3.
> > >>>>> ---
> > >>>>> manual/tunables.texi | 9 +++++++
> > >>>>> sysdeps/x86/dl-cacheinfo.h | 50 +++++++++++++++++++++---------
> ------
> > >>>>> sysdeps/x86/dl-tunables.list | 10 ++++++++
> > >>>>> 3 files changed, 48 insertions(+), 21 deletions(-)
> > >>>>>
> > >>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi index
> > >>>>> be97190d67..ee5d90b91b 100644
> > >>>>> --- a/manual/tunables.texi
> > >>>>> +++ b/manual/tunables.texi
> > >>>>> @@ -569,6 +569,15 @@ greater than zero, and currently defaults to
> 2048 bytes.
> > >>>>> This tunable is specific to i386 and x86-64.
> > >>>>> @end deftp
> > >>>>>
> > >>>>> +@deftp Tunable glibc.cpu.x86_rep_movsb_stop_threshold
> > >>>>> +The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the
> > >>>>> +user to set the threshold in bytes to stop using "rep movsb".
> > >>>>> +The value must be greater than zero, and currently, the default
> > >>>>> +depends on the CPU and the cache size.
> > >>>>> +
> > >>>>> +This tunable is specific to i386 and x86-64.
> > >>>>> +@end deftp
> > >>>>> +
> > >>>>> @deftp Tunable glibc.cpu.x86_rep_stosb_threshold The
> > >>>>> @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user
> > >>>>> to set threshold in bytes to start using "rep stosb". The
> > >>>>> value must be diff --git a/sysdeps/x86/dl-cacheinfo.h
> > >>>>> b/sysdeps/x86/dl-cacheinfo.h index d5101615e3..74b804c5e6
> 100644
> > >>>>> --- a/sysdeps/x86/dl-cacheinfo.h
> > >>>>> +++ b/sysdeps/x86/dl-cacheinfo.h
> > >>>>> @@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>> long int data = -1;
> > >>>>> long int shared = -1;
> > >>>>> long int shared_per_thread = -1;
> > >>>>> - long int core = -1;
> > >>>>> unsigned int threads = 0;
> > >>>>> unsigned long int level1_icache_size = -1;
> > >>>>> unsigned long int level1_icache_linesize = -1; @@ -809,7
> > >>>>> +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >>>>> if (cpu_features->basic.kind == arch_kind_intel)
> > >>>>> {
> > >>>>> data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
> > >>>>> - core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
> > >>>>> shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
> > >>>>> shared_per_thread = shared;
> > >>>>>
> > >>>>> @@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>> = handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
> > >>>>> level1_dcache_linesize
> > >>>>> = handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
> > >>>>> - level2_cache_size = core;
> > >>>>> + level2_cache_size
> > >>>>> + = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
> > >>>>> level2_cache_assoc
> > >>>>> = handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
> > >>>>> level2_cache_linesize
> > >>>>> @@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>> level4_cache_size
> > >>>>> = handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
> > >>>>>
> > >>>>> - get_common_cache_info (&shared, &shared_per_thread,
> &threads, core);
> > >>>>> + get_common_cache_info (&shared, &shared_per_thread,
> &threads,
> > >>>>> + level2_cache_size);
> > >>>>> }
> > >>>>> else if (cpu_features->basic.kind == arch_kind_zhaoxin)
> > >>>>> {
> > >>>>> data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
> > >>>>> - core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
> > >>>>> shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
> > >>>>> shared_per_thread = shared;
> > >>>>>
> > >>>>> @@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>> level1_dcache_size = data;
> > >>>>> level1_dcache_assoc = handle_zhaoxin
> (_SC_LEVEL1_DCACHE_ASSOC);
> > >>>>> level1_dcache_linesize = handle_zhaoxin
> (_SC_LEVEL1_DCACHE_LINESIZE);
> > >>>>> - level2_cache_size = core;
> > >>>>> + level2_cache_size = handle_zhaoxin
> > >>>>> + (_SC_LEVEL2_CACHE_SIZE);
> > >>>>> level2_cache_assoc = handle_zhaoxin
> (_SC_LEVEL2_CACHE_ASSOC);
> > >>>>> level2_cache_linesize = handle_zhaoxin
> (_SC_LEVEL2_CACHE_LINESIZE);
> > >>>>> level3_cache_size = shared;
> > >>>>> level3_cache_assoc = handle_zhaoxin
> (_SC_LEVEL3_CACHE_ASSOC);
> > >>>>> level3_cache_linesize = handle_zhaoxin
> > >>>>> (_SC_LEVEL3_CACHE_LINESIZE);
> > >>>>>
> > >>>>> - get_common_cache_info (&shared, &shared_per_thread,
> &threads, core);
> > >>>>> + get_common_cache_info (&shared, &shared_per_thread,
> &threads,
> > >>>>> + level2_cache_size);
> > >>>>> }
> > >>>>> else if (cpu_features->basic.kind == arch_kind_amd)
> > >>>>> {
> > >>>>> data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> > >>>>> - core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> > >>>>> shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > >>>>>
> > >>>>> level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> > >>>>> @@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>> level1_dcache_size = data;
> > >>>>> level1_dcache_assoc = handle_amd
> (_SC_LEVEL1_DCACHE_ASSOC);
> > >>>>> level1_dcache_linesize = handle_amd
> (_SC_LEVEL1_DCACHE_LINESIZE);
> > >>>>> - level2_cache_size = core;
> > >>>>> + level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);;
> > >>>>> level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
> > >>>>> level2_cache_linesize = handle_amd
> (_SC_LEVEL2_CACHE_LINESIZE);
> > >>>>> level3_cache_size = shared; @@ -880,12 +879,12 @@
> > >>>>> dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >>>>> if (shared <= 0)
> > >>>>> {
> > >>>>> /* No shared L3 cache. All we have is the L2 cache. */
> > >>>>> - shared = core;
> > >>>>> + shared = level2_cache_size;
> > >>>>> }
> > >>>>> else if (cpu_features->basic.family < 0x17)
> > >>>>> {
> > >>>>> /* Account for exclusive L2 and L3 caches. */
> > >>>>> - shared += core;
> > >>>>> + shared += level2_cache_size;
> > >>>>> }
> > >>>>>
> > >>>>> shared_per_thread = shared; @@ -1028,16 +1027,25 @@
> > >>>>> dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >>>>> SIZE_MAX);
> > >>>>>
> > >>>>> unsigned long int rep_movsb_stop_threshold;
> > >>>>> - /* ERMS feature is implemented from AMD Zen3 architecture and it
> is
> > >>>>> - performing poorly for data above L2 cache size. Henceforth, adding
> > >>>>> - an upper bound threshold parameter to limit the usage of
> Enhanced
> > >>>>> - REP MOVSB operations and setting its value to L2 cache size. */
> > >>>>> - if (cpu_features->basic.kind == arch_kind_amd)
> > >>>>> - rep_movsb_stop_threshold = core;
> > >>>>> - /* Setting the upper bound of ERMS to the computed value of
> > >>>>> - non-temporal threshold for architectures other than AMD. */
> > >>>>> - else
> > >>>>> - rep_movsb_stop_threshold = non_temporal_threshold;
> > >>>>> + /* If the tunable is set and with a valid value (larger than the minimal
> > >>>>> + threshold to use ERMS) use it instead of default values.
> > >>>>> + */ rep_movsb_stop_threshold = TUNABLE_GET
> (x86_rep_movsb_stop_threshold,
> > >>>>> + long int, NULL); if
> > >>>>> + (!TUNABLE_IS_INITIALIZED (x86_rep_movsb_stop_threshold)
> > >>>>> + || rep_movsb_stop_threshold <= rep_movsb_threshold)
> > >>>>> + {
> > >>>>> + /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a
> lot of
> > >>>>> + cases slower than the vectorized path (and for some alignments,
> > >>>>> + it is really slow, check BZ #30994). */
> > >>>>> + if (cpu_features->basic.kind == arch_kind_amd)
> > >>>>> + rep_movsb_stop_threshold = 0;
> > >>>> note that if `size >= rep_movsb_threshold && size >=
> > >>>> rep_movsb_stop_threshold` we will use NT stores, not temporal stores.
> > >>>>
> > >>>> Id think you would want this to be setting the
> > >>>> `rep_movsb_threshold` -> `non_temporal_threshold` which would
> > >>>> essentially disable `rep movsb` but continue to use the other
> > >>>> tunables for temporal/non-temporal decisions.
> > >>>
> > >>> My understanding is it will keep using non temporal stores, for
> > >>> instance with a size equal to 25165824
> > >>> (x86.cpu_features.non_temporal_threshold)
> > >>> the code will:
> > >>>
> > >>> sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > >>>
> > >>> 384 #if defined USE_MULTIARCH && IS_IN (libc)
> > >>> 385 L(movsb_more_2x_vec):
> > >>> 386 cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> > >>> 387 ja L(movsb)
> > >>>
> > >>>
> > >>> And then:
> > >>>
> > >>> 613 /* If above __x86_rep_movsb_stop_threshold most likely is
> > >>> 614 candidate for NT moves as well. */
> > >>> 615 cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP
> > >>> 616 jae L(large_memcpy_2x_check)
> > >>>
> > >>> And then skipping 'rep movsb' altogether. And it will check
> > >>> whether to use temporal stores:
> > >>>
> > >>> 683 L(large_memcpy_2x):
> > >>> 684 mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > >>> 685 cmp %R11_LP, %RDX_LP
> > >>> 686 jb L(more_8x_vec_check)
> > >>>
> > >>
> > >> Ah you're right, forgot about that code!
> > >>
> > >>
> > >>>
> > >>> Maybe one options would to set rep_movsb_stop_threshold to
> > >>> rep_movsb_threshold, it slight clear that the range to use ERMS is a 0
> size internal.
> > >> So never use `rep movsb`?
> > >> If that where the case, I would just set `rep_movsb_threshold` to
> > >> `non_temporal_threshold` as the default.
> > >
> > > It works as well, it is essentially the same as setting
> > > rep_movsb_threshold to rep_movsb_stop_threshold.
> > >
> >
> > And I think by just setting rep_movsb_threshold, it makes the
> > rep_movsb_stop_threshold tunable less appealing (it would be a matter
> > to adjust the existing x86_rep_movsb_threshold tunable to enable ERMS).
I agree with you. LGTM
>
> Would think that is a good thing? Less configs to juggle.
next prev parent reply other threads:[~2024-02-09 10:59 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-06 17:43 [PATCH v2 0/3] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
2024-02-06 17:43 ` [PATCH v2 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella
2024-02-06 18:36 ` Noah Goldstein
2024-02-07 12:10 ` Adhemerval Zanella Netto
2024-02-07 17:39 ` Noah Goldstein
2024-02-07 18:06 ` Adhemerval Zanella Netto
2024-02-07 18:10 ` Adhemerval Zanella Netto
2024-02-07 22:07 ` Noah Goldstein
2024-02-09 10:59 ` Karumanchi, Sajan [this message]
2024-02-06 17:43 ` [PATCH v2 2/3] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella
2024-02-06 19:01 ` Noah Goldstein
2024-02-07 12:14 ` Adhemerval Zanella Netto
2024-02-06 17:43 ` [PATCH v2 3/3] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=SJ0PR12MB5633AC6D0242AE97C3EDE4B2894B2@SJ0PR12MB5633.namprd12.prod.outlook.com \
--to=sajan.karumanchi@amd.com \
--cc=Premachandra.Mallappa@amd.com \
--cc=adhemerval.zanella@linaro.org \
--cc=bmerry@sarao.ac.za \
--cc=goldstein.w.n@gmail.com \
--cc=hjl.tools@gmail.com \
--cc=libc-alpha@sourceware.org \
--cc=sajan.karumanchi@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).