RE: [PATCH v2 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "Karumanchi, Sajan" <Sajan.Karumanchi@amd.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>,
	Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: "libc-alpha@sourceware.org" <libc-alpha@sourceware.org>,
	"H . J . Lu" <hjl.tools@gmail.com>,
	Sajan Karumanchi <sajan.karumanchi@gmail.com>,
	"bmerry@sarao.ac.za" <bmerry@sarao.ac.za>,
	"Mallappa, Premachandra" <Premachandra.Mallappa@amd.com>,
	"Karumanchi, Sajan" <Sajan.Karumanchi@amd.com>
Subject: RE: [PATCH v2 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
Date: Fri, 9 Feb 2024 10:59:17 +0000	[thread overview]
Message-ID: <SJ0PR12MB5633AC6D0242AE97C3EDE4B2894B2@SJ0PR12MB5633.namprd12.prod.outlook.com> (raw)
In-Reply-To: <CAFUsyfKtR6qyumD29fn23V84TMe0Ro83EMfqDX-qv7fB6Zsb8w@mail.gmail.com>

[AMD Official Use Only - General]

> On 08/02/24 3:37, Noah Goldstein wrote:
> On Wed, Feb 7, 2024 at 6:10 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 07/02/24 15:06, Adhemerval Zanella Netto wrote:
> > >
> > >
> > > On 07/02/24 14:39, Noah Goldstein wrote:
> > >> On Wed, Feb 7, 2024 at 12:10 PM Adhemerval Zanella Netto
> > >> <adhemerval.zanella@linaro.org> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On 06/02/24 15:36, Noah Goldstein wrote:
> > >>>> On Tue, Feb 6, 2024 at 5:43 PM Adhemerval Zanella
> > >>>> <adhemerval.zanella@linaro.org> wrote:
> > >>>>>
> > >>>>> The REP MOVSB usage on memcpy/memmove does not show much
> > >>>>> performance improvement on Zen3/Zen4 cores compared to the
> > >>>>> vectorized loops.  Also, as from BZ 30994, if the source is
> > >>>>> aligned and the destination is not the performance can be 20x slower.
> > >>>>>
> > >>>>> The performance difference is noticeable with small buffer
> > >>>>> sizes, closer to the lower bounds limits when memcpy/memmove
> > >>>>> starts to use ERMS.  The performance of REP MOVSB is similar to
> > >>>>> vectorized instruction on the size limit (the L2 cache).  Also,
> > >>>>> there is no drawback to multiple cores sharing the cache.
> > >>>>>
> > >>>>> A new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, allows to
> > >>>>> set up the higher bound size to use 'rep movsb'.
> > >>>>>
> > >>>>> Checked on x86_64-linux-gnu on Zen3.
> > >>>>> ---
> > >>>>>  manual/tunables.texi         |  9 +++++++
> > >>>>>  sysdeps/x86/dl-cacheinfo.h   | 50 +++++++++++++++++++++---------
> ------
> > >>>>>  sysdeps/x86/dl-tunables.list | 10 ++++++++
> > >>>>>  3 files changed, 48 insertions(+), 21 deletions(-)
> > >>>>>
> > >>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi index
> > >>>>> be97190d67..ee5d90b91b 100644
> > >>>>> --- a/manual/tunables.texi
> > >>>>> +++ b/manual/tunables.texi
> > >>>>> @@ -569,6 +569,15 @@ greater than zero, and currently defaults to
> 2048 bytes.
> > >>>>>  This tunable is specific to i386 and x86-64.
> > >>>>>  @end deftp
> > >>>>>
> > >>>>> +@deftp Tunable glibc.cpu.x86_rep_movsb_stop_threshold
> > >>>>> +The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the
> > >>>>> +user to set the threshold in bytes to stop using "rep movsb".
> > >>>>> +The value must be greater than zero, and currently, the default
> > >>>>> +depends on the CPU and the cache size.
> > >>>>> +
> > >>>>> +This tunable is specific to i386 and x86-64.
> > >>>>> +@end deftp
> > >>>>> +
> > >>>>>  @deftp Tunable glibc.cpu.x86_rep_stosb_threshold  The
> > >>>>> @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user
> > >>>>> to  set threshold in bytes to start using "rep stosb".  The
> > >>>>> value must be diff --git a/sysdeps/x86/dl-cacheinfo.h
> > >>>>> b/sysdeps/x86/dl-cacheinfo.h index d5101615e3..74b804c5e6
> 100644
> > >>>>> --- a/sysdeps/x86/dl-cacheinfo.h
> > >>>>> +++ b/sysdeps/x86/dl-cacheinfo.h
> > >>>>> @@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>>    long int data = -1;
> > >>>>>    long int shared = -1;
> > >>>>>    long int shared_per_thread = -1;
> > >>>>> -  long int core = -1;
> > >>>>>    unsigned int threads = 0;
> > >>>>>    unsigned long int level1_icache_size = -1;
> > >>>>>    unsigned long int level1_icache_linesize = -1; @@ -809,7
> > >>>>> +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >>>>>    if (cpu_features->basic.kind == arch_kind_intel)
> > >>>>>      {
> > >>>>>        data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
> > >>>>> -      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
> > >>>>>        shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
> > >>>>>        shared_per_thread = shared;
> > >>>>>
> > >>>>> @@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>>         = handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
> > >>>>>        level1_dcache_linesize
> > >>>>>         = handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
> > >>>>> -      level2_cache_size = core;
> > >>>>> +      level2_cache_size
> > >>>>> +       = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
> > >>>>>        level2_cache_assoc
> > >>>>>         = handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
> > >>>>>        level2_cache_linesize
> > >>>>> @@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>>        level4_cache_size
> > >>>>>         = handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
> > >>>>>
> > >>>>> -      get_common_cache_info (&shared, &shared_per_thread,
> &threads, core);
> > >>>>> +      get_common_cache_info (&shared, &shared_per_thread,
> &threads,
> > >>>>> +                            level2_cache_size);
> > >>>>>      }
> > >>>>>    else if (cpu_features->basic.kind == arch_kind_zhaoxin)
> > >>>>>      {
> > >>>>>        data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
> > >>>>> -      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
> > >>>>>        shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
> > >>>>>        shared_per_thread = shared;
> > >>>>>
> > >>>>> @@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>>        level1_dcache_size = data;
> > >>>>>        level1_dcache_assoc = handle_zhaoxin
> (_SC_LEVEL1_DCACHE_ASSOC);
> > >>>>>        level1_dcache_linesize = handle_zhaoxin
> (_SC_LEVEL1_DCACHE_LINESIZE);
> > >>>>> -      level2_cache_size = core;
> > >>>>> +      level2_cache_size = handle_zhaoxin
> > >>>>> + (_SC_LEVEL2_CACHE_SIZE);
> > >>>>>        level2_cache_assoc = handle_zhaoxin
> (_SC_LEVEL2_CACHE_ASSOC);
> > >>>>>        level2_cache_linesize = handle_zhaoxin
> (_SC_LEVEL2_CACHE_LINESIZE);
> > >>>>>        level3_cache_size = shared;
> > >>>>>        level3_cache_assoc = handle_zhaoxin
> (_SC_LEVEL3_CACHE_ASSOC);
> > >>>>>        level3_cache_linesize = handle_zhaoxin
> > >>>>> (_SC_LEVEL3_CACHE_LINESIZE);
> > >>>>>
> > >>>>> -      get_common_cache_info (&shared, &shared_per_thread,
> &threads, core);
> > >>>>> +      get_common_cache_info (&shared, &shared_per_thread,
> &threads,
> > >>>>> +                            level2_cache_size);
> > >>>>>      }
> > >>>>>    else if (cpu_features->basic.kind == arch_kind_amd)
> > >>>>>      {
> > >>>>>        data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> > >>>>> -      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> > >>>>>        shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > >>>>>
> > >>>>>        level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> > >>>>> @@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features
> *cpu_features)
> > >>>>>        level1_dcache_size = data;
> > >>>>>        level1_dcache_assoc = handle_amd
> (_SC_LEVEL1_DCACHE_ASSOC);
> > >>>>>        level1_dcache_linesize = handle_amd
> (_SC_LEVEL1_DCACHE_LINESIZE);
> > >>>>> -      level2_cache_size = core;
> > >>>>> +      level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);;
> > >>>>>        level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
> > >>>>>        level2_cache_linesize = handle_amd
> (_SC_LEVEL2_CACHE_LINESIZE);
> > >>>>>        level3_cache_size = shared; @@ -880,12 +879,12 @@
> > >>>>> dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >>>>>        if (shared <= 0)
> > >>>>>          {
> > >>>>>             /* No shared L3 cache.  All we have is the L2 cache.  */
> > >>>>> -           shared = core;
> > >>>>> +           shared = level2_cache_size;
> > >>>>>          }
> > >>>>>        else if (cpu_features->basic.family < 0x17)
> > >>>>>          {
> > >>>>>             /* Account for exclusive L2 and L3 caches.  */
> > >>>>> -           shared += core;
> > >>>>> +           shared += level2_cache_size;
> > >>>>>          }
> > >>>>>
> > >>>>>        shared_per_thread = shared; @@ -1028,16 +1027,25 @@
> > >>>>> dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >>>>>                            SIZE_MAX);
> > >>>>>
> > >>>>>    unsigned long int rep_movsb_stop_threshold;
> > >>>>> -  /* ERMS feature is implemented from AMD Zen3 architecture and it
> is
> > >>>>> -     performing poorly for data above L2 cache size. Henceforth, adding
> > >>>>> -     an upper bound threshold parameter to limit the usage of
> Enhanced
> > >>>>> -     REP MOVSB operations and setting its value to L2 cache size.  */
> > >>>>> -  if (cpu_features->basic.kind == arch_kind_amd)
> > >>>>> -    rep_movsb_stop_threshold = core;
> > >>>>> -  /* Setting the upper bound of ERMS to the computed value of
> > >>>>> -     non-temporal threshold for architectures other than AMD.  */
> > >>>>> -  else
> > >>>>> -    rep_movsb_stop_threshold = non_temporal_threshold;
> > >>>>> +  /* If the tunable is set and with a valid value (larger than the minimal
> > >>>>> +     threshold to use ERMS) use it instead of default values.
> > >>>>> + */  rep_movsb_stop_threshold = TUNABLE_GET
> (x86_rep_movsb_stop_threshold,
> > >>>>> +                                         long int, NULL);  if
> > >>>>> + (!TUNABLE_IS_INITIALIZED (x86_rep_movsb_stop_threshold)
> > >>>>> +      || rep_movsb_stop_threshold <= rep_movsb_threshold)
> > >>>>> +    {
> > >>>>> +      /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a
> lot of
> > >>>>> +        cases slower than the vectorized path (and for some alignments,
> > >>>>> +        it is really slow, check BZ #30994).  */
> > >>>>> +      if (cpu_features->basic.kind == arch_kind_amd)
> > >>>>> +       rep_movsb_stop_threshold = 0;
> > >>>> note that if `size >= rep_movsb_threshold && size >=
> > >>>> rep_movsb_stop_threshold` we will use NT stores, not temporal stores.
> > >>>>
> > >>>> Id think you would want this to be setting the
> > >>>> `rep_movsb_threshold` -> `non_temporal_threshold` which would
> > >>>> essentially disable `rep movsb` but continue to use the other
> > >>>> tunables for temporal/non-temporal decisions.
> > >>>
> > >>> My understanding is it will keep using non temporal stores, for
> > >>> instance with a size equal to 25165824
> > >>> (x86.cpu_features.non_temporal_threshold)
> > >>> the code will:
> > >>>
> > >>> sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > >>>
> > >>> 384 #if defined USE_MULTIARCH && IS_IN (libc)
> > >>> 385 L(movsb_more_2x_vec):
> > >>> 386         cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> > >>> 387         ja      L(movsb)
> > >>>
> > >>>
> > >>> And then:
> > >>>
> > >>> 613         /* If above __x86_rep_movsb_stop_threshold most likely is
> > >>> 614            candidate for NT moves as well.  */
> > >>> 615         cmp     __x86_rep_movsb_stop_threshold(%rip), %RDX_LP
> > >>> 616         jae     L(large_memcpy_2x_check)
> > >>>
> > >>> And then skipping 'rep movsb' altogether.  And it will check
> > >>> whether to use temporal stores:
> > >>>
> > >>> 683 L(large_memcpy_2x):
> > >>> 684         mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > >>> 685         cmp     %R11_LP, %RDX_LP
> > >>> 686         jb      L(more_8x_vec_check)
> > >>>
> > >>
> > >> Ah you're right, forgot about that code!
> > >>
> > >>
> > >>>
> > >>> Maybe one options would to set rep_movsb_stop_threshold to
> > >>> rep_movsb_threshold, it slight clear that the range to use ERMS is a 0
> size internal.
> > >> So never use `rep movsb`?
> > >> If that where the case, I would just set `rep_movsb_threshold` to
> > >> `non_temporal_threshold` as the default.
> > >
> > > It works as well, it is essentially the same as setting
> > > rep_movsb_threshold to rep_movsb_stop_threshold.
> > >
> >
> > And I think by just setting rep_movsb_threshold, it makes the
> > rep_movsb_stop_threshold tunable less appealing (it would be a matter
> > to adjust the existing x86_rep_movsb_threshold tunable to enable ERMS).
I agree with you. LGTM
>
> Would think that is a good thing? Less configs to juggle.

next prev parent reply	other threads:[~2024-02-09 10:59 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-06 17:43 [PATCH v2 0/3] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
2024-02-06 17:43 ` [PATCH v2 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella
2024-02-06 18:36   ` Noah Goldstein
2024-02-07 12:10     ` Adhemerval Zanella Netto
2024-02-07 17:39       ` Noah Goldstein
2024-02-07 18:06         ` Adhemerval Zanella Netto
2024-02-07 18:10           ` Adhemerval Zanella Netto
2024-02-07 22:07             ` Noah Goldstein
2024-02-09 10:59               ` Karumanchi, Sajan [this message]
2024-02-06 17:43 ` [PATCH v2 2/3] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella
2024-02-06 19:01   ` Noah Goldstein
2024-02-07 12:14     ` Adhemerval Zanella Netto
2024-02-06 17:43 ` [PATCH v2 3/3] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SJ0PR12MB5633AC6D0242AE97C3EDE4B2894B2@SJ0PR12MB5633.namprd12.prod.outlook.com \
    --to=sajan.karumanchi@amd.com \
    --cc=Premachandra.Mallappa@amd.com \
    --cc=adhemerval.zanella@linaro.org \
    --cc=bmerry@sarao.ac.za \
    --cc=goldstein.w.n@gmail.com \
    --cc=hjl.tools@gmail.com \
    --cc=libc-alpha@sourceware.org \
    --cc=sajan.karumanchi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).