public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold
Date: Wed, 23 Sep 2020 15:57:47 -0500	[thread overview]
Message-ID: <9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com> (raw)
In-Reply-To: <CAMe9rOqpcKUgQihB2xvtyR-wDj9-zOyLWcdvfTakW0vPOg7BcQ@mail.gmail.com>



On 9/23/2020 3:23 PM, H.J. Lu wrote:
> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>> uses non_temporal stores to avoid pushing other data out of the last
>> level cache.
>>
>> This patch proposes to revert the calculation change made by H.J. Lu's
>> patch of June 2, 2017.
>>
>> H.J. Lu's patch selected a threshold suitable for a single thread
>> getting maximum performance. It was tuned using the single threaded
>> large memcpy micro benchmark on an 8 core processor. The last change
>> changes the threshold from using 3/4 of one thread's share of the
>> cache to using 3/4 of the entire cache of a multi-threaded system
>> before switching to non-temporal stores. Multi-threaded systems with
>> more than a few threads are server-class and typically have many
>> active threads. If one thread consumes 3/4 of the available cache for
>> all threads, it will cause other active threads to have data removed
>> from the cache. Two examples show the range of the effect. John
>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>> and fetches data sequentially, saw a 20% slowdown with this patch on
>> an internal system test of 128 threads. This regression was discovered
>> when comparing OL8 performance to OL7.  An example that compares
>> normal stores to non-temporal stores may be found at
>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>> shows performance loss of 400 to 500% due to a failure to use
>> nontemporal stores. These performance losses are most likely to occur
>> when the system load is heaviest and good performance is critical.
>>
>> The tunable x86_non_temporal_threshold can be used to override the
>> default for the knowledgable user who really wants maximum cache
>> allocation to a single thread in a multi-threaded system.
>> The manual entry for the tunable has been expanded to provide
>> more information about its purpose.
>>
>>          modified: sysdeps/x86/cacheinfo.c
>>          modified: manual/tunables.texi
>> ---
>>   manual/tunables.texi    |  6 +++++-
>>   sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>   2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>> index b6bb54d..94d4fbd 100644
>> --- a/manual/tunables.texi
>> +++ b/manual/tunables.texi
>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>
>>   @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>   The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>> -to set threshold in bytes for non temporal store.
>> +to set threshold in bytes for non temporal store. Non temporal stores
>> +give a hint to the hardware to move data directly to memory without
>> +displacing other data from the cache. This tunable is used by some
>> +platforms to determine when to use non temporal stores in operations
>> +like memmove and memcpy.
>>
>>   This tunable is specific to i386 and x86-64.
>>   @end deftp
>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>> index b9444dd..c6767d9 100644
>> --- a/sysdeps/x86/cacheinfo.c
>> +++ b/sysdeps/x86/cacheinfo.c
>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>         __x86_shared_cache_size = shared;
>>       }
>>
>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>> -     shared cache size is the approximate value above which non-temporal
>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>> -     total shared cache size.  */
>> +  /* The default setting for the non_temporal threshold is 3/4
>> +     of one thread's share of the chip's cache. While higher
>> +     single thread performance may be observed with a higher
>> +     threshold, having a single thread use more than it's share
>> +     of the cache will negatively impact the performance of
>> +     other threads running on the chip. */
>>     __x86_shared_non_temporal_threshold
>>       = (cpu_features->non_temporal_threshold != 0
>>          ? cpu_features->non_temporal_threshold
>> -       : __x86_shared_cache_size * threads * 3 / 4);
>> +       : __x86_shared_cache_size * 3 / 4);
>>   }
>>
> Can we tune it with the number of threads and/or total cache
> size?
>

When you say "total cache size", is that different from 
shared_cache_size * threads?

I see a fundamental conflict of optimization goals:
1) Provide best single thread performance (current code)
2) Provide best overall system performance under full load (proposed patch)
I don't know of any way to have default behavior meet both goals without 
knowledge
of the system size/usage/requirements.

Consider a hypothetical single chip system with 64 threads and 128 MB of 
total cache on the chip.
That won't be uncommon in the coming years on server class systems, 
especially
in large databases or HPC environments (think vision processing or 
weather modeling for example).
If a single app owns the whole chip and is running a multi-threaded 
application but needs
to memcpy a really large block of data when one phase of computation 
finished
before moving to the next phase. A common practice would be to have 64 
parallel calls
to memcpy. The Stream benchmark demonstrates with OpenMP that current 
compilers
handle that with no trouble.

In the example, the per thread share of the cache is 2 MB and the 
proposed formula will set
the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or 
less, all threads comfortably
fit in cache. If the total copy size is over that, then non-temporal 
stores are used and all is well there too.

The current formula would set the threshold at 96 Mbytes for each 
thread. Only when the total
copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. 
We'd like
to switch to non-temporal stores much sooner as we will be thrashing all 
the threads caches.

In practical terms, I've had access to typical memcpy copy lengths for a 
variety of commerical
applications while studying memcpy on Solaris over the years. The vast 
majority of copies
are for 64Kbytes or less. Most modern chips have much more than 64Kbytes 
of cache
per thread, allowing in-cache copies for the common case, even without 
borrowing
cache from other threads. The occasional really large copies tend to be 
when an application
is passing a block of data to prepare for a new phase of computation or 
as a shared memory
communication to another thread. In these cases, having the data remain 
in cache is usually
not relevant and using non-temporal stores even when they are not 
strictly required does
not have a negative affect on performance.

A downside of tuning for a single thread comes in cloud computing 
environments, where
having neighboring threads being cache hogs, even if relatively isolated 
in virtual machines,
is a "bad thing" for having stable system performance. Whatever we can 
do to provide consistent,
reasonable performance whatever the neighboring threads might be doing 
is a "good thing".

- patrick


  reply	other threads:[~2020-09-23 20:57 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-23 20:09 Patrick McGehearty
2020-09-23 20:23 ` H.J. Lu
2020-09-23 20:57   ` Patrick McGehearty [this message]
2020-09-23 21:37     ` H.J. Lu
2020-09-23 22:39       ` Patrick McGehearty
2020-09-23 23:13         ` H.J. Lu
2020-09-24 21:47           ` Patrick McGehearty
2020-09-24 21:54             ` H.J. Lu
2020-09-24 23:22               ` Patrick McGehearty
2020-09-24 23:57                 ` H.J. Lu
2020-09-25 20:53                   ` Patrick McGehearty
2020-09-25 21:04                     ` H.J. Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com \
    --to=patrick.mcgehearty@oracle.com \
    --cc=hjl.tools@gmail.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).