From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold
Date: Wed, 23 Sep 2020 15:57:47 -0500 [thread overview]
Message-ID: <9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com> (raw)
In-Reply-To: <CAMe9rOqpcKUgQihB2xvtyR-wDj9-zOyLWcdvfTakW0vPOg7BcQ@mail.gmail.com>
On 9/23/2020 3:23 PM, H.J. Lu wrote:
> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>> uses non_temporal stores to avoid pushing other data out of the last
>> level cache.
>>
>> This patch proposes to revert the calculation change made by H.J. Lu's
>> patch of June 2, 2017.
>>
>> H.J. Lu's patch selected a threshold suitable for a single thread
>> getting maximum performance. It was tuned using the single threaded
>> large memcpy micro benchmark on an 8 core processor. The last change
>> changes the threshold from using 3/4 of one thread's share of the
>> cache to using 3/4 of the entire cache of a multi-threaded system
>> before switching to non-temporal stores. Multi-threaded systems with
>> more than a few threads are server-class and typically have many
>> active threads. If one thread consumes 3/4 of the available cache for
>> all threads, it will cause other active threads to have data removed
>> from the cache. Two examples show the range of the effect. John
>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>> and fetches data sequentially, saw a 20% slowdown with this patch on
>> an internal system test of 128 threads. This regression was discovered
>> when comparing OL8 performance to OL7. An example that compares
>> normal stores to non-temporal stores may be found at
>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test
>> shows performance loss of 400 to 500% due to a failure to use
>> nontemporal stores. These performance losses are most likely to occur
>> when the system load is heaviest and good performance is critical.
>>
>> The tunable x86_non_temporal_threshold can be used to override the
>> default for the knowledgable user who really wants maximum cache
>> allocation to a single thread in a multi-threaded system.
>> The manual entry for the tunable has been expanded to provide
>> more information about its purpose.
>>
>> modified: sysdeps/x86/cacheinfo.c
>> modified: manual/tunables.texi
>> ---
>> manual/tunables.texi | 6 +++++-
>> sysdeps/x86/cacheinfo.c | 12 +++++++-----
>> 2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>> index b6bb54d..94d4fbd 100644
>> --- a/manual/tunables.texi
>> +++ b/manual/tunables.texi
>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>
>> @deftp Tunable glibc.tune.x86_non_temporal_threshold
>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>> -to set threshold in bytes for non temporal store.
>> +to set threshold in bytes for non temporal store. Non temporal stores
>> +give a hint to the hardware to move data directly to memory without
>> +displacing other data from the cache. This tunable is used by some
>> +platforms to determine when to use non temporal stores in operations
>> +like memmove and memcpy.
>>
>> This tunable is specific to i386 and x86-64.
>> @end deftp
>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>> index b9444dd..c6767d9 100644
>> --- a/sysdeps/x86/cacheinfo.c
>> +++ b/sysdeps/x86/cacheinfo.c
>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>> __x86_shared_cache_size = shared;
>> }
>>
>> - /* The large memcpy micro benchmark in glibc shows that 6 times of
>> - shared cache size is the approximate value above which non-temporal
>> - store becomes faster on a 8-core processor. This is the 3/4 of the
>> - total shared cache size. */
>> + /* The default setting for the non_temporal threshold is 3/4
>> + of one thread's share of the chip's cache. While higher
>> + single thread performance may be observed with a higher
>> + threshold, having a single thread use more than it's share
>> + of the cache will negatively impact the performance of
>> + other threads running on the chip. */
>> __x86_shared_non_temporal_threshold
>> = (cpu_features->non_temporal_threshold != 0
>> ? cpu_features->non_temporal_threshold
>> - : __x86_shared_cache_size * threads * 3 / 4);
>> + : __x86_shared_cache_size * 3 / 4);
>> }
>>
> Can we tune it with the number of threads and/or total cache
> size?
>
When you say "total cache size", is that different from
shared_cache_size * threads?
I see a fundamental conflict of optimization goals:
1) Provide best single thread performance (current code)
2) Provide best overall system performance under full load (proposed patch)
I don't know of any way to have default behavior meet both goals without
knowledge
of the system size/usage/requirements.
Consider a hypothetical single chip system with 64 threads and 128 MB of
total cache on the chip.
That won't be uncommon in the coming years on server class systems,
especially
in large databases or HPC environments (think vision processing or
weather modeling for example).
If a single app owns the whole chip and is running a multi-threaded
application but needs
to memcpy a really large block of data when one phase of computation
finished
before moving to the next phase. A common practice would be to have 64
parallel calls
to memcpy. The Stream benchmark demonstrates with OpenMP that current
compilers
handle that with no trouble.
In the example, the per thread share of the cache is 2 MB and the
proposed formula will set
the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
less, all threads comfortably
fit in cache. If the total copy size is over that, then non-temporal
stores are used and all is well there too.
The current formula would set the threshold at 96 Mbytes for each
thread. Only when the total
copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
We'd like
to switch to non-temporal stores much sooner as we will be thrashing all
the threads caches.
In practical terms, I've had access to typical memcpy copy lengths for a
variety of commerical
applications while studying memcpy on Solaris over the years. The vast
majority of copies
are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
of cache
per thread, allowing in-cache copies for the common case, even without
borrowing
cache from other threads. The occasional really large copies tend to be
when an application
is passing a block of data to prepare for a new phase of computation or
as a shared memory
communication to another thread. In these cases, having the data remain
in cache is usually
not relevant and using non-temporal stores even when they are not
strictly required does
not have a negative affect on performance.
A downside of tuning for a single thread comes in cloud computing
environments, where
having neighboring threads being cache hogs, even if relatively isolated
in virtual machines,
is a "bad thing" for having stable system performance. Whatever we can
do to provide consistent,
reasonable performance whatever the neighboring threads might be doing
is a "good thing".
- patrick
next prev parent reply other threads:[~2020-09-23 20:57 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-23 20:09 Patrick McGehearty
2020-09-23 20:23 ` H.J. Lu
2020-09-23 20:57 ` Patrick McGehearty [this message]
2020-09-23 21:37 ` H.J. Lu
2020-09-23 22:39 ` Patrick McGehearty
2020-09-23 23:13 ` H.J. Lu
2020-09-24 21:47 ` Patrick McGehearty
2020-09-24 21:54 ` H.J. Lu
2020-09-24 23:22 ` Patrick McGehearty
2020-09-24 23:57 ` H.J. Lu
2020-09-25 20:53 ` Patrick McGehearty
2020-09-25 21:04 ` H.J. Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com \
--to=patrick.mcgehearty@oracle.com \
--cc=hjl.tools@gmail.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).