From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-il1-x12f.google.com (mail-il1-x12f.google.com [IPv6:2607:f8b0:4864:20::12f]) by sourceware.org (Postfix) with ESMTPS id E87BA3857824 for ; Fri, 25 Sep 2020 21:04:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E87BA3857824 Received: by mail-il1-x12f.google.com with SMTP id q4so3677890ils.4 for ; Fri, 25 Sep 2020 14:04:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dDbOKmJumo9OuO7mnh0GycGECioZHg7MgB4clLKqOIs=; b=Y+YBqt1K4EoQXyY7HtPWCK5q/Eq8tkpXBaYf7ctFq0sFlAVkWppe8ENCKbDlOR4w4M m14sd2lPHAUWtUI0GRtv4AzoWZs92kM5nvrJiQ9F5H2AjH8C+mOBnQCu05S2FVZnlqgQ dE0Ja29mGuB02yWkytmLysQvyeZdm5mxmNoPvXmC+txjlcIUQ0plNvvmHOko43k3pnrs 81Y9S6usxbwtNPsVYSKoMucg7QT1Np3WsaYSi3mvLJe4zZW8/tEz4gcDn9zMvDGLhRx6 7ssQEFjCKdMQAQSoZLg3zxpqP8dgaBZkKeFmhkpwO3luzDX4zkqLhXAwAb+3lfPSN78W VBiQ== X-Gm-Message-State: AOAM530JWQslls6ywpDSbmd5GEXmi8730fi3UoPhAzCx6blsBURpkVtU 6GWRjTBxAhsAyiE+rgDJQsTNmxNvEkwqKHhHsGu4Z3+7 X-Google-Smtp-Source: ABdhPJzo5d7bJKVKaO0XtAhPmpUI/OzBuHqnFlmMc5+i+MJQWNwquNFKgC/aYQ49XFvXYC6lcUZzSftWPrXSDSFaXvM= X-Received: by 2002:a92:1589:: with SMTP id 9mr1838535ilv.292.1601067888051; Fri, 25 Sep 2020 14:04:48 -0700 (PDT) MIME-Version: 1.0 References: <1600891781-9272-1-git-send-email-patrick.mcgehearty@oracle.com> <9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com> <3f5e95c7-8601-ef86-d4c1-8e16005614d0@oracle.com> <4e422308-f935-e151-e1ce-175fd199f84f@oracle.com> <9f218fdc-dfc1-1458-f486-20af915017b9@oracle.com> In-Reply-To: From: "H.J. Lu" Date: Fri, 25 Sep 2020 14:04:12 -0700 Message-ID: Subject: Re: [PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold To: Patrick McGehearty Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3037.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Sep 2020 21:04:51 -0000 On Fri, Sep 25, 2020 at 1:53 PM Patrick McGehearty wrote: > > > > On 9/24/2020 6:57 PM, H.J. Lu wrote: > > On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty > > wrote: > >> > >> > >> On 9/24/2020 4:54 PM, H.J. Lu wrote: > >>> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty > >>> wrote: > >>>> > >>>> On 9/23/2020 6:13 PM, H.J. Lu wrote: > >>>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty > >>>>> wrote: > >>>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote: > >>>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty > >>>>>>> wrote: > >>>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: > >>>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha > >>>>>>>>> wrote: > >>>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 > >>>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last > >>>>>>>>>> level cache. > >>>>>>>>>> > >>>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's > >>>>>>>>>> patch of June 2, 2017. > >>>>>>>>>> > >>>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread > >>>>>>>>>> getting maximum performance. It was tuned using the single threaded > >>>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change > >>>>>>>>>> changes the threshold from using 3/4 of one thread's share of the > >>>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system > >>>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with > >>>>>>>>>> more than a few threads are server-class and typically have many > >>>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for > >>>>>>>>>> all threads, it will cause other active threads to have data removed > >>>>>>>>>> from the cache. Two examples show the range of the effect. John > >>>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel > >>>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on > >>>>>>>>>> an internal system test of 128 threads. This regression was discovered > >>>>>>>>>> when comparing OL8 performance to OL7. An example that compares > >>>>>>>>>> normal stores to non-temporal stores may be found at > >>>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test > >>>>>>>>>> shows performance loss of 400 to 500% due to a failure to use > >>>>>>>>>> nontemporal stores. These performance losses are most likely to occur > >>>>>>>>>> when the system load is heaviest and good performance is critical. > >>>>>>>>>> > >>>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the > >>>>>>>>>> default for the knowledgable user who really wants maximum cache > >>>>>>>>>> allocation to a single thread in a multi-threaded system. > >>>>>>>>>> The manual entry for the tunable has been expanded to provide > >>>>>>>>>> more information about its purpose. > >>>>>>>>>> > >>>>>>>>>> modified: sysdeps/x86/cacheinfo.c > >>>>>>>>>> modified: manual/tunables.texi > >>>>>>>>>> --- > >>>>>>>>>> manual/tunables.texi | 6 +++++- > >>>>>>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- > >>>>>>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi > >>>>>>>>>> index b6bb54d..94d4fbd 100644 > >>>>>>>>>> --- a/manual/tunables.texi > >>>>>>>>>> +++ b/manual/tunables.texi > >>>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. > >>>>>>>>>> > >>>>>>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold > >>>>>>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user > >>>>>>>>>> -to set threshold in bytes for non temporal store. > >>>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores > >>>>>>>>>> +give a hint to the hardware to move data directly to memory without > >>>>>>>>>> +displacing other data from the cache. This tunable is used by some > >>>>>>>>>> +platforms to determine when to use non temporal stores in operations > >>>>>>>>>> +like memmove and memcpy. > >>>>>>>>>> > >>>>>>>>>> This tunable is specific to i386 and x86-64. > >>>>>>>>>> @end deftp > >>>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > >>>>>>>>>> index b9444dd..c6767d9 100644 > >>>>>>>>>> --- a/sysdeps/x86/cacheinfo.c > >>>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c > >>>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: > >>>>>>>>>> __x86_shared_cache_size = shared; > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of > >>>>>>>>>> - shared cache size is the approximate value above which non-temporal > >>>>>>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the > >>>>>>>>>> - total shared cache size. */ > >>>>>>>>>> + /* The default setting for the non_temporal threshold is 3/4 > >>>>>>>>>> + of one thread's share of the chip's cache. While higher > >>>>>>>>>> + single thread performance may be observed with a higher > >>>>>>>>>> + threshold, having a single thread use more than it's share > >>>>>>>>>> + of the cache will negatively impact the performance of > >>>>>>>>>> + other threads running on the chip. */ > >>>>>>>>>> __x86_shared_non_temporal_threshold > >>>>>>>>>> = (cpu_features->non_temporal_threshold != 0 > >>>>>>>>>> ? cpu_features->non_temporal_threshold > >>>>>>>>>> - : __x86_shared_cache_size * threads * 3 / 4); > >>>>>>>>>> + : __x86_shared_cache_size * 3 / 4); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>> Can we tune it with the number of threads and/or total cache > >>>>>>>>> size? > >>>>>>>>> > >>>>>>>> When you say "total cache size", is that different from > >>>>>>>> shared_cache_size * threads? > >>>>>>>> > >>>>>>>> I see a fundamental conflict of optimization goals: > >>>>>>>> 1) Provide best single thread performance (current code) > >>>>>>>> 2) Provide best overall system performance under full load (proposed patch) > >>>>>>>> I don't know of any way to have default behavior meet both goals without > >>>>>>>> knowledge > >>>>>>>> of the system size/usage/requirements. > >>>>>>>> > >>>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of > >>>>>>>> total cache on the chip. > >>>>>>>> That won't be uncommon in the coming years on server class systems, > >>>>>>>> especially > >>>>>>>> in large databases or HPC environments (think vision processing or > >>>>>>>> weather modeling for example). > >>>>>>>> If a single app owns the whole chip and is running a multi-threaded > >>>>>>>> application but needs > >>>>>>>> to memcpy a really large block of data when one phase of computation > >>>>>>>> finished > >>>>>>>> before moving to the next phase. A common practice would be to have 64 > >>>>>>>> parallel calls > >>>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current > >>>>>>>> compilers > >>>>>>>> handle that with no trouble. > >>>>>>>> > >>>>>>>> In the example, the per thread share of the cache is 2 MB and the > >>>>>>>> proposed formula will set > >>>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or > >>>>>>>> less, all threads comfortably > >>>>>>>> fit in cache. If the total copy size is over that, then non-temporal > >>>>>>>> stores are used and all is well there too. > >>>>>>>> > >>>>>>>> The current formula would set the threshold at 96 Mbytes for each > >>>>>>>> thread. Only when the total > >>>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. > >>>>>>>> We'd like > >>>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all > >>>>>>>> the threads caches. > >>>>>>>> > >>>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a > >>>>>>>> variety of commerical > >>>>>>>> applications while studying memcpy on Solaris over the years. The vast > >>>>>>>> majority of copies > >>>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes > >>>>>>>> of cache > >>>>>>>> per thread, allowing in-cache copies for the common case, even without > >>>>>>>> borrowing > >>>>>>>> cache from other threads. The occasional really large copies tend to be > >>>>>>>> when an application > >>>>>>>> is passing a block of data to prepare for a new phase of computation or > >>>>>>>> as a shared memory > >>>>>>>> communication to another thread. In these cases, having the data remain > >>>>>>>> in cache is usually > >>>>>>>> not relevant and using non-temporal stores even when they are not > >>>>>>>> strictly required does > >>>>>>>> not have a negative affect on performance. > >>>>>>>> > >>>>>>>> A downside of tuning for a single thread comes in cloud computing > >>>>>>>> environments, where > >>>>>>>> having neighboring threads being cache hogs, even if relatively isolated > >>>>>>>> in virtual machines, > >>>>>>>> is a "bad thing" for having stable system performance. Whatever we can > >>>>>>>> do to provide consistent, > >>>>>>>> reasonable performance whatever the neighboring threads might be doing > >>>>>>>> is a "good thing". > >>>>>>>> > >>>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? > >>>>>>> > >>>>>> I have not tested larger thresholds. I'd be more comfortable with a > >>>>>> smaller one. > >>>>>> We could construct specific tests to show either advantage or disadvantage > >>>>>> to shifting from 3/4 to all of cache depending on what data access was used > >>>>>> between memcpy operations. > >>>>>> > >>>>>> I consider pushing the limit on cache usage to be a risky approach. Few > >>>>>> applications > >>>>>> only work on a single block of data. If all threads are doing a shared > >>>>>> copy and > >>>>>> they use all the available cache, then after the memcpy returns, any other > >>>>>> active data would have been pushed out of the cache. That's likely to cost > >>>>>> severe performance loss in more cases than the modest performance gains for > >>>>>> a few cases where the application only is concerned with using the data that > >>>>>> was just copied. > >>>>>> > >>>>>> Just to give a more detailed example where large copies are not followed > >>>>>> by using > >>>>>> the data. Consider garbage collection followed by compression. With a > >>>>>> multi-age > >>>>>> garbage collector, stable data that is active and survived several > >>>>>> garbage collections > >>>>>> is in a 'old' region. It does not need to be copied. The current 'new' > >>>>>> region is full > >>>>>> but has both referenced and unreferenced data. After the marking phase, > >>>>>> the individual elements of the referenced data is copied to the base of > >>>>>> the 'new' region. > >>>>>> When complete, the rest of the 'new' region becomes the new free pool. > >>>>>> The total amount copied may far exceed the processor cache. Then the > >>>>>> application > >>>>>> exits garbage collection and resumes active use of mostly the stable > >>>>>> data with > >>>>>> some accesses to the just moved new data and fresh allocations. If we > >>>>>> under-use > >>>>>> non-temporal stores, we clear the cache and the whole application runs > >>>>>> slower > >>>>>> than otherwise. > >>>>>> > >>>>>> Individual memcpy benchmarks are useful in isolation testing and comparing > >>>>>> code patterns but can mislead about overall application performance in the > >>>>>> context of potential for cache abuse. I fell into that tarpit once while > >>>>>> tuning > >>>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe > >>>>>> 5% faster for in-cache data) caused a major customer application to run > >>>>>> slower > >>>>>> because my new code abused the cache. I modified my code to only use the > >>>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or > >>>>>> 128Kbytes if I remember right) and all was well. > >>>>>> > >>>>> The new threshold can be substantially smaller with large core count. > >>>>> Are you saying that even 3 / 4 may be too big? Is there a reasonable > >>>>> fixed threshold? > >>>>> > >>>> I don't have any evidence to say 3/4 is too big for typical applications > >>>> and environments. > >>>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size > >>>> which is what is > >>>> the current default for Oracle el7 and Red Hat el7. > >>>> > >>>> Given the typically larger sized caches/thread today than 8 years, 3/4 > >>>> may work out well > >>>> since the remaining 1/4 of today's larger cache is often greater than > >>>> 1/2 of yesteryear's smaller cache. > >>>> > >>> Please update the comment with your rationale for 3/4. Don't use > >>> today or current. Use 2020 instead. > >>> > >>> Thanks. > >>> > >> I'm unsure about what needs to change in the comment which does not mention > >> any dates currently. I'm assuming you are referring to the following > >> comment in cacheinfo.c > >> > >> /* The default setting for the non_temporal threshold is 3/4 > >> of one thread's share of the chip's cache. While higher > >> single thread performance may be observed with a higher > >> threshold, having a single thread use more than it's share > >> of the cache will negatively impact the performance of > >> other threads running on the chip. */ > >> > >> While I could add a comment on why 3/4 vs 1/2 is the best choice, I > >> don't have hard > >> data to back it up. I'd be comfortable with either 3/4 or 1/2. I > >> selected 3/4 as it > >> was closer to the formula you chose in 2017 instead of the formula you > >> chose in 2012. > > The comment is for readers 5 years from now who may be wondering > > where 3/4 came from. Just add something close to what you have said above. > > > Before I redo the commit and resubmit the whole patch. I thought I'd present > a revised comment for review.The value of 500KB to 2MB/thread is based > on a quick review of the wikipedia entries for Intel and AMD processors > released since 2017. There may be a few outliers, but the vast majority > fit that range for L3/thread. I tried to balance giving a sense of the > situation without diving too deeply into application specific details. > > > Comment in v2: > /* The default setting for the non_temporal threshold is 3/4 > of one thread's share of the chip's cache. While higher > single thread performance may be observed with a higher > threshold, having a single thread use more than it's share > of the cache will negatively impact the performance of > other threads running on the chip. */ > > Proposed comment for v3: > /* The default setting for the non_temporal threshold is 3/4 of one > thread's share of the chip's cache. For most Intel and AMD processors > with an initial release date between 2017 and 2020,a thread's typical > share ofthe cache is from 500 KBytes to 2 MBytes. Using the 3/4 > threshold leaves 125 KBytes to 500 KBytes of the thread'sdata > in cache after a maximum temporal copy, which will maintain > in cache a reasonable portion of the thread's stack and other > active data. If the threshold is set higher than one thread's > share of the cache, it has a substantial risk of negatively > impacting the performance of other threads running on the chip. */ > Comments look good. Please submit the patch with the updated comment. Thanks. -- H.J.