From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78]) by sourceware.org (Postfix) with ESMTPS id A2ACF3857C6B for ; Thu, 24 Sep 2020 23:22:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A2ACF3857C6B Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 08ONJMRl057761; Thu, 24 Sep 2020 23:22:30 GMT Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2120.oracle.com with ESMTP id 33q5rgsbgd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 24 Sep 2020 23:22:30 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 08ONGBHA187161; Thu, 24 Sep 2020 23:22:29 GMT Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserp3020.oracle.com with ESMTP id 33r28xmq3w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Sep 2020 23:22:29 +0000 Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 08ONMSIX017572; Thu, 24 Sep 2020 23:22:28 GMT Received: from [10.154.181.126] (/10.154.181.126) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 24 Sep 2020 16:22:27 -0700 Subject: Re: [PATCH v2] Reversing calculation of __x86_shared_non_temporal_threshold To: "H.J. Lu" Cc: GNU C Library References: <1600891781-9272-1-git-send-email-patrick.mcgehearty@oracle.com> <9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com> <3f5e95c7-8601-ef86-d4c1-8e16005614d0@oracle.com> <4e422308-f935-e151-e1ce-175fd199f84f@oracle.com> From: Patrick McGehearty Message-ID: <9f218fdc-dfc1-1458-f486-20af915017b9@oracle.com> Date: Thu, 24 Sep 2020 18:22:34 -0500 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9754 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0 mlxlogscore=999 suspectscore=0 adultscore=0 bulkscore=0 malwarescore=0 spamscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2009240169 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9754 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 impostorscore=0 clxscore=1015 suspectscore=0 phishscore=0 malwarescore=0 priorityscore=1501 mlxlogscore=999 adultscore=0 bulkscore=0 mlxscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2009240169 X-Spam-Status: No, score=-9.8 required=5.0 tests=BAYES_00, BODY_8BITS, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Sep 2020 23:22:34 -0000 On 9/24/2020 4:54 PM, H.J. Lu wrote: > On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty > wrote: >> >> >> On 9/23/2020 6:13 PM, H.J. Lu wrote: >>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty >>> wrote: >>>> >>>> On 9/23/2020 4:37 PM, H.J. Lu wrote: >>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty >>>>> wrote: >>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: >>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha >>>>>>> wrote: >>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 >>>>>>>> uses non_temporal stores to avoid pushing other data out of the last >>>>>>>> level cache. >>>>>>>> >>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's >>>>>>>> patch of June 2, 2017. >>>>>>>> >>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread >>>>>>>> getting maximum performance. It was tuned using the single threaded >>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change >>>>>>>> changes the threshold from using 3/4 of one thread's share of the >>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system >>>>>>>> before switching to non-temporal stores. Multi-threaded systems with >>>>>>>> more than a few threads are server-class and typically have many >>>>>>>> active threads. If one thread consumes 3/4 of the available cache for >>>>>>>> all threads, it will cause other active threads to have data removed >>>>>>>> from the cache. Two examples show the range of the effect. John >>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel >>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on >>>>>>>> an internal system test of 128 threads. This regression was discovered >>>>>>>> when comparing OL8 performance to OL7. An example that compares >>>>>>>> normal stores to non-temporal stores may be found at >>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test >>>>>>>> shows performance loss of 400 to 500% due to a failure to use >>>>>>>> nontemporal stores. These performance losses are most likely to occur >>>>>>>> when the system load is heaviest and good performance is critical. >>>>>>>> >>>>>>>> The tunable x86_non_temporal_threshold can be used to override the >>>>>>>> default for the knowledgable user who really wants maximum cache >>>>>>>> allocation to a single thread in a multi-threaded system. >>>>>>>> The manual entry for the tunable has been expanded to provide >>>>>>>> more information about its purpose. >>>>>>>> >>>>>>>> modified: sysdeps/x86/cacheinfo.c >>>>>>>> modified: manual/tunables.texi >>>>>>>> --- >>>>>>>> manual/tunables.texi | 6 +++++- >>>>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- >>>>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) >>>>>>>> >>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi >>>>>>>> index b6bb54d..94d4fbd 100644 >>>>>>>> --- a/manual/tunables.texi >>>>>>>> +++ b/manual/tunables.texi >>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. >>>>>>>> >>>>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold >>>>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user >>>>>>>> -to set threshold in bytes for non temporal store. >>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores >>>>>>>> +give a hint to the hardware to move data directly to memory without >>>>>>>> +displacing other data from the cache. This tunable is used by some >>>>>>>> +platforms to determine when to use non temporal stores in operations >>>>>>>> +like memmove and memcpy. >>>>>>>> >>>>>>>> This tunable is specific to i386 and x86-64. >>>>>>>> @end deftp >>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c >>>>>>>> index b9444dd..c6767d9 100644 >>>>>>>> --- a/sysdeps/x86/cacheinfo.c >>>>>>>> +++ b/sysdeps/x86/cacheinfo.c >>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: >>>>>>>> __x86_shared_cache_size = shared; >>>>>>>> } >>>>>>>> >>>>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of >>>>>>>> - shared cache size is the approximate value above which non-temporal >>>>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the >>>>>>>> - total shared cache size. */ >>>>>>>> + /* The default setting for the non_temporal threshold is 3/4 >>>>>>>> + of one thread's share of the chip's cache. While higher >>>>>>>> + single thread performance may be observed with a higher >>>>>>>> + threshold, having a single thread use more than it's share >>>>>>>> + of the cache will negatively impact the performance of >>>>>>>> + other threads running on the chip. */ >>>>>>>> __x86_shared_non_temporal_threshold >>>>>>>> = (cpu_features->non_temporal_threshold != 0 >>>>>>>> ? cpu_features->non_temporal_threshold >>>>>>>> - : __x86_shared_cache_size * threads * 3 / 4); >>>>>>>> + : __x86_shared_cache_size * 3 / 4); >>>>>>>> } >>>>>>>> >>>>>>> Can we tune it with the number of threads and/or total cache >>>>>>> size? >>>>>>> >>>>>> When you say "total cache size", is that different from >>>>>> shared_cache_size * threads? >>>>>> >>>>>> I see a fundamental conflict of optimization goals: >>>>>> 1) Provide best single thread performance (current code) >>>>>> 2) Provide best overall system performance under full load (proposed patch) >>>>>> I don't know of any way to have default behavior meet both goals without >>>>>> knowledge >>>>>> of the system size/usage/requirements. >>>>>> >>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of >>>>>> total cache on the chip. >>>>>> That won't be uncommon in the coming years on server class systems, >>>>>> especially >>>>>> in large databases or HPC environments (think vision processing or >>>>>> weather modeling for example). >>>>>> If a single app owns the whole chip and is running a multi-threaded >>>>>> application but needs >>>>>> to memcpy a really large block of data when one phase of computation >>>>>> finished >>>>>> before moving to the next phase. A common practice would be to have 64 >>>>>> parallel calls >>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current >>>>>> compilers >>>>>> handle that with no trouble. >>>>>> >>>>>> In the example, the per thread share of the cache is 2 MB and the >>>>>> proposed formula will set >>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or >>>>>> less, all threads comfortably >>>>>> fit in cache. If the total copy size is over that, then non-temporal >>>>>> stores are used and all is well there too. >>>>>> >>>>>> The current formula would set the threshold at 96 Mbytes for each >>>>>> thread. Only when the total >>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. >>>>>> We'd like >>>>>> to switch to non-temporal stores much sooner as we will be thrashing all >>>>>> the threads caches. >>>>>> >>>>>> In practical terms, I've had access to typical memcpy copy lengths for a >>>>>> variety of commerical >>>>>> applications while studying memcpy on Solaris over the years. The vast >>>>>> majority of copies >>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes >>>>>> of cache >>>>>> per thread, allowing in-cache copies for the common case, even without >>>>>> borrowing >>>>>> cache from other threads. The occasional really large copies tend to be >>>>>> when an application >>>>>> is passing a block of data to prepare for a new phase of computation or >>>>>> as a shared memory >>>>>> communication to another thread. In these cases, having the data remain >>>>>> in cache is usually >>>>>> not relevant and using non-temporal stores even when they are not >>>>>> strictly required does >>>>>> not have a negative affect on performance. >>>>>> >>>>>> A downside of tuning for a single thread comes in cloud computing >>>>>> environments, where >>>>>> having neighboring threads being cache hogs, even if relatively isolated >>>>>> in virtual machines, >>>>>> is a "bad thing" for having stable system performance. Whatever we can >>>>>> do to provide consistent, >>>>>> reasonable performance whatever the neighboring threads might be doing >>>>>> is a "good thing". >>>>>> >>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? >>>>> >>>> I have not tested larger thresholds. I'd be more comfortable with a >>>> smaller one. >>>> We could construct specific tests to show either advantage or disadvantage >>>> to shifting from 3/4 to all of cache depending on what data access was used >>>> between memcpy operations. >>>> >>>> I consider pushing the limit on cache usage to be a risky approach. Few >>>> applications >>>> only work on a single block of data. If all threads are doing a shared >>>> copy and >>>> they use all the available cache, then after the memcpy returns, any other >>>> active data would have been pushed out of the cache. That's likely to cost >>>> severe performance loss in more cases than the modest performance gains for >>>> a few cases where the application only is concerned with using the data that >>>> was just copied. >>>> >>>> Just to give a more detailed example where large copies are not followed >>>> by using >>>> the data. Consider garbage collection followed by compression. With a >>>> multi-age >>>> garbage collector, stable data that is active and survived several >>>> garbage collections >>>> is in a 'old' region. It does not need to be copied. The current 'new' >>>> region is full >>>> but has both referenced and unreferenced data. After the marking phase, >>>> the individual elements of the referenced data is copied to the base of >>>> the 'new' region. >>>> When complete, the rest of the 'new' region becomes the new free pool. >>>> The total amount copied may far exceed the processor cache. Then the >>>> application >>>> exits garbage collection and resumes active use of mostly the stable >>>> data with >>>> some accesses to the just moved new data and fresh allocations. If we >>>> under-use >>>> non-temporal stores, we clear the cache and the whole application runs >>>> slower >>>> than otherwise. >>>> >>>> Individual memcpy benchmarks are useful in isolation testing and comparing >>>> code patterns but can mislead about overall application performance in the >>>> context of potential for cache abuse. I fell into that tarpit once while >>>> tuning >>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe >>>> 5% faster for in-cache data) caused a major customer application to run >>>> slower >>>> because my new code abused the cache. I modified my code to only use the >>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or >>>> 128Kbytes if I remember right) and all was well. >>>> >>> The new threshold can be substantially smaller with large core count. >>> Are you saying that even 3 / 4 may be too big? Is there a reasonable >>> fixed threshold? >>> >> I don't have any evidence to say 3/4 is too big for typical applications >> and environments. >> In 2012, the default for memcpy was set to 1/2 the shared_cache_size >> which is what is >> the current default for Oracle el7 and Red Hat el7. >> >> Given the typically larger sized caches/thread today than 8 years, 3/4 >> may work out well >> since the remaining 1/4 of today's larger cache is often greater than >> 1/2 of yesteryear's smaller cache. >> > Please update the comment with your rationale for 3/4. Don't use > today or current. Use 2020 instead. > > Thanks. > I'm unsure about what needs to change in the comment which does not mention any dates currently. I'm assuming you are referring to the following comment in cacheinfo.c   /* The default setting for the non_temporal threshold is 3/4      of one thread's share of the chip's cache. While higher      single thread performance may be observed with a higher      threshold, having a single thread use more than it's share      of the cache will negatively impact the performance of      other threads running on the chip. */ While I could add a comment on why 3/4 vs 1/2 is the best choice, I don't have hard data to back it up. I'd be comfortable with either  3/4 or 1/2. I selected 3/4 as it was closer to the formula you chose in 2017 instead of the formula you chose in 2012. - patrick