From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <patrick.mcgehearty@oracle.com>
Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78])
 by sourceware.org (Postfix) with ESMTPS id A2ACF3857C6B
 for <libc-alpha@sourceware.org>; Thu, 24 Sep 2020 23:22:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A2ACF3857C6B
Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1])
 by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 08ONJMRl057761;
 Thu, 24 Sep 2020 23:22:30 GMT
Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70])
 by aserp2120.oracle.com with ESMTP id 33q5rgsbgd-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL);
 Thu, 24 Sep 2020 23:22:30 +0000
Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1])
 by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 08ONGBHA187161;
 Thu, 24 Sep 2020 23:22:29 GMT
Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75])
 by aserp3020.oracle.com with ESMTP id 33r28xmq3w-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Thu, 24 Sep 2020 23:22:29 +0000
Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23])
 by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 08ONMSIX017572;
 Thu, 24 Sep 2020 23:22:28 GMT
Received: from [10.154.181.126] (/10.154.181.126)
 by default (Oracle Beehive Gateway v4.0)
 with ESMTP ; Thu, 24 Sep 2020 16:22:27 -0700
Subject: Re: [PATCH v2] Reversing calculation of
 __x86_shared_non_temporal_threshold
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
References: <1600891781-9272-1-git-send-email-patrick.mcgehearty@oracle.com>
 <CAMe9rOqpcKUgQihB2xvtyR-wDj9-zOyLWcdvfTakW0vPOg7BcQ@mail.gmail.com>
 <9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com>
 <CAMe9rOr8BduymNNbf8YrahDobtUuk-Aetdg5Cu3eXdVp_K8vug@mail.gmail.com>
 <3f5e95c7-8601-ef86-d4c1-8e16005614d0@oracle.com>
 <CAMe9rOod2uYitZgSZmn_2-iqcnOtYTMY7hX-c8KncWZSux7j=A@mail.gmail.com>
 <4e422308-f935-e151-e1ce-175fd199f84f@oracle.com>
 <CAMe9rOoZdyZOQCHQSO0fADVpXLGYp21XxDGAFP7jPRxU-z5egA@mail.gmail.com>
From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Message-ID: <9f218fdc-dfc1-1458-f486-20af915017b9@oracle.com>
Date: Thu, 24 Sep 2020 18:22:34 -0500
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:68.0) Gecko/20100101
 Thunderbird/68.12.0
MIME-Version: 1.0
In-Reply-To: <CAMe9rOoZdyZOQCHQSO0fADVpXLGYp21XxDGAFP7jPRxU-z5egA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9754
 signatures=668680
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0
 mlxlogscore=999
 suspectscore=0 adultscore=0 bulkscore=0 malwarescore=0 spamscore=0
 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2006250000 definitions=main-2009240169
X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9754
 signatures=668680
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 impostorscore=0
 clxscore=1015 suspectscore=0 phishscore=0 malwarescore=0
 priorityscore=1501 mlxlogscore=999 adultscore=0 bulkscore=0 mlxscore=0
 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2006250000 definitions=main-2009240169
X-Spam-Status: No, score=-9.8 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF,
 GIT_PATCH_0, NICE_REPLY_A, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Sep 2020 23:22:34 -0000


On 9/24/2020 4:54 PM, H.J. Lu wrote:
> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
> <patrick.mcgehearty@oracle.com> wrote:
>>
>>
>> On 9/23/2020 6:13 PM, H.J. Lu wrote:
>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>
>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
>>>>> <patrick.mcgehearty@oracle.com> wrote:
>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
>>>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last
>>>>>>>> level cache.
>>>>>>>>
>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
>>>>>>>> patch of June 2, 2017.
>>>>>>>>
>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
>>>>>>>> getting maximum performance. It was tuned using the single threaded
>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
>>>>>>>> changes the threshold from using 3/4 of one thread's share of the
>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with
>>>>>>>> more than a few threads are server-class and typically have many
>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for
>>>>>>>> all threads, it will cause other active threads to have data removed
>>>>>>>> from the cache. Two examples show the range of the effect. John
>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
>>>>>>>> an internal system test of 128 threads. This regression was discovered
>>>>>>>> when comparing OL8 performance to OL7.  An example that compares
>>>>>>>> normal stores to non-temporal stores may be found at
>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
>>>>>>>> shows performance loss of 400 to 500% due to a failure to use
>>>>>>>> nontemporal stores. These performance losses are most likely to occur
>>>>>>>> when the system load is heaviest and good performance is critical.
>>>>>>>>
>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the
>>>>>>>> default for the knowledgable user who really wants maximum cache
>>>>>>>> allocation to a single thread in a multi-threaded system.
>>>>>>>> The manual entry for the tunable has been expanded to provide
>>>>>>>> more information about its purpose.
>>>>>>>>
>>>>>>>>             modified: sysdeps/x86/cacheinfo.c
>>>>>>>>             modified: manual/tunables.texi
>>>>>>>> ---
>>>>>>>>      manual/tunables.texi    |  6 +++++-
>>>>>>>>      sysdeps/x86/cacheinfo.c | 12 +++++++-----
>>>>>>>>      2 files changed, 12 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>>>>>>> index b6bb54d..94d4fbd 100644
>>>>>>>> --- a/manual/tunables.texi
>>>>>>>> +++ b/manual/tunables.texi
>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>>>>>>>>
>>>>>>>>      @deftp Tunable glibc.tune.x86_non_temporal_threshold
>>>>>>>>      The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
>>>>>>>> -to set threshold in bytes for non temporal store.
>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
>>>>>>>> +give a hint to the hardware to move data directly to memory without
>>>>>>>> +displacing other data from the cache. This tunable is used by some
>>>>>>>> +platforms to determine when to use non temporal stores in operations
>>>>>>>> +like memmove and memcpy.
>>>>>>>>
>>>>>>>>      This tunable is specific to i386 and x86-64.
>>>>>>>>      @end deftp
>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
>>>>>>>> index b9444dd..c6767d9 100644
>>>>>>>> --- a/sysdeps/x86/cacheinfo.c
>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c
>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
>>>>>>>>            __x86_shared_cache_size = shared;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
>>>>>>>> -     shared cache size is the approximate value above which non-temporal
>>>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
>>>>>>>> -     total shared cache size.  */
>>>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
>>>>>>>> +     of one thread's share of the chip's cache. While higher
>>>>>>>> +     single thread performance may be observed with a higher
>>>>>>>> +     threshold, having a single thread use more than it's share
>>>>>>>> +     of the cache will negatively impact the performance of
>>>>>>>> +     other threads running on the chip. */
>>>>>>>>        __x86_shared_non_temporal_threshold
>>>>>>>>          = (cpu_features->non_temporal_threshold != 0
>>>>>>>>             ? cpu_features->non_temporal_threshold
>>>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
>>>>>>>> +       : __x86_shared_cache_size * 3 / 4);
>>>>>>>>      }
>>>>>>>>
>>>>>>> Can we tune it with the number of threads and/or total cache
>>>>>>> size?
>>>>>>>
>>>>>> When you say "total cache size", is that different from
>>>>>> shared_cache_size * threads?
>>>>>>
>>>>>> I see a fundamental conflict of optimization goals:
>>>>>> 1) Provide best single thread performance (current code)
>>>>>> 2) Provide best overall system performance under full load (proposed patch)
>>>>>> I don't know of any way to have default behavior meet both goals without
>>>>>> knowledge
>>>>>> of the system size/usage/requirements.
>>>>>>
>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
>>>>>> total cache on the chip.
>>>>>> That won't be uncommon in the coming years on server class systems,
>>>>>> especially
>>>>>> in large databases or HPC environments (think vision processing or
>>>>>> weather modeling for example).
>>>>>> If a single app owns the whole chip and is running a multi-threaded
>>>>>> application but needs
>>>>>> to memcpy a really large block of data when one phase of computation
>>>>>> finished
>>>>>> before moving to the next phase. A common practice would be to have 64
>>>>>> parallel calls
>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
>>>>>> compilers
>>>>>> handle that with no trouble.
>>>>>>
>>>>>> In the example, the per thread share of the cache is 2 MB and the
>>>>>> proposed formula will set
>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
>>>>>> less, all threads comfortably
>>>>>> fit in cache. If the total copy size is over that, then non-temporal
>>>>>> stores are used and all is well there too.
>>>>>>
>>>>>> The current formula would set the threshold at 96 Mbytes for each
>>>>>> thread. Only when the total
>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
>>>>>> We'd like
>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all
>>>>>> the threads caches.
>>>>>>
>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a
>>>>>> variety of commerical
>>>>>> applications while studying memcpy on Solaris over the years. The vast
>>>>>> majority of copies
>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
>>>>>> of cache
>>>>>> per thread, allowing in-cache copies for the common case, even without
>>>>>> borrowing
>>>>>> cache from other threads. The occasional really large copies tend to be
>>>>>> when an application
>>>>>> is passing a block of data to prepare for a new phase of computation or
>>>>>> as a shared memory
>>>>>> communication to another thread. In these cases, having the data remain
>>>>>> in cache is usually
>>>>>> not relevant and using non-temporal stores even when they are not
>>>>>> strictly required does
>>>>>> not have a negative affect on performance.
>>>>>>
>>>>>> A downside of tuning for a single thread comes in cloud computing
>>>>>> environments, where
>>>>>> having neighboring threads being cache hogs, even if relatively isolated
>>>>>> in virtual machines,
>>>>>> is a "bad thing" for having stable system performance. Whatever we can
>>>>>> do to provide consistent,
>>>>>> reasonable performance whatever the neighboring threads might be doing
>>>>>> is a "good thing".
>>>>>>
>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
>>>>>
>>>> I have not tested larger thresholds. I'd be more comfortable with a
>>>> smaller one.
>>>> We could construct specific tests to show either advantage or disadvantage
>>>> to shifting from 3/4 to all of cache depending on what data access was used
>>>> between memcpy operations.
>>>>
>>>> I consider pushing the limit on cache usage to be a risky approach. Few
>>>> applications
>>>> only work on a single block of data.  If all threads are doing a shared
>>>> copy and
>>>> they use all the available cache, then after the memcpy returns, any other
>>>> active data would have been pushed out of the cache. That's likely to cost
>>>> severe performance loss in more cases than the modest performance gains for
>>>> a few cases where the application only is concerned with using the data that
>>>> was just copied.
>>>>
>>>> Just to give a more detailed example where large copies are not followed
>>>> by using
>>>> the data. Consider garbage collection followed by compression. With a
>>>> multi-age
>>>> garbage collector, stable data that is active and survived several
>>>> garbage collections
>>>> is in a 'old' region. It does not need to be copied. The current 'new'
>>>> region is full
>>>> but has both referenced and unreferenced data. After the marking phase,
>>>> the individual elements of the referenced data is copied to the base of
>>>> the 'new' region.
>>>> When complete, the rest of the 'new' region becomes the new free pool.
>>>> The total amount copied may far exceed the processor cache.  Then the
>>>> application
>>>> exits garbage collection and resumes active use of mostly the stable
>>>> data with
>>>> some accesses to the just moved new data and fresh allocations. If we
>>>> under-use
>>>> non-temporal stores, we clear the cache and the whole application runs
>>>> slower
>>>> than otherwise.
>>>>
>>>> Individual memcpy benchmarks are useful in isolation testing and comparing
>>>> code patterns but can mislead about overall application performance in the
>>>> context of potential for cache abuse. I fell into that tarpit once while
>>>> tuning
>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
>>>> 5% faster for in-cache data) caused a major customer application to run
>>>> slower
>>>> because my new code abused the cache.  I modified my code to  only use the
>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
>>>> 128Kbytes if I remember right) and all was well.
>>>>
>>> The new threshold can be substantially smaller with large core count.
>>> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
>>> fixed threshold?
>>>
>> I don't have any evidence to say 3/4 is too big for typical applications
>> and environments.
>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
>> which is what is
>> the current default for Oracle el7 and Red Hat el7.
>>
>> Given the typically larger sized caches/thread today than 8 years, 3/4
>> may work out well
>> since the remaining 1/4 of today's larger cache is often greater than
>> 1/2 of yesteryear's smaller cache.
>>
> Please update the comment with your rationale for 3/4.  Don't use
> today or current.   Use 2020 instead.
>
> Thanks.
>
I'm unsure about what needs to change in the comment which does not mention
any dates currently. I'm assuming you are referring to the following 
comment in cacheinfo.c

   /* The default setting for the non_temporal threshold is 3/4
      of one thread's share of the chip's cache. While higher
      single thread performance may be observed with a higher
      threshold, having a single thread use more than it's share
      of the cache will negatively impact the performance of
      other threads running on the chip. */

While I could add a comment on why 3/4 vs 1/2 is the best choice, I 
don't have hard
data to back it up. I'd be comfortable with either  3/4 or 1/2. I 
selected 3/4 as it
was closer to the formula you chose in 2017 instead of the formula you 
chose in 2012.

- patrick