From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-il1-x12f.google.com (mail-il1-x12f.google.com
 [IPv6:2607:f8b0:4864:20::12f])
 by sourceware.org (Postfix) with ESMTPS id E87BA3857824
 for <libc-alpha@sourceware.org>; Fri, 25 Sep 2020 21:04:48 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E87BA3857824
Received: by mail-il1-x12f.google.com with SMTP id q4so3677890ils.4
 for <libc-alpha@sourceware.org>; Fri, 25 Sep 2020 14:04:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=dDbOKmJumo9OuO7mnh0GycGECioZHg7MgB4clLKqOIs=;
 b=Y+YBqt1K4EoQXyY7HtPWCK5q/Eq8tkpXBaYf7ctFq0sFlAVkWppe8ENCKbDlOR4w4M
 m14sd2lPHAUWtUI0GRtv4AzoWZs92kM5nvrJiQ9F5H2AjH8C+mOBnQCu05S2FVZnlqgQ
 dE0Ja29mGuB02yWkytmLysQvyeZdm5mxmNoPvXmC+txjlcIUQ0plNvvmHOko43k3pnrs
 81Y9S6usxbwtNPsVYSKoMucg7QT1Np3WsaYSi3mvLJe4zZW8/tEz4gcDn9zMvDGLhRx6
 7ssQEFjCKdMQAQSoZLg3zxpqP8dgaBZkKeFmhkpwO3luzDX4zkqLhXAwAb+3lfPSN78W
 VBiQ==
X-Gm-Message-State: AOAM530JWQslls6ywpDSbmd5GEXmi8730fi3UoPhAzCx6blsBURpkVtU
 6GWRjTBxAhsAyiE+rgDJQsTNmxNvEkwqKHhHsGu4Z3+7
X-Google-Smtp-Source: ABdhPJzo5d7bJKVKaO0XtAhPmpUI/OzBuHqnFlmMc5+i+MJQWNwquNFKgC/aYQ49XFvXYC6lcUZzSftWPrXSDSFaXvM=
X-Received: by 2002:a92:1589:: with SMTP id 9mr1838535ilv.292.1601067888051;
 Fri, 25 Sep 2020 14:04:48 -0700 (PDT)
MIME-Version: 1.0
References: <1600891781-9272-1-git-send-email-patrick.mcgehearty@oracle.com>
 <CAMe9rOqpcKUgQihB2xvtyR-wDj9-zOyLWcdvfTakW0vPOg7BcQ@mail.gmail.com>
 <9bdaaf47-3a20-6921-7d4b-6d428a06d4fc@oracle.com>
 <CAMe9rOr8BduymNNbf8YrahDobtUuk-Aetdg5Cu3eXdVp_K8vug@mail.gmail.com>
 <3f5e95c7-8601-ef86-d4c1-8e16005614d0@oracle.com>
 <CAMe9rOod2uYitZgSZmn_2-iqcnOtYTMY7hX-c8KncWZSux7j=A@mail.gmail.com>
 <4e422308-f935-e151-e1ce-175fd199f84f@oracle.com>
 <CAMe9rOoZdyZOQCHQSO0fADVpXLGYp21XxDGAFP7jPRxU-z5egA@mail.gmail.com>
 <9f218fdc-dfc1-1458-f486-20af915017b9@oracle.com>
 <CAMe9rOrQ8esmZ=6f96JKUa+v5tfq4_aaWU+syFitJUUZAaqOKw@mail.gmail.com>
 <d4ad3822-5bcb-dd03-44e2-865d32a73efa@oracle.com>
In-Reply-To: <d4ad3822-5bcb-dd03-44e2-865d32a73efa@oracle.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 25 Sep 2020 14:04:12 -0700
Message-ID: <CAMe9rOp2d86fJNpRh=wnbbU9xES7W5QPLXeAurq8fAqFGH+7BA@mail.gmail.com>
Subject: Re: [PATCH v2] Reversing calculation of
 __x86_shared_non_temporal_threshold
To: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3037.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Sep 2020 21:04:51 -0000

On Fri, Sep 25, 2020 at 1:53 PM Patrick McGehearty
<patrick.mcgehearty@oracle.com> wrote:
>
>
>
> On 9/24/2020 6:57 PM, H.J. Lu wrote:
> > On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty
> > <patrick.mcgehearty@oracle.com> wrote:
> >>
> >>
> >> On 9/24/2020 4:54 PM, H.J. Lu wrote:
> >>> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty
> >>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>
> >>>> On 9/23/2020 6:13 PM, H.J. Lu wrote:
> >>>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty
> >>>>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote:
> >>>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty
> >>>>>>> <patrick.mcgehearty@oracle.com> wrote:
> >>>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote:
> >>>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha
> >>>>>>>>> <libc-alpha@sourceware.org> wrote:
> >>>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> >>>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last
> >>>>>>>>>> level cache.
> >>>>>>>>>>
> >>>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's
> >>>>>>>>>> patch of June 2, 2017.
> >>>>>>>>>>
> >>>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread
> >>>>>>>>>> getting maximum performance. It was tuned using the single threaded
> >>>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change
> >>>>>>>>>> changes the threshold from using 3/4 of one thread's share of the
> >>>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system
> >>>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with
> >>>>>>>>>> more than a few threads are server-class and typically have many
> >>>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for
> >>>>>>>>>> all threads, it will cause other active threads to have data removed
> >>>>>>>>>> from the cache. Two examples show the range of the effect. John
> >>>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel
> >>>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on
> >>>>>>>>>> an internal system test of 128 threads. This regression was discovered
> >>>>>>>>>> when comparing OL8 performance to OL7.  An example that compares
> >>>>>>>>>> normal stores to non-temporal stores may be found at
> >>>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ .  A simple test
> >>>>>>>>>> shows performance loss of 400 to 500% due to a failure to use
> >>>>>>>>>> nontemporal stores. These performance losses are most likely to occur
> >>>>>>>>>> when the system load is heaviest and good performance is critical.
> >>>>>>>>>>
> >>>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the
> >>>>>>>>>> default for the knowledgable user who really wants maximum cache
> >>>>>>>>>> allocation to a single thread in a multi-threaded system.
> >>>>>>>>>> The manual entry for the tunable has been expanded to provide
> >>>>>>>>>> more information about its purpose.
> >>>>>>>>>>
> >>>>>>>>>>              modified: sysdeps/x86/cacheinfo.c
> >>>>>>>>>>              modified: manual/tunables.texi
> >>>>>>>>>> ---
> >>>>>>>>>>       manual/tunables.texi    |  6 +++++-
> >>>>>>>>>>       sysdeps/x86/cacheinfo.c | 12 +++++++-----
> >>>>>>>>>>       2 files changed, 12 insertions(+), 6 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
> >>>>>>>>>> index b6bb54d..94d4fbd 100644
> >>>>>>>>>> --- a/manual/tunables.texi
> >>>>>>>>>> +++ b/manual/tunables.texi
> >>>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
> >>>>>>>>>>
> >>>>>>>>>>       @deftp Tunable glibc.tune.x86_non_temporal_threshold
> >>>>>>>>>>       The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> >>>>>>>>>> -to set threshold in bytes for non temporal store.
> >>>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores
> >>>>>>>>>> +give a hint to the hardware to move data directly to memory without
> >>>>>>>>>> +displacing other data from the cache. This tunable is used by some
> >>>>>>>>>> +platforms to determine when to use non temporal stores in operations
> >>>>>>>>>> +like memmove and memcpy.
> >>>>>>>>>>
> >>>>>>>>>>       This tunable is specific to i386 and x86-64.
> >>>>>>>>>>       @end deftp
> >>>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> >>>>>>>>>> index b9444dd..c6767d9 100644
> >>>>>>>>>> --- a/sysdeps/x86/cacheinfo.c
> >>>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c
> >>>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info:
> >>>>>>>>>>             __x86_shared_cache_size = shared;
> >>>>>>>>>>           }
> >>>>>>>>>>
> >>>>>>>>>> -  /* The large memcpy micro benchmark in glibc shows that 6 times of
> >>>>>>>>>> -     shared cache size is the approximate value above which non-temporal
> >>>>>>>>>> -     store becomes faster on a 8-core processor.  This is the 3/4 of the
> >>>>>>>>>> -     total shared cache size.  */
> >>>>>>>>>> +  /* The default setting for the non_temporal threshold is 3/4
> >>>>>>>>>> +     of one thread's share of the chip's cache. While higher
> >>>>>>>>>> +     single thread performance may be observed with a higher
> >>>>>>>>>> +     threshold, having a single thread use more than it's share
> >>>>>>>>>> +     of the cache will negatively impact the performance of
> >>>>>>>>>> +     other threads running on the chip. */
> >>>>>>>>>>         __x86_shared_non_temporal_threshold
> >>>>>>>>>>           = (cpu_features->non_temporal_threshold != 0
> >>>>>>>>>>              ? cpu_features->non_temporal_threshold
> >>>>>>>>>> -       : __x86_shared_cache_size * threads * 3 / 4);
> >>>>>>>>>> +       : __x86_shared_cache_size * 3 / 4);
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>> Can we tune it with the number of threads and/or total cache
> >>>>>>>>> size?
> >>>>>>>>>
> >>>>>>>> When you say "total cache size", is that different from
> >>>>>>>> shared_cache_size * threads?
> >>>>>>>>
> >>>>>>>> I see a fundamental conflict of optimization goals:
> >>>>>>>> 1) Provide best single thread performance (current code)
> >>>>>>>> 2) Provide best overall system performance under full load (proposed patch)
> >>>>>>>> I don't know of any way to have default behavior meet both goals without
> >>>>>>>> knowledge
> >>>>>>>> of the system size/usage/requirements.
> >>>>>>>>
> >>>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of
> >>>>>>>> total cache on the chip.
> >>>>>>>> That won't be uncommon in the coming years on server class systems,
> >>>>>>>> especially
> >>>>>>>> in large databases or HPC environments (think vision processing or
> >>>>>>>> weather modeling for example).
> >>>>>>>> If a single app owns the whole chip and is running a multi-threaded
> >>>>>>>> application but needs
> >>>>>>>> to memcpy a really large block of data when one phase of computation
> >>>>>>>> finished
> >>>>>>>> before moving to the next phase. A common practice would be to have 64
> >>>>>>>> parallel calls
> >>>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current
> >>>>>>>> compilers
> >>>>>>>> handle that with no trouble.
> >>>>>>>>
> >>>>>>>> In the example, the per thread share of the cache is 2 MB and the
> >>>>>>>> proposed formula will set
> >>>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or
> >>>>>>>> less, all threads comfortably
> >>>>>>>> fit in cache. If the total copy size is over that, then non-temporal
> >>>>>>>> stores are used and all is well there too.
> >>>>>>>>
> >>>>>>>> The current formula would set the threshold at 96 Mbytes for each
> >>>>>>>> thread. Only when the total
> >>>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used.
> >>>>>>>> We'd like
> >>>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all
> >>>>>>>> the threads caches.
> >>>>>>>>
> >>>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a
> >>>>>>>> variety of commerical
> >>>>>>>> applications while studying memcpy on Solaris over the years. The vast
> >>>>>>>> majority of copies
> >>>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes
> >>>>>>>> of cache
> >>>>>>>> per thread, allowing in-cache copies for the common case, even without
> >>>>>>>> borrowing
> >>>>>>>> cache from other threads. The occasional really large copies tend to be
> >>>>>>>> when an application
> >>>>>>>> is passing a block of data to prepare for a new phase of computation or
> >>>>>>>> as a shared memory
> >>>>>>>> communication to another thread. In these cases, having the data remain
> >>>>>>>> in cache is usually
> >>>>>>>> not relevant and using non-temporal stores even when they are not
> >>>>>>>> strictly required does
> >>>>>>>> not have a negative affect on performance.
> >>>>>>>>
> >>>>>>>> A downside of tuning for a single thread comes in cloud computing
> >>>>>>>> environments, where
> >>>>>>>> having neighboring threads being cache hogs, even if relatively isolated
> >>>>>>>> in virtual machines,
> >>>>>>>> is a "bad thing" for having stable system performance. Whatever we can
> >>>>>>>> do to provide consistent,
> >>>>>>>> reasonable performance whatever the neighboring threads might be doing
> >>>>>>>> is a "good thing".
> >>>>>>>>
> >>>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4?
> >>>>>>>
> >>>>>> I have not tested larger thresholds. I'd be more comfortable with a
> >>>>>> smaller one.
> >>>>>> We could construct specific tests to show either advantage or disadvantage
> >>>>>> to shifting from 3/4 to all of cache depending on what data access was used
> >>>>>> between memcpy operations.
> >>>>>>
> >>>>>> I consider pushing the limit on cache usage to be a risky approach. Few
> >>>>>> applications
> >>>>>> only work on a single block of data.  If all threads are doing a shared
> >>>>>> copy and
> >>>>>> they use all the available cache, then after the memcpy returns, any other
> >>>>>> active data would have been pushed out of the cache. That's likely to cost
> >>>>>> severe performance loss in more cases than the modest performance gains for
> >>>>>> a few cases where the application only is concerned with using the data that
> >>>>>> was just copied.
> >>>>>>
> >>>>>> Just to give a more detailed example where large copies are not followed
> >>>>>> by using
> >>>>>> the data. Consider garbage collection followed by compression. With a
> >>>>>> multi-age
> >>>>>> garbage collector, stable data that is active and survived several
> >>>>>> garbage collections
> >>>>>> is in a 'old' region. It does not need to be copied. The current 'new'
> >>>>>> region is full
> >>>>>> but has both referenced and unreferenced data. After the marking phase,
> >>>>>> the individual elements of the referenced data is copied to the base of
> >>>>>> the 'new' region.
> >>>>>> When complete, the rest of the 'new' region becomes the new free pool.
> >>>>>> The total amount copied may far exceed the processor cache.  Then the
> >>>>>> application
> >>>>>> exits garbage collection and resumes active use of mostly the stable
> >>>>>> data with
> >>>>>> some accesses to the just moved new data and fresh allocations. If we
> >>>>>> under-use
> >>>>>> non-temporal stores, we clear the cache and the whole application runs
> >>>>>> slower
> >>>>>> than otherwise.
> >>>>>>
> >>>>>> Individual memcpy benchmarks are useful in isolation testing and comparing
> >>>>>> code patterns but can mislead about overall application performance in the
> >>>>>> context of potential for cache abuse. I fell into that tarpit once while
> >>>>>> tuning
> >>>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe
> >>>>>> 5% faster for in-cache data) caused a major customer application to run
> >>>>>> slower
> >>>>>> because my new code abused the cache.  I modified my code to  only use the
> >>>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or
> >>>>>> 128Kbytes if I remember right) and all was well.
> >>>>>>
> >>>>> The new threshold can be substantially smaller with large core count.
> >>>>> Are you saying that even 3 / 4 may be too big?  Is there a reasonable
> >>>>> fixed threshold?
> >>>>>
> >>>> I don't have any evidence to say 3/4 is too big for typical applications
> >>>> and environments.
> >>>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size
> >>>> which is what is
> >>>> the current default for Oracle el7 and Red Hat el7.
> >>>>
> >>>> Given the typically larger sized caches/thread today than 8 years, 3/4
> >>>> may work out well
> >>>> since the remaining 1/4 of today's larger cache is often greater than
> >>>> 1/2 of yesteryear's smaller cache.
> >>>>
> >>> Please update the comment with your rationale for 3/4.  Don't use
> >>> today or current.   Use 2020 instead.
> >>>
> >>> Thanks.
> >>>
> >> I'm unsure about what needs to change in the comment which does not mention
> >> any dates currently. I'm assuming you are referring to the following
> >> comment in cacheinfo.c
> >>
> >>     /* The default setting for the non_temporal threshold is 3/4
> >>        of one thread's share of the chip's cache. While higher
> >>        single thread performance may be observed with a higher
> >>        threshold, having a single thread use more than it's share
> >>        of the cache will negatively impact the performance of
> >>        other threads running on the chip. */
> >>
> >> While I could add a comment on why 3/4 vs 1/2 is the best choice, I
> >> don't have hard
> >> data to back it up. I'd be comfortable with either  3/4 or 1/2. I
> >> selected 3/4 as it
> >> was closer to the formula you chose in 2017 instead of the formula you
> >> chose in 2012.
> > The comment is for readers 5 years from now who may be wondering
> > where 3/4 came from.  Just add something close to what you have said above.
> >
> Before I redo the commit and resubmit the whole patch. I thought I'd present
> a revised comment for review.The value of 500KB to 2MB/thread is based
> on a quick review of the wikipedia entries for Intel and AMD processors
> released since 2017. There may be a few outliers, but the vast majority
> fit that range for L3/thread. I tried to balance giving a sense of the
> situation without diving too deeply into application specific details.
>
>
> Comment in v2:
>    /* The default setting for the non_temporal threshold is 3/4
>       of one thread's share of the chip's cache. While higher
>       single thread performance may be observed with a higher
>       threshold, having a single thread use more than it's share
>       of the cache will negatively impact the performance of
>       other threads running on the chip. */
>
> Proposed comment for v3:
>    /* The default setting for the non_temporal threshold is 3/4 of one
>       thread's share of the chip's cache. For most Intel and AMD processors
>       with an initial release date between 2017 and 2020,a thread's typical
>       share ofthe cache is from 500 KBytes to 2 MBytes. Using the 3/4
> threshold leaves 125 KBytes to 500 KBytes of the thread'sdata
>       in cache after a maximum temporal copy, which will maintain
>       in cache a reasonable portion of the thread's stack and other
>       active data. If the threshold is set higher than one thread's
>       share of the cache, it has a substantial risk of negatively
>       impacting the performance of other threads running on the chip. */
>

Comments look good.  Please submit the patch with the updated
comment.

Thanks.

-- 
H.J.