From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-79226-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 121595 invoked by alias); 10 May 2017 17:33:13 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 121577 invoked by uid 89); 10 May 2017 17:33:12 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=compromise, million
X-HELO: mail-qk0-f174.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=9YG0i8p0/FA3kiigHG/VSofB5IH14FL1qM4PCdp2be8=;
        b=s+UNtlexIotD/uhwTPH5MPDiK6GofDhcN2DOMhVtYJxS78oWDQaHYzEpAFM36fu6o6
         L/aYqgNQXsXNOmgofMlOsGqAgGOc5sJ6eOD4hVORj43q9I0GI6igYa3GbNRrH8QlPTmJ
         oe45gb4ooplkZFIq7pJz0vrDgRDthwUrkAhY9icVSwLxZsoWtqgtIfL44Zh7MU3BtQEa
         7Po/tFuhsreBPZlFLjOikZAdlg9ZtnOkO2BTjttn5cMmvgM6Wr7Az4j/yB/tWDExtgjt
         0ZhXuWQW3TPth7/peGJRXv9VDehsjAOehMn5m3oplEhTLa1t+gmx7pC2aSx7KEqJPQm/
         lG4w==
X-Gm-Message-State: AODbwcDfP9VKLvJMNAzuy9FFgqwqkbuV71pH4Xa/60n5yM8KbHorLyhd
	fZ4iGhrK4k/l80tFg5UEnqZhWfM0lA==
X-Received: by 10.55.74.214 with SMTP id x205mr7072430qka.231.1494437590962;
 Wed, 10 May 2017 10:33:10 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <CAOVZoAMmOoPU2ECGGiM6=tXdNXmd6HhKEqu67dP3u_WZ5gfm3Q@mail.gmail.com>
References: <CAOVZoAPo-A5-bRZFHeu_wvTASzh_4nYwmqfCVfHQ7h34GyWKAA@mail.gmail.com>
 <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com> <CAOVZoAO9Ryz0uqGx4aXi+vf27c_+j8e+_opxcfQK27qO0OnBpw@mail.gmail.com>
 <CAMe9rOo8SbuodNhZugDLhdRPU=8OZuKVp4eT+3-6Gw-K0OLs3Q@mail.gmail.com> <CAOVZoAMmOoPU2ECGGiM6=tXdNXmd6HhKEqu67dP3u_WZ5gfm3Q@mail.gmail.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 10 May 2017 17:33:00 -0000
Message-ID: <CAMe9rOr-E6OsDYsUCc7ci54e9R7A+BjH82eiMvLydtiJD61uzw@mail.gmail.com>
Subject: Re: memcpy performance regressions 2.19 -> 2.24(5)
To: Erich Elsen <eriche@google.com>
Cc: "Carlos O'Donell" <carlos@redhat.com>, GNU C Library <libc-alpha@sourceware.org>
Content-Type: text/plain; charset=UTF-8
X-SW-Source: 2017-05/txt/msg00254.txt.bz2

On Tue, May 9, 2017 at 4:48 PM, Erich Elsen <eriche@google.com> wrote:
> I've created a shareable benchmark, available here:
> https://gist.github.com/ekelsen/b66cc085eb39f0495b57679cdb1874fa .
> This is not the one the numbers on the spreadsheet are generated from,
> but the results are similar.

I will take a look.

> I think libc 2.19 chooses sse2_unaligned for all the cpus on the spreadsheet.
>
> You can use this to see the difference on Haswell between
> avx_unaligned and avx_unaligned_erms on the readcache and nocache
> benchmarks.  It's true that for readwritecache, which corresponds to
> the libc benchmarks, avx_unaligned_erms is always at least as fast.

I created hjl/x86/optimize branch with memcpy-sse2-unaligned.S
from glibc 2.19 so that we can compare its performance against
others with glibc benchmark.

> You can also use it to see the regression on IvyBridge from 2.19 to 2.24.

That is expected since memcpy-sse2-unaligned.S doesn't use
non-temporal store.

> Are there standard benchmarks showing that using the non-temporal

How responsive is your glibc 2.19 machine when your memcpy benchmark
is running? I would expect glibc 2.24 machine is more responsive.

> store is a net win even though it causes a 2-3x decrease in single
> threaded performance for some processors?  Or how else is the decision
> about the threshold made?

There is no perfect number to make everyone happy.  I am open
to suggestion to improve the compromise.

H.J.

> Thanks,
> Erich
>
> On Sat, May 6, 2017 at 8:41 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, May 5, 2017 at 5:57 PM, Erich Elsen <eriche@google.com> wrote:
>>> Hi Carlos,
>>>
>>> a/b) The number of runs is dependent on the time taken; the number
>>> iterations was such that each size took at least 500ms for all
>>> iterations.  For many of the smaller sizes this means 10-100 million
>>> iterations, for the largest size, 64MB, it was ~60.  10 runs were
>>> launched separately, the difference between the maximum and the
>>> minimum average was never more than 6% for any size; all of the
>>> regressions are larger than this difference (usually much larger).
>>> The times on the spreadsheet are from a randomly chosen run - it would
>>> be possible to use a median or average, but given the large size of
>>> effect, it didn't seem necessary.
>>>
>>> b) The machines were idle (background processes only) except for the
>>> test being run.  Boost was disabled.  The benchmark is single
>>> threaded.  I did not explicitly pin the process - but given that the
>>> machine was otherwise idle - it would be surprising if it was
>>> migrated.  I can add this to see if the results change.
>>>
>>> c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy),
>>> E5-2689 (Sandy); I don't have motherboard or memory info.  The kernel
>>> on the benchmark machines is 3.11.10.
>>>
>>> d)  Only bench-memcpy-large would expose the problem at the largest
>>> sizes.  2.19 did not have bench-memcpy-large.  The current benchmarks
>>> will not reveal the regressions on Ivy and Haswell in the intermediate
>>> size range because they only correspond to the readwritecache case on
>>> the spreadsheet.  That is, they loop over the same src and dst buffers
>>> in the timing loop.
>>>
>>> nocache means that both the src and dst buffers go through memory with
>>> strides such that nothing will be cached.
>>> readcache means that the src buffer is fixed, but the dst buffer
>>> strides through memory.
>>>
>>> To see the difference at the largest sizes with the bench-memcpy-large
>>> you can run it twice; once forcing __x86_shared_non_temporal_threshold
>>> to LONG_MAX so the non-temporal path is never taken.
>>
>> The purpose of using non-tempora store is to avoid cache pullution
>> so that cache is also available to other threads.  We can improve the
>> heuristic for non-temporal threshold.   But we can't give all cache to
>> a single thread by default.
>>
>> As for Haswell, there are some cases where the SSSE3 memcpy in
>> glibc 2.19 is faster than the new AVX memcpy.  But the new AVX
>> memcpy is faster than the SSSE3 memcpy in majority of cases.  The
>> new AVX memcpy in glibc 2.24 replaces the old AVX memcpy in glibc
>> 2.23. So there is no regression from 2.23 to 2.24.
>>
>> I also  checked my glibc performance data.  For data > 32K,
>> __memcpy_avx_unaligned is slower than __memcpy_avx_unaligned_erms.
>> We have
>>
>> /* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
>>    up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
>>    memcpy micro benchmark in glibc shows that 2KB is the approximate
>>    value above which REP MOVSB becomes faster than SSE2 optimization
>>    on processors with Enhanced REP MOVSB.  Since larger register size
>>    can move more data with a single load and store, the threshold is
>>    higher with larger register size.  */
>> #ifndef REP_MOVSB_THRESHOLD
>> # define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16))
>> #endif
>>
>> We can change it if there is improvement in glibc benchmarks.
>>
>>
>> H.J.
>>
>>> e) Yes, I can do this. It needs to go through approval to share
>>> publicly, will take a few days.
>>>
>>> Thanks,
>>> Erich
>>>
>>> On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>>>> On 05/05/2017 01:09 PM, Erich Elsen wrote:
>>>>> I had a couple of questions:
>>>>>
>>>>> 1) Are the large regressions at large sizes for IvyBridge and
>>>>> SandyBridge expected?  Is avoiding non-temporal stores a reasonable
>>>>> solution?
>>>>
>>>> No large regressions are expected.
>>>>
>>>>> 2) Is it possible to fix the IvyBridge regressions by using model
>>>>> information to force a specific implementation?  I'm not sure how
>>>>> other cpus (AMD) would be affected if the selection logic was modified
>>>>> based on feature flags.
>>>>
>>>> A different memcpy can be used for any detectable difference in hardware.
>>>> What you can't do is select a different memcpy for a different range of
>>>> inputs. You have to make the choice upfront with only the knowledge of
>>>> the hardware as your input. Though today we could augment that choice
>>>> with a glibc tunable set by the shell starting the process.
>>>>
>>>> I have questions of my own:
>>>>
>>>> (a) How statistically relevant were your results?
>>>> - What are your confidence intervals?
>>>> - What is your standard deviation?
>>>> - How many runs did you average?
>>>>
>>>> (b) Was your machine hardware stable?
>>>> - See:
>>>> https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/
>>>> - What methodology did you use to carry out your tests? Like CPU pinning.
>>>>
>>>> (c) Exactly what hardware did you use?
>>>> - You mention IvyBridge and SandyBridge, but what exact hardware did
>>>>   you use for the tests, and what exact kernel version?
>>>>
>>>> (d) If you run glibc's own microbenchmarks do you see the same
>>>>     performance problems? e.g. make bench, and look at the detailed
>>>>     bench-memcpy, bench-memcpy-large, and bench-memcpy-random results.
>>>>
>>>> https://sourceware.org/glibc/wiki/Testing/Builds
>>>>
>>>> (e) Are you willing to publish your microbencmark sources for others
>>>>     to confirm the results?
>>>>
>>>> --
>>>> Cheers,
>>>> Carlos.
>>
>>
>>
>> --
>> H.J.


-- 
H.J.