From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 121595 invoked by alias); 10 May 2017 17:33:13 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 121577 invoked by uid 89); 10 May 2017 17:33:12 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=compromise, million X-HELO: mail-qk0-f174.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=9YG0i8p0/FA3kiigHG/VSofB5IH14FL1qM4PCdp2be8=; b=s+UNtlexIotD/uhwTPH5MPDiK6GofDhcN2DOMhVtYJxS78oWDQaHYzEpAFM36fu6o6 L/aYqgNQXsXNOmgofMlOsGqAgGOc5sJ6eOD4hVORj43q9I0GI6igYa3GbNRrH8QlPTmJ oe45gb4ooplkZFIq7pJz0vrDgRDthwUrkAhY9icVSwLxZsoWtqgtIfL44Zh7MU3BtQEa 7Po/tFuhsreBPZlFLjOikZAdlg9ZtnOkO2BTjttn5cMmvgM6Wr7Az4j/yB/tWDExtgjt 0ZhXuWQW3TPth7/peGJRXv9VDehsjAOehMn5m3oplEhTLa1t+gmx7pC2aSx7KEqJPQm/ lG4w== X-Gm-Message-State: AODbwcDfP9VKLvJMNAzuy9FFgqwqkbuV71pH4Xa/60n5yM8KbHorLyhd fZ4iGhrK4k/l80tFg5UEnqZhWfM0lA== X-Received: by 10.55.74.214 with SMTP id x205mr7072430qka.231.1494437590962; Wed, 10 May 2017 10:33:10 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com> From: "H.J. Lu" Date: Wed, 10 May 2017 17:33:00 -0000 Message-ID: Subject: Re: memcpy performance regressions 2.19 -> 2.24(5) To: Erich Elsen Cc: "Carlos O'Donell" , GNU C Library Content-Type: text/plain; charset=UTF-8 X-SW-Source: 2017-05/txt/msg00254.txt.bz2 On Tue, May 9, 2017 at 4:48 PM, Erich Elsen wrote: > I've created a shareable benchmark, available here: > https://gist.github.com/ekelsen/b66cc085eb39f0495b57679cdb1874fa . > This is not the one the numbers on the spreadsheet are generated from, > but the results are similar. I will take a look. > I think libc 2.19 chooses sse2_unaligned for all the cpus on the spreadsheet. > > You can use this to see the difference on Haswell between > avx_unaligned and avx_unaligned_erms on the readcache and nocache > benchmarks. It's true that for readwritecache, which corresponds to > the libc benchmarks, avx_unaligned_erms is always at least as fast. I created hjl/x86/optimize branch with memcpy-sse2-unaligned.S from glibc 2.19 so that we can compare its performance against others with glibc benchmark. > You can also use it to see the regression on IvyBridge from 2.19 to 2.24. That is expected since memcpy-sse2-unaligned.S doesn't use non-temporal store. > Are there standard benchmarks showing that using the non-temporal How responsive is your glibc 2.19 machine when your memcpy benchmark is running? I would expect glibc 2.24 machine is more responsive. > store is a net win even though it causes a 2-3x decrease in single > threaded performance for some processors? Or how else is the decision > about the threshold made? There is no perfect number to make everyone happy. I am open to suggestion to improve the compromise. H.J. > Thanks, > Erich > > On Sat, May 6, 2017 at 8:41 AM, H.J. Lu wrote: >> On Fri, May 5, 2017 at 5:57 PM, Erich Elsen wrote: >>> Hi Carlos, >>> >>> a/b) The number of runs is dependent on the time taken; the number >>> iterations was such that each size took at least 500ms for all >>> iterations. For many of the smaller sizes this means 10-100 million >>> iterations, for the largest size, 64MB, it was ~60. 10 runs were >>> launched separately, the difference between the maximum and the >>> minimum average was never more than 6% for any size; all of the >>> regressions are larger than this difference (usually much larger). >>> The times on the spreadsheet are from a randomly chosen run - it would >>> be possible to use a median or average, but given the large size of >>> effect, it didn't seem necessary. >>> >>> b) The machines were idle (background processes only) except for the >>> test being run. Boost was disabled. The benchmark is single >>> threaded. I did not explicitly pin the process - but given that the >>> machine was otherwise idle - it would be surprising if it was >>> migrated. I can add this to see if the results change. >>> >>> c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy), >>> E5-2689 (Sandy); I don't have motherboard or memory info. The kernel >>> on the benchmark machines is 3.11.10. >>> >>> d) Only bench-memcpy-large would expose the problem at the largest >>> sizes. 2.19 did not have bench-memcpy-large. The current benchmarks >>> will not reveal the regressions on Ivy and Haswell in the intermediate >>> size range because they only correspond to the readwritecache case on >>> the spreadsheet. That is, they loop over the same src and dst buffers >>> in the timing loop. >>> >>> nocache means that both the src and dst buffers go through memory with >>> strides such that nothing will be cached. >>> readcache means that the src buffer is fixed, but the dst buffer >>> strides through memory. >>> >>> To see the difference at the largest sizes with the bench-memcpy-large >>> you can run it twice; once forcing __x86_shared_non_temporal_threshold >>> to LONG_MAX so the non-temporal path is never taken. >> >> The purpose of using non-tempora store is to avoid cache pullution >> so that cache is also available to other threads. We can improve the >> heuristic for non-temporal threshold. But we can't give all cache to >> a single thread by default. >> >> As for Haswell, there are some cases where the SSSE3 memcpy in >> glibc 2.19 is faster than the new AVX memcpy. But the new AVX >> memcpy is faster than the SSSE3 memcpy in majority of cases. The >> new AVX memcpy in glibc 2.24 replaces the old AVX memcpy in glibc >> 2.23. So there is no regression from 2.23 to 2.24. >> >> I also checked my glibc performance data. For data > 32K, >> __memcpy_avx_unaligned is slower than __memcpy_avx_unaligned_erms. >> We have >> >> /* Threshold to use Enhanced REP MOVSB. Since there is overhead to set >> up REP MOVSB operation, REP MOVSB isn't faster on short data. The >> memcpy micro benchmark in glibc shows that 2KB is the approximate >> value above which REP MOVSB becomes faster than SSE2 optimization >> on processors with Enhanced REP MOVSB. Since larger register size >> can move more data with a single load and store, the threshold is >> higher with larger register size. */ >> #ifndef REP_MOVSB_THRESHOLD >> # define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16)) >> #endif >> >> We can change it if there is improvement in glibc benchmarks. >> >> >> H.J. >> >>> e) Yes, I can do this. It needs to go through approval to share >>> publicly, will take a few days. >>> >>> Thanks, >>> Erich >>> >>> On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell wrote: >>>> On 05/05/2017 01:09 PM, Erich Elsen wrote: >>>>> I had a couple of questions: >>>>> >>>>> 1) Are the large regressions at large sizes for IvyBridge and >>>>> SandyBridge expected? Is avoiding non-temporal stores a reasonable >>>>> solution? >>>> >>>> No large regressions are expected. >>>> >>>>> 2) Is it possible to fix the IvyBridge regressions by using model >>>>> information to force a specific implementation? I'm not sure how >>>>> other cpus (AMD) would be affected if the selection logic was modified >>>>> based on feature flags. >>>> >>>> A different memcpy can be used for any detectable difference in hardware. >>>> What you can't do is select a different memcpy for a different range of >>>> inputs. You have to make the choice upfront with only the knowledge of >>>> the hardware as your input. Though today we could augment that choice >>>> with a glibc tunable set by the shell starting the process. >>>> >>>> I have questions of my own: >>>> >>>> (a) How statistically relevant were your results? >>>> - What are your confidence intervals? >>>> - What is your standard deviation? >>>> - How many runs did you average? >>>> >>>> (b) Was your machine hardware stable? >>>> - See: >>>> https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/ >>>> - What methodology did you use to carry out your tests? Like CPU pinning. >>>> >>>> (c) Exactly what hardware did you use? >>>> - You mention IvyBridge and SandyBridge, but what exact hardware did >>>> you use for the tests, and what exact kernel version? >>>> >>>> (d) If you run glibc's own microbenchmarks do you see the same >>>> performance problems? e.g. make bench, and look at the detailed >>>> bench-memcpy, bench-memcpy-large, and bench-memcpy-random results. >>>> >>>> https://sourceware.org/glibc/wiki/Testing/Builds >>>> >>>> (e) Are you willing to publish your microbencmark sources for others >>>> to confirm the results? >>>> >>>> -- >>>> Cheers, >>>> Carlos. >> >> >> >> -- >> H.J. -- H.J.