From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 92289 invoked by alias); 6 May 2017 00:57:58 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 90369 invoked by uid 89); 6 May 2017 00:57:56 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=publicly, idle, sandy, H*F:D*google.com X-HELO: mail-oi0-f53.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=EOvrq8iMSHzvGE31ts58R/ngTSvO357tX5s7oLeSbdA=; b=XC2Hn82FsIneXuiq2ChkzS756AaSLKu3Go4enfCUlyvIl2zLZBWDchQuzlshfQi60C BtUJ8AwD56lFa+5JvqQ4YMVn0V1dped7F5TTKteE3A6g/tSrm6EwVPmtiVqyPvKriFEA BK+ZZ+95vgWQYuWn0Tdt+Voc0R+qdWWvG+wXNmYkFope6icXiqGPaGm0TUxhkgYkyzFp +td/+A0o5h8qsyS7qafBiH627vN5GQpJAT3MD5r4JNBOvJABbWF2lH/zdbVq0Qo+vKiY MW31103t1VkmxvHiu2n7kryQvcHHrXFKNIIWL+sacswTFS5TZvbuIwDUanSsybLgxy8G QJgQ== X-Gm-Message-State: AN3rC/7IMlc3pjxZlw5o7jgx0miD/vIEJlSIpU/ERyfmVh2FwJIvJaiS 2dWNeP2Qu07e2k7Dah07P8OsIvOl7RyXBio= X-Received: by 10.202.207.141 with SMTP id f135mr6945283oig.211.1494032274832; Fri, 05 May 2017 17:57:54 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com> References: <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com> From: Erich Elsen Date: Sat, 06 May 2017 00:57:00 -0000 Message-ID: Subject: Re: memcpy performance regressions 2.19 -> 2.24(5) To: "Carlos O'Donell" Cc: libc-alpha@sourceware.org, "H.J. Lu" Content-Type: text/plain; charset=UTF-8 X-SW-Source: 2017-05/txt/msg00115.txt.bz2 Hi Carlos, a/b) The number of runs is dependent on the time taken; the number iterations was such that each size took at least 500ms for all iterations. For many of the smaller sizes this means 10-100 million iterations, for the largest size, 64MB, it was ~60. 10 runs were launched separately, the difference between the maximum and the minimum average was never more than 6% for any size; all of the regressions are larger than this difference (usually much larger). The times on the spreadsheet are from a randomly chosen run - it would be possible to use a median or average, but given the large size of effect, it didn't seem necessary. b) The machines were idle (background processes only) except for the test being run. Boost was disabled. The benchmark is single threaded. I did not explicitly pin the process - but given that the machine was otherwise idle - it would be surprising if it was migrated. I can add this to see if the results change. c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy), E5-2689 (Sandy); I don't have motherboard or memory info. The kernel on the benchmark machines is 3.11.10. d) Only bench-memcpy-large would expose the problem at the largest sizes. 2.19 did not have bench-memcpy-large. The current benchmarks will not reveal the regressions on Ivy and Haswell in the intermediate size range because they only correspond to the readwritecache case on the spreadsheet. That is, they loop over the same src and dst buffers in the timing loop. nocache means that both the src and dst buffers go through memory with strides such that nothing will be cached. readcache means that the src buffer is fixed, but the dst buffer strides through memory. To see the difference at the largest sizes with the bench-memcpy-large you can run it twice; once forcing __x86_shared_non_temporal_threshold to LONG_MAX so the non-temporal path is never taken. e) Yes, I can do this. It needs to go through approval to share publicly, will take a few days. Thanks, Erich On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell wrote: > On 05/05/2017 01:09 PM, Erich Elsen wrote: >> I had a couple of questions: >> >> 1) Are the large regressions at large sizes for IvyBridge and >> SandyBridge expected? Is avoiding non-temporal stores a reasonable >> solution? > > No large regressions are expected. > >> 2) Is it possible to fix the IvyBridge regressions by using model >> information to force a specific implementation? I'm not sure how >> other cpus (AMD) would be affected if the selection logic was modified >> based on feature flags. > > A different memcpy can be used for any detectable difference in hardware. > What you can't do is select a different memcpy for a different range of > inputs. You have to make the choice upfront with only the knowledge of > the hardware as your input. Though today we could augment that choice > with a glibc tunable set by the shell starting the process. > > I have questions of my own: > > (a) How statistically relevant were your results? > - What are your confidence intervals? > - What is your standard deviation? > - How many runs did you average? > > (b) Was your machine hardware stable? > - See: > https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/ > - What methodology did you use to carry out your tests? Like CPU pinning. > > (c) Exactly what hardware did you use? > - You mention IvyBridge and SandyBridge, but what exact hardware did > you use for the tests, and what exact kernel version? > > (d) If you run glibc's own microbenchmarks do you see the same > performance problems? e.g. make bench, and look at the detailed > bench-memcpy, bench-memcpy-large, and bench-memcpy-random results. > > https://sourceware.org/glibc/wiki/Testing/Builds > > (e) Are you willing to publish your microbencmark sources for others > to confirm the results? > > -- > Cheers, > Carlos.