From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-79087-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 92289 invoked by alias); 6 May 2017 00:57:58 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 90369 invoked by uid 89); 6 May 2017 00:57:56 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=publicly, idle, sandy, H*F:D*google.com
X-HELO: mail-oi0-f53.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=EOvrq8iMSHzvGE31ts58R/ngTSvO357tX5s7oLeSbdA=;
        b=XC2Hn82FsIneXuiq2ChkzS756AaSLKu3Go4enfCUlyvIl2zLZBWDchQuzlshfQi60C
         BtUJ8AwD56lFa+5JvqQ4YMVn0V1dped7F5TTKteE3A6g/tSrm6EwVPmtiVqyPvKriFEA
         BK+ZZ+95vgWQYuWn0Tdt+Voc0R+qdWWvG+wXNmYkFope6icXiqGPaGm0TUxhkgYkyzFp
         +td/+A0o5h8qsyS7qafBiH627vN5GQpJAT3MD5r4JNBOvJABbWF2lH/zdbVq0Qo+vKiY
         MW31103t1VkmxvHiu2n7kryQvcHHrXFKNIIWL+sacswTFS5TZvbuIwDUanSsybLgxy8G
         QJgQ==
X-Gm-Message-State: AN3rC/7IMlc3pjxZlw5o7jgx0miD/vIEJlSIpU/ERyfmVh2FwJIvJaiS
	2dWNeP2Qu07e2k7Dah07P8OsIvOl7RyXBio=
X-Received: by 10.202.207.141 with SMTP id f135mr6945283oig.211.1494032274832;
 Fri, 05 May 2017 17:57:54 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com>
References: <CAOVZoAPo-A5-bRZFHeu_wvTASzh_4nYwmqfCVfHQ7h34GyWKAA@mail.gmail.com>
 <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com>
From: Erich Elsen <eriche@google.com>
Date: Sat, 06 May 2017 00:57:00 -0000
Message-ID: <CAOVZoAO9Ryz0uqGx4aXi+vf27c_+j8e+_opxcfQK27qO0OnBpw@mail.gmail.com>
Subject: Re: memcpy performance regressions 2.19 -> 2.24(5)
To: "Carlos O'Donell" <carlos@redhat.com>
Cc: libc-alpha@sourceware.org, "H.J. Lu" <hjl.tools@gmail.com>
Content-Type: text/plain; charset=UTF-8
X-SW-Source: 2017-05/txt/msg00115.txt.bz2

Hi Carlos,

a/b) The number of runs is dependent on the time taken; the number
iterations was such that each size took at least 500ms for all
iterations.  For many of the smaller sizes this means 10-100 million
iterations, for the largest size, 64MB, it was ~60.  10 runs were
launched separately, the difference between the maximum and the
minimum average was never more than 6% for any size; all of the
regressions are larger than this difference (usually much larger).
The times on the spreadsheet are from a randomly chosen run - it would
be possible to use a median or average, but given the large size of
effect, it didn't seem necessary.

b) The machines were idle (background processes only) except for the
test being run.  Boost was disabled.  The benchmark is single
threaded.  I did not explicitly pin the process - but given that the
machine was otherwise idle - it would be surprising if it was
migrated.  I can add this to see if the results change.

c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy),
E5-2689 (Sandy); I don't have motherboard or memory info.  The kernel
on the benchmark machines is 3.11.10.

d)  Only bench-memcpy-large would expose the problem at the largest
sizes.  2.19 did not have bench-memcpy-large.  The current benchmarks
will not reveal the regressions on Ivy and Haswell in the intermediate
size range because they only correspond to the readwritecache case on
the spreadsheet.  That is, they loop over the same src and dst buffers
in the timing loop.

nocache means that both the src and dst buffers go through memory with
strides such that nothing will be cached.
readcache means that the src buffer is fixed, but the dst buffer
strides through memory.

To see the difference at the largest sizes with the bench-memcpy-large
you can run it twice; once forcing __x86_shared_non_temporal_threshold
to LONG_MAX so the non-temporal path is never taken.

e) Yes, I can do this. It needs to go through approval to share
publicly, will take a few days.

Thanks,
Erich

On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 05/05/2017 01:09 PM, Erich Elsen wrote:
>> I had a couple of questions:
>>
>> 1) Are the large regressions at large sizes for IvyBridge and
>> SandyBridge expected?  Is avoiding non-temporal stores a reasonable
>> solution?
>
> No large regressions are expected.
>
>> 2) Is it possible to fix the IvyBridge regressions by using model
>> information to force a specific implementation?  I'm not sure how
>> other cpus (AMD) would be affected if the selection logic was modified
>> based on feature flags.
>
> A different memcpy can be used for any detectable difference in hardware.
> What you can't do is select a different memcpy for a different range of
> inputs. You have to make the choice upfront with only the knowledge of
> the hardware as your input. Though today we could augment that choice
> with a glibc tunable set by the shell starting the process.
>
> I have questions of my own:
>
> (a) How statistically relevant were your results?
> - What are your confidence intervals?
> - What is your standard deviation?
> - How many runs did you average?
>
> (b) Was your machine hardware stable?
> - See:
> https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/
> - What methodology did you use to carry out your tests? Like CPU pinning.
>
> (c) Exactly what hardware did you use?
> - You mention IvyBridge and SandyBridge, but what exact hardware did
>   you use for the tests, and what exact kernel version?
>
> (d) If you run glibc's own microbenchmarks do you see the same
>     performance problems? e.g. make bench, and look at the detailed
>     bench-memcpy, bench-memcpy-large, and bench-memcpy-random results.
>
> https://sourceware.org/glibc/wiki/Testing/Builds
>
> (e) Are you willing to publish your microbencmark sources for others
>     to confirm the results?
>
> --
> Cheers,
> Carlos.