From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-81526-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 105828 invoked by alias); 30 Jun 2017 16:07:59 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 104910 invoked by uid 89); 30 Jun 2017 16:07:58 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.4 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.2 spammy=
X-HELO: mail-qt0-f176.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:cc:references:from:organization
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=F7iJkJtCcht916z/mXRBKrgHSyjmxGNfm5J+F0zTN58=;
        b=ahRZddmN7R0GbCwTRJ+bGpAHZnBac7ff5J20rsopaalnEPCu3hZ+xzbAdTQxMMUPub
         txo1qi8+w7wU1kdwB1Rq4Y4fESJ0fQrJlRGJrSJpG5ormlQHEn5xUUQWV6miIzm8My6f
         8aYAJQ6uw1nc44UG4w1C1aQD6FiyYw/S0GP5bBL+JXvHS9RwTQSCkDh1da+W0MebMmSw
         x0B755dRdtTachcDnHNZv4UCPZFao33MhnLclgePCkST5qWMgTZJwbcElfyA8Pj37Uq+
         Zn2eIKlJFVnbW2kWnekJBjJN0RIZBO/UTCdpmP67cj81RyMPE24GNQs3LSZ4Ucit5Rog
         MygA==
X-Gm-Message-State: AKS2vOyQpspHUhE4atXwua1wj7U28pD7yVjg4Y1xZGw0tOWczhCtdXki
	sLi5kHWv67cQRhSd
X-Received: by 10.237.48.130 with SMTP id 2mr26983829qtf.17.1498838875875;
        Fri, 30 Jun 2017 09:07:55 -0700 (PDT)
Subject: Re: [PATCH v2] Single threaded stdio optimization
To: Szabolcs Nagy <szabolcs.nagy@arm.com>,
 Siddhesh Poyarekar <siddhesh@gotplt.org>,
 Joseph Myers <joseph@codesourcery.com>
Cc: nd@arm.com, GNU C Library <libc-alpha@sourceware.org>,
 "triegel@redhat.com" <triegel@redhat.com>
References: <594AA0A4.7010600@arm.com>
 <alpine.DEB.2.20.1706211651110.3924@digraph.polyomino.org.uk>
 <594B92ED.6060809@arm.com> <2dc4477c-c349-86b1-f8a5-95d69c397f24@gotplt.org>
 <98711966-b1c2-1234-ebee-301e5102aa46@gotplt.org>
 <8e7e9ce1-4f50-217e-b99f-e640fcca5490@redhat.com> <595640FB.8030904@arm.com>
 <648bb45e-9573-3328-b575-ff68970ddc5e@redhat.com> <595662AC.7080607@arm.com>
From: Carlos O'Donell <carlos@redhat.com>
Message-ID: <dc143b1a-267a-5509-819f-39bc04b645d6@redhat.com>
Date: Fri, 30 Jun 2017 16:07:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.0
MIME-Version: 1.0
In-Reply-To: <595662AC.7080607@arm.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-SW-Source: 2017-06/txt/msg01638.txt.bz2

On 06/30/2017 10:39 AM, Szabolcs Nagy wrote:
> On 30/06/17 14:16, Carlos O'Donell wrote:
>> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>>> i didn't dig into the root cause of the regression (or
>>> why is static linking slower?), i would not be too
>>> worried about it since the common case for hot stdio
>>> loops is in single thread processes where even on x86
>>> the patch gives >2x speedup.
>>
>> Regardless of the cause, the 15% regression on x86 MT performance
>> is serious, and I see no reason to push this into glibc 2.26. 
>> We can add it any time in 2.27, or the distros can pick it up with
>> a backport.
>>
>> I would like to see a better characterization of the regression before
>> accepting this patch.
>>
>> While I agree that common case for hot stdio loops is non-MT, there
>> are still MT cases, and 15% is a large double-digit loss.
>>
>> Have you looked at the assembly differences? What is the compiler
>> doing differently?
>>
>> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
>> answer that is clear and concise.
>>
> 
> sorry the x86 measurement was bogus because only
> the high level code thought it's multithreaded, the
> lowlevellock code thought it's single threaded so
> there were no atomic ops executed in the stdio_mt case

OK.

> with atomics the orig performance is significantly
> slower so the regression relative to that is small in %.
> 
> if i create a dummy thread (to measure true mt
> behaviour, same loop count):
> 
> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt
> 20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k
> 0inputs+0outputs (0major+180minor)pagefaults 0swaps
> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt
> 20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k
> 0inputs+0outputs (0major+179minor)pagefaults 0swaps
> 
> the relative diff is 2% now, but notice that the
> abs diff went down too (which points to uarch issue
> in the previous measurement).

OK. This is much better.

> perf stat indicates that there are 15 vs 16 branches
> in the loop (so my patch indeed adds one branch
> but there are plenty branches already) the instruction
> count goes from 43 to 45 per loop iteration
> (flag check + branch).
> 
> in my previous measurements, how can +1 branch
> decrease the performance >10% when there are
> already >10 branches (and several other insns)
> is something the x86 uarchitects could explain.
> 
> in summary the patch trades 2% mt performance to
> 2x non-mt performance on this x86 cpu.
 
Excellent, this is exactly the analysis I was looking for, and this kind
of result is something that can make sense to our users.

I'm OK with the patch for 2.26.

-- 
Cheers,
Carlos.