From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 105828 invoked by alias); 30 Jun 2017 16:07:59 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 104910 invoked by uid 89); 30 Jun 2017 16:07:58 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.4 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.2 spammy= X-HELO: mail-qt0-f176.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=F7iJkJtCcht916z/mXRBKrgHSyjmxGNfm5J+F0zTN58=; b=ahRZddmN7R0GbCwTRJ+bGpAHZnBac7ff5J20rsopaalnEPCu3hZ+xzbAdTQxMMUPub txo1qi8+w7wU1kdwB1Rq4Y4fESJ0fQrJlRGJrSJpG5ormlQHEn5xUUQWV6miIzm8My6f 8aYAJQ6uw1nc44UG4w1C1aQD6FiyYw/S0GP5bBL+JXvHS9RwTQSCkDh1da+W0MebMmSw x0B755dRdtTachcDnHNZv4UCPZFao33MhnLclgePCkST5qWMgTZJwbcElfyA8Pj37Uq+ Zn2eIKlJFVnbW2kWnekJBjJN0RIZBO/UTCdpmP67cj81RyMPE24GNQs3LSZ4Ucit5Rog MygA== X-Gm-Message-State: AKS2vOyQpspHUhE4atXwua1wj7U28pD7yVjg4Y1xZGw0tOWczhCtdXki sLi5kHWv67cQRhSd X-Received: by 10.237.48.130 with SMTP id 2mr26983829qtf.17.1498838875875; Fri, 30 Jun 2017 09:07:55 -0700 (PDT) Subject: Re: [PATCH v2] Single threaded stdio optimization To: Szabolcs Nagy , Siddhesh Poyarekar , Joseph Myers Cc: nd@arm.com, GNU C Library , "triegel@redhat.com" References: <594AA0A4.7010600@arm.com> <594B92ED.6060809@arm.com> <2dc4477c-c349-86b1-f8a5-95d69c397f24@gotplt.org> <98711966-b1c2-1234-ebee-301e5102aa46@gotplt.org> <8e7e9ce1-4f50-217e-b99f-e640fcca5490@redhat.com> <595640FB.8030904@arm.com> <648bb45e-9573-3328-b575-ff68970ddc5e@redhat.com> <595662AC.7080607@arm.com> From: Carlos O'Donell Message-ID: Date: Fri, 30 Jun 2017 16:07:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <595662AC.7080607@arm.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-SW-Source: 2017-06/txt/msg01638.txt.bz2 On 06/30/2017 10:39 AM, Szabolcs Nagy wrote: > On 30/06/17 14:16, Carlos O'Donell wrote: >> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote: >>> i didn't dig into the root cause of the regression (or >>> why is static linking slower?), i would not be too >>> worried about it since the common case for hot stdio >>> loops is in single thread processes where even on x86 >>> the patch gives >2x speedup. >> >> Regardless of the cause, the 15% regression on x86 MT performance >> is serious, and I see no reason to push this into glibc 2.26. >> We can add it any time in 2.27, or the distros can pick it up with >> a backport. >> >> I would like to see a better characterization of the regression before >> accepting this patch. >> >> While I agree that common case for hot stdio loops is non-MT, there >> are still MT cases, and 15% is a large double-digit loss. >> >> Have you looked at the assembly differences? What is the compiler >> doing differently? >> >> When our a user asks "Why is my MT stdio 15% slower?" We owe them an >> answer that is clear and concise. >> > > sorry the x86 measurement was bogus because only > the high level code thought it's multithreaded, the > lowlevellock code thought it's single threaded so > there were no atomic ops executed in the stdio_mt case OK. > with atomics the orig performance is significantly > slower so the regression relative to that is small in %. > > if i create a dummy thread (to measure true mt > behaviour, same loop count): > > time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt > 20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k > 0inputs+0outputs (0major+180minor)pagefaults 0swaps > time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt > 20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k > 0inputs+0outputs (0major+179minor)pagefaults 0swaps > > the relative diff is 2% now, but notice that the > abs diff went down too (which points to uarch issue > in the previous measurement). OK. This is much better. > perf stat indicates that there are 15 vs 16 branches > in the loop (so my patch indeed adds one branch > but there are plenty branches already) the instruction > count goes from 43 to 45 per loop iteration > (flag check + branch). > > in my previous measurements, how can +1 branch > decrease the performance >10% when there are > already >10 branches (and several other insns) > is something the x86 uarchitects could explain. > > in summary the patch trades 2% mt performance to > 2x non-mt performance on this x86 cpu. Excellent, this is exactly the analysis I was looking for, and this kind of result is something that can make sense to our users. I'm OK with the patch for 2.26. -- Cheers, Carlos.