From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13922 invoked by alias); 31 Mar 2014 20:09:38 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 13804 invoked by uid 89); 31 Mar 2014 20:09:37 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=AWL,BAYES_00,DATE_IN_PAST_06_12,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS autolearn=no version=3.3.2 X-Spam-User: qpsmtpd, 2 recipients X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 31 Mar 2014 20:09:36 +0000 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s2VK9WmL013882 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 31 Mar 2014 16:09:32 -0400 Received: from [10.36.7.65] (vpn1-7-65.ams2.redhat.com [10.36.7.65]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s2VK9UBp007268; Mon, 31 Mar 2014 16:09:30 -0400 Subject: Re: [PATCH] Unify pthread_once (bug 15215) From: Torvald Riegel To: Will Newton Cc: "Joseph S. Myers" , "Carlos O'Donell" , GLIBC Devel , libc-ports In-Reply-To: References: <1368024237.7774.794.camel@triegel.csb> <519D97E4.4030808@redhat.com> <1381018836.8757.3598.camel@triegel.csb> <1381182784.18547.138.camel@triegel.csb> Content-Type: text/plain; charset="UTF-8" Date: Mon, 31 Mar 2014 20:09:00 -0000 Message-ID: <1396267124.19076.5659.camel@triegel.csb> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-IsSubscribed: yes X-SW-Source: 2014-03/txt/msg00007.txt.bz2 On Mon, 2014-03-31 at 12:44 +0100, Will Newton wrote: > On 7 October 2013 22:53, Torvald Riegel wrote: > > On Mon, 2013-10-07 at 16:04 +0000, Joseph S. Myers wrote: > >> I have no comments on the substance of this patch, but note that ports/ > >> has a separate ChangeLog file for each architecture. > > > > Sorry. The attached patch now has separate ChangeLog entries for each of > > the affected archs. > > There seems to be a significant performance delta on aarch64: > > Old code: > > "pthread_once": { > "": { > "duration": 9.29471e+09, "iterations": 1.10667e+09, "max": 24.54, > "min": 8.38, "mean": 8.39882 > > New code: > > "pthread_once": { > "": { > "duration": 9.72366e+09, "iterations": 4.33843e+08, "max": 30.86, > "min": 22.38, "mean": 22.4128 > > And also ARM: > > Old code: > > "pthread_once": { > "": { > "duration": 8.38662e+09, "iterations": 6.6695e+08, "max": 35.292, > "min": 12.416, "mean": 12.5746 > > New code: > > "pthread_once": { > "": { > "duration": 9.26424e+09, "iterations": 3.07574e+08, "max": 86.125, > "min": 28.875, "mean": 30.1204 > > It would be nice to understand the source of this variation. I can put > it on my todo list but I can't promise I will be able to look at it > any time soon. The ARM code (or, the code in general) was lacking a memory barrier. Here's what I wrote in the email that first sent the patch: > > Both I1 and I2 were missing acquire MO on the very first load of > > once_control. This needs to synchronize with the release MO on setting > > the state to init-finished, so without it it's not guaranteed to work > > either. > > Note that this will make a call to pthread_once that doesn't need to > > actually run the init routine slightly slower due to the additional > > acquire barrier. If you're really concerned about this overhead, speak > > up. There are ways to avoid it, but it comes with additional complexity > > and bookkeeping. One way to try to work around the overhead is to keep thread-local state that checks via a counter or such whether a particular thread already used an acquire barrier on a load to this pthread_once previously. This will help only if the same pthread_once is called several times from the same thread -- it won't help if a couple of threads all just call a particular pthread_once a few times. Also, because we can't keep thread-local state for each pthread_once, we'd need to group them all -- in return, this will lead to some synchronization between the initialization phases of unrelated pthread_once instances.