From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-4122-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 21822 invoked by alias); 9 May 2013 08:39:32 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 21801 invoked by uid 89); 9 May 2013 08:39:32 -0000
X-Spam-SWARE-Status: No, score=-8.0 required=5.0 tests=AWL,BAYES_00,KHOP_THREADED,RCVD_IN_HOSTKARMA_W,RCVD_IN_HOSTKARMA_WL,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.1
X-Spam-User: qpsmtpd, 2 recipients
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28)    by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Thu, 09 May 2013 08:39:30 +0000
Received: from int-mx12.intmail.prod.int.phx2.redhat.com (int-mx12.intmail.prod.int.phx2.redhat.com [10.5.11.25])	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r498dRKN000386	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);	Thu, 9 May 2013 04:39:27 -0400
Received: from [10.36.4.231] (vpn1-4-231.ams2.redhat.com [10.36.4.231])	by int-mx12.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id r498dP8f031201;	Thu, 9 May 2013 04:39:26 -0400
Subject: Re: [PATCH] Unify pthread_once (bug 15215)
From: Torvald Riegel <triegel@redhat.com>
To: Rich Felker <dalias@aerifal.cx>
Cc: GLIBC Devel <libc-alpha@sourceware.org>,        libc-ports <libc-ports@sourceware.org>
In-Reply-To: <20130508212502.GF20323@brightrain.aerifal.cx>
References: <1368024237.7774.794.camel@triegel.csb>	 <20130508175132.GB20323@brightrain.aerifal.cx>	 <1368046046.7774.1441.camel@triegel.csb>	 <20130508212502.GF20323@brightrain.aerifal.cx>
Content-Type: text/plain; charset="UTF-8"
Date: Thu, 09 May 2013 08:39:00 -0000
Message-ID: <1368088765.7774.1571.camel@triegel.csb>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-SW-Source: 2013-05/txt/msg00042.txt.bz2

On Wed, 2013-05-08 at 17:25 -0400, Rich Felker wrote:
> On Wed, May 08, 2013 at 10:47:26PM +0200, Torvald Riegel wrote:
> > On Wed, 2013-05-08 at 13:51 -0400, Rich Felker wrote:
> > > On Wed, May 08, 2013 at 04:43:57PM +0200, Torvald Riegel wrote:
> > > > Note that this will make a call to pthread_once that doesn't need to
> > > > actually run the init routine slightly slower due to the additional
> > > > acquire barrier.  If you're really concerned about this overhead, speak
> > > > up.  There are ways to avoid it, but it comes with additional complexity
> > > > and bookkeeping.
> > > 
> > > On the one hand, I think it should be avoided if at all possible.
> > > pthread_once is the correct, canonical way to do initialization (as
> > > opposed to hacks like library init functions or global ctors), and the
> > > main doubt lots of people have about doing it the correct way is that
> > > they're going to kill performance if they call pthread_once from every
> > > point where initialization needs to have been completed. If every call
> > > imposes memory synchronization, performance might become a real issue
> > > discouraging people from following best practices for library
> > > initialization.
> > 
> > Well, what we precisely need is that the initialization happens-before
> > (ie, the relation from the, say, C11 memory model) every call that does
> > not in fact initialize.  If initialization happened on another thread,
> > you need to synchronize.  But from there on, you are essentially free to
> > establish this in any way you want.  And there are ways, because
> > happens-before is more-or-less transitive.
> > 
> > > On the other hand, I don't think it's conforming to elide the barrier.
> > > POSIX states (XSH 4.11 Memory Synchronization):
> > > 
> > > "The pthread_once() function shall synchronize memory for the first
> > > call in each thread for a given pthread_once_t object."
> > 
> > No, it's not.  You could see just parts of the effects of the
> > initialization; potentially reading garbage can't be the intended
> > semantics :)
> 
> The work of synchronizing memory should take place at the end of the
> pthread_once call that actually does the initialization, rather than
> in the other threads which synchronize.

This isn't how the (hardware) memory models work.  And it makes sense;
if one CPU could prevent reordering in other CPUs (which would be
required for what you have in mind), this would be an unconditional big
hammer.  Instead, CPUs can opt in by issuing barriers when needed, which
then prevent reordering wrt. what happens globally to memory.

> This is the way the x86 memory
> model naturally works, but perhaps it's prohibitive to achieve on
> other architectures.

The x86 memory model is just stronger than others, so certain
reorderings don't appear or aren't visible to programs.  IOW, you don't
need to do certain things explicitly for the hardware.  You still do
need the appropriate compiler barriers though; for example, if the
compiler reorders the once_control release store to before the
initialization stores, you still have an incorrectly synchronized
program, even on x86.

More background on this can be found in the C11 and C++11 memory models,
in the Batty et al. paper formalizing C++11's.  This list of mappings
from these language-level models to HW could also be interesting (note
that it doesn't cover the compiler side of this explicitly):
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

> However, the idea is that pthread_once only runs
> init routines a small finite number of times, so even if you had to so
> some horrible hack that makes the synchronization on return 1000x
> slower (e.g. a syscall), it would still be better than incurring the
> cost of a full acquire barrier in each subsequent call, which ideally
> should have the same cost as a call to an empty function.

That would be true if non-first calls appear
1000*(syscall_overhead/acquire_mbar_overhead) times.  But do they?

I think the way forward here is to:
1) Fix the implementation (ie, add the mbars).
2) Let the arch maintainers of the affected archs with weak memory moels
(or people interested in this) look at this and come up with some
measurements for how much overhead the mbars actually present in real
code.
3) Decide whether this overhead justifies adding optimizations.

This patch is step 1.  I don't think we need to merge this step 3.

> > > Since it's impossible to track whether a call is the first call in a
> > > given thread
> > 
> > Are you sure about this? :)
> 
> It's impossible with bounded memory requirements, and thus impossible
> in general (allocating memory for the tracking might fail).

I believe you think about needing to track more than you actually need
to know.  All you need is knowing whether a thread established a
happens-before with whoever initialized the once_control in the past.
So you do need per-thread state, and per-once_control state, but not
necessarily more.  If in doubt, you can still do the acquire barrier.

> > > this means every call to pthread_once() is required to
> > > be a full memory barrier.
> > 
> > Note that we do not need a full memory barrier, just an acquire memory
> > barrier.  So this only matters on architectures with memory models that
> > give weaker per-default ordering guarantees.  For example, this doesn't
> > add any hardware barrier instructions on x86 or Sparc TSO.  But for
> > Power and ARM it does.
> 
> Yes, I see that.
> 
> > > I suspect this is unintended, and we should
> > > perhaps file a bug report with the Austin Group and see if the
> > > requirement can be relaxed.
> > 
> > I don't think that other semantics are intended.  If you return from
> > pthread_once(), initialization should have happened before that.  If it
> > doesn't, you don't really know whether initialization happened once, so
> > programs would be forced to do their own synchronization.
> 
> I think my confusion is merely that POSIX does not define the phrase
> "synchronize memory", and in the absence of a definition, "full memory
> barrier" (both release and acquire semantics) is the only reasonable
> interpretation I can find. In other words, it seems like a
> pathological conforming program could attempt to use the language in
> the specification to use pthread_once as a release barrier. I'm not
> sure if there are ways this could be meaningfully arranged (i.e. with
> well-defined ordering; off-hand, I would think tricks with cancelling
> an in-progress invocation of pthread_once might make it possible.

I agree that the absence of a proper memory model makes reasoning about
some of this hard.  I guess it would be best if POSIX would just endorse
C11's memory model, and specify the intended semantics in relation to
this model where needed.

For example, the C11 variant of pthread_once has the following
requirement:
"Completion of an effective call to the call_once function synchronizes
with all subsequent calls to the call_once function with the same value
of flag."

This makes intuitive sense, and is what's enforced by the patch I sent.
("synchronizes with" is a well-defined relationship in the model, and
contributes to happens-before.)

Torvald