On Thu, 2013-05-23 at 00:15 -0400, Carlos O'Donell wrote:
> On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> > See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background.
> 
> You've already hashed out the details of these changes with Rich
> and he has no objection with this first phase of the patch which
> is to unify the implementations.
>  
> > I1 and I2 follow essentially the same algorithm, and we can replace it
> > with a unified variant, as the bug suggests.  See the attached patch for
> > a modified version of the sparc instance.  The differences between both
> > are either cosmetic, or are unnecessary changes (ie, how the
> > init-finished state is set (atomic_inc vs. store), or how the fork
> > generations are compared).
> > 
> > Both I1 and I2 were missing a release memory order (MO) when marking
> > once_control as finished initialization.  If the particular arch doesn't
> > need a HW barrier for release, we at least need a compiler barrier; if
> > it's needed, the original I1 and I2 are not guaranteed to work.
> > 
> > Both I1 and I2 were missing acquire MO on the very first load of
> > once_control.  This needs to synchronize with the release MO on setting
> > the state to init-finished, so without it it's not guaranteed to work
> > either.
> > Note that this will make a call to pthread_once that doesn't need to
> > actually run the init routine slightly slower due to the additional
> > acquire barrier.  If you're really concerned about this overhead, speak
> > up.  There are ways to avoid it, but it comes with additional complexity
> > and bookkeeping.
> 
> We want correctness. This is a place where correctness is infinitely
> more important than speed. We should be correct first and then we
> should argue about how to make it fast.
> 
> > I'm currently also using the existing atomic_{read/write}_barrier
> > functions instead of not-yet-existing load_acq or store_rel functions.
> > I'm not sure whether the latter can have somewhat more efficient
> > implementations on Power and ARM; if so, and if you're concerned about
> > the overhead, we can add load_acq and store_rel to atomic.h and start
> > using it.  This would be in line with C11, where we should eventually be
> > heading to anyways, IMO.
> 
> Agreed.
> 
> > Both I1 and I2 have an ABA issue on __fork_generation, as explained in
> > the comments that the patch adds.  How do you all feel about this?
> > I can't present a simple fix right now, but I believe it could be fixed
> > with additional bookkeeping.
> > 
> > If there's no objection to the essence of this patch, I'll post another
> > patch that actually replaces I1 and I2 with the modified variant in the
> > attached patch.
> 
> Please repost.

See attached patch.  This has been tested on ppc64 but not on the other
archs that are affected.  Nonetheless, ppc has a weak memory model, so,
for example, having an acquire barrier on a load or not having it does
make a difference.

OK?