On Sun, 2013-10-06 at 02:20 +0200, Torvald Riegel wrote: > On Thu, 2013-05-23 at 00:15 -0400, Carlos O'Donell wrote: > > On 05/08/2013 10:43 AM, Torvald Riegel wrote: > > > See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background. > > > > You've already hashed out the details of these changes with Rich > > and he has no objection with this first phase of the patch which > > is to unify the implementations. > > > > > I1 and I2 follow essentially the same algorithm, and we can replace it > > > with a unified variant, as the bug suggests. See the attached patch for > > > a modified version of the sparc instance. The differences between both > > > are either cosmetic, or are unnecessary changes (ie, how the > > > init-finished state is set (atomic_inc vs. store), or how the fork > > > generations are compared). > > > > > > Both I1 and I2 were missing a release memory order (MO) when marking > > > once_control as finished initialization. If the particular arch doesn't > > > need a HW barrier for release, we at least need a compiler barrier; if > > > it's needed, the original I1 and I2 are not guaranteed to work. > > > > > > Both I1 and I2 were missing acquire MO on the very first load of > > > once_control. This needs to synchronize with the release MO on setting > > > the state to init-finished, so without it it's not guaranteed to work > > > either. > > > Note that this will make a call to pthread_once that doesn't need to > > > actually run the init routine slightly slower due to the additional > > > acquire barrier. If you're really concerned about this overhead, speak > > > up. There are ways to avoid it, but it comes with additional complexity > > > and bookkeeping. > > > > We want correctness. This is a place where correctness is infinitely > > more important than speed. We should be correct first and then we > > should argue about how to make it fast. > > > > > I'm currently also using the existing atomic_{read/write}_barrier > > > functions instead of not-yet-existing load_acq or store_rel functions. > > > I'm not sure whether the latter can have somewhat more efficient > > > implementations on Power and ARM; if so, and if you're concerned about > > > the overhead, we can add load_acq and store_rel to atomic.h and start > > > using it. This would be in line with C11, where we should eventually be > > > heading to anyways, IMO. > > > > Agreed. > > > > > Both I1 and I2 have an ABA issue on __fork_generation, as explained in > > > the comments that the patch adds. How do you all feel about this? > > > I can't present a simple fix right now, but I believe it could be fixed > > > with additional bookkeeping. > > > > > > If there's no objection to the essence of this patch, I'll post another > > > patch that actually replaces I1 and I2 with the modified variant in the > > > attached patch. > > > > Please repost. > > See attached patch. This has been tested on ppc64 but not on the other > archs that are affected. Nonetheless, ppc has a weak memory model, so, > for example, having an acquire barrier on a load or not having it does > make a difference. > > OK? Attached is a slightly updated version; the only difference is that the changelog chunks now incorporate earlier feedback that I had forgotten about.