See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background. I1 and I2 follow essentially the same algorithm, and we can replace it with a unified variant, as the bug suggests. See the attached patch for a modified version of the sparc instance. The differences between both are either cosmetic, or are unnecessary changes (ie, how the init-finished state is set (atomic_inc vs. store), or how the fork generations are compared). Both I1 and I2 were missing a release memory order (MO) when marking once_control as finished initialization. If the particular arch doesn't need a HW barrier for release, we at least need a compiler barrier; if it's needed, the original I1 and I2 are not guaranteed to work. Both I1 and I2 were missing acquire MO on the very first load of once_control. This needs to synchronize with the release MO on setting the state to init-finished, so without it it's not guaranteed to work either. Note that this will make a call to pthread_once that doesn't need to actually run the init routine slightly slower due to the additional acquire barrier. If you're really concerned about this overhead, speak up. There are ways to avoid it, but it comes with additional complexity and bookkeeping. I'm currently also using the existing atomic_{read/write}_barrier functions instead of not-yet-existing load_acq or store_rel functions. I'm not sure whether the latter can have somewhat more efficient implementations on Power and ARM; if so, and if you're concerned about the overhead, we can add load_acq and store_rel to atomic.h and start using it. This would be in line with C11, where we should eventually be heading to anyways, IMO. Both I1 and I2 have an ABA issue on __fork_generation, as explained in the comments that the patch adds. How do you all feel about this? I can't present a simple fix right now, but I believe it could be fixed with additional bookkeeping. If there's no objection to the essence of this patch, I'll post another patch that actually replaces I1 and I2 with the modified variant in the attached patch. Cleaning up the magic numbers, perhaps fixing the ABA issue, and comparing to the custom asm versions would be next. I had a brief look at the latter, and at least x86 doesn't seem to do anything logically different. Torvald