public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* -mcx16 vs. not using CAS for atomic loads
@ 2017-01-19 18:23 Torvald Riegel
  2017-01-20 17:55 ` Richard Henderson
  0 siblings, 1 reply; 6+ messages in thread
From: Torvald Riegel @ 2017-01-19 18:23 UTC (permalink / raw)
  To: GCC; +Cc: Bin Fan, Richard Henderson

If using -mcx16 (directly or indirectly) today, then cmpxchg16b is used
to implement atomic loads too.  I consider this a bug because it can
result in a store being issued (e.g., when loading from a read-only
page, or when trying to do a volatile atomic load), and because it can
increase contention (which makes atomic loads perform much different
than HW load instructions would).  See the thread "GCC libatomic ABI
specification draft" for more background.

It's not quite obvious how to fix this.  We can't just ignore -mcx16;
only using it if enabled directly also doesn't seem like a useful
approach.

We could try to deprecate -mcx16, but that doesn't really give us a
solution for options that imply -mcx16 (we can't deprecate all of them).

* Option 1:
We could never issue cmpxchg16b and instead call out to libatomic, for
both __sync and __atomic builtins.  However, that would create an
additional dependency on libatomic, and would be incompatible when
combined with old code that uses cmpxchg16b directly.

* Option 2:
We could try to alter the meaning of -mcx16 to *also* assert that SSE
loads can be used for 16-byte atomic loads.  In this scenario, we could
then use the SSE loads to implement loads in __atomic*, and keep __sync*
unchanged.  This would not break combining new code with old code, and
would keep __sync compatible with __atomic.
However, I have not enough information about when SSE loads are atomic;
this information would have to come from the x86 vendors, because we
would depend on it (and the ABI we set up would depend on it).
In the best case, all processors that support cmpxchg16b would also have
atomic SSE loads.  However, if this implication does not hold for a
significant set of processors, then this change would alter when -mcx16
is safe to be used; it may also be a problem for other options implying
-mcx16.

* Option 3a:
-mcx16 continues to only mean that cmpxchg16b is available, and we keep
__sync builtins unchanged.  This doesn't break valid uses of __sync*
(eg, if they didn't need atomic loads at all).
We change __atomic for 16-byte to not use cmpxchg16b but to instead call
out to libatomic.  libatomic would continue to use cmpxchg16b
internally.  We retain compatibility between __atomic and __sync.  We do
not change __atomic_*_lock_free.
This does not fix the load-via-cmpxchg bug, but makes sure that we
reroute through libatomic early for the __atomic builtins, so that it
becomes easier in the future to either do something like Option 2 or
Option 3c.  Until then, nothing would really change.

* Option 3b:
Like Option 3a, except that __atomic_*_lock_free return false for 16
bytes.  The benefit over 3a is that this stops advertising "fast"
atomics when that is arguably not the case because the loads are slowed
down by contention (I assume a lot more users read "lock-free" as "fast"
instead of thinking about progress conditions).  The potential downside
is that programs may exist that assert(__atomic_always_lock_free(16,0));
these assertions would fail, although the rest of the program would
continue to work.

* Option 3c:
Like Option 3b, but libatomic would not use cmpxchg16b internally but
fall back to locks for 16-byte atomics.  This fixes the load-via-cmpxchg
bug, but breaks compatibility between old __atomic-using code and new
__atomic-using code, and between __sync and new __atomic.

* Option 4:
Introduce a -mload16atomic option or similar that asserts that true
16-byte atomic loads are supported by the hardware (eg, through SSE).
Enable this option for all processors where we know that it is true.
Don't change __sync.  Change __atomic to use the 16-byte atomic loads if
available, and otherwise continue to use cmpxchg16b.  Return false from
__atomic_*_lock_free(16, ...) if 16-byte atomic loads are not available.


I think I prefer Option 3b as the short-term solution.  It does not
break programs (except the __atomic_always_lock_free assertion scenario,
but that's likely to not work anyway given that the atomics will be
lock-free but not "fast").  It makes programs aware that the atomics
will not be fast when they are not fast indeed (ie, when getting loads
through cmpxchg).
It also enables us to fix more programs earlier than under Option 4, for
example (in the sense that Option 3b requires only a change to
libatomic, but no recompilation).   The usage of __atomic should be less
frequent today than in a year, I believe, and so this can make a
difference.  It introduces the function call overhead, but that isn't
much of a problem compared to atomic loads suffering from contention.

I'm worried that Option 4 would not be possible until some time in the
future when we have actually gotten confirmation from the HW vendors
about 16-byte atomic loads.  The additional risk is that we may never
get such a confirmation (eg, because they do not want to constrain
future HW), or that this actually holds just for a few processors.

Option 3b does not preclude us from applying Option 4 selectively in the
future (ie, for processors that do have true 16-byte atomic loads).

Thoughts?


We're awfully close to stage 3 end, but I'd prefer if this could do
something for this release.  I worry that dragging this out to the next
release in a year will just make things worse, given that the __atomic
builtins are still a relatively recent feature.

To which extent could we do some of these options still (early) in stage
4?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: -mcx16 vs. not using CAS for atomic loads
  2017-01-19 18:23 -mcx16 vs. not using CAS for atomic loads Torvald Riegel
@ 2017-01-20 17:55 ` Richard Henderson
  2017-01-24  9:08   ` Torvald Riegel
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Henderson @ 2017-01-20 17:55 UTC (permalink / raw)
  To: Torvald Riegel, GCC; +Cc: Bin Fan

On 01/19/2017 10:23 AM, Torvald Riegel wrote:
> * Option 3a:
> -mcx16 continues to only mean that cmpxchg16b is available, and we keep
> __sync builtins unchanged.  This doesn't break valid uses of __sync*
> (eg, if they didn't need atomic loads at all).
> We change __atomic for 16-byte to not use cmpxchg16b but to instead call
> out to libatomic.  libatomic would continue to use cmpxchg16b
> internally.  We retain compatibility between __atomic and __sync.  We do
> not change __atomic_*_lock_free.
> This does not fix the load-via-cmpxchg bug, but makes sure that we
> reroute through libatomic early for the __atomic builtins, so that it
> becomes easier in the future to either do something like Option 2 or
> Option 3c.  Until then, nothing would really change.
>
> * Option 3b:
> Like Option 3a, except that __atomic_*_lock_free return false for 16
> bytes.  The benefit over 3a is that this stops advertising "fast"
> atomics when that is arguably not the case because the loads are slowed
> down by contention (I assume a lot more users read "lock-free" as "fast"
> instead of thinking about progress conditions).  The potential downside
> is that programs may exist that assert(__atomic_always_lock_free(16,0));
> these assertions would fail, although the rest of the program would
> continue to work.
>
> * Option 3c:
> Like Option 3b, but libatomic would not use cmpxchg16b internally but
> fall back to locks for 16-byte atomics.  This fixes the load-via-cmpxchg
> bug, but breaks compatibility between old __atomic-using code and new
> __atomic-using code, and between __sync and new __atomic.
>
> * Option 4:
> Introduce a -mload16atomic option or similar that asserts that true
> 16-byte atomic loads are supported by the hardware (eg, through SSE).
> Enable this option for all processors where we know that it is true.
> Don't change __sync.  Change __atomic to use the 16-byte atomic loads if
> available, and otherwise continue to use cmpxchg16b.  Return false from
> __atomic_*_lock_free(16, ...) if 16-byte atomic loads are not available.
>
>
> I think I prefer Option 3b as the short-term solution.  It does not
> break programs (except the __atomic_always_lock_free assertion scenario,
> but that's likely to not work anyway given that the atomics will be
> lock-free but not "fast").  It makes programs aware that the atomics
> will not be fast when they are not fast indeed (ie, when getting loads
> through cmpxchg).

I agree.  Let's go through the library for the loads, giving us a hook to fix 
this in the future.

> I'm worried that Option 4 would not be possible until some time in the
> future when we have actually gotten confirmation from the HW vendors
> about 16-byte atomic loads.  The additional risk is that we may never
> get such a confirmation (eg, because they do not want to constrain
> future HW), or that this actually holds just for a few processors.

Indeed, I don't think we'll get any proper confirmation from the hw vendors any 
time soon.  Or possibly ever.

The only light on the horizon that I can see is that HTM is now working in 
newly shipping Intel processors, and we could write a pure load path through 
libatomic that uses that.  Over time the lack of guaranteed SSE atomicity 
becomes less relevant.


r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: -mcx16 vs. not using CAS for atomic loads
  2017-01-20 17:55 ` Richard Henderson
@ 2017-01-24  9:08   ` Torvald Riegel
  2017-01-24 21:06     ` Richard Henderson
  0 siblings, 1 reply; 6+ messages in thread
From: Torvald Riegel @ 2017-01-24  9:08 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC, Bin Fan

On Fri, 2017-01-20 at 09:55 -0800, Richard Henderson wrote:
> On 01/19/2017 10:23 AM, Torvald Riegel wrote:
> > I think I prefer Option 3b as the short-term solution.  It does not
> > break programs (except the __atomic_always_lock_free assertion scenario,
> > but that's likely to not work anyway given that the atomics will be
> > lock-free but not "fast").  It makes programs aware that the atomics
> > will not be fast when they are not fast indeed (ie, when getting loads
> > through cmpxchg).
> 
> I agree.  Let's go through the library for the loads, giving us a hook to fix 
> this in the future.

I'm working on a patch for this.

> > I'm worried that Option 4 would not be possible until some time in the
> > future when we have actually gotten confirmation from the HW vendors
> > about 16-byte atomic loads.  The additional risk is that we may never
> > get such a confirmation (eg, because they do not want to constrain
> > future HW), or that this actually holds just for a few processors.
> 
> Indeed, I don't think we'll get any proper confirmation from the hw vendors any 
> time soon.  Or possibly ever.
> 
> The only light on the horizon that I can see is that HTM is now working in 
> newly shipping Intel processors, and we could write a pure load path through 
> libatomic that uses that.  Over time the lack of guaranteed SSE atomicity 
> becomes less relevant.

Unless HW transactions are guaranteed to succeed for scenarios that are
sufficient for the atomics, HTM won't help because we'd have to consider
the worst-case, which would mean some non-HTM fallback.
Intel's current HTM does not make guarantees; IIRC, either Power or s390
have an HTM mode in which there are guarantees, provided that the user
follows a few rules.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: -mcx16 vs. not using CAS for atomic loads
  2017-01-24  9:08   ` Torvald Riegel
@ 2017-01-24 21:06     ` Richard Henderson
  2017-01-24 21:30       ` Peter Bergner
  2017-01-25 11:10       ` Torvald Riegel
  0 siblings, 2 replies; 6+ messages in thread
From: Richard Henderson @ 2017-01-24 21:06 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: GCC, Bin Fan

On 01/24/2017 01:08 AM, Torvald Riegel wrote:
> Unless HW transactions are guaranteed to succeed for scenarios that are
> sufficient for the atomics, HTM won't help because we'd have to consider
> the worst-case, which would mean some non-HTM fallback.

We're talking about a 16 byte aligned load here -- one cacheline, probably 3-4
instructions.  If an HTM cannot succeed with that, I'm happy to call it useless.

The only possible concern I see might be with simulators that force HTM
failure, for the purpose of forcibly testing fallback paths.  I guess we'd have
to continue to fall back to the lock path for that case.


r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: -mcx16 vs. not using CAS for atomic loads
  2017-01-24 21:06     ` Richard Henderson
@ 2017-01-24 21:30       ` Peter Bergner
  2017-01-25 11:10       ` Torvald Riegel
  1 sibling, 0 replies; 6+ messages in thread
From: Peter Bergner @ 2017-01-24 21:30 UTC (permalink / raw)
  To: Richard Henderson, Torvald Riegel; +Cc: GCC, Bin Fan

On 1/24/17 3:06 PM, Richard Henderson wrote:
> The only possible concern I see might be with simulators that force HTM
> failure, for the purpose of forcibly testing fallback paths.  I guess we'd have
> to continue to fall back to the lock path for that case.

IIRC, this was the path that valgrind was going to use all of the time,
because actually implementing the HTM instructions was too hard.

Peter


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: -mcx16 vs. not using CAS for atomic loads
  2017-01-24 21:06     ` Richard Henderson
  2017-01-24 21:30       ` Peter Bergner
@ 2017-01-25 11:10       ` Torvald Riegel
  1 sibling, 0 replies; 6+ messages in thread
From: Torvald Riegel @ 2017-01-25 11:10 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC, Bin Fan

On Tue, 2017-01-24 at 13:06 -0800, Richard Henderson wrote:
> On 01/24/2017 01:08 AM, Torvald Riegel wrote:
> > Unless HW transactions are guaranteed to succeed for scenarios that are
> > sufficient for the atomics, HTM won't help because we'd have to consider
> > the worst-case, which would mean some non-HTM fallback.
> 
> We're talking about a 16 byte aligned load here -- one cacheline, probably 3-4
> instructions.  If an HTM cannot succeed with that, I'm happy to call it useless.

I would not call it useless.  I'm not a hardware engineer, but what I've
heard from hardware people over the years is that it can be quite
complicated (and thus costly) to guarantee progress.  We just need
obstruction-freedom, strictly speaking, which makes this somewhat
easier; but I guess there still are various corner cases for which it's
much easier for the hardware to just abort.

I'd say that lock elision is still the primary use case for HTM
currently; for that use case, there's no need for a guarantee to be able
to execute certain transactions.

Irrespective of whether we consider it useless or not, we can only work
with the guarantees that we get from the hardware vendors.  If we don't
get the guarantees, we can't use it.
I would guess that it's easier for hardware to guarantee atomicity of
aligned 16-byte loads (because the use case is more constrained), and
we're not even getting this as a guarantee on Intel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-25 11:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-19 18:23 -mcx16 vs. not using CAS for atomic loads Torvald Riegel
2017-01-20 17:55 ` Richard Henderson
2017-01-24  9:08   ` Torvald Riegel
2017-01-24 21:06     ` Richard Henderson
2017-01-24 21:30       ` Peter Bergner
2017-01-25 11:10       ` Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).