Re: GCC interpretation of C11 atomics (DR 459)

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: GCC interpretation of C11 atomics (DR 459)
       [not found] <886227277.5611063.1519759959364.ref@mail.yahoo.com>
@ 2018-02-27 20:20 ` Ruslan Nikolaev via gcc
  2018-02-27 22:19   ` Torvald Riegel
  0 siblings, 1 reply; 23+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-27 20:20 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Simon Wright, Alexander Monakov, Florian Weimer, Szabolcs Nagy,
	GCC Patches



> But we're not talking about that special case of 128b types here.  The
> majority of synchronization doesn't need more than machine word size.
Then why do you worry about read-only access for 128b types? (it is a special case anyway).

> No, such a program would have a bug anyway.  It wouldn't even
> synchronize properly. 

Therefore, it was not a valid example / use case (for 128-bit) in the first place. It was a *valid* example for smaller atomics, though. But that is exactly my point -- your current solution for 128 bit does not add any practical value except when you want to use lock-based solution (but see my explanation below).


> The lock would need to be shared between processes in the example I
> gave.  You have to build your own lock for that currently, because C/C++
> don't give you any process-shared locks.
At least in Linux, you can simply use eventfd(2) to reliably do it (without relying on "array of locks"). Given that it is not a very common use case, does not seem to need to have special C standard for this. And whatever C11 provides can not be relied upon anyway since you do not have strict guarantees that read-only memory is supported for larger types. For example, clang (and possibly other compilers) will break this assumption. At least, I would prefer to use eventfd in my application (if ever needed at all) since it has reliable and well-defined behavior in Linux.


   

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 20:20 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc
@ 2018-02-27 22:19   ` Torvald Riegel
  2018-02-28  1:46     ` Ruslan Nikolaev via gcc
  0 siblings, 1 reply; 23+ messages in thread
From: Torvald Riegel @ 2018-02-27 22:19 UTC (permalink / raw)
  To: Ruslan Nikolaev
  Cc: Simon Wright, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

On Tue, 2018-02-27 at 19:32 +0000, Ruslan Nikolaev wrote:
> 
> > But we're not talking about that special case of 128b types here.  The
> > majority of synchronization doesn't need more than machine word size.
> Then why do you worry about read-only access for 128b types? (it is a special case anyway).

The types that are machine-word-sized and less determine what the
default is.  The 128b types are the outlier, and they should conform to
what we want for all other types.  The context of my statement was which
the producer-consumer example I brought up.

> > No, such a program would have a bug anyway.  It wouldn't even
> > synchronize properly. 
> 
> Therefore, it was not a valid example / use case (for 128-bit) in the first place.

You brought up the 128b case, and claimed that there was no proper
solution for it.  Now you are saying it's an invalid example.  If it is,
then I don't see what you are trying to argue for.

> It was a *valid* example for smaller atomics, though. But that is exactly my point -- your current solution for 128 bit does not add any practical value except when you want to use lock-based solution (but see my explanation below).
> 
> 
> > The lock would need to be shared between processes in the example I
> > gave.  You have to build your own lock for that currently, because C/C++
> > don't give you any process-shared locks.
> At least in Linux, you can simply use eventfd(2) to reliably do it (without relying on "array of locks").

You wanted to have something that's portable, and focused on the C
standard.  Now you are saying this doesn't matter anymore.  This seems
inconsistent.
My statement was in the context of portable C/C++ programs.

> Given that it is not a very common use case, does not seem to need to have special C standard for this.

Did you mean to say that the fact that there's no support in ISO C for
it means that it's not a use case it matters?  If so, that's not the
case.  The standards simply don't support all use cases that have
relevance to their users.

> And whatever C11 provides can not be relied upon anyway since you do not have strict guarantees that read-only memory is supported for larger types.

We discussed this topic already.  I'm not going to repeat myself.

> For example, clang (and possibly other compilers) will break this assumption. At least, I would prefer to use eventfd in my application (if ever needed at all) since it has reliable and well-defined behavior in Linux.

I don't have any opinion on what you personally use or don't use.

I care about GCC users as a whole, of course.  But it seems to me we
discussed all aspects of this, so I'd prefer this discussion to
eventually conclude.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 22:19   ` Torvald Riegel
@ 2018-02-28  1:46     ` Ruslan Nikolaev via gcc
  0 siblings, 0 replies; 23+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-28  1:46 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Simon Wright, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

Torvald, I think this discussion, indeed, gets pointless. Some of your responses clearly take my comments out of larger picture and context of the discussion.
One thing is clear that either implementation is fine with the standard (formally speaking) simply because the standard allows too much leeway on how you implement atomics. In fact, as I mentioned clang/llvm implements it differently. I find it as a weakness of the standard, actually, because for portable (across different compilers), the only thing you can more or less safely assume are single-width types.
Thank you for your output and discussion.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
@ 2018-02-27 19:32                       ` Torvald Riegel
  0 siblings, 0 replies; 23+ messages in thread
From: Torvald Riegel @ 2018-02-27 19:32 UTC (permalink / raw)
  To: Ruslan Nikolaev
  Cc: Simon Wright, Alexander Monakov, Florian Weimer, Szabolcs Nagy,
	GCC Patches

On Tue, 2018-02-27 at 17:29 +0000, Ruslan Nikolaev wrote:
> 
> 
> > Consider a producer-consumer relationship between two processes where
> > the producer doesn't want to wait for the consumer.  For example, the
> > producer could be an application that's being traced, and the consumer
> > is a trace aggregation tool.  The producer can provide a read-only
> > mapping to the consumer, and put a nonblocking ring buffer or something
> > similar in there.  That allows the consumer to read, but it still needs
> > atomic access because the consumer is modifying the ring buffer
> > concurrently.
> Sorry for getting into someone's else conversation... And what good solution gcc offers right now? It forces producer and consumer to use lock-based (BTW: global lock!)

It's not one global lock, but a lock from an array of locks (global per
process, though).

> approach for *both* producer and consumer if we are talking about 128-bit types.

But we're not talking about that special case of 128b types here.  The
majority of synchronization doesn't need more than machine word size.

> Therefore, sometimes producers *will* wait (by, effectively, blocking). Basically, it becomes useless.

No, such a program would have a bug anyway.  It wouldn't even
synchronize properly.  Please make yourself familiar with what the
standard means by "address-free".  This use case needs address-free, so
that's what the program has to ensure (and it can test that portably).
Only lock-free gives you address-free.

> In this case, I would rather use a lock-based approach which at least does not use a global lock.

The lock would need to be shared between processes in the example I
gave.  You have to build your own lock for that currently, because C/C++
don't give you any process-shared locks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 17:30                   ` Torvald Riegel
  2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
@ 2018-02-27 17:59                     ` Simon Wright
  1 sibling, 0 replies; 23+ messages in thread
From: Simon Wright @ 2018-02-27 17:59 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer,
	Szabolcs Nagy, GCC Patches

On 27 Feb 2018, at 17:07, Torvald Riegel <triegel@redhat.com> wrote:
> 
> On Tue, 2018-02-27 at 16:40 +0000, Simon Wright wrote:
>> On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote:
>>> 
>>> And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway
>> 
>> Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)
> 
> Consider a producer-consumer relationship between two processes where
> the producer doesn't want to wait for the consumer.  For example, the
> producer could be an application that's being traced, and the consumer
> is a trace aggregation tool.  The producer can provide a read-only
> mapping to the consumer, and put a nonblocking ring buffer or something
> similar in there.  That allows the consumer to read, but it still needs
> atomic access because the consumer is modifying the ring buffer
> concurrently.

OK, got that, thanks (this is what I meant by "just a constant view of the object", btw).

Misled by "read-only memory" since in the embedded world ROM (usually actually in Flash) is effectively read-only to all.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 17:30                   ` Torvald Riegel
@ 2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
  2018-02-27 19:32                       ` Torvald Riegel
  2018-02-27 17:59                     ` Simon Wright
  1 sibling, 1 reply; 23+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-27 17:33 UTC (permalink / raw)
  To: Torvald Riegel, Simon Wright
  Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, GCC Patches




> Consider a producer-consumer relationship between two processes where
> the producer doesn't want to wait for the consumer.  For example, the
> producer could be an application that's being traced, and the consumer
> is a trace aggregation tool.  The producer can provide a read-only
> mapping to the consumer, and put a nonblocking ring buffer or something
> similar in there.  That allows the consumer to read, but it still needs
> atomic access because the consumer is modifying the ring buffer
> concurrently.
Sorry for getting into someone's else conversation... And what good solution gcc offers right now? It forces producer and consumer to use lock-based (BTW: global lock!) approach for *both* producer and consumer if we are talking about 128-bit types.  Therefore, sometimes producers *will* wait (by, effectively, blocking). Basically, it becomes useless. In this case, I would rather use a lock-based approach which at least does not use a global lock. On the contrary, the alternative implementation would have been at least useful when both producers and consumers have full (RW) access.

Anyway, I already said that I personally will go with assembly inlines for right now. I just wanted to raise this concern since other people may find it useful in their projects.


   

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 16:46                 ` Simon Wright
  2018-02-27 16:52                   ` Florian Weimer
@ 2018-02-27 17:30                   ` Torvald Riegel
  2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
  2018-02-27 17:59                     ` Simon Wright
  1 sibling, 2 replies; 23+ messages in thread
From: Torvald Riegel @ 2018-02-27 17:30 UTC (permalink / raw)
  To: Simon Wright
  Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer,
	Szabolcs Nagy, GCC Patches

On Tue, 2018-02-27 at 16:40 +0000, Simon Wright wrote:
> On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote:
> > 
> > And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway
> 
> Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)

Consider a producer-consumer relationship between two processes where
the producer doesn't want to wait for the consumer.  For example, the
producer could be an application that's being traced, and the consumer
is a trace aggregation tool.  The producer can provide a read-only
mapping to the consumer, and put a nonblocking ring buffer or something
similar in there.  That allows the consumer to read, but it still needs
atomic access because the consumer is modifying the ring buffer
concurrently.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 16:46                 ` Simon Wright
@ 2018-02-27 16:52                   ` Florian Weimer
  2018-02-27 17:30                   ` Torvald Riegel
  1 sibling, 0 replies; 23+ messages in thread
From: Florian Weimer @ 2018-02-27 16:52 UTC (permalink / raw)
  To: Simon Wright, Ruslan Nikolaev
  Cc: Torvald Riegel, Alexander Monakov, Szabolcs Nagy, gcc

On 02/27/2018 05:40 PM, Simon Wright wrote:
> Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)

On many systems, the read-only nature of a memory region is a 
thread-local or process-local attribute.  Other parts of the system 
might have a different view on the same memory region.

Some CPUs support memory protection keys, which provide a cheap way to 
switch memory from read-only to read-write and back, and those switching 
operations deliberately do not involve a memory barrier.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 13:04               ` Ruslan Nikolaev via gcc
@ 2018-02-27 16:46                 ` Simon Wright
  2018-02-27 16:52                   ` Florian Weimer
  2018-02-27 17:30                   ` Torvald Riegel
  0 siblings, 2 replies; 23+ messages in thread
From: Simon Wright @ 2018-02-27 16:46 UTC (permalink / raw)
  To: Ruslan Nikolaev
  Cc: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote:
> 
> And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway

Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:36     ` Janne Blomqvist
@ 2018-02-27 10:22       ` Florian Weimer
  0 siblings, 0 replies; 23+ messages in thread
From: Florian Weimer @ 2018-02-27 10:22 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: nruslan_devel, gcc mailing list

On 02/26/2018 07:36 PM, Janne Blomqvist wrote:
> There is no such architectural guarantee. At least on some
> micro-architecture (AMD Opteron "Istanbul") it's possible to construct
> a test which fails, proving that at least on that micro-arch SSE2
> load/store isn't guaranteed to be atomic.

Looks like I was wrong.  Ugh.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:59           ` Ruslan Nikolaev via gcc
@ 2018-02-26 19:20             ` Torvald Riegel
  0 siblings, 0 replies; 23+ messages in thread
From: Torvald Riegel @ 2018-02-26 19:20 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd

On Mon, 2018-02-26 at 18:55 +0000, Ruslan Nikolaev via gcc wrote:
> Torvald, thank you for your output. See my response below. 
> 
>     On Monday, February 26, 2018 1:35 PM, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > ... does not imply this latter statement.  The statement you cited is
> > about what the standard itself requires, not what makes sense for a
> > particular implementation. 
> 
> True but makes sense to provide true atomics when they are available.

What do you mean by "true atomics"?  For me, that includes an atomic
load that is not emulated through an RMW.

> Since the standard seem to allow atomic_load implementation using RMW, does not seem to be a problem.

I believe that in the C++ committee, we have consensus that the intent
for lock-free atomics is that they should have an atomic load available
behaves like a typical natively-supported atomic load.  I can't speak
for the C committee, but at least the memory models are supposed to be
the same.

This is a decision that implementations ultimately make, however.

> In fact, lock_free flag for this type can return true only if mcx16 is specified; otherwise -- it returns false (since it can only be determined during runtime, assuming worst case scenario)

But then -mcx16 is a different ABI effectively, and it also changes what
(portable) synchronization code can expect when it sees an atomic type
declared as lock-free.

> > So, in such a case, using the wide CAS for
> > atomic loads breaks a reasonable assumption.  Moreover, it's also a
> > special case, in that 32b atomics do work as intended.
> 
> But in this case a programmer already makes an assumption that atomic_load does not use RMW which C11 does not seem to guarantee.

It makes sense for GCC as an implementation to guarantee that.

> Of course, for single-width operations, the programmer may in most practical cases assume it (even though there is no guarantee).

Requiring programs to consider what is "single-width" for a particular
platform, instead of just being able to test the lock-free property,
decreases portability.

> Anyway, there is no good solution here for double-width operations, and the programmer should not assume it is possible when writing portable code.

That's an argument in favor of splitting wide CAS out into a separate
interface -- C11 atomics are portable from the perspective of the major
use cases, and they should stay that way.

> In fact, lock-based solution is even more confusing and potentially error-prone (e.g., cannot be safely used inside signal handlers since it is not lock-free, etc)
> 
> > The behavior you favor would violate that, and
> > there's no portable way to distinguish one from the other. 
> 
> There is already a similar problem with IFFUNC (when used with Linux and glibc). In fact, I do not see any difference here. Redirection to libatomic when mcx16 is specified just adds extra cost + less predictable behavior. Moreover, it seems counterintuitive -- I specify a flag that mcx16 is supported but gcc still does not use it (at least directly). It is possible to make a change to libatomic to always use cmpxchg16b when available (even on systems without IFFUNC), this way it is totally consistent and binary compatible for code compiled with and without mcx16.

I've commented on that elsewhere in the thread.

> > I see your point in wanting to have a builtin or such for the 64b atomic
> > CAS.  However, IMO, this doesn't fit into the world of C11/C++11
> > atomics, and thus rather should be accessible through a separate
> > interface.
> Why not? If atomic_load is not really an issue, then it may be good to use standardized interface.

See above.  The atomic builtins are a package that, at least on GCC's
implementation, gives you a set of properties you can rely on in a
portable way (in particular when used through the C11/C++11 atomic ops).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:35         ` Torvald Riegel
@ 2018-02-26 18:59           ` Ruslan Nikolaev via gcc
  2018-02-26 19:20             ` Torvald Riegel
  0 siblings, 1 reply; 23+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 18:59 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd

Torvald, thank you for your output. See my response below. 

    On Monday, February 26, 2018 1:35 PM, Torvald Riegel <triegel@redhat.com> wrote:

> ... does not imply this latter statement.  The statement you cited is
> about what the standard itself requires, not what makes sense for a
> particular implementation. 

True but makes sense to provide true atomics when they are available. Since the standard seem to allow atomic_load implementation using RMW, does not seem to be a problem.
In fact, lock_free flag for this type can return true only if mcx16 is specified; otherwise -- it returns false (since it can only be determined during runtime, assuming worst case scenario)

> So, in such a case, using the wide CAS for
> atomic loads breaks a reasonable assumption.  Moreover, it's also a
> special case, in that 32b atomics do work as intended.

But in this case a programmer already makes an assumption that atomic_load does not use RMW which C11 does not seem to guarantee.Of course, for single-width operations, the programmer may in most practical cases assume it (even though there is no guarantee).
Anyway, there is no good solution here for double-width operations, and the programmer should not assume it is possible when writing portable code.In fact, lock-based solution is even more confusing and potentially error-prone (e.g., cannot be safely used inside signal handlers since it is not lock-free, etc)

> The behavior you favor would violate that, and
> there's no portable way to distinguish one from the other. 

There is already a similar problem with IFFUNC (when used with Linux and glibc). In fact, I do not see any difference here. Redirection to libatomic when mcx16 is specified just adds extra cost + less predictable behavior. Moreover, it seems counterintuitive -- I specify a flag that mcx16 is supported but gcc still does not use it (at least directly). It is possible to make a change to libatomic to always use cmpxchg16b when available (even on systems without IFFUNC), this way it is totally consistent and binary compatible for code compiled with and without mcx16.

> I see your point in wanting to have a builtin or such for the 64b atomic
> CAS.  However, IMO, this doesn't fit into the world of C11/C++11
> atomics, and thus rather should be accessible through a separate
> interface.
Why not? If atomic_load is not really an issue, then it may be good to use standardized interface.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  5:50   ` Alexander Monakov
  2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
@ 2018-02-26 18:56     ` Torvald Riegel
  1 sibling, 0 replies; 23+ messages in thread
From: Torvald Riegel @ 2018-02-26 18:56 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Ruslan Nikolaev, gcc

On Mon, 2018-02-26 at 08:50 +0300, Alexander Monakov wrote:
> > For these reasons, it may be a good idea if GCC folks reconsider past
> > decision. And just to clarify: if mcx16 (x86-64) is not specified during
> > compilation, it is totally OK to redirect to libatomic, and there make the
> > final decision if target CPU supports a given instruction or not. But if it is
> > specified, it makes sense for performance reasons and lock-freedom guarantees
> > to always generate it directly. 
> 
> You don't mention it directly, so just to make it clear for readers: on systems
> where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do
> exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to
> lock-free RMW implementations if so.  (I don't like this solution)

I thought we had fixed that to not use the wide CAS?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:16   ` Florian Weimer
  2018-02-26 18:34     ` Ruslan Nikolaev via gcc
@ 2018-02-26 18:36     ` Janne Blomqvist
  2018-02-27 10:22       ` Florian Weimer
  1 sibling, 1 reply; 23+ messages in thread
From: Janne Blomqvist @ 2018-02-26 18:36 UTC (permalink / raw)
  To: Florian Weimer; +Cc: nruslan_devel, gcc mailing list

On Mon, Feb 26, 2018 at 8:15 PM, Florian Weimer <fweimer@redhat.com> wrote:
> On 02/26/2018 05:00 AM, Ruslan Nikolaev via gcc wrote:
>>
>> If I understand correctly, the redirection to libatomic was made for 2
>> reasons:
>> 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16
>> flag already specifies that you use CPUs that have this instruction, so it
>> should not be a concern when the flag is specified.)
>> 2. atomic_load on read-only memory.
>
>
> I think x86-64 should be able to do atomic load and store via SSE2
> registers,

There is no such architectural guarantee. At least on some
micro-architecture (AMD Opteron "Istanbul") it's possible to construct
a test which fails, proving that at least on that micro-arch SSE2
load/store isn't guaranteed to be atomic. See
https://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations/7647825
for more discussion and a testcase.




-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 14:53       ` Ruslan Nikolaev via gcc
@ 2018-02-26 18:35         ` Torvald Riegel
  2018-02-26 18:59           ` Ruslan Nikolaev via gcc
  0 siblings, 1 reply; 23+ messages in thread
From: Torvald Riegel @ 2018-02-26 18:35 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd

On Mon, 2018-02-26 at 14:53 +0000, Ruslan Nikolaev via gcc wrote:
> Thank you for more comments, my response is below.
> 
> 
> 
> On Mon, 26 Feb 2018, Szabolcs Nagy wrote:> 
> > rmw load is only valid if the implementation can
> > guarantee that atomic objects are never read-only.
> But per response from WG14 regarding DR 459 which I quoted, the standard does not seem to define behavior for read-only memory

This ...

> (and const qualifier should not suggest that). RMW, according to them, is fine for atomic_load.

... does not imply this latter statement.  The statement you cited is
about what the standard itself requires, not what makes sense for a
particular implementation.  For example, one could build an
implementation that does not have any read-only memory and doesn't
distinguish between loads and atomic RMW operations; in such as case, it
wouldn't make sense for the standard to require it.  OTOH though, if
read-only memory exists, it makes sense for an implementation to try to
respect it.

Consider trying to use atomics for memory mapped read-only from another
process, for example to observe output from that other process.  You
don't want to make it read-write for security reasons, for example.
Atomic operations designated as lock-free by the implementation are
supposed to be address-free too, which targets the use case of mapping
memory from somewhere else.  So, in such a case, using the wide CAS for
atomic loads breaks a reasonable assumption.  Moreover, it's also a
special case, in that 32b atomics do work as intended.

Also, I believe the vast majority of synchronization code makes implicit
assumptions about the performance of atomic load operations, notably
that concurrent loads don't create contention, or at least much less
than concurrent writes.  The behavior you favor would violate that, and
there's no portable way to distinguish one from the other. 

Thus, GCC only declares operations as lock-free if atomic loads of the
particular size/alignment are natively supported, and with the
performance properties one would associate with just a load on the
particular arch.  If an atomic load and an atomic CAS are supported,
that's fine; if there's just a CAS, that's not enough.

I see your point in wanting to have a builtin or such for the 64b atomic
CAS.  However, IMO, this doesn't fit into the world of C11/C++11
atomics, and thus rather should be accessible through a separate
interface.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:16   ` Florian Weimer
@ 2018-02-26 18:34     ` Ruslan Nikolaev via gcc
  2018-02-26 18:36     ` Janne Blomqvist
  1 sibling, 0 replies; 23+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 18:34 UTC (permalink / raw)
  To: Florian Weimer; +Cc: gcc

On Monday, February 26, 2018 1:15 PM, Florian Weimer <fweimer@redhat.com> wrote:

> I think x86-64 should be able to do atomic load and store via SSE2 
> registers, but perhaps if the memory is suitably aligned (which is the 
> other problem—the libatomic code will work irrespective of alignment, as 
> far as I understand it).

IIRC, it is not always guaranteed to be atomic, so RMW is probably the only safe option for x86-64. And for ARM64, too, as far as I understand.

Just to summarize what can be done if the proposed change is accepted (from the discussion so far):

1. _Atomic on objects larger than 8 bytes should not be placed in .rodata even if declared as const. It can also be specified that atomic_load should not be used on read-only memory with double-width operations.

2. libatomic can be modified to redirect to functions that use cmpxchg16b (whenever available on target CPU) through regular functions pointers even if IFFUNC is not available. This will provide consistent behavior everywhere, and binary compatibility for mcx16 and mno-cx16

3. never redirect to libatomic for arm64 (since ldaxp/staxp are available), redirect for x86-64 only if mcx16 is not specified. For ARM64, there is no mcx16 option at all.

-- Ruslan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  4:01 ` Ruslan Nikolaev via gcc
  2018-02-26  5:50   ` Alexander Monakov
  2018-02-26 12:30   ` Szabolcs Nagy
@ 2018-02-26 18:16   ` Florian Weimer
  2018-02-26 18:34     ` Ruslan Nikolaev via gcc
  2018-02-26 18:36     ` Janne Blomqvist
  2 siblings, 2 replies; 23+ messages in thread
From: Florian Weimer @ 2018-02-26 18:16 UTC (permalink / raw)
  To: nruslan_devel; +Cc: gcc

On 02/26/2018 05:00 AM, Ruslan Nikolaev via gcc wrote:
> If I understand correctly, the redirection to libatomic was made for 2 reasons:
> 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.)
> 2. atomic_load on read-only memory.

I think x86-64 should be able to do atomic load and store via SSE2 
registers, but perhaps if the memory is suitably aligned (which is the 
other problemâ€”the libatomic code will work irrespective of alignment, as 
far as I understand it).

Thanks,
Florian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 13:57     ` Alexander Monakov
  2018-02-26 14:51       ` Szabolcs Nagy
@ 2018-02-26 14:53       ` Ruslan Nikolaev via gcc
  2018-02-26 18:35         ` Torvald Riegel
  1 sibling, 1 reply; 23+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 14:53 UTC (permalink / raw)
  To: Alexander Monakov, Szabolcs Nagy; +Cc: gcc, nd

Thank you for more comments, my response is below.



On Mon, 26 Feb 2018, Szabolcs Nagy wrote:> 
> rmw load is only valid if the implementation can
> guarantee that atomic objects are never read-only.
But per response from WG14 regarding DR 459 which I quoted, the standard does not seem to define behavior for read-only memory (and const qualifier should not suggest that). RMW, according to them, is fine for atomic_load.

> current implementations on linux (including clang)
> don't do that, so an rmw load can observably break
> conforming c code: a static global const object is
> placed in .rodata section and thus rmw on it is a
> crash at runtime contrary to c standard requirements.
I have just tried to compile the code using clang. Latest stable version of clang seems to emit cmpxchg16b for the code you mentioned if I specify mcx16. If I do not, it redirects to libatomic.  (I have not tried the version from the trunk, though.)


  On Monday, February 26, 2018 8:57 AM, Alexander Monakov wrote:
> OK, but that sounds like a matter of not emitting atomic
> objects into .rodata, which shouldn't be a big problem,
> if not for backwards compatibility concern?

I agree, sounds like a good idea. Certainly for _Atomic objects > 8 bytes.

> and then with new enough libatomic on Glibc this segfaults
> with GCC on x86_64 too due to IFUNC redirection mentioned
> in the other subthread.

Seems like it is a problem anyway. Another reason to never emit _Atomic inside .rodata




   

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 13:57     ` Alexander Monakov
@ 2018-02-26 14:51       ` Szabolcs Nagy
  2018-02-26 14:53       ` Ruslan Nikolaev via gcc
  1 sibling, 0 replies; 23+ messages in thread
From: Szabolcs Nagy @ 2018-02-26 14:51 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: nd, Ruslan Nikolaev, gcc

On 26/02/18 13:56, Alexander Monakov wrote:
> On Mon, 26 Feb 2018, Szabolcs Nagy wrote:
>>
>> rmw load is only valid if the implementation can
>> guarantee that atomic objects are never read-only.
> 
> OK, but that sounds like a matter of not emitting atomic
> objects into .rodata, which shouldn't be a big problem,
> if not for backwards compatibility concern?
> 

well gcc wants to allow atomic access on non-atomic
objects too, otherwise public interfaces may need to
change to use the _Atomic qualifier (which is not even
valid in c++ so it would cause all sorts of breakage).

i think it would be valid to put _Atomic stuff in
writable section and then say atomic load is only
supported on const objects if it is declared with
_Atomic, this would make all strictly conforming
c code work as well as most code that ppl write in
practice (they probably don't use atomics on global
consts).

>> current implementations on linux (including clang)
>> don't do that, so an rmw load can observably break
>> conforming c code: a static global const object is
>> placed in .rodata section and thus rmw on it is a
>> crash at runtime contrary to c standard requirements.
> 
> Note that in your example GCC emits 'x' as a common symbol,
> you need '... x = { 0 };' for it to appear in .rodata,
> 

i see.

static ... x = {0}; and static ... x; are
equivalent in c, so if gcc treats them differently
that's a gcc weirdness, but does not change the
issue that there is no guarantee about readonlyness.

>> on an aarch64 machine clang miscompiles this code:
> [...]
> 
> and then with new enough libatomic on Glibc this segfaults
> with GCC on x86_64 too due to IFUNC redirection mentioned
> in the other subthread.
> 

that's yet another issue, that this is not fully
fixed in x86 gcc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 12:30   ` Szabolcs Nagy
@ 2018-02-26 13:57     ` Alexander Monakov
  2018-02-26 14:51       ` Szabolcs Nagy
  2018-02-26 14:53       ` Ruslan Nikolaev via gcc
  0 siblings, 2 replies; 23+ messages in thread
From: Alexander Monakov @ 2018-02-26 13:57 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Ruslan Nikolaev, gcc, nd

On Mon, 26 Feb 2018, Szabolcs Nagy wrote:
> 
> rmw load is only valid if the implementation can
> guarantee that atomic objects are never read-only.

OK, but that sounds like a matter of not emitting atomic
objects into .rodata, which shouldn't be a big problem,
if not for backwards compatibility concern?

> current implementations on linux (including clang)
> don't do that, so an rmw load can observably break
> conforming c code: a static global const object is
> placed in .rodata section and thus rmw on it is a
> crash at runtime contrary to c standard requirements.

Note that in your example GCC emits 'x' as a common symbol,
you need '... x = { 0 };' for it to appear in .rodata,

> on an aarch64 machine clang miscompiles this code:
[...]

and then with new enough libatomic on Glibc this segfaults
with GCC on x86_64 too due to IFUNC redirection mentioned
in the other subthread.

Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  4:01 ` Ruslan Nikolaev via gcc
  2018-02-26  5:50   ` Alexander Monakov
@ 2018-02-26 12:30   ` Szabolcs Nagy
  2018-02-26 13:57     ` Alexander Monakov
  2018-02-26 18:16   ` Florian Weimer
  2 siblings, 1 reply; 23+ messages in thread
From: Szabolcs Nagy @ 2018-02-26 12:30 UTC (permalink / raw)
  To: Ruslan Nikolaev, gcc; +Cc: nd

On 26/02/18 04:00, Ruslan Nikolaev via gcc wrote:
> 1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!)
this should be reported as a bug against clang.

there is no abi guarantee that double-width atomics
will be able to synchronize with code in other modules,
you have to introduce a new abi to do this whatever
that takes (new elf flag, new dynamic linker name,..).

> 4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below).
> 

no, it can't be.

>  Â  Â  Â [..]Â  The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation.
> 

rmw load is only valid if the implementation can
guarantee that atomic objects are never read-only.

current implementations on linux (including clang)
don't do that, so an rmw load can observably break
conforming c code: a static global const object is
placed in .rodata section and thus rmw on it is a
crash at runtime contrary to c standard requirements.

on an aarch64 machine clang miscompiles this code:

$ cat a.c
#include <stdatomic.h>

static const _Atomic struct S {long i,j;} x;

int f(const _Atomic struct S *p)
{
	struct S y = *p;
	return y.i;
}

int main()
{
	return f(&x);
}
$ gcc a.c -latomic
$ ./a.out
$ clang a.c -latomic
$ ./a.out
Segmentation fault (core dumped)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  4:01 ` Ruslan Nikolaev via gcc
@ 2018-02-26  5:50   ` Alexander Monakov
  2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
  2018-02-26 18:56     ` Torvald Riegel
  2018-02-26 12:30   ` Szabolcs Nagy
  2018-02-26 18:16   ` Florian Weimer
  2 siblings, 2 replies; 23+ messages in thread
From: Alexander Monakov @ 2018-02-26  5:50 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: gcc

Hello,

Although I wouldn't like to fight defending GCC's design change here, let me
offer a couple of corrections/additions so everyone is on the same page:

On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote:
> 
> 1. Not consistent with clang/llvm which completely supports double-width
> atomics for arm32, arm64, x86 and x86-64 making it possible to write portable
> code (w/o specific extensions or assembly code) across all these architectures
> (which is finally possible with C11!).The behavior of clang: if mxc16 is
> specified, cmpxchg16b is generated for x86-64 (without any calls to
> libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are
> always generated. In my opinion, this is very logical and non-confusing.

Note that there's more issues to that than just behavior on readonly memory:
you need to ensure that the whole program, including all static and shared
libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level
support to ensure that), or you'd need to be sure that it's safe to mix code
compiled with different -mcx16 settings because it never happens to interop
on wide atomic objects.

(if you mix -mcx16 and -mno-cx16 code operating on the same 128-bit object,
you get wrong code that will appear to work >99% of the time)

> 3. The behavior is inconsistent even within GCC. Older (and more limited, less
> portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic
> and C11 -- do not. Moreover, __sync builtins are probably less suitable for
> arm/arm64.

Note that there's no "load" function in the __sync family, so the original
concern about operations on readonly memory does not apply.

> For these reasons, it may be a good idea if GCC folks reconsider past
> decision. And just to clarify: if mcx16 (x86-64) is not specified during
> compilation, it is totally OK to redirect to libatomic, and there make the
> final decision if target CPU supports a given instruction or not. But if it is
> specified, it makes sense for performance reasons and lock-freedom guarantees
> to always generate it directly. 

You don't mention it directly, so just to make it clear for readers: on systems
where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do
exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to
lock-free RMW implementations if so.  (I don't like this solution)

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 23+ messages in thread

* GCC interpretation of C11 atomics (DR 459)
       [not found] <1615980330.4453149.1519617655582.ref@mail.yahoo.com>
@ 2018-02-26  4:01 ` Ruslan Nikolaev via gcc
  2018-02-26  5:50   ` Alexander Monakov
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26  4:01 UTC (permalink / raw)
  To: gcc

Hi
I have read multiple bug reports (84522, 80878, 70490), and the past decision regarding GCC change to redirect double-width (128-bit) atomics for x86-64 and arm64 to libatomic. Below I mention major concerns as well as the response from C11 (WG14) regarding DR 459 which, most likely, triggered this change in more recent GCC releases in the first place. 
If I understand correctly, the redirection to libatomic was made for 2 reasons:
1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.)
2. atomic_load on read-only memory. DR 459 now requires to have 'const' qualifiers for atomic_load which probably resulted in the interpretation that read-only memory must be supported. However, per response from C11/WG14 (see below), it does not seem to be the case at all. Therefore, previously filed bug 70490 does not seem to be valid.
There are several concerns with current GCC behavior:

1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!).The behavior of clang: if mxc16 is specified, cmpxchg16b is generated for x86-64 (without any calls to libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are always generated. In my opinion, this is very logical and non-confusing.

2. Oftentimes you want to have strict guarantees (by specifying mcx16 flag for x86-64) that the generated code is lock-free, otherwise it is useless. Double-width atomics are often used in lock-free algorithms that use tags (stamps) for pointers to resolve the ABA problem. So, it is very useful to have corresponding support in the compiler.

3. The behavior is inconsistent even within GCC. Older (and more limited, less portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic and C11 -- do not. Moreover, __sync builtins are probably less suitable for arm/arm64.

4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below).

For these reasons, it may be a good idea if GCC folks reconsider past decision. And just to clarify: if mcx16 (x86-64) is not specified during compilation, it is totally OK to redirect to libatomic, and there make the final decision if target CPU supports a given instruction or not. But if it is specified, it makes sense for performance reasons and lock-freedom guarantees to always generate it directly. 

-- Ruslan

Response from the WG14 (C11) Convener regarding DR 459: (I asked for a permission to publish this response here.)
Ruslan,

     Thank you for your comments.  There is no normative requirement that const objects be suitable for read-only memory.  An example and a footnote refer to read-only memory as a way to illustrate a point, but examples and footnotes are not normative.  The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation.

                                        David
My original email:

Dear David Keaton,
After reviewing the proposed change DR 459 for C11: http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_459 ,I identified that adding const qualifier to atomic_load (C11 implements its without it) may actually be harmful in some cases.
Particularly, for double-width (128-bit) atomics found in x86-64 (cmpxchg16b instruction), arm64 (ldaxp/staxp instructions), it is currently only possible to implement atomic_load for 128 bit using corresponding read-modify-write instructions (i.e., potentially rewriting memory with the same value, but, in essence, not changing it). But these implementations will not work on read-only memory. Similar concerns apply to some extent to x86 and arm32 for double-width (64-bit) atomics. Otherwise, there is no obstacle to implement all C11 atomics for corresponding types in these architectures. Moreover, a well-known clang/llvm compiler already implements all double-width operations for x86, x86-64, arm32 and arm64 (atomic_load is implemented using corresponding read-modify-write instructions). Double-width atomics are often used in data structures that need tagging for pointers to avoid the ABA problem (e.g., in lock-free stacks and queues).
It is my understanding that C11 aimed to make atomics more or less portable across different microarchitectures, while at the same time provide an ability for a compiler to optimize code well and utilize all potential of the corresponding microarchitecture.
If now it is required to support read-only memory (i.e., const qualifier) for atomic_load, 128-bit atomics are likely be impossible to implement in any meaningful and portable way. Thus, anyone who wants to use them will have to go with assembly fallbacks (or compiler extensions), thus, partially defeating the purpose of C11 atomics. One way to address this concern would be to state that atomic_load on read-only memory is implementation-defined and may not be supported for all types. That would also mean to go with the previous C11 definition (i.e., without the const qualifier) to implement atomic_load rather than what was proposed in the DR 459 change.
I am ready to submit a more formal proposal if this is something that can be considered by the committee.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-02-27 22:19 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <886227277.5611063.1519759959364.ref@mail.yahoo.com>
2018-02-27 20:20 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc
2018-02-27 22:19   ` Torvald Riegel
2018-02-28  1:46     ` Ruslan Nikolaev via gcc
     [not found] <1615980330.4453149.1519617655582.ref@mail.yahoo.com>
2018-02-26  4:01 ` Ruslan Nikolaev via gcc
2018-02-26  5:50   ` Alexander Monakov
2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
2018-02-26 19:07       ` Torvald Riegel
2018-02-26 19:43         ` Ruslan Nikolaev via gcc
2018-02-26 22:49           ` Ruslan Nikolaev via gcc
2018-02-27 12:39             ` Torvald Riegel
2018-02-27 13:04               ` Ruslan Nikolaev via gcc
2018-02-27 16:46                 ` Simon Wright
2018-02-27 16:52                   ` Florian Weimer
2018-02-27 17:30                   ` Torvald Riegel
2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
2018-02-27 19:32                       ` Torvald Riegel
2018-02-27 17:59                     ` Simon Wright
2018-02-26 18:56     ` Torvald Riegel
2018-02-26 12:30   ` Szabolcs Nagy
2018-02-26 13:57     ` Alexander Monakov
2018-02-26 14:51       ` Szabolcs Nagy
2018-02-26 14:53       ` Ruslan Nikolaev via gcc
2018-02-26 18:35         ` Torvald Riegel
2018-02-26 18:59           ` Ruslan Nikolaev via gcc
2018-02-26 19:20             ` Torvald Riegel
2018-02-26 18:16   ` Florian Weimer
2018-02-26 18:34     ` Ruslan Nikolaev via gcc
2018-02-26 18:36     ` Janne Blomqvist
2018-02-27 10:22       ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).