GCC interpretation of C11 atomics (DR 459)

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* GCC interpretation of C11 atomics (DR 459)
       [not found] <1615980330.4453149.1519617655582.ref@mail.yahoo.com>
@ 2018-02-26  4:01 ` Ruslan Nikolaev via gcc
  2018-02-26  5:50   ` Alexander Monakov
                     ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26  4:01 UTC (permalink / raw)
  To: gcc

Hi
I have read multiple bug reports (84522, 80878, 70490), and the past decision regarding GCC change to redirect double-width (128-bit) atomics for x86-64 and arm64 to libatomic. Below I mention major concerns as well as the response from C11 (WG14) regarding DR 459 which, most likely, triggered this change in more recent GCC releases in the first place. 
If I understand correctly, the redirection to libatomic was made for 2 reasons:
1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.)
2. atomic_load on read-only memory. DR 459 now requires to have 'const' qualifiers for atomic_load which probably resulted in the interpretation that read-only memory must be supported. However, per response from C11/WG14 (see below), it does not seem to be the case at all. Therefore, previously filed bug 70490 does not seem to be valid.
There are several concerns with current GCC behavior:

1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!).The behavior of clang: if mxc16 is specified, cmpxchg16b is generated for x86-64 (without any calls to libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are always generated. In my opinion, this is very logical and non-confusing.

2. Oftentimes you want to have strict guarantees (by specifying mcx16 flag for x86-64) that the generated code is lock-free, otherwise it is useless. Double-width atomics are often used in lock-free algorithms that use tags (stamps) for pointers to resolve the ABA problem. So, it is very useful to have corresponding support in the compiler.

3. The behavior is inconsistent even within GCC. Older (and more limited, less portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic and C11 -- do not. Moreover, __sync builtins are probably less suitable for arm/arm64.

4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below).

For these reasons, it may be a good idea if GCC folks reconsider past decision. And just to clarify: if mcx16 (x86-64) is not specified during compilation, it is totally OK to redirect to libatomic, and there make the final decision if target CPU supports a given instruction or not. But if it is specified, it makes sense for performance reasons and lock-freedom guarantees to always generate it directly. 

-- Ruslan

Response from the WG14 (C11) Convener regarding DR 459: (I asked for a permission to publish this response here.)
Ruslan,

     Thank you for your comments.  There is no normative requirement that const objects be suitable for read-only memory.  An example and a footnote refer to read-only memory as a way to illustrate a point, but examples and footnotes are not normative.  The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation.

                                        David
My original email:

Dear David Keaton,
After reviewing the proposed change DR 459 for C11: http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_459 ,I identified that adding const qualifier to atomic_load (C11 implements its without it) may actually be harmful in some cases.
Particularly, for double-width (128-bit) atomics found in x86-64 (cmpxchg16b instruction), arm64 (ldaxp/staxp instructions), it is currently only possible to implement atomic_load for 128 bit using corresponding read-modify-write instructions (i.e., potentially rewriting memory with the same value, but, in essence, not changing it). But these implementations will not work on read-only memory. Similar concerns apply to some extent to x86 and arm32 for double-width (64-bit) atomics. Otherwise, there is no obstacle to implement all C11 atomics for corresponding types in these architectures. Moreover, a well-known clang/llvm compiler already implements all double-width operations for x86, x86-64, arm32 and arm64 (atomic_load is implemented using corresponding read-modify-write instructions). Double-width atomics are often used in data structures that need tagging for pointers to avoid the ABA problem (e.g., in lock-free stacks and queues).
It is my understanding that C11 aimed to make atomics more or less portable across different microarchitectures, while at the same time provide an ability for a compiler to optimize code well and utilize all potential of the corresponding microarchitecture.
If now it is required to support read-only memory (i.e., const qualifier) for atomic_load, 128-bit atomics are likely be impossible to implement in any meaningful and portable way. Thus, anyone who wants to use them will have to go with assembly fallbacks (or compiler extensions), thus, partially defeating the purpose of C11 atomics. One way to address this concern would be to state that atomic_load on read-only memory is implementation-defined and may not be supported for all types. That would also mean to go with the previous C11 definition (i.e., without the const qualifier) to implement atomic_load rather than what was proposed in the DR 459 change.
I am ready to submit a more formal proposal if this is something that can be considered by the committee.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc
@ 2018-02-26  5:50   ` Alexander Monakov
  2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
  2018-02-26 18:56     ` Torvald Riegel
  2018-02-26 12:30   ` Szabolcs Nagy
  2018-02-26 18:16   ` Florian Weimer
  2 siblings, 2 replies; 38+ messages in thread
From: Alexander Monakov @ 2018-02-26  5:50 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: gcc

Hello,

Although I wouldn't like to fight defending GCC's design change here, let me
offer a couple of corrections/additions so everyone is on the same page:

On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote:
> 
> 1. Not consistent with clang/llvm which completely supports double-width
> atomics for arm32, arm64, x86 and x86-64 making it possible to write portable
> code (w/o specific extensions or assembly code) across all these architectures
> (which is finally possible with C11!).The behavior of clang: if mxc16 is
> specified, cmpxchg16b is generated for x86-64 (without any calls to
> libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are
> always generated. In my opinion, this is very logical and non-confusing.

Note that there's more issues to that than just behavior on readonly memory:
you need to ensure that the whole program, including all static and shared
libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level
support to ensure that), or you'd need to be sure that it's safe to mix code
compiled with different -mcx16 settings because it never happens to interop
on wide atomic objects.

(if you mix -mcx16 and -mno-cx16 code operating on the same 128-bit object,
you get wrong code that will appear to work >99% of the time)

> 3. The behavior is inconsistent even within GCC. Older (and more limited, less
> portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic
> and C11 -- do not. Moreover, __sync builtins are probably less suitable for
> arm/arm64.

Note that there's no "load" function in the __sync family, so the original
concern about operations on readonly memory does not apply.

> For these reasons, it may be a good idea if GCC folks reconsider past
> decision. And just to clarify: if mcx16 (x86-64) is not specified during
> compilation, it is totally OK to redirect to libatomic, and there make the
> final decision if target CPU supports a given instruction or not. But if it is
> specified, it makes sense for performance reasons and lock-freedom guarantees
> to always generate it directly. 

You don't mention it directly, so just to make it clear for readers: on systems
where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do
exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to
lock-free RMW implementations if so.  (I don't like this solution)

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  5:50   ` Alexander Monakov
@ 2018-02-26  7:24     ` Ruslan Nikolaev via gcc
  2018-02-26  8:20       ` Alexander Monakov
  2018-02-26 19:07       ` Torvald Riegel
  2018-02-26 18:56     ` Torvald Riegel
  1 sibling, 2 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26  7:24 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc

Alexander,
Thank you for your comments. Please see my response below. I definitely do not want to fight for or against this change in gcc, but there are definitely legitimate concerns to consider.  I think, it would really be good to consider this change to make things more compatible (i.e., at least between clang/llvm and gcc which can be both used within the same ecosystem). There are real practical benefits of having true lock-free double-width operations when implementing algorithms that rely on ABA tagging for pointers, and C11 at last gives an opportunity to do that without resorting to assembly or platform-specific implementations.

> Note that there's more issues to that than just behavior on readonly memory:
> you need to ensure that the whole program, including all static and shared
> libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level
> support to ensure that), or you'd need to be sure that it's safe to mix code
> compiled with different -mcx16 settings because it never happens to interop
> on wide atomic objects.

Well, if libatomic is already doing it when corresponding CPU feature is available (i.e., effectively implementing operations using cmpxchg16b), I do not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore other code compiled without -mcx16 flag will go to libatomic. Inside libatomic, it will detect that cmpxchg16b *is* available, thus making code compiled with and without -mcx16 flag completely compatible on a given system. Or do I miss something here?

If you do not have cmpxchg16b, but the program is compiled with the flag, it will simply not run (as expected).

So, in other words, libatomic should still decide whether you have cmpxchg16b or not for cases when -mcx16 is not specified. But if it is specified, cmpxchg16b can be generated unconditionally. If you want better compatibility, you will not specify the flag. Mix of -mcx16 and mno-cx16 will be, thus, binary compatible.

> Note that there's no "load" function in the __sync family, so the original
> concern about operations on readonly memory does not apply.
Yes, but per clarification from WG14/C11, read-only memory should not be a concern at all, as this behavior is not specified anyway (regardless of the const specifier). Read-modify-write is allowed for atomic_load as long as there is no 'visible' change on the value being loaded. In this sense, the bug that was filed previously regarding read-only memory accesses and const specifier does not seem to be valid.
Additionally, it is really odd and counterintuitive to still provide support for (almost) deprecated macros while not giving such an opportunity for newer and more advanced functions.

> You don't mention it directly, so just to make it clear for readers: on systems
> where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do
> exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to
> lock-free RMW implementations if so.  (I don't like this solution)

Yes, but libatomic makes things slower due to indirection. Also, it is much harder to track what is going on, as there is no guarantee of lock-freedom in this case. BTW -- The fact that it currently uses cmpxchg16b if available may actually be helpful to switch to the suggested behavior without breaking binary compatibility (if I understand everything correctly).

-- Ruslan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
@ 2018-02-26  8:20       ` Alexander Monakov
  2018-02-26  8:43         ` Ruslan Nikolaev via gcc
  2018-02-26 19:07       ` Torvald Riegel
  1 sibling, 1 reply; 38+ messages in thread
From: Alexander Monakov @ 2018-02-26  8:20 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: gcc

On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote:
> Well, if libatomic is already doing it when corresponding CPU feature is
> available (i.e., effectively implementing operations using cmpxchg16b), I do
> not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore
> other code compiled without -mcx16 flag will go to libatomic. Inside
> libatomic, it will detect that cmpxchg16b *is* available, thus making code
> compiled with and without -mcx16 flag completely compatible on a given system.
> Or do I miss something here?

I'd say the main issue is that libatomic is not guaranteed to work like that.
Today it relies on IFUNC for redirection, so you may (and not "will") get the
desired behavior on Glibc (implying Linux), not on other OSes, and neither on
Linux with non-GNU libc (nor on bare metal, for that matter).

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  8:20       ` Alexander Monakov
@ 2018-02-26  8:43         ` Ruslan Nikolaev via gcc
  0 siblings, 0 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26  8:43 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc

> I'd say the main issue is that libatomic is not guaranteed to work like that.


> Today it relies on IFUNC for redirection, so you may (and not "will") get the
> desired behavior on Glibc (implying Linux), not on other OSes, and neither on
> Linux with non-GNU libc (nor on bare metal, for that matter).
I think, in case if IFUNC is not available (i.e., outside glibc), redirection is still possible by introducing a regular function pointer there. Yes, it is an extra cost but better than nothing (+ consistent behavior on all platforms), probably will not add too much anyway because there is already a performance hit by going to libatomic.

   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
  2018-02-26  8:20       ` Alexander Monakov
@ 2018-02-26 19:07       ` Torvald Riegel
  2018-02-26 19:43         ` Ruslan Nikolaev via gcc
  1 sibling, 1 reply; 38+ messages in thread
From: Torvald Riegel @ 2018-02-26 19:07 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, GCC Patches

On Mon, 2018-02-26 at 07:24 +0000, Ruslan Nikolaev via gcc wrote:
> Alexander,
> Thank you for your comments. Please see my response below. I definitely do not want to fight for or against this change in gcc, but there are definitely legitimate concerns to consider.  I think, it would really be good to consider this change to make things more compatible (i.e., at least between clang/llvm and gcc which can be both used within the same ecosystem). There are real practical benefits of having true lock-free double-width operations when implementing algorithms that rely on ABA tagging for pointers, and C11 at last gives an opportunity to do that without resorting to assembly or platform-specific implementations.

I agree a wide CAS can be useful, but that has to be weighed against all
other use cases and how they'd be affected by the change you propose.
Not getting the performance usually associated with atomic loads can be
a big problem for code that tries to be portable.

> > Note that there's more issues to that than just behavior on readonly memory:
> > you need to ensure that the whole program, including all static and shared
> > libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level
> > support to ensure that), or you'd need to be sure that it's safe to mix code
> > compiled with different -mcx16 settings because it never happens to interop
> > on wide atomic objects.
> 
> Well, if libatomic is already doing it when corresponding CPU feature is available (i.e., effectively implementing operations using cmpxchg16b), I do not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore other code compiled without -mcx16 flag will go to libatomic. Inside libatomic, it will detect that cmpxchg16b *is* available, thus making code compiled with and without -mcx16 flag completely compatible on a given system. Or do I miss something here?

I think I now remember why we "didn't fix" libatomic: There might be
compiled code out there that does use the wide CAS, so changing
libatomic from the status quo to using its intenral locks could break
programs.  In contrast, only redirecting to libatomic and not promising
lock-free anymore doesn't break these programs, but it gives us the
opportunity to fix this in the future; because we don't advertise it
those operations as lock-free anymore, we also make new programs aware
that they won't get the default set of native atomic operations, and
thus prevent new programs from running into this problem.

> If you do not have cmpxchg16b, but the program is compiled with the flag, it will simply not run (as expected).
>  
> So, in other words, libatomic should still decide whether you have cmpxchg16b or not for cases when -mcx16 is not specified. But if it is specified, cmpxchg16b can be generated unconditionally. If you want better compatibility, you will not specify the flag. Mix of -mcx16 and mno-cx16 will be, thus, binary compatible.
> 
> > Note that there's no "load" function in the __sync family, so the original
> > concern about operations on readonly memory does not apply.
> Yes, but per clarification from WG14/C11, read-only memory should not be a concern at all,

No, they only said that it doesn't need to be a concern for the
standard.  Implementations have to pay attention to more things, so it
is a concern for implementation.

> as this behavior is not specified anyway (regardless of the const specifier). Read-modify-write is allowed for atomic_load as long as there is no 'visible' change on the value being loaded.

It's not "visible" in the abstract machine under some setting of the
as-if rule.  But it is definitely visible in an implementation in which
the effects of read-only memory are visible (see my example of mapping
memory from another process read-only so as to read data from that
process).

> In this sense, the bug that was filed previously regarding read-only memory accesses and const specifier does not seem to be valid.
> Additionally, it is really odd and counterintuitive to still provide support for (almost) deprecated macros while not giving such an opportunity for newer and more advanced functions.
> 
> > You don't mention it directly, so just to make it clear for readers: on systems
> > where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do
> > exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to
> > lock-free RMW implementations if so.  (I don't like this solution)
> 
> Yes, but libatomic makes things slower due to indirection. Also, it is much harder to track what is going on, as there is no guarantee of lock-freedom in this case. BTW -- The fact that it currently uses cmpxchg16b if available may actually be helpful to switch to the suggested behavior without breaking binary compatibility (if I understand everything correctly).

It's rather done that way to switch away from the previous behavior but
in a manner that's less likely to break existing programs.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 19:07       ` Torvald Riegel
@ 2018-02-26 19:43         ` Ruslan Nikolaev via gcc
  2018-02-26 22:49           ` Ruslan Nikolaev via gcc
  2018-02-27 10:40           ` Fw: " Torvald Riegel
  0 siblings, 2 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 19:43 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Alexander Monakov, GCC Patches

Torvald, I definitely do not want to insist on this design choice, but it makes sense to at least seriuously consider it given the concerns I described. And especially because IFFUNC in libatomic already redirects to cmpxchg16b, so it just adds extra cost and indirection. Quite frankly, I do not even see any serious problem here with respect to binary compatibility. Even if cmpxchg16b was not used on some platforms outside Linux, old binaries will go to libatomic which can now be updated to simply use cmpxchg16b. (Even for statically linked should not be an issue -- they will not have any direct interaction with newer binaries.)

 > Not getting the performance usually associated with atomic loads can be
> a big problem for code that tries to be portable.

I do not think it is a common use case anyway. How often atomic_load is used on double-width operations? If a programmer needs some guarantees and does not care about lock-freedom, why not use a regular lock here? This way nothing magical happens. Otherwise, he will may hit unexpected issues in places like signal handlers (which is hard to debug since it will hang only once in a while). With cmpxchg16b, it is at least more or less reproducible: if you tried to use it on read-only memory, you will immediately get a segfault.

> I think I now remember why we "didn't fix" libatomic: There might be
> compiled code out there that does use the wide CAS, so changing
> libatomic from the status quo to using its intenral locks could break
> programs.
Well, it already happens for Linux and glibc. There nothing will break. For other architectures, it would be good to implement the same, so that consistent behavior is observed everywhere.

> No, they only said that it doesn't need to be a concern for the
> standard.  Implementations have to pay attention to more things, so it
> is a concern for implementation.
Yes, but the only problem I see is that it is currently placed to .rodata when const is used. It is easy to resolve: just do not place it there for _Atomic objects > 8 bytes. Then also clarify that a programmer cannot safely cast some arbitrary object that can be placed in .rodata to use with atomic_load.
It needs to be addressed anyway, as there is already a segfault for provided example in x86-64 and Linux even with redirection to libatomic.

> It's not "visible" in the abstract machine under some setting of the
> as-if rule.  But it is definitely visible in an implementation in which
> the effects of read-only memory are visible (see my example of mapping
> memory from another process read-only so as to read data from that
> process).
True but it is not defined for read-only memory anyway, and no assumptions can be made in portable code. 

-- Ruslan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 19:43         ` Ruslan Nikolaev via gcc
@ 2018-02-26 22:49           ` Ruslan Nikolaev via gcc
  2018-02-27  3:33             ` Ruslan Nikolaev via gcc
                               ` (2 more replies)
  2018-02-27 10:40           ` Fw: " Torvald Riegel
  1 sibling, 3 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 22:49 UTC (permalink / raw)
  To: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc
  Cc: GCC Patches

Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) )
I also opened Bug 84563 with the rationale.

Pros of the proposed approach:
1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers).

2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking

3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility).
4. Faster & easy to analyze code when mcx16 is specified.

5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worse-case scenario.

6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already)

7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere.

Cons of the proposed approach:

1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.)

2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly.
3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc).
-- Ruslan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 22:49           ` Ruslan Nikolaev via gcc
@ 2018-02-27  3:33             ` Ruslan Nikolaev via gcc
  2018-02-27 10:34             ` Ramana Radhakrishnan
  2018-02-27 12:39             ` Torvald Riegel
  2 siblings, 0 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-27  3:33 UTC (permalink / raw)
  To: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc
  Cc: GCC Patches

Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) )
I also opened Bug 84563 with the rationale.

Pros of the proposed approach:
1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers).

2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking

3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility).
4. Faster & easy to analyze code when mcx16 is specified.

5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worse-case scenario.

6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already)

7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere.

Cons of the proposed approach:

1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.)

2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly.
3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc).
-- Ruslan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 22:49           ` Ruslan Nikolaev via gcc
  2018-02-27  3:33             ` Ruslan Nikolaev via gcc
@ 2018-02-27 10:34             ` Ramana Radhakrishnan
  2018-02-27 11:14               ` Torvald Riegel
  2018-02-27 12:39             ` Torvald Riegel
  2 siblings, 1 reply; 38+ messages in thread
From: Ramana Radhakrishnan @ 2018-02-27 10:34 UTC (permalink / raw)
  To: Ruslan Nikolaev
  Cc: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

On Mon, Feb 26, 2018 at 10:45 PM, Ruslan Nikolaev via gcc
<gcc@gcc.gnu.org> wrote:
> Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) )
> I also opened Bug 84563 with the rationale.
>
>
> Pros of the proposed approach:
> 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers).
>
> 2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking
>
> 3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility).

That is going to create an ABI break on AArch64. Think about binaries
produced by old releases GCC that use locks in libatomic and those
used by new GCC. The way to fix this in AArch64 if there is a
guarantee from the standard that there are no  problems with read-only
locations is to implement the change in libatomic. You cannot have the
same region of memory protected by locks in older binaries and the
appropriate load / store instructions in new binaries.

Ramana


> 4. Faster & easy to analyze code when mcx16 is specified.
>
> 5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worse-case scenario.
>
> 6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already)
>
> 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere.
>
>
> Cons of the proposed approach:
>
> 1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.)
>
> 2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly.
> 3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc).
> -- Ruslan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 10:34             ` Ramana Radhakrishnan
@ 2018-02-27 11:14               ` Torvald Riegel
  0 siblings, 0 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 11:14 UTC (permalink / raw)
  To: Ramana Radhakrishnan
  Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer,
	Szabolcs Nagy, GCC Patches

On Tue, 2018-02-27 at 10:22 +0000, Ramana Radhakrishnan wrote:
> The way to fix this in AArch64 if there is a
> guarantee from the standard that there are no  problems with read-only
> locations is to implement the change in libatomic.

Even though the standard doesn't specify read-only memory, I think that
consensus in ISO C++ SG1 (ie, the concurrency study group) exists that
it makes sense for implementations to not declare something lock-free if
the hardware doesn't provide a true atomic load for the particular
size/alignment.  It is an implementation-level decision though (given
that the details of the as-if rule depend on what's doable on the
particular implementation), and I do not see a reason to change GCC's
stance on this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 22:49           ` Ruslan Nikolaev via gcc
  2018-02-27  3:33             ` Ruslan Nikolaev via gcc
  2018-02-27 10:34             ` Ramana Radhakrishnan
@ 2018-02-27 12:39             ` Torvald Riegel
  2018-02-27 13:04               ` Ruslan Nikolaev via gcc
  2 siblings, 1 reply; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 12:39 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

On Mon, 2018-02-26 at 22:45 +0000, Ruslan Nikolaev via gcc wrote:
> Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) )
> I also opened Bug 84563 with the rationale.

This bug summarizes your perspective on the matter.  I'd call that not
just slightly biased :)

I do not see a reason to change GCC's position regarding this topic.  We
should update the docs though to clarify the intent and guarantees GCC's
implementation gives, I suppose.

The reasons have been discussed elsewhere in this thread already, so I'm
not going to repeat them here.

> 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture).

That's a valid goal, but it does not imply that we should mess with how
atomics are implemented by default, nor should we mess with the default
use cases.  This goal wants something special, and that is exposing the
fact that *only* a CAS is available to synchronize atomically on a
particular type.  That is an extension of the existing atomics design.

There are different ways to expose such an extension, with one being to
simply provide a __atomic_special_cas builtin or something like that.
It would have the same synchronization semantics as the normal CAS, but
concurrent access between the special CAS and the normal atomics would
be considered a data race (so making sure that there's no guaranteed
atomicity between them).  It could have a fallback to libatomic for ease
of use, and it could be defined for smaller types too.  This would be
portable, and would allow us to separate the different use cases.

> 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere.

The topic we're currently discussing does not significantly affect when
we can remove __sync builtins, IMO.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 12:39             ` Torvald Riegel
@ 2018-02-27 13:04               ` Ruslan Nikolaev via gcc
  2018-02-27 13:08                 ` Szabolcs Nagy
                                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-27 13:04 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

Formally speaking, either implementation satisfies C11 because the standard allows much leeway in the interpretation here. But, of course, it is kind of annoying that double-width types (and that also includes potentially 64-bit on some 32-bit processors, e.g. i586 also has cmpxchg8b and no official way to read atomically otherwise) need special handling and compiler extensions which basically means that in a number of cases I cannot write portable code, I need to put a bunch of architecture-dependent ifdefs, for say, 64 bit atomics even. (And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway.)

Particularly, imagine when someones writes some lock-free code for different types (in templates, macros, etc). It basically uses same C11 atomic primitives but for various integer sizes. Now I need special handling for larger types because whatever libatomic provides does not guarantee lock-freedom (i.e., useless) which otherwise I do not need. True that wider types may not be available across all architectures, but I would prefer to have generic and standard-conformant code at least for those that have them.

> That's a valid goal, but it does not imply that we should mess with how
> atomics are implemented by default, nor should we mess with the default
> use cases.  This goal wants something special, and that is exposing the
> fact that *only* a CAS is available to synchronize atomically on a
>particular type.  That is an extension of the existing atomics design.

See above

> The standard doesn't specify read-only memory, so it also doesn't forbid
> the concept.  The implementation takes it into account though, and thus
> it's defined in that context.
But my point is that a programmer cannot rely on this feature anyway unless she/he wants to write code which compiles only with gcc. It is unspecified by the standard and implementations that use read-modify-write for atomic_load are perfectly valid. The whole point to have this standard in the first place is to allow code be compiled by different compilers, otherwise people can just rely on gcc-specific extensions.

> The topic we're currently discussing does not significantly affect when
> we can remove __sync builtins, IMO.

They are the only builtins that directly expose double-width operations. Short of using assembly fall-backs, they are the only option right now.

> They do care about whether atomic operations are natively supported on
> that particular type -- and that should include a load.
I think, the whole point to have atomic operations is ability to provide lock-free operations whenever possible. Even though standard does not guarantee it, that is almost the only sane use case. Otherwise, there is no point -- you can always use locks. If they do not care about lock-freedom, they should just use locks.

> Nobody is proposing to mark things as lock-free if they aren't.  Thus, I
> don't see any change to what's usable in signal handlers.
It is not obvious to anyone that atomic_load will block. It will *not* for single-width types. So, again we see differences for single- and double-width types. Even though you do not have problems with read-only memory, you have another problem for double-width types which may be even more subtle and much harder to debug in a number of cases. Of course, no one can make an assumption that it will not block, but the same can be said about read-only memory.
Anyway, I do not have a horse in the race... I just proposed to consider this change for a number of legitimate use cases, but it is eventually up to the gcc developers to decide.
-- Ruslan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 13:04               ` Ruslan Nikolaev via gcc
@ 2018-02-27 13:08                 ` Szabolcs Nagy
  2018-02-27 13:17                   ` Ruslan Nikolaev via gcc
  2018-02-27 16:21                   ` Torvald Riegel
  2018-02-27 16:16                 ` Torvald Riegel
  2018-02-27 16:46                 ` Simon Wright
  2 siblings, 2 replies; 38+ messages in thread
From: Szabolcs Nagy @ 2018-02-27 13:08 UTC (permalink / raw)
  To: Ruslan Nikolaev, Torvald Riegel
  Cc: nd, Alexander Monakov, Florian Weimer, gcc

On 27/02/18 12:56, Ruslan Nikolaev wrote:
> Formally speaking, either implementation satisfies C11 because the standard allows much leeway in the interpretation here.

no,

1) your proposal would make gcc non-conforming to iso c unless it changes how static const objects are emitted.
2) the two implementations are not abi compatible, the choice is already made, changing it is an abi break.
3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores.

the solutions is to add a language extension, but that requires careful design.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 13:08                 ` Szabolcs Nagy
@ 2018-02-27 13:17                   ` Ruslan Nikolaev via gcc
  2018-02-27 16:40                     ` Torvald Riegel
  2018-02-27 16:21                   ` Torvald Riegel
  1 sibling, 1 reply; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-27 13:17 UTC (permalink / raw)
  To: Szabolcs Nagy, Torvald Riegel; +Cc: nd, Alexander Monakov, Florian Weimer, gcc




> 1) your proposal would make gcc non-conforming to iso c unless it changes how static const objects are emitted.
I do not think, ISO C requires to put const objects to .rodata. And it is easily solved by not placing it there for _Atomic objects that cannot be safely loaded from read-only memory.

> 2) the two implementations are not abi compatible, the choice is already made, changing it is an abi break.
Since current implementations redirects to libatomic anyway, almost nothing should break. The only case it will break -- if somebody erroneously used atomic_load for 128-bit type on read-only memory (which is, again, not guaranteed by the standard). In practice, this case almost non-existent. The worst that may happen -- you will a segfault right away.

> 3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores.

Is it even true? Is it faster to use some global lock (implemented through RMW) than a single RMW operation? If you use this global lock, you will not get loads faster than stores.

   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 13:17                   ` Ruslan Nikolaev via gcc
@ 2018-02-27 16:40                     ` Torvald Riegel
  2018-02-27 17:07                       ` Ruslan Nikolaev via gcc
  0 siblings, 1 reply; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 16:40 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Szabolcs Nagy, nd, Alexander Monakov, Florian Weimer, gcc

On Tue, 2018-02-27 at 13:16 +0000, Ruslan Nikolaev via gcc wrote:
> > 3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores.
> 
> Is it even true? Is it faster to use some global lock (implemented through RMW) than a single RMW operation? If you use this global lock, you will not get loads faster than stores.

If GCC declares a type as lock-free, atomic loads on this type will be
natively supported through some sort of load instruction.  That means
they are faster than stores under concurrent accesses, in particular
when there are concurrent atomic loads (for all major HW we care about).

If there is no natively supported atomic load, GCC will not declare the
type to be lock-free.

Nobody made statement about performance of locks vs. RMWs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 16:40                     ` Torvald Riegel
@ 2018-02-27 17:07                       ` Ruslan Nikolaev via gcc
  0 siblings, 0 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-27 17:07 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Szabolcs Nagy, nd, Alexander Monakov, Florian Weimer, gcc

Torvald, thank you for your output, but I think, this discussion gets a little pointless. There is nothing else I can add since gcc folks are reluctant to this change anyway. In my opinion, there is no compelling reason against such an implementation (it is perfectly fine with the standard, read-only memory is not guaranteed for atomic_load anyway). Even binary compatibility that was mentioned is unlikely to be an issue if implemented as I described. And finally this is something that can actually be useful in practice (at least as far as I can judge from my experience). By the way, this issue was already raised multiple times during last couple of years by different people who actually use it for various real projects (bugs were eventually closed as 'INVALID').
All described challenges are purely technical and can easily be resolved. Moreover, clang/llvm chose this implementation, and it seems very logical and non-confusing to me. It certainly makes sense to expose hardware capabilities through standard interfaces whenever possible.

For my projects, I will simply fall back to my own implementation using inline assembly (at least for now) because, unfortunately, it is the only thing that is guaranteed to work outside of clang/llvm in the foreseeable future (__sync functions have some limitations and do not look like an attractive option either, by the way).

    On Tuesday, February 27, 2018 11:21 AM, Torvald Riegel <triegel@redhat.com> wrote:

 On Tue, 2018-02-27 at 13:16 +0000, Ruslan Nikolaev via gcc wrote:
> > 3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores.
> 
> Is it even true? Is it faster to use some global lock (implemented through RMW) than a single RMW operation? If you use this global lock, you will not get loads faster than stores.

If GCC declares a type as lock-free, atomic loads on this type will be
natively supported through some sort of load instruction.  That means
they are faster than stores under concurrent accesses, in particular
when there are concurrent atomic loads (for all major HW we care about).

If there is no natively supported atomic load, GCC will not declare the
type to be lock-free.

Nobody made statement about performance of locks vs. RMWs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 13:08                 ` Szabolcs Nagy
  2018-02-27 13:17                   ` Ruslan Nikolaev via gcc
@ 2018-02-27 16:21                   ` Torvald Riegel
  1 sibling, 0 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 16:21 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Ruslan Nikolaev, nd, Alexander Monakov, Florian Weimer, gcc

On Tue, 2018-02-27 at 13:04 +0000, Szabolcs Nagy wrote:
> the solutions is to add a language extension

I think this only needs a library interface, at least when we're just
considering the __atomic builtins.  On the C/C++ level, it might amount
to just another atomic type, which only has a CAS however; this could be
probably modelled entirely through default atomics (not implemented
though), and so wouldn't need a language or memory model extension. 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 13:04               ` Ruslan Nikolaev via gcc
  2018-02-27 13:08                 ` Szabolcs Nagy
@ 2018-02-27 16:16                 ` Torvald Riegel
  2018-02-27 16:46                 ` Simon Wright
  2 siblings, 0 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 16:16 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

On Tue, 2018-02-27 at 12:56 +0000, Ruslan Nikolaev via gcc wrote:
> But, of course, it is kind of annoying that double-width types (and that also includes potentially 64-bit on some 32-bit processors, e.g. i586 also has cmpxchg8b and no official way to read atomically otherwise) need special handling and compiler extensions which basically means that in a number of cases I cannot write portable code, I need to put a bunch of architecture-dependent ifdefs, for say, 64 bit atomics even.

The extension I outlined gives you a portable to use wide CAS, provided
that the particular implementation supports this extension.

Whether the standard covers this extension is a different matter.  You
can certainly propose it to the C and/or C++ committees, but that gets
easier if you can show existing practice.

> (And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway.)

You keep repeating that claim.  You also keep ignoring the point about
what kind of performance programs can expect atomic loads to have, if
they are declared lock-free.

Also note that the use case is *not* about wider-than-machine-word
accesses on read-only memory, but whether a portable use of atomics
(which doesn't have to consider machine word size) can use atomic loads
on read-only memory. 

> Particularly, imagine when someones writes some lock-free code for different types (in templates, macros, etc). It basically uses same C11 atomic primitives but for various integer sizes. Now I need special handling for larger types because whatever libatomic provides does not guarantee lock-freedom (i.e., useless) which otherwise I do not need.

If you use C11 atomics, you're bound to the usage intended by the
standard.  Which means that if you need lock freedom, you must check or
ensure that the implementation provides it to you.  If you make an
implicit assumption there beyond what the standard promises you, you are
not writing portable C11 concurrent code -- you are writing
architecture/platform-specific code.

Now, imagine someone writes atomic code using C11, and expects atomic
loads to actually perform like loads and lot like RMWs -- which is what
most concurrent code does, in particular lots of nonblocking  code ...

> True that wider types may not be available across all architectures, but I would prefer to have generic and standard-conformant code at least for those that have them.

It is conforming to the standard.

> > The standard doesn't specify read-only memory, so it also doesn't forbid
> > the concept.  The implementation takes it into account though, and thus
> > it's defined in that context.
> But my point is that a programmer cannot rely on this feature anyway unless she/he wants to write code which compiles only with gcc.

She/he wants to rely on implementation-specific behavior, so that's not
a problem.

> It is unspecified by the standard and implementations that use read-modify-write for atomic_load are perfectly valid. The whole point to have this standard in the first place is to allow code be compiled by different compilers, otherwise people can just rely on gcc-specific extensions.

And they can, because GCC's behavior conforms to the standard.  It
doesn't have the implementation-specific properties you prefer, but
that's not about the standard but about your personal preferences.

> > The topic we're currently discussing does not significantly affect when
> > we can remove __sync builtins, IMO.
> 
> They are the only builtins that directly expose double-width operations. Short of using assembly fall-backs, they are the only option right now.

We can still have an extension such as the one I outlined.

> > They do care about whether atomic operations are natively supported on
> > that particular type -- and that should include a load.
> I think, the whole point to have atomic operations is ability to provide lock-free operations whenever possible. Even though standard does not guarantee it, that is almost the only sane use case. Otherwise, there is no point -- you can always use locks. If they do not care about lock-freedom, they should just use locks.

The standards actually just promise you obstruction freedom.  Forward
progress guarantees are a part of the intention behind the lock-free
class, but not all of it.  There's address-freedom too, and an implicit
assumption about what rough class of performance a particular operation
is in.

The majority of synchronization code will care much more about
performance than about the operation being actually lock-free or not
(things like signal handlers or C++ unsequenced execution policy the are
exceptions).  Case in point: Lots of concurrent code built out of
lock-free atomics is actually not lock-free but has blocking parts -- so
it couldn't be used in things like signal handlers anyway.

> 
> > Nobody is proposing to mark things as lock-free if they aren't.  Thus, I
> > don't see any change to what's usable in signal handlers.
> It is not obvious to anyone that atomic_load will block.

What's your point?  Iff atomic_load will never block, it will be marked
lock-free.  Programs can check for this, so it will be obvious to them
that whether a certain operation might block.  

> It will *not* for single-width types. So, again we see differences for single- and double-width types.

So? Lock-freedom is a per-type property.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 13:04               ` Ruslan Nikolaev via gcc
  2018-02-27 13:08                 ` Szabolcs Nagy
  2018-02-27 16:16                 ` Torvald Riegel
@ 2018-02-27 16:46                 ` Simon Wright
  2018-02-27 16:52                   ` Florian Weimer
  2018-02-27 17:30                   ` Torvald Riegel
  2 siblings, 2 replies; 38+ messages in thread
From: Simon Wright @ 2018-02-27 16:46 UTC (permalink / raw)
  To: Ruslan Nikolaev
  Cc: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc

On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote:
> 
> And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway

Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 16:46                 ` Simon Wright
@ 2018-02-27 16:52                   ` Florian Weimer
  2018-02-27 17:30                   ` Torvald Riegel
  1 sibling, 0 replies; 38+ messages in thread
From: Florian Weimer @ 2018-02-27 16:52 UTC (permalink / raw)
  To: Simon Wright, Ruslan Nikolaev
  Cc: Torvald Riegel, Alexander Monakov, Szabolcs Nagy, gcc

On 02/27/2018 05:40 PM, Simon Wright wrote:
> Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)

On many systems, the read-only nature of a memory region is a 
thread-local or process-local attribute.  Other parts of the system 
might have a different view on the same memory region.

Some CPUs support memory protection keys, which provide a cheap way to 
switch memory from read-only to read-write and back, and those switching 
operations deliberately do not involve a memory barrier.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 16:46                 ` Simon Wright
  2018-02-27 16:52                   ` Florian Weimer
@ 2018-02-27 17:30                   ` Torvald Riegel
  2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
  2018-02-27 17:59                     ` Simon Wright
  1 sibling, 2 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 17:30 UTC (permalink / raw)
  To: Simon Wright
  Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer,
	Szabolcs Nagy, GCC Patches

On Tue, 2018-02-27 at 16:40 +0000, Simon Wright wrote:
> On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote:
> > 
> > And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway
> 
> Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)

Consider a producer-consumer relationship between two processes where
the producer doesn't want to wait for the consumer.  For example, the
producer could be an application that's being traced, and the consumer
is a trace aggregation tool.  The producer can provide a read-only
mapping to the consumer, and put a nonblocking ring buffer or something
similar in there.  That allows the consumer to read, but it still needs
atomic access because the consumer is modifying the ring buffer
concurrently.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 17:30                   ` Torvald Riegel
@ 2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
  2018-02-27 19:32                       ` Torvald Riegel
  2018-02-27 17:59                     ` Simon Wright
  1 sibling, 1 reply; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-27 17:33 UTC (permalink / raw)
  To: Torvald Riegel, Simon Wright
  Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, GCC Patches




> Consider a producer-consumer relationship between two processes where
> the producer doesn't want to wait for the consumer.  For example, the
> producer could be an application that's being traced, and the consumer
> is a trace aggregation tool.  The producer can provide a read-only
> mapping to the consumer, and put a nonblocking ring buffer or something
> similar in there.  That allows the consumer to read, but it still needs
> atomic access because the consumer is modifying the ring buffer
> concurrently.
Sorry for getting into someone's else conversation... And what good solution gcc offers right now? It forces producer and consumer to use lock-based (BTW: global lock!) approach for *both* producer and consumer if we are talking about 128-bit types.  Therefore, sometimes producers *will* wait (by, effectively, blocking). Basically, it becomes useless. In this case, I would rather use a lock-based approach which at least does not use a global lock. On the contrary, the alternative implementation would have been at least useful when both producers and consumers have full (RW) access.

Anyway, I already said that I personally will go with assembly inlines for right now. I just wanted to raise this concern since other people may find it useful in their projects.


   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
@ 2018-02-27 19:32                       ` Torvald Riegel
  0 siblings, 0 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 19:32 UTC (permalink / raw)
  To: Ruslan Nikolaev
  Cc: Simon Wright, Alexander Monakov, Florian Weimer, Szabolcs Nagy,
	GCC Patches

On Tue, 2018-02-27 at 17:29 +0000, Ruslan Nikolaev wrote:
> 
> 
> > Consider a producer-consumer relationship between two processes where
> > the producer doesn't want to wait for the consumer.  For example, the
> > producer could be an application that's being traced, and the consumer
> > is a trace aggregation tool.  The producer can provide a read-only
> > mapping to the consumer, and put a nonblocking ring buffer or something
> > similar in there.  That allows the consumer to read, but it still needs
> > atomic access because the consumer is modifying the ring buffer
> > concurrently.
> Sorry for getting into someone's else conversation... And what good solution gcc offers right now? It forces producer and consumer to use lock-based (BTW: global lock!)

It's not one global lock, but a lock from an array of locks (global per
process, though).

> approach for *both* producer and consumer if we are talking about 128-bit types.

But we're not talking about that special case of 128b types here.  The
majority of synchronization doesn't need more than machine word size.

> Therefore, sometimes producers *will* wait (by, effectively, blocking). Basically, it becomes useless.

No, such a program would have a bug anyway.  It wouldn't even
synchronize properly.  Please make yourself familiar with what the
standard means by "address-free".  This use case needs address-free, so
that's what the program has to ensure (and it can test that portably).
Only lock-free gives you address-free.

> In this case, I would rather use a lock-based approach which at least does not use a global lock.

The lock would need to be shared between processes in the example I
gave.  You have to build your own lock for that currently, because C/C++
don't give you any process-shared locks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-27 17:30                   ` Torvald Riegel
  2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
@ 2018-02-27 17:59                     ` Simon Wright
  1 sibling, 0 replies; 38+ messages in thread
From: Simon Wright @ 2018-02-27 17:59 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer,
	Szabolcs Nagy, GCC Patches

On 27 Feb 2018, at 17:07, Torvald Riegel <triegel@redhat.com> wrote:
> 
> On Tue, 2018-02-27 at 16:40 +0000, Simon Wright wrote:
>> On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote:
>>> 
>>> And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway
>> 
>> Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious)
> 
> Consider a producer-consumer relationship between two processes where
> the producer doesn't want to wait for the consumer.  For example, the
> producer could be an application that's being traced, and the consumer
> is a trace aggregation tool.  The producer can provide a read-only
> mapping to the consumer, and put a nonblocking ring buffer or something
> similar in there.  That allows the consumer to read, but it still needs
> atomic access because the consumer is modifying the ring buffer
> concurrently.

OK, got that, thanks (this is what I meant by "just a constant view of the object", btw).

Misled by "read-only memory" since in the embedded world ROM (usually actually in Flash) is effectively read-only to all.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fw: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 19:43         ` Ruslan Nikolaev via gcc
  2018-02-26 22:49           ` Ruslan Nikolaev via gcc
@ 2018-02-27 10:40           ` Torvald Riegel
  1 sibling, 0 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-27 10:40 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, GCC Patches

On Mon, 2018-02-26 at 19:39 +0000, Ruslan Nikolaev via gcc wrote:
> Torvald, I definitely do not want to insist on this design choice, but it makes sense to at least seriuously consider it given the concerns I described. And especially because IFFUNC in libatomic already redirects to cmpxchg16b,

That's because we want to keep the old (but wrong) behavior unchanged,
at least for existing code, until there's a time when we switch it.
Given that we also don't declare those types lock-free anymore, new code
will know that it's not safe to use them for cases such as inter-process
communication (because if not lock-free, it's also not address-free
anymore).

>  > Not getting the performance usually associated with atomic loads can be
> > a big problem for code that tries to be portable.
> 
> I do not think it is a common use case anyway. How often atomic_load is used on double-width operations?

A portable program doesn't have to think about things like double-width,
or whether the platform is something like x32 vs. x86-64.  What a
portable program cares about is whether atomic ops are lock-free on a
particular 64b integer type or not.  If they are, you want to use them
to synchronize (e.g., counters), and then it can matter a lot whether a
load is actually a load or just creates lots of contention.  If they
aren't available, the program knows that it has to find a different way
to synchronize (e.g., build the 64b counter out of 32b operations). 

> If a programmer needs some guarantees and does not care about lock-freedom, why not use a regular lock here?

They do care about whether atomic operations are natively supported on
that particular type -- and that should include a load.

> This way nothing magical happens. Otherwise, he will may hit unexpected issues in places like signal handlers (which is hard to debug since it will hang only once in a while). With cmpxchg16b, it is at least more or less reproducible: if you tried to use it on read-only memory, you will immediately get a segfault.

Nobody is proposing to mark things as lock-free if they aren't.  Thus, I
don't see any change to what's usable in signal handlers.

> > I think I now remember why we "didn't fix" libatomic: There might be
> > compiled code out there that does use the wide CAS, so changing
> > libatomic from the status quo to using its intenral locks could break
> > programs.
> Well, it already happens for Linux and glibc. There nothing will break. For other architectures, it would be good to implement the same, so that consistent behavior is observed everywhere.

It's not about consistency across archs, but consistency for existing
code.  New code or new implementations should just do the right thing,
which is requiring a natively supported atomic load of the particular
size/alignment.

> 
> > No, they only said that it doesn't need to be a concern for the
> > standard.  Implementations have to pay attention to more things, so it
> > is a concern for implementation.
> Yes, but the only problem I see is that it is currently placed to .rodata when const is used.

I and others are of different opinion:  Load performance matters,
inter-process communication on read-only memory matters, and it's useful
to have the builtins work on not just _Atomic types but general integer
types with proper alignment (e.g., look at how glibc uses the builtins
in a code base that is not C11 or more recent).

> It is easy to resolve: just do not place it there for _Atomic objects > 8 bytes. Then also clarify that a programmer cannot safely cast some arbitrary object that can be placed in .rodata to use with atomic_load.

That doesn't help with the use cases I listed previously.

> It needs to be addressed anyway, as there is already a segfault for provided example in x86-64 and Linux even with redirection to libatomic.
> 
> > It's not "visible" in the abstract machine under some setting of the
> > as-if rule.  But it is definitely visible in an implementation in which
> > the effects of read-only memory are visible (see my example of mapping
> > memory from another process read-only so as to read data from that
> > process).
> True but it is not defined for read-only memory anyway,

The standard doesn't specify read-only memory, so it also doesn't forbid
the concept.  The implementation takes it into account though, and thus
it's defined in that context.

> and no assumptions can be made in portable code. 

No you can make assumptions, given what we want the implementation to
do.  We might need to explain that better (or at all) in the docs, but
the idea is that *new* code can expect lock-free atomics to both have a
true atomic load (ie, performance-wise) and have loads work on
read-only-mapped memory.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  5:50   ` Alexander Monakov
  2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
@ 2018-02-26 18:56     ` Torvald Riegel
  1 sibling, 0 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-26 18:56 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Ruslan Nikolaev, gcc

On Mon, 2018-02-26 at 08:50 +0300, Alexander Monakov wrote:
> > For these reasons, it may be a good idea if GCC folks reconsider past
> > decision. And just to clarify: if mcx16 (x86-64) is not specified during
> > compilation, it is totally OK to redirect to libatomic, and there make the
> > final decision if target CPU supports a given instruction or not. But if it is
> > specified, it makes sense for performance reasons and lock-freedom guarantees
> > to always generate it directly. 
> 
> You don't mention it directly, so just to make it clear for readers: on systems
> where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do
> exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to
> lock-free RMW implementations if so.  (I don't like this solution)

I thought we had fixed that to not use the wide CAS?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc
  2018-02-26  5:50   ` Alexander Monakov
@ 2018-02-26 12:30   ` Szabolcs Nagy
  2018-02-26 13:57     ` Alexander Monakov
  2018-02-26 18:16   ` Florian Weimer
  2 siblings, 1 reply; 38+ messages in thread
From: Szabolcs Nagy @ 2018-02-26 12:30 UTC (permalink / raw)
  To: Ruslan Nikolaev, gcc; +Cc: nd

On 26/02/18 04:00, Ruslan Nikolaev via gcc wrote:
> 1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!)
this should be reported as a bug against clang.

there is no abi guarantee that double-width atomics
will be able to synchronize with code in other modules,
you have to introduce a new abi to do this whatever
that takes (new elf flag, new dynamic linker name,..).

> 4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below).
> 

no, it can't be.

>  Â  Â  Â [..]Â  The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation.
> 

rmw load is only valid if the implementation can
guarantee that atomic objects are never read-only.

current implementations on linux (including clang)
don't do that, so an rmw load can observably break
conforming c code: a static global const object is
placed in .rodata section and thus rmw on it is a
crash at runtime contrary to c standard requirements.

on an aarch64 machine clang miscompiles this code:

$ cat a.c
#include <stdatomic.h>

static const _Atomic struct S {long i,j;} x;

int f(const _Atomic struct S *p)
{
	struct S y = *p;
	return y.i;
}

int main()
{
	return f(&x);
}
$ gcc a.c -latomic
$ ./a.out
$ clang a.c -latomic
$ ./a.out
Segmentation fault (core dumped)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 12:30   ` Szabolcs Nagy
@ 2018-02-26 13:57     ` Alexander Monakov
  2018-02-26 14:51       ` Szabolcs Nagy
  2018-02-26 14:53       ` Ruslan Nikolaev via gcc
  0 siblings, 2 replies; 38+ messages in thread
From: Alexander Monakov @ 2018-02-26 13:57 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Ruslan Nikolaev, gcc, nd

On Mon, 26 Feb 2018, Szabolcs Nagy wrote:
> 
> rmw load is only valid if the implementation can
> guarantee that atomic objects are never read-only.

OK, but that sounds like a matter of not emitting atomic
objects into .rodata, which shouldn't be a big problem,
if not for backwards compatibility concern?

> current implementations on linux (including clang)
> don't do that, so an rmw load can observably break
> conforming c code: a static global const object is
> placed in .rodata section and thus rmw on it is a
> crash at runtime contrary to c standard requirements.

Note that in your example GCC emits 'x' as a common symbol,
you need '... x = { 0 };' for it to appear in .rodata,

> on an aarch64 machine clang miscompiles this code:
[...]

and then with new enough libatomic on Glibc this segfaults
with GCC on x86_64 too due to IFUNC redirection mentioned
in the other subthread.

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 13:57     ` Alexander Monakov
@ 2018-02-26 14:51       ` Szabolcs Nagy
  2018-02-26 14:53       ` Ruslan Nikolaev via gcc
  1 sibling, 0 replies; 38+ messages in thread
From: Szabolcs Nagy @ 2018-02-26 14:51 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: nd, Ruslan Nikolaev, gcc

On 26/02/18 13:56, Alexander Monakov wrote:
> On Mon, 26 Feb 2018, Szabolcs Nagy wrote:
>>
>> rmw load is only valid if the implementation can
>> guarantee that atomic objects are never read-only.
> 
> OK, but that sounds like a matter of not emitting atomic
> objects into .rodata, which shouldn't be a big problem,
> if not for backwards compatibility concern?
> 

well gcc wants to allow atomic access on non-atomic
objects too, otherwise public interfaces may need to
change to use the _Atomic qualifier (which is not even
valid in c++ so it would cause all sorts of breakage).

i think it would be valid to put _Atomic stuff in
writable section and then say atomic load is only
supported on const objects if it is declared with
_Atomic, this would make all strictly conforming
c code work as well as most code that ppl write in
practice (they probably don't use atomics on global
consts).

>> current implementations on linux (including clang)
>> don't do that, so an rmw load can observably break
>> conforming c code: a static global const object is
>> placed in .rodata section and thus rmw on it is a
>> crash at runtime contrary to c standard requirements.
> 
> Note that in your example GCC emits 'x' as a common symbol,
> you need '... x = { 0 };' for it to appear in .rodata,
> 

i see.

static ... x = {0}; and static ... x; are
equivalent in c, so if gcc treats them differently
that's a gcc weirdness, but does not change the
issue that there is no guarantee about readonlyness.

>> on an aarch64 machine clang miscompiles this code:
> [...]
> 
> and then with new enough libatomic on Glibc this segfaults
> with GCC on x86_64 too due to IFUNC redirection mentioned
> in the other subthread.
> 

that's yet another issue, that this is not fully
fixed in x86 gcc.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 13:57     ` Alexander Monakov
  2018-02-26 14:51       ` Szabolcs Nagy
@ 2018-02-26 14:53       ` Ruslan Nikolaev via gcc
  2018-02-26 18:35         ` Torvald Riegel
  1 sibling, 1 reply; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 14:53 UTC (permalink / raw)
  To: Alexander Monakov, Szabolcs Nagy; +Cc: gcc, nd

Thank you for more comments, my response is below.



On Mon, 26 Feb 2018, Szabolcs Nagy wrote:> 
> rmw load is only valid if the implementation can
> guarantee that atomic objects are never read-only.
But per response from WG14 regarding DR 459 which I quoted, the standard does not seem to define behavior for read-only memory (and const qualifier should not suggest that). RMW, according to them, is fine for atomic_load.

> current implementations on linux (including clang)
> don't do that, so an rmw load can observably break
> conforming c code: a static global const object is
> placed in .rodata section and thus rmw on it is a
> crash at runtime contrary to c standard requirements.
I have just tried to compile the code using clang. Latest stable version of clang seems to emit cmpxchg16b for the code you mentioned if I specify mcx16. If I do not, it redirects to libatomic.  (I have not tried the version from the trunk, though.)


  On Monday, February 26, 2018 8:57 AM, Alexander Monakov wrote:
> OK, but that sounds like a matter of not emitting atomic
> objects into .rodata, which shouldn't be a big problem,
> if not for backwards compatibility concern?

I agree, sounds like a good idea. Certainly for _Atomic objects > 8 bytes.

> and then with new enough libatomic on Glibc this segfaults
> with GCC on x86_64 too due to IFUNC redirection mentioned
> in the other subthread.

Seems like it is a problem anyway. Another reason to never emit _Atomic inside .rodata




   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 14:53       ` Ruslan Nikolaev via gcc
@ 2018-02-26 18:35         ` Torvald Riegel
  2018-02-26 18:59           ` Ruslan Nikolaev via gcc
  0 siblings, 1 reply; 38+ messages in thread
From: Torvald Riegel @ 2018-02-26 18:35 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd

On Mon, 2018-02-26 at 14:53 +0000, Ruslan Nikolaev via gcc wrote:
> Thank you for more comments, my response is below.
> 
> 
> 
> On Mon, 26 Feb 2018, Szabolcs Nagy wrote:> 
> > rmw load is only valid if the implementation can
> > guarantee that atomic objects are never read-only.
> But per response from WG14 regarding DR 459 which I quoted, the standard does not seem to define behavior for read-only memory

This ...

> (and const qualifier should not suggest that). RMW, according to them, is fine for atomic_load.

... does not imply this latter statement.  The statement you cited is
about what the standard itself requires, not what makes sense for a
particular implementation.  For example, one could build an
implementation that does not have any read-only memory and doesn't
distinguish between loads and atomic RMW operations; in such as case, it
wouldn't make sense for the standard to require it.  OTOH though, if
read-only memory exists, it makes sense for an implementation to try to
respect it.

Consider trying to use atomics for memory mapped read-only from another
process, for example to observe output from that other process.  You
don't want to make it read-write for security reasons, for example.
Atomic operations designated as lock-free by the implementation are
supposed to be address-free too, which targets the use case of mapping
memory from somewhere else.  So, in such a case, using the wide CAS for
atomic loads breaks a reasonable assumption.  Moreover, it's also a
special case, in that 32b atomics do work as intended.

Also, I believe the vast majority of synchronization code makes implicit
assumptions about the performance of atomic load operations, notably
that concurrent loads don't create contention, or at least much less
than concurrent writes.  The behavior you favor would violate that, and
there's no portable way to distinguish one from the other. 

Thus, GCC only declares operations as lock-free if atomic loads of the
particular size/alignment are natively supported, and with the
performance properties one would associate with just a load on the
particular arch.  If an atomic load and an atomic CAS are supported,
that's fine; if there's just a CAS, that's not enough.

I see your point in wanting to have a builtin or such for the 64b atomic
CAS.  However, IMO, this doesn't fit into the world of C11/C++11
atomics, and thus rather should be accessible through a separate
interface.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:35         ` Torvald Riegel
@ 2018-02-26 18:59           ` Ruslan Nikolaev via gcc
  2018-02-26 19:20             ` Torvald Riegel
  0 siblings, 1 reply; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 18:59 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd

Torvald, thank you for your output. See my response below. 

    On Monday, February 26, 2018 1:35 PM, Torvald Riegel <triegel@redhat.com> wrote:

> ... does not imply this latter statement.  The statement you cited is
> about what the standard itself requires, not what makes sense for a
> particular implementation. 

True but makes sense to provide true atomics when they are available. Since the standard seem to allow atomic_load implementation using RMW, does not seem to be a problem.
In fact, lock_free flag for this type can return true only if mcx16 is specified; otherwise -- it returns false (since it can only be determined during runtime, assuming worst case scenario)

> So, in such a case, using the wide CAS for
> atomic loads breaks a reasonable assumption.  Moreover, it's also a
> special case, in that 32b atomics do work as intended.

But in this case a programmer already makes an assumption that atomic_load does not use RMW which C11 does not seem to guarantee.Of course, for single-width operations, the programmer may in most practical cases assume it (even though there is no guarantee).
Anyway, there is no good solution here for double-width operations, and the programmer should not assume it is possible when writing portable code.In fact, lock-based solution is even more confusing and potentially error-prone (e.g., cannot be safely used inside signal handlers since it is not lock-free, etc)

> The behavior you favor would violate that, and
> there's no portable way to distinguish one from the other. 

There is already a similar problem with IFFUNC (when used with Linux and glibc). In fact, I do not see any difference here. Redirection to libatomic when mcx16 is specified just adds extra cost + less predictable behavior. Moreover, it seems counterintuitive -- I specify a flag that mcx16 is supported but gcc still does not use it (at least directly). It is possible to make a change to libatomic to always use cmpxchg16b when available (even on systems without IFFUNC), this way it is totally consistent and binary compatible for code compiled with and without mcx16.

> I see your point in wanting to have a builtin or such for the 64b atomic
> CAS.  However, IMO, this doesn't fit into the world of C11/C++11
> atomics, and thus rather should be accessible through a separate
> interface.
Why not? If atomic_load is not really an issue, then it may be good to use standardized interface.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:59           ` Ruslan Nikolaev via gcc
@ 2018-02-26 19:20             ` Torvald Riegel
  0 siblings, 0 replies; 38+ messages in thread
From: Torvald Riegel @ 2018-02-26 19:20 UTC (permalink / raw)
  To: Ruslan Nikolaev; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd

On Mon, 2018-02-26 at 18:55 +0000, Ruslan Nikolaev via gcc wrote:
> Torvald, thank you for your output. See my response below. 
> 
>     On Monday, February 26, 2018 1:35 PM, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > ... does not imply this latter statement.  The statement you cited is
> > about what the standard itself requires, not what makes sense for a
> > particular implementation. 
> 
> True but makes sense to provide true atomics when they are available.

What do you mean by "true atomics"?  For me, that includes an atomic
load that is not emulated through an RMW.

> Since the standard seem to allow atomic_load implementation using RMW, does not seem to be a problem.

I believe that in the C++ committee, we have consensus that the intent
for lock-free atomics is that they should have an atomic load available
behaves like a typical natively-supported atomic load.  I can't speak
for the C committee, but at least the memory models are supposed to be
the same.

This is a decision that implementations ultimately make, however.

> In fact, lock_free flag for this type can return true only if mcx16 is specified; otherwise -- it returns false (since it can only be determined during runtime, assuming worst case scenario)

But then -mcx16 is a different ABI effectively, and it also changes what
(portable) synchronization code can expect when it sees an atomic type
declared as lock-free.

> > So, in such a case, using the wide CAS for
> > atomic loads breaks a reasonable assumption.  Moreover, it's also a
> > special case, in that 32b atomics do work as intended.
> 
> But in this case a programmer already makes an assumption that atomic_load does not use RMW which C11 does not seem to guarantee.

It makes sense for GCC as an implementation to guarantee that.

> Of course, for single-width operations, the programmer may in most practical cases assume it (even though there is no guarantee).

Requiring programs to consider what is "single-width" for a particular
platform, instead of just being able to test the lock-free property,
decreases portability.

> Anyway, there is no good solution here for double-width operations, and the programmer should not assume it is possible when writing portable code.

That's an argument in favor of splitting wide CAS out into a separate
interface -- C11 atomics are portable from the perspective of the major
use cases, and they should stay that way.

> In fact, lock-based solution is even more confusing and potentially error-prone (e.g., cannot be safely used inside signal handlers since it is not lock-free, etc)
> 
> > The behavior you favor would violate that, and
> > there's no portable way to distinguish one from the other. 
> 
> There is already a similar problem with IFFUNC (when used with Linux and glibc). In fact, I do not see any difference here. Redirection to libatomic when mcx16 is specified just adds extra cost + less predictable behavior. Moreover, it seems counterintuitive -- I specify a flag that mcx16 is supported but gcc still does not use it (at least directly). It is possible to make a change to libatomic to always use cmpxchg16b when available (even on systems without IFFUNC), this way it is totally consistent and binary compatible for code compiled with and without mcx16.

I've commented on that elsewhere in the thread.

> > I see your point in wanting to have a builtin or such for the 64b atomic
> > CAS.  However, IMO, this doesn't fit into the world of C11/C++11
> > atomics, and thus rather should be accessible through a separate
> > interface.
> Why not? If atomic_load is not really an issue, then it may be good to use standardized interface.

See above.  The atomic builtins are a package that, at least on GCC's
implementation, gives you a set of properties you can rely on in a
portable way (in particular when used through the C11/C++11 atomic ops).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26  4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc
  2018-02-26  5:50   ` Alexander Monakov
  2018-02-26 12:30   ` Szabolcs Nagy
@ 2018-02-26 18:16   ` Florian Weimer
  2018-02-26 18:34     ` Ruslan Nikolaev via gcc
  2018-02-26 18:36     ` Janne Blomqvist
  2 siblings, 2 replies; 38+ messages in thread
From: Florian Weimer @ 2018-02-26 18:16 UTC (permalink / raw)
  To: nruslan_devel; +Cc: gcc

On 02/26/2018 05:00 AM, Ruslan Nikolaev via gcc wrote:
> If I understand correctly, the redirection to libatomic was made for 2 reasons:
> 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.)
> 2. atomic_load on read-only memory.

I think x86-64 should be able to do atomic load and store via SSE2 
registers, but perhaps if the memory is suitably aligned (which is the 
other problemâ€”the libatomic code will work irrespective of alignment, as 
far as I understand it).

Thanks,
Florian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:16   ` Florian Weimer
@ 2018-02-26 18:34     ` Ruslan Nikolaev via gcc
  2018-02-26 18:36     ` Janne Blomqvist
  1 sibling, 0 replies; 38+ messages in thread
From: Ruslan Nikolaev via gcc @ 2018-02-26 18:34 UTC (permalink / raw)
  To: Florian Weimer; +Cc: gcc

On Monday, February 26, 2018 1:15 PM, Florian Weimer <fweimer@redhat.com> wrote:

> I think x86-64 should be able to do atomic load and store via SSE2 
> registers, but perhaps if the memory is suitably aligned (which is the 
> other problem—the libatomic code will work irrespective of alignment, as 
> far as I understand it).

IIRC, it is not always guaranteed to be atomic, so RMW is probably the only safe option for x86-64. And for ARM64, too, as far as I understand.

Just to summarize what can be done if the proposed change is accepted (from the discussion so far):

1. _Atomic on objects larger than 8 bytes should not be placed in .rodata even if declared as const. It can also be specified that atomic_load should not be used on read-only memory with double-width operations.

2. libatomic can be modified to redirect to functions that use cmpxchg16b (whenever available on target CPU) through regular functions pointers even if IFFUNC is not available. This will provide consistent behavior everywhere, and binary compatibility for mcx16 and mno-cx16

3. never redirect to libatomic for arm64 (since ldaxp/staxp are available), redirect for x86-64 only if mcx16 is not specified. For ARM64, there is no mcx16 option at all.

-- Ruslan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:16   ` Florian Weimer
  2018-02-26 18:34     ` Ruslan Nikolaev via gcc
@ 2018-02-26 18:36     ` Janne Blomqvist
  2018-02-27 10:22       ` Florian Weimer
  1 sibling, 1 reply; 38+ messages in thread
From: Janne Blomqvist @ 2018-02-26 18:36 UTC (permalink / raw)
  To: Florian Weimer; +Cc: nruslan_devel, gcc mailing list

On Mon, Feb 26, 2018 at 8:15 PM, Florian Weimer <fweimer@redhat.com> wrote:
> On 02/26/2018 05:00 AM, Ruslan Nikolaev via gcc wrote:
>>
>> If I understand correctly, the redirection to libatomic was made for 2
>> reasons:
>> 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16
>> flag already specifies that you use CPUs that have this instruction, so it
>> should not be a concern when the flag is specified.)
>> 2. atomic_load on read-only memory.
>
>
> I think x86-64 should be able to do atomic load and store via SSE2
> registers,

There is no such architectural guarantee. At least on some
micro-architecture (AMD Opteron "Istanbul") it's possible to construct
a test which fails, proving that at least on that micro-arch SSE2
load/store isn't guaranteed to be atomic. See
https://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations/7647825
for more discussion and a testcase.




-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: GCC interpretation of C11 atomics (DR 459)
  2018-02-26 18:36     ` Janne Blomqvist
@ 2018-02-27 10:22       ` Florian Weimer
  0 siblings, 0 replies; 38+ messages in thread
From: Florian Weimer @ 2018-02-27 10:22 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: nruslan_devel, gcc mailing list

On 02/26/2018 07:36 PM, Janne Blomqvist wrote:
> There is no such architectural guarantee. At least on some
> micro-architecture (AMD Opteron "Istanbul") it's possible to construct
> a test which fails, proving that at least on that micro-arch SSE2
> load/store isn't guaranteed to be atomic.

Looks like I was wrong.  Ugh.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-02-27 17:59 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1615980330.4453149.1519617655582.ref@mail.yahoo.com>
2018-02-26  4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc
2018-02-26  5:50   ` Alexander Monakov
2018-02-26  7:24     ` Fw: " Ruslan Nikolaev via gcc
2018-02-26  8:20       ` Alexander Monakov
2018-02-26  8:43         ` Ruslan Nikolaev via gcc
2018-02-26 19:07       ` Torvald Riegel
2018-02-26 19:43         ` Ruslan Nikolaev via gcc
2018-02-26 22:49           ` Ruslan Nikolaev via gcc
2018-02-27  3:33             ` Ruslan Nikolaev via gcc
2018-02-27 10:34             ` Ramana Radhakrishnan
2018-02-27 11:14               ` Torvald Riegel
2018-02-27 12:39             ` Torvald Riegel
2018-02-27 13:04               ` Ruslan Nikolaev via gcc
2018-02-27 13:08                 ` Szabolcs Nagy
2018-02-27 13:17                   ` Ruslan Nikolaev via gcc
2018-02-27 16:40                     ` Torvald Riegel
2018-02-27 17:07                       ` Ruslan Nikolaev via gcc
2018-02-27 16:21                   ` Torvald Riegel
2018-02-27 16:16                 ` Torvald Riegel
2018-02-27 16:46                 ` Simon Wright
2018-02-27 16:52                   ` Florian Weimer
2018-02-27 17:30                   ` Torvald Riegel
2018-02-27 17:33                     ` Ruslan Nikolaev via gcc
2018-02-27 19:32                       ` Torvald Riegel
2018-02-27 17:59                     ` Simon Wright
2018-02-27 10:40           ` Fw: " Torvald Riegel
2018-02-26 18:56     ` Torvald Riegel
2018-02-26 12:30   ` Szabolcs Nagy
2018-02-26 13:57     ` Alexander Monakov
2018-02-26 14:51       ` Szabolcs Nagy
2018-02-26 14:53       ` Ruslan Nikolaev via gcc
2018-02-26 18:35         ` Torvald Riegel
2018-02-26 18:59           ` Ruslan Nikolaev via gcc
2018-02-26 19:20             ` Torvald Riegel
2018-02-26 18:16   ` Florian Weimer
2018-02-26 18:34     ` Ruslan Nikolaev via gcc
2018-02-26 18:36     ` Janne Blomqvist
2018-02-27 10:22       ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).