* GCC interpretation of C11 atomics (DR 459) [not found] <1615980330.4453149.1519617655582.ref@mail.yahoo.com> @ 2018-02-26 4:01 ` Ruslan Nikolaev via gcc 2018-02-26 5:50 ` Alexander Monakov ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 4:01 UTC (permalink / raw) To: gcc Hi I have read multiple bug reports (84522, 80878, 70490), and the past decision regarding GCC change to redirect double-width (128-bit) atomics for x86-64 and arm64 to libatomic. Below I mention major concerns as well as the response from C11 (WG14) regarding DR 459 which, most likely, triggered this change in more recent GCC releases in the first place. If I understand correctly, the redirection to libatomic was made for 2 reasons: 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.) 2. atomic_load on read-only memory. DR 459 now requires to have 'const' qualifiers for atomic_load which probably resulted in the interpretation that read-only memory must be supported. However, per response from C11/WG14 (see below), it does not seem to be the case at all. Therefore, previously filed bug 70490 does not seem to be valid. There are several concerns with current GCC behavior: 1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!).The behavior of clang: if mxc16 is specified, cmpxchg16b is generated for x86-64 (without any calls to libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are always generated. In my opinion, this is very logical and non-confusing. 2. Oftentimes you want to have strict guarantees (by specifying mcx16 flag for x86-64) that the generated code is lock-free, otherwise it is useless. Double-width atomics are often used in lock-free algorithms that use tags (stamps) for pointers to resolve the ABA problem. So, it is very useful to have corresponding support in the compiler. 3. The behavior is inconsistent even within GCC. Older (and more limited, less portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic and C11 -- do not. Moreover, __sync builtins are probably less suitable for arm/arm64. 4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below). For these reasons, it may be a good idea if GCC folks reconsider past decision. And just to clarify: if mcx16 (x86-64) is not specified during compilation, it is totally OK to redirect to libatomic, and there make the final decision if target CPU supports a given instruction or not. But if it is specified, it makes sense for performance reasons and lock-freedom guarantees to always generate it directly. -- Ruslan Response from the WG14 (C11) Convener regarding DR 459: (I asked for a permission to publish this response here.) Ruslan, Thank you for your comments. There is no normative requirement that const objects be suitable for read-only memory. An example and a footnote refer to read-only memory as a way to illustrate a point, but examples and footnotes are not normative. The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation. David My original email: Dear David Keaton, After reviewing the proposed change DR 459 for C11: http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_459 ,I identified that adding const qualifier to atomic_load (C11 implements its without it) may actually be harmful in some cases. Particularly, for double-width (128-bit) atomics found in x86-64 (cmpxchg16b instruction), arm64 (ldaxp/staxp instructions), it is currently only possible to implement atomic_load for 128 bit using corresponding read-modify-write instructions (i.e., potentially rewriting memory with the same value, but, in essence, not changing it). But these implementations will not work on read-only memory. Similar concerns apply to some extent to x86 and arm32 for double-width (64-bit) atomics. Otherwise, there is no obstacle to implement all C11 atomics for corresponding types in these architectures. Moreover, a well-known clang/llvm compiler already implements all double-width operations for x86, x86-64, arm32 and arm64 (atomic_load is implemented using corresponding read-modify-write instructions). Double-width atomics are often used in data structures that need tagging for pointers to avoid the ABA problem (e.g., in lock-free stacks and queues). It is my understanding that C11 aimed to make atomics more or less portable across different microarchitectures, while at the same time provide an ability for a compiler to optimize code well and utilize all potential of the corresponding microarchitecture. If now it is required to support read-only memory (i.e., const qualifier) for atomic_load, 128-bit atomics are likely be impossible to implement in any meaningful and portable way. Thus, anyone who wants to use them will have to go with assembly fallbacks (or compiler extensions), thus, partially defeating the purpose of C11 atomics. One way to address this concern would be to state that atomic_load on read-only memory is implementation-defined and may not be supported for all types. That would also mean to go with the previous C11 definition (i.e., without the const qualifier) to implement atomic_load rather than what was proposed in the DR 459 change. I am ready to submit a more formal proposal if this is something that can be considered by the committee. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc @ 2018-02-26 5:50 ` Alexander Monakov 2018-02-26 7:24 ` Fw: " Ruslan Nikolaev via gcc 2018-02-26 18:56 ` Torvald Riegel 2018-02-26 12:30 ` Szabolcs Nagy 2018-02-26 18:16 ` Florian Weimer 2 siblings, 2 replies; 38+ messages in thread From: Alexander Monakov @ 2018-02-26 5:50 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: gcc Hello, Although I wouldn't like to fight defending GCC's design change here, let me offer a couple of corrections/additions so everyone is on the same page: On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote: > > 1. Not consistent with clang/llvm which completely supports double-width > atomics for arm32, arm64, x86 and x86-64 making it possible to write portable > code (w/o specific extensions or assembly code) across all these architectures > (which is finally possible with C11!).The behavior of clang: if mxc16 is > specified, cmpxchg16b is generated for x86-64 (without any calls to > libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are > always generated. In my opinion, this is very logical and non-confusing. Note that there's more issues to that than just behavior on readonly memory: you need to ensure that the whole program, including all static and shared libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level support to ensure that), or you'd need to be sure that it's safe to mix code compiled with different -mcx16 settings because it never happens to interop on wide atomic objects. (if you mix -mcx16 and -mno-cx16 code operating on the same 128-bit object, you get wrong code that will appear to work >99% of the time) > 3. The behavior is inconsistent even within GCC. Older (and more limited, less > portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic > and C11 -- do not. Moreover, __sync builtins are probably less suitable for > arm/arm64. Note that there's no "load" function in the __sync family, so the original concern about operations on readonly memory does not apply. > For these reasons, it may be a good idea if GCC folks reconsider past > decision. And just to clarify: if mcx16 (x86-64) is not specified during > compilation, it is totally OK to redirect to libatomic, and there make the > final decision if target CPU supports a given instruction or not. But if it is > specified, it makes sense for performance reasons and lock-freedom guarantees > to always generate it directly. You don't mention it directly, so just to make it clear for readers: on systems where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to lock-free RMW implementations if so. (I don't like this solution) Thanks. Alexander ^ permalink raw reply [flat|nested] 38+ messages in thread
* Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 5:50 ` Alexander Monakov @ 2018-02-26 7:24 ` Ruslan Nikolaev via gcc 2018-02-26 8:20 ` Alexander Monakov 2018-02-26 19:07 ` Torvald Riegel 2018-02-26 18:56 ` Torvald Riegel 1 sibling, 2 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 7:24 UTC (permalink / raw) To: Alexander Monakov; +Cc: gcc Alexander, Thank you for your comments. Please see my response below. I definitely do not want to fight for or against this change in gcc, but there are definitely legitimate concerns to consider. I think, it would really be good to consider this change to make things more compatible (i.e., at least between clang/llvm and gcc which can be both used within the same ecosystem). There are real practical benefits of having true lock-free double-width operations when implementing algorithms that rely on ABA tagging for pointers, and C11 at last gives an opportunity to do that without resorting to assembly or platform-specific implementations. > Note that there's more issues to that than just behavior on readonly memory: > you need to ensure that the whole program, including all static and shared > libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level > support to ensure that), or you'd need to be sure that it's safe to mix code > compiled with different -mcx16 settings because it never happens to interop > on wide atomic objects. Well, if libatomic is already doing it when corresponding CPU feature is available (i.e., effectively implementing operations using cmpxchg16b), I do not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore other code compiled without -mcx16 flag will go to libatomic. Inside libatomic, it will detect that cmpxchg16b *is* available, thus making code compiled with and without -mcx16 flag completely compatible on a given system. Or do I miss something here? If you do not have cmpxchg16b, but the program is compiled with the flag, it will simply not run (as expected). So, in other words, libatomic should still decide whether you have cmpxchg16b or not for cases when -mcx16 is not specified. But if it is specified, cmpxchg16b can be generated unconditionally. If you want better compatibility, you will not specify the flag. Mix of -mcx16 and mno-cx16 will be, thus, binary compatible. > Note that there's no "load" function in the __sync family, so the original > concern about operations on readonly memory does not apply. Yes, but per clarification from WG14/C11, read-only memory should not be a concern at all, as this behavior is not specified anyway (regardless of the const specifier). Read-modify-write is allowed for atomic_load as long as there is no 'visible' change on the value being loaded. In this sense, the bug that was filed previously regarding read-only memory accesses and const specifier does not seem to be valid. Additionally, it is really odd and counterintuitive to still provide support for (almost) deprecated macros while not giving such an opportunity for newer and more advanced functions. > You don't mention it directly, so just to make it clear for readers: on systems > where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do > exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to > lock-free RMW implementations if so. (I don't like this solution) Yes, but libatomic makes things slower due to indirection. Also, it is much harder to track what is going on, as there is no guarantee of lock-freedom in this case. BTW -- The fact that it currently uses cmpxchg16b if available may actually be helpful to switch to the suggested behavior without breaking binary compatibility (if I understand everything correctly). -- Ruslan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 7:24 ` Fw: " Ruslan Nikolaev via gcc @ 2018-02-26 8:20 ` Alexander Monakov 2018-02-26 8:43 ` Ruslan Nikolaev via gcc 2018-02-26 19:07 ` Torvald Riegel 1 sibling, 1 reply; 38+ messages in thread From: Alexander Monakov @ 2018-02-26 8:20 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: gcc On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote: > Well, if libatomic is already doing it when corresponding CPU feature is > available (i.e., effectively implementing operations using cmpxchg16b), I do > not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore > other code compiled without -mcx16 flag will go to libatomic. Inside > libatomic, it will detect that cmpxchg16b *is* available, thus making code > compiled with and without -mcx16 flag completely compatible on a given system. > Or do I miss something here? I'd say the main issue is that libatomic is not guaranteed to work like that. Today it relies on IFUNC for redirection, so you may (and not "will") get the desired behavior on Glibc (implying Linux), not on other OSes, and neither on Linux with non-GNU libc (nor on bare metal, for that matter). Alexander ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 8:20 ` Alexander Monakov @ 2018-02-26 8:43 ` Ruslan Nikolaev via gcc 0 siblings, 0 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 8:43 UTC (permalink / raw) To: Alexander Monakov; +Cc: gcc > I'd say the main issue is that libatomic is not guaranteed to work like that. > Today it relies on IFUNC for redirection, so you may (and not "will") get the > desired behavior on Glibc (implying Linux), not on other OSes, and neither on > Linux with non-GNU libc (nor on bare metal, for that matter). I think, in case if IFUNC is not available (i.e., outside glibc), redirection is still possible by introducing a regular function pointer there. Yes, it is an extra cost but better than nothing (+ consistent behavior on all platforms), probably will not add too much anyway because there is already a performance hit by going to libatomic. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 7:24 ` Fw: " Ruslan Nikolaev via gcc 2018-02-26 8:20 ` Alexander Monakov @ 2018-02-26 19:07 ` Torvald Riegel 2018-02-26 19:43 ` Ruslan Nikolaev via gcc 1 sibling, 1 reply; 38+ messages in thread From: Torvald Riegel @ 2018-02-26 19:07 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: Alexander Monakov, GCC Patches On Mon, 2018-02-26 at 07:24 +0000, Ruslan Nikolaev via gcc wrote: > Alexander, > Thank you for your comments. Please see my response below. I definitely do not want to fight for or against this change in gcc, but there are definitely legitimate concerns to consider. I think, it would really be good to consider this change to make things more compatible (i.e., at least between clang/llvm and gcc which can be both used within the same ecosystem). There are real practical benefits of having true lock-free double-width operations when implementing algorithms that rely on ABA tagging for pointers, and C11 at last gives an opportunity to do that without resorting to assembly or platform-specific implementations. I agree a wide CAS can be useful, but that has to be weighed against all other use cases and how they'd be affected by the change you propose. Not getting the performance usually associated with atomic loads can be a big problem for code that tries to be portable. > > Note that there's more issues to that than just behavior on readonly memory: > > you need to ensure that the whole program, including all static and shared > > libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level > > support to ensure that), or you'd need to be sure that it's safe to mix code > > compiled with different -mcx16 settings because it never happens to interop > > on wide atomic objects. > > Well, if libatomic is already doing it when corresponding CPU feature is available (i.e., effectively implementing operations using cmpxchg16b), I do not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore other code compiled without -mcx16 flag will go to libatomic. Inside libatomic, it will detect that cmpxchg16b *is* available, thus making code compiled with and without -mcx16 flag completely compatible on a given system. Or do I miss something here? I think I now remember why we "didn't fix" libatomic: There might be compiled code out there that does use the wide CAS, so changing libatomic from the status quo to using its intenral locks could break programs. In contrast, only redirecting to libatomic and not promising lock-free anymore doesn't break these programs, but it gives us the opportunity to fix this in the future; because we don't advertise it those operations as lock-free anymore, we also make new programs aware that they won't get the default set of native atomic operations, and thus prevent new programs from running into this problem. > If you do not have cmpxchg16b, but the program is compiled with the flag, it will simply not run (as expected). > > So, in other words, libatomic should still decide whether you have cmpxchg16b or not for cases when -mcx16 is not specified. But if it is specified, cmpxchg16b can be generated unconditionally. If you want better compatibility, you will not specify the flag. Mix of -mcx16 and mno-cx16 will be, thus, binary compatible. > > > Note that there's no "load" function in the __sync family, so the original > > concern about operations on readonly memory does not apply. > Yes, but per clarification from WG14/C11, read-only memory should not be a concern at all, No, they only said that it doesn't need to be a concern for the standard. Implementations have to pay attention to more things, so it is a concern for implementation. > as this behavior is not specified anyway (regardless of the const specifier). Read-modify-write is allowed for atomic_load as long as there is no 'visible' change on the value being loaded. It's not "visible" in the abstract machine under some setting of the as-if rule. But it is definitely visible in an implementation in which the effects of read-only memory are visible (see my example of mapping memory from another process read-only so as to read data from that process). > In this sense, the bug that was filed previously regarding read-only memory accesses and const specifier does not seem to be valid. > Additionally, it is really odd and counterintuitive to still provide support for (almost) deprecated macros while not giving such an opportunity for newer and more advanced functions. > > > You don't mention it directly, so just to make it clear for readers: on systems > > where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do > > exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to > > lock-free RMW implementations if so. (I don't like this solution) > > Yes, but libatomic makes things slower due to indirection. Also, it is much harder to track what is going on, as there is no guarantee of lock-freedom in this case. BTW -- The fact that it currently uses cmpxchg16b if available may actually be helpful to switch to the suggested behavior without breaking binary compatibility (if I understand everything correctly). It's rather done that way to switch away from the previous behavior but in a manner that's less likely to break existing programs. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 19:07 ` Torvald Riegel @ 2018-02-26 19:43 ` Ruslan Nikolaev via gcc 2018-02-26 22:49 ` Ruslan Nikolaev via gcc 2018-02-27 10:40 ` Fw: " Torvald Riegel 0 siblings, 2 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 19:43 UTC (permalink / raw) To: Torvald Riegel; +Cc: Alexander Monakov, GCC Patches Torvald, I definitely do not want to insist on this design choice, but it makes sense to at least seriuously consider it given the concerns I described. And especially because IFFUNC in libatomic already redirects to cmpxchg16b, so it just adds extra cost and indirection. Quite frankly, I do not even see any serious problem here with respect to binary compatibility. Even if cmpxchg16b was not used on some platforms outside Linux, old binaries will go to libatomic which can now be updated to simply use cmpxchg16b. (Even for statically linked should not be an issue -- they will not have any direct interaction with newer binaries.) > Not getting the performance usually associated with atomic loads can be > a big problem for code that tries to be portable. I do not think it is a common use case anyway. How often atomic_load is used on double-width operations? If a programmer needs some guarantees and does not care about lock-freedom, why not use a regular lock here? This way nothing magical happens. Otherwise, he will may hit unexpected issues in places like signal handlers (which is hard to debug since it will hang only once in a while). With cmpxchg16b, it is at least more or less reproducible: if you tried to use it on read-only memory, you will immediately get a segfault. > I think I now remember why we "didn't fix" libatomic: There might be > compiled code out there that does use the wide CAS, so changing > libatomic from the status quo to using its intenral locks could break > programs. Well, it already happens for Linux and glibc. There nothing will break. For other architectures, it would be good to implement the same, so that consistent behavior is observed everywhere. > No, they only said that it doesn't need to be a concern for the > standard. Implementations have to pay attention to more things, so it > is a concern for implementation. Yes, but the only problem I see is that it is currently placed to .rodata when const is used. It is easy to resolve: just do not place it there for _Atomic objects > 8 bytes. Then also clarify that a programmer cannot safely cast some arbitrary object that can be placed in .rodata to use with atomic_load. It needs to be addressed anyway, as there is already a segfault for provided example in x86-64 and Linux even with redirection to libatomic. > It's not "visible" in the abstract machine under some setting of the > as-if rule. But it is definitely visible in an implementation in which > the effects of read-only memory are visible (see my example of mapping > memory from another process read-only so as to read data from that > process). True but it is not defined for read-only memory anyway, and no assumptions can be made in portable code. -- Ruslan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 19:43 ` Ruslan Nikolaev via gcc @ 2018-02-26 22:49 ` Ruslan Nikolaev via gcc 2018-02-27 3:33 ` Ruslan Nikolaev via gcc ` (2 more replies) 2018-02-27 10:40 ` Fw: " Torvald Riegel 1 sibling, 3 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 22:49 UTC (permalink / raw) To: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc Cc: GCC Patches Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) ) I also opened Bug 84563 with the rationale. Pros of the proposed approach: 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers). 2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking 3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility). 4. Faster & easy to analyze code when mcx16 is specified. 5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worse-case scenario. 6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already) 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere. Cons of the proposed approach: 1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.) 2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly. 3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc). -- Ruslan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 22:49 ` Ruslan Nikolaev via gcc @ 2018-02-27 3:33 ` Ruslan Nikolaev via gcc 2018-02-27 10:34 ` Ramana Radhakrishnan 2018-02-27 12:39 ` Torvald Riegel 2 siblings, 0 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-27 3:33 UTC (permalink / raw) To: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc Cc: GCC Patches Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) ) I also opened Bug 84563 with the rationale. Pros of the proposed approach: 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers). 2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking 3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility). 4. Faster & easy to analyze code when mcx16 is specified. 5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worse-case scenario. 6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already) 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere. Cons of the proposed approach: 1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.) 2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly. 3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc). -- Ruslan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 22:49 ` Ruslan Nikolaev via gcc 2018-02-27 3:33 ` Ruslan Nikolaev via gcc @ 2018-02-27 10:34 ` Ramana Radhakrishnan 2018-02-27 11:14 ` Torvald Riegel 2018-02-27 12:39 ` Torvald Riegel 2 siblings, 1 reply; 38+ messages in thread From: Ramana Radhakrishnan @ 2018-02-27 10:34 UTC (permalink / raw) To: Ruslan Nikolaev Cc: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc On Mon, Feb 26, 2018 at 10:45 PM, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote: > Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) ) > I also opened Bug 84563 with the rationale. > > > Pros of the proposed approach: > 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers). > > 2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking > > 3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility). That is going to create an ABI break on AArch64. Think about binaries produced by old releases GCC that use locks in libatomic and those used by new GCC. The way to fix this in AArch64 if there is a guarantee from the standard that there are no problems with read-only locations is to implement the change in libatomic. You cannot have the same region of memory protected by locks in older binaries and the appropriate load / store instructions in new binaries. Ramana > 4. Faster & easy to analyze code when mcx16 is specified. > > 5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worse-case scenario. > > 6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already) > > 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere. > > > Cons of the proposed approach: > > 1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.) > > 2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly. > 3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc). > -- Ruslan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 10:34 ` Ramana Radhakrishnan @ 2018-02-27 11:14 ` Torvald Riegel 0 siblings, 0 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 11:14 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer, Szabolcs Nagy, GCC Patches On Tue, 2018-02-27 at 10:22 +0000, Ramana Radhakrishnan wrote: > The way to fix this in AArch64 if there is a > guarantee from the standard that there are no problems with read-only > locations is to implement the change in libatomic. Even though the standard doesn't specify read-only memory, I think that consensus in ISO C++ SG1 (ie, the concurrency study group) exists that it makes sense for implementations to not declare something lock-free if the hardware doesn't provide a true atomic load for the particular size/alignment. It is an implementation-level decision though (given that the details of the as-if rule depend on what's doable on the particular implementation), and I do not see a reason to change GCC's stance on this. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 22:49 ` Ruslan Nikolaev via gcc 2018-02-27 3:33 ` Ruslan Nikolaev via gcc 2018-02-27 10:34 ` Ramana Radhakrishnan @ 2018-02-27 12:39 ` Torvald Riegel 2018-02-27 13:04 ` Ruslan Nikolaev via gcc 2 siblings, 1 reply; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 12:39 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc On Mon, 2018-02-26 at 22:45 +0000, Ruslan Nikolaev via gcc wrote: > Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) ) > I also opened Bug 84563 with the rationale. This bug summarizes your perspective on the matter. I'd call that not just slightly biased :) I do not see a reason to change GCC's position regarding this topic. We should update the docs though to clarify the intent and guarantees GCC's implementation gives, I suppose. The reasons have been discussed elsewhere in this thread already, so I'm not going to repeat them here. > 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). That's a valid goal, but it does not imply that we should mess with how atomics are implemented by default, nor should we mess with the default use cases. This goal wants something special, and that is exposing the fact that *only* a CAS is available to synchronize atomically on a particular type. That is an extension of the existing atomics design. There are different ways to expose such an extension, with one being to simply provide a __atomic_special_cas builtin or something like that. It would have the same synchronization semantics as the normal CAS, but concurrent access between the special CAS and the normal atomics would be considered a data race (so making sure that there's no guaranteed atomicity between them). It could have a fallback to libatomic for ease of use, and it could be defined for smaller types too. This would be portable, and would allow us to separate the different use cases. > 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere. The topic we're currently discussing does not significantly affect when we can remove __sync builtins, IMO. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 12:39 ` Torvald Riegel @ 2018-02-27 13:04 ` Ruslan Nikolaev via gcc 2018-02-27 13:08 ` Szabolcs Nagy ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-27 13:04 UTC (permalink / raw) To: Torvald Riegel; +Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc Formally speaking, either implementation satisfies C11 because the standard allows much leeway in the interpretation here. But, of course, it is kind of annoying that double-width types (and that also includes potentially 64-bit on some 32-bit processors, e.g. i586 also has cmpxchg8b and no official way to read atomically otherwise) need special handling and compiler extensions which basically means that in a number of cases I cannot write portable code, I need to put a bunch of architecture-dependent ifdefs, for say, 64 bit atomics even. (And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway.) Particularly, imagine when someones writes some lock-free code for different types (in templates, macros, etc). It basically uses same C11 atomic primitives but for various integer sizes. Now I need special handling for larger types because whatever libatomic provides does not guarantee lock-freedom (i.e., useless) which otherwise I do not need. True that wider types may not be available across all architectures, but I would prefer to have generic and standard-conformant code at least for those that have them. > That's a valid goal, but it does not imply that we should mess with how > atomics are implemented by default, nor should we mess with the default > use cases. This goal wants something special, and that is exposing the > fact that *only* a CAS is available to synchronize atomically on a >particular type. That is an extension of the existing atomics design. See above > The standard doesn't specify read-only memory, so it also doesn't forbid > the concept. The implementation takes it into account though, and thus > it's defined in that context. But my point is that a programmer cannot rely on this feature anyway unless she/he wants to write code which compiles only with gcc. It is unspecified by the standard and implementations that use read-modify-write for atomic_load are perfectly valid. The whole point to have this standard in the first place is to allow code be compiled by different compilers, otherwise people can just rely on gcc-specific extensions. > The topic we're currently discussing does not significantly affect when > we can remove __sync builtins, IMO. They are the only builtins that directly expose double-width operations. Short of using assembly fall-backs, they are the only option right now. > They do care about whether atomic operations are natively supported on > that particular type -- and that should include a load. I think, the whole point to have atomic operations is ability to provide lock-free operations whenever possible. Even though standard does not guarantee it, that is almost the only sane use case. Otherwise, there is no point -- you can always use locks. If they do not care about lock-freedom, they should just use locks. > Nobody is proposing to mark things as lock-free if they aren't. Thus, I > don't see any change to what's usable in signal handlers. It is not obvious to anyone that atomic_load will block. It will *not* for single-width types. So, again we see differences for single- and double-width types. Even though you do not have problems with read-only memory, you have another problem for double-width types which may be even more subtle and much harder to debug in a number of cases. Of course, no one can make an assumption that it will not block, but the same can be said about read-only memory. Anyway, I do not have a horse in the race... I just proposed to consider this change for a number of legitimate use cases, but it is eventually up to the gcc developers to decide. -- Ruslan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 13:04 ` Ruslan Nikolaev via gcc @ 2018-02-27 13:08 ` Szabolcs Nagy 2018-02-27 13:17 ` Ruslan Nikolaev via gcc 2018-02-27 16:21 ` Torvald Riegel 2018-02-27 16:16 ` Torvald Riegel 2018-02-27 16:46 ` Simon Wright 2 siblings, 2 replies; 38+ messages in thread From: Szabolcs Nagy @ 2018-02-27 13:08 UTC (permalink / raw) To: Ruslan Nikolaev, Torvald Riegel Cc: nd, Alexander Monakov, Florian Weimer, gcc On 27/02/18 12:56, Ruslan Nikolaev wrote: > Formally speaking, either implementation satisfies C11 because the standard allows much leeway in the interpretation here. no, 1) your proposal would make gcc non-conforming to iso c unless it changes how static const objects are emitted. 2) the two implementations are not abi compatible, the choice is already made, changing it is an abi break. 3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores. the solutions is to add a language extension, but that requires careful design. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 13:08 ` Szabolcs Nagy @ 2018-02-27 13:17 ` Ruslan Nikolaev via gcc 2018-02-27 16:40 ` Torvald Riegel 2018-02-27 16:21 ` Torvald Riegel 1 sibling, 1 reply; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-27 13:17 UTC (permalink / raw) To: Szabolcs Nagy, Torvald Riegel; +Cc: nd, Alexander Monakov, Florian Weimer, gcc > 1) your proposal would make gcc non-conforming to iso c unless it changes how static const objects are emitted. I do not think, ISO C requires to put const objects to .rodata. And it is easily solved by not placing it there for _Atomic objects that cannot be safely loaded from read-only memory. > 2) the two implementations are not abi compatible, the choice is already made, changing it is an abi break. Since current implementations redirects to libatomic anyway, almost nothing should break. The only case it will break -- if somebody erroneously used atomic_load for 128-bit type on read-only memory (which is, again, not guaranteed by the standard). In practice, this case almost non-existent. The worst that may happen -- you will a segfault right away. > 3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores. Is it even true? Is it faster to use some global lock (implemented through RMW) than a single RMW operation? If you use this global lock, you will not get loads faster than stores. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 13:17 ` Ruslan Nikolaev via gcc @ 2018-02-27 16:40 ` Torvald Riegel 2018-02-27 17:07 ` Ruslan Nikolaev via gcc 0 siblings, 1 reply; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 16:40 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: Szabolcs Nagy, nd, Alexander Monakov, Florian Weimer, gcc On Tue, 2018-02-27 at 13:16 +0000, Ruslan Nikolaev via gcc wrote: > > 3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores. > > Is it even true? Is it faster to use some global lock (implemented through RMW) than a single RMW operation? If you use this global lock, you will not get loads faster than stores. If GCC declares a type as lock-free, atomic loads on this type will be natively supported through some sort of load instruction. That means they are faster than stores under concurrent accesses, in particular when there are concurrent atomic loads (for all major HW we care about). If there is no natively supported atomic load, GCC will not declare the type to be lock-free. Nobody made statement about performance of locks vs. RMWs. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 16:40 ` Torvald Riegel @ 2018-02-27 17:07 ` Ruslan Nikolaev via gcc 0 siblings, 0 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-27 17:07 UTC (permalink / raw) To: Torvald Riegel; +Cc: Szabolcs Nagy, nd, Alexander Monakov, Florian Weimer, gcc Torvald, thank you for your output, but I think, this discussion gets a little pointless. There is nothing else I can add since gcc folks are reluctant to this change anyway. In my opinion, there is no compelling reason against such an implementation (it is perfectly fine with the standard, read-only memory is not guaranteed for atomic_load anyway). Even binary compatibility that was mentioned is unlikely to be an issue if implemented as I described. And finally this is something that can actually be useful in practice (at least as far as I can judge from my experience). By the way, this issue was already raised multiple times during last couple of years by different people who actually use it for various real projects (bugs were eventually closed as 'INVALID'). All described challenges are purely technical and can easily be resolved. Moreover, clang/llvm chose this implementation, and it seems very logical and non-confusing to me. It certainly makes sense to expose hardware capabilities through standard interfaces whenever possible. For my projects, I will simply fall back to my own implementation using inline assembly (at least for now) because, unfortunately, it is the only thing that is guaranteed to work outside of clang/llvm in the foreseeable future (__sync functions have some limitations and do not look like an attractive option either, by the way). On Tuesday, February 27, 2018 11:21 AM, Torvald Riegel <triegel@redhat.com> wrote: On Tue, 2018-02-27 at 13:16 +0000, Ruslan Nikolaev via gcc wrote: > > 3) Torvald pointed out further considerations such as users expecting lock-free atomic loads to be faster than stores. > > Is it even true? Is it faster to use some global lock (implemented through RMW) than a single RMW operation? If you use this global lock, you will not get loads faster than stores. If GCC declares a type as lock-free, atomic loads on this type will be natively supported through some sort of load instruction. That means they are faster than stores under concurrent accesses, in particular when there are concurrent atomic loads (for all major HW we care about). If there is no natively supported atomic load, GCC will not declare the type to be lock-free. Nobody made statement about performance of locks vs. RMWs. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 13:08 ` Szabolcs Nagy 2018-02-27 13:17 ` Ruslan Nikolaev via gcc @ 2018-02-27 16:21 ` Torvald Riegel 1 sibling, 0 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 16:21 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: Ruslan Nikolaev, nd, Alexander Monakov, Florian Weimer, gcc On Tue, 2018-02-27 at 13:04 +0000, Szabolcs Nagy wrote: > the solutions is to add a language extension I think this only needs a library interface, at least when we're just considering the __atomic builtins. On the C/C++ level, it might amount to just another atomic type, which only has a CAS however; this could be probably modelled entirely through default atomics (not implemented though), and so wouldn't need a language or memory model extension. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-27 13:04 ` Ruslan Nikolaev via gcc 2018-02-27 13:08 ` Szabolcs Nagy @ 2018-02-27 16:16 ` Torvald Riegel 2018-02-27 16:46 ` Simon Wright 2 siblings, 0 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 16:16 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc On Tue, 2018-02-27 at 12:56 +0000, Ruslan Nikolaev via gcc wrote: > But, of course, it is kind of annoying that double-width types (and that also includes potentially 64-bit on some 32-bit processors, e.g. i586 also has cmpxchg8b and no official way to read atomically otherwise) need special handling and compiler extensions which basically means that in a number of cases I cannot write portable code, I need to put a bunch of architecture-dependent ifdefs, for say, 64 bit atomics even. The extension I outlined gives you a portable to use wide CAS, provided that the particular implementation supports this extension. Whether the standard covers this extension is a different matter. You can certainly propose it to the C and/or C++ committees, but that gets easier if you can show existing practice. > (And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway.) You keep repeating that claim. You also keep ignoring the point about what kind of performance programs can expect atomic loads to have, if they are declared lock-free. Also note that the use case is *not* about wider-than-machine-word accesses on read-only memory, but whether a portable use of atomics (which doesn't have to consider machine word size) can use atomic loads on read-only memory. > Particularly, imagine when someones writes some lock-free code for different types (in templates, macros, etc). It basically uses same C11 atomic primitives but for various integer sizes. Now I need special handling for larger types because whatever libatomic provides does not guarantee lock-freedom (i.e., useless) which otherwise I do not need. If you use C11 atomics, you're bound to the usage intended by the standard. Which means that if you need lock freedom, you must check or ensure that the implementation provides it to you. If you make an implicit assumption there beyond what the standard promises you, you are not writing portable C11 concurrent code -- you are writing architecture/platform-specific code. Now, imagine someone writes atomic code using C11, and expects atomic loads to actually perform like loads and lot like RMWs -- which is what most concurrent code does, in particular lots of nonblocking code ... > True that wider types may not be available across all architectures, but I would prefer to have generic and standard-conformant code at least for those that have them. It is conforming to the standard. > > The standard doesn't specify read-only memory, so it also doesn't forbid > > the concept. The implementation takes it into account though, and thus > > it's defined in that context. > But my point is that a programmer cannot rely on this feature anyway unless she/he wants to write code which compiles only with gcc. She/he wants to rely on implementation-specific behavior, so that's not a problem. > It is unspecified by the standard and implementations that use read-modify-write for atomic_load are perfectly valid. The whole point to have this standard in the first place is to allow code be compiled by different compilers, otherwise people can just rely on gcc-specific extensions. And they can, because GCC's behavior conforms to the standard. It doesn't have the implementation-specific properties you prefer, but that's not about the standard but about your personal preferences. > > The topic we're currently discussing does not significantly affect when > > we can remove __sync builtins, IMO. > > They are the only builtins that directly expose double-width operations. Short of using assembly fall-backs, they are the only option right now. We can still have an extension such as the one I outlined. > > They do care about whether atomic operations are natively supported on > > that particular type -- and that should include a load. > I think, the whole point to have atomic operations is ability to provide lock-free operations whenever possible. Even though standard does not guarantee it, that is almost the only sane use case. Otherwise, there is no point -- you can always use locks. If they do not care about lock-freedom, they should just use locks. The standards actually just promise you obstruction freedom. Forward progress guarantees are a part of the intention behind the lock-free class, but not all of it. There's address-freedom too, and an implicit assumption about what rough class of performance a particular operation is in. The majority of synchronization code will care much more about performance than about the operation being actually lock-free or not (things like signal handlers or C++ unsequenced execution policy the are exceptions). Case in point: Lots of concurrent code built out of lock-free atomics is actually not lock-free but has blocking parts -- so it couldn't be used in things like signal handlers anyway. > > > Nobody is proposing to mark things as lock-free if they aren't. Thus, I > > don't see any change to what's usable in signal handlers. > It is not obvious to anyone that atomic_load will block. What's your point? Iff atomic_load will never block, it will be marked lock-free. Programs can check for this, so it will be obvious to them that whether a certain operation might block. > It will *not* for single-width types. So, again we see differences for single- and double-width types. So? Lock-freedom is a per-type property. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-27 13:04 ` Ruslan Nikolaev via gcc 2018-02-27 13:08 ` Szabolcs Nagy 2018-02-27 16:16 ` Torvald Riegel @ 2018-02-27 16:46 ` Simon Wright 2018-02-27 16:52 ` Florian Weimer 2018-02-27 17:30 ` Torvald Riegel 2 siblings, 2 replies; 38+ messages in thread From: Simon Wright @ 2018-02-27 16:46 UTC (permalink / raw) To: Ruslan Nikolaev Cc: Torvald Riegel, Alexander Monakov, Florian Weimer, Szabolcs Nagy, gcc On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote: > > And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-27 16:46 ` Simon Wright @ 2018-02-27 16:52 ` Florian Weimer 2018-02-27 17:30 ` Torvald Riegel 1 sibling, 0 replies; 38+ messages in thread From: Florian Weimer @ 2018-02-27 16:52 UTC (permalink / raw) To: Simon Wright, Ruslan Nikolaev Cc: Torvald Riegel, Alexander Monakov, Szabolcs Nagy, gcc On 02/27/2018 05:40 PM, Simon Wright wrote: > Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious) On many systems, the read-only nature of a memory region is a thread-local or process-local attribute. Other parts of the system might have a different view on the same memory region. Some CPUs support memory protection keys, which provide a cheap way to switch memory from read-only to read-write and back, and those switching operations deliberately do not involve a memory barrier. Thanks, Florian ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-27 16:46 ` Simon Wright 2018-02-27 16:52 ` Florian Weimer @ 2018-02-27 17:30 ` Torvald Riegel 2018-02-27 17:33 ` Ruslan Nikolaev via gcc 2018-02-27 17:59 ` Simon Wright 1 sibling, 2 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 17:30 UTC (permalink / raw) To: Simon Wright Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer, Szabolcs Nagy, GCC Patches On Tue, 2018-02-27 at 16:40 +0000, Simon Wright wrote: > On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote: > > > > And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway > > Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious) Consider a producer-consumer relationship between two processes where the producer doesn't want to wait for the consumer. For example, the producer could be an application that's being traced, and the consumer is a trace aggregation tool. The producer can provide a read-only mapping to the consumer, and put a nonblocking ring buffer or something similar in there. That allows the consumer to read, but it still needs atomic access because the consumer is modifying the ring buffer concurrently. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-27 17:30 ` Torvald Riegel @ 2018-02-27 17:33 ` Ruslan Nikolaev via gcc 2018-02-27 19:32 ` Torvald Riegel 2018-02-27 17:59 ` Simon Wright 1 sibling, 1 reply; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-27 17:33 UTC (permalink / raw) To: Torvald Riegel, Simon Wright Cc: Alexander Monakov, Florian Weimer, Szabolcs Nagy, GCC Patches > Consider a producer-consumer relationship between two processes where > the producer doesn't want to wait for the consumer. For example, the > producer could be an application that's being traced, and the consumer > is a trace aggregation tool. The producer can provide a read-only > mapping to the consumer, and put a nonblocking ring buffer or something > similar in there. That allows the consumer to read, but it still needs > atomic access because the consumer is modifying the ring buffer > concurrently. Sorry for getting into someone's else conversation... And what good solution gcc offers right now? It forces producer and consumer to use lock-based (BTW: global lock!) approach for *both* producer and consumer if we are talking about 128-bit types. Therefore, sometimes producers *will* wait (by, effectively, blocking). Basically, it becomes useless. In this case, I would rather use a lock-based approach which at least does not use a global lock. On the contrary, the alternative implementation would have been at least useful when both producers and consumers have full (RW) access. Anyway, I already said that I personally will go with assembly inlines for right now. I just wanted to raise this concern since other people may find it useful in their projects. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-27 17:33 ` Ruslan Nikolaev via gcc @ 2018-02-27 19:32 ` Torvald Riegel 0 siblings, 0 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 19:32 UTC (permalink / raw) To: Ruslan Nikolaev Cc: Simon Wright, Alexander Monakov, Florian Weimer, Szabolcs Nagy, GCC Patches On Tue, 2018-02-27 at 17:29 +0000, Ruslan Nikolaev wrote: > > > > Consider a producer-consumer relationship between two processes where > > the producer doesn't want to wait for the consumer. For example, the > > producer could be an application that's being traced, and the consumer > > is a trace aggregation tool. The producer can provide a read-only > > mapping to the consumer, and put a nonblocking ring buffer or something > > similar in there. That allows the consumer to read, but it still needs > > atomic access because the consumer is modifying the ring buffer > > concurrently. > Sorry for getting into someone's else conversation... And what good solution gcc offers right now? It forces producer and consumer to use lock-based (BTW: global lock!) It's not one global lock, but a lock from an array of locks (global per process, though). > approach for *both* producer and consumer if we are talking about 128-bit types. But we're not talking about that special case of 128b types here. The majority of synchronization doesn't need more than machine word size. > Therefore, sometimes producers *will* wait (by, effectively, blocking). Basically, it becomes useless. No, such a program would have a bug anyway. It wouldn't even synchronize properly. Please make yourself familiar with what the standard means by "address-free". This use case needs address-free, so that's what the program has to ensure (and it can test that portably). Only lock-free gives you address-free. > In this case, I would rather use a lock-based approach which at least does not use a global lock. The lock would need to be shared between processes in the example I gave. You have to build your own lock for that currently, because C/C++ don't give you any process-shared locks. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-27 17:30 ` Torvald Riegel 2018-02-27 17:33 ` Ruslan Nikolaev via gcc @ 2018-02-27 17:59 ` Simon Wright 1 sibling, 0 replies; 38+ messages in thread From: Simon Wright @ 2018-02-27 17:59 UTC (permalink / raw) To: Torvald Riegel Cc: Ruslan Nikolaev, Alexander Monakov, Florian Weimer, Szabolcs Nagy, GCC Patches On 27 Feb 2018, at 17:07, Torvald Riegel <triegel@redhat.com> wrote: > > On Tue, 2018-02-27 at 16:40 +0000, Simon Wright wrote: >> On 27 Feb 2018, at 12:56, Ruslan Nikolaev via gcc <gcc@gcc.gnu.org> wrote: >>> >>> And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway >> >> Sorry to butt in, but - if it's ROM why would you need atomic load anyway? (of course, if it's just a constant view of the object, reason is obvious) > > Consider a producer-consumer relationship between two processes where > the producer doesn't want to wait for the consumer. For example, the > producer could be an application that's being traced, and the consumer > is a trace aggregation tool. The producer can provide a read-only > mapping to the consumer, and put a nonblocking ring buffer or something > similar in there. That allows the consumer to read, but it still needs > atomic access because the consumer is modifying the ring buffer > concurrently. OK, got that, thanks (this is what I meant by "just a constant view of the object", btw). Misled by "read-only memory" since in the embedded world ROM (usually actually in Flash) is effectively read-only to all. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Fw: GCC interpretation of C11 atomics (DR 459) 2018-02-26 19:43 ` Ruslan Nikolaev via gcc 2018-02-26 22:49 ` Ruslan Nikolaev via gcc @ 2018-02-27 10:40 ` Torvald Riegel 1 sibling, 0 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-27 10:40 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: Alexander Monakov, GCC Patches On Mon, 2018-02-26 at 19:39 +0000, Ruslan Nikolaev via gcc wrote: > Torvald, I definitely do not want to insist on this design choice, but it makes sense to at least seriuously consider it given the concerns I described. And especially because IFFUNC in libatomic already redirects to cmpxchg16b, That's because we want to keep the old (but wrong) behavior unchanged, at least for existing code, until there's a time when we switch it. Given that we also don't declare those types lock-free anymore, new code will know that it's not safe to use them for cases such as inter-process communication (because if not lock-free, it's also not address-free anymore). > > Not getting the performance usually associated with atomic loads can be > > a big problem for code that tries to be portable. > > I do not think it is a common use case anyway. How often atomic_load is used on double-width operations? A portable program doesn't have to think about things like double-width, or whether the platform is something like x32 vs. x86-64. What a portable program cares about is whether atomic ops are lock-free on a particular 64b integer type or not. If they are, you want to use them to synchronize (e.g., counters), and then it can matter a lot whether a load is actually a load or just creates lots of contention. If they aren't available, the program knows that it has to find a different way to synchronize (e.g., build the 64b counter out of 32b operations). > If a programmer needs some guarantees and does not care about lock-freedom, why not use a regular lock here? They do care about whether atomic operations are natively supported on that particular type -- and that should include a load. > This way nothing magical happens. Otherwise, he will may hit unexpected issues in places like signal handlers (which is hard to debug since it will hang only once in a while). With cmpxchg16b, it is at least more or less reproducible: if you tried to use it on read-only memory, you will immediately get a segfault. Nobody is proposing to mark things as lock-free if they aren't. Thus, I don't see any change to what's usable in signal handlers. > > I think I now remember why we "didn't fix" libatomic: There might be > > compiled code out there that does use the wide CAS, so changing > > libatomic from the status quo to using its intenral locks could break > > programs. > Well, it already happens for Linux and glibc. There nothing will break. For other architectures, it would be good to implement the same, so that consistent behavior is observed everywhere. It's not about consistency across archs, but consistency for existing code. New code or new implementations should just do the right thing, which is requiring a natively supported atomic load of the particular size/alignment. > > > No, they only said that it doesn't need to be a concern for the > > standard. Implementations have to pay attention to more things, so it > > is a concern for implementation. > Yes, but the only problem I see is that it is currently placed to .rodata when const is used. I and others are of different opinion: Load performance matters, inter-process communication on read-only memory matters, and it's useful to have the builtins work on not just _Atomic types but general integer types with proper alignment (e.g., look at how glibc uses the builtins in a code base that is not C11 or more recent). > It is easy to resolve: just do not place it there for _Atomic objects > 8 bytes. Then also clarify that a programmer cannot safely cast some arbitrary object that can be placed in .rodata to use with atomic_load. That doesn't help with the use cases I listed previously. > It needs to be addressed anyway, as there is already a segfault for provided example in x86-64 and Linux even with redirection to libatomic. > > > It's not "visible" in the abstract machine under some setting of the > > as-if rule. But it is definitely visible in an implementation in which > > the effects of read-only memory are visible (see my example of mapping > > memory from another process read-only so as to read data from that > > process). > True but it is not defined for read-only memory anyway, The standard doesn't specify read-only memory, so it also doesn't forbid the concept. The implementation takes it into account though, and thus it's defined in that context. > and no assumptions can be made in portable code. No you can make assumptions, given what we want the implementation to do. We might need to explain that better (or at all) in the docs, but the idea is that *new* code can expect lock-free atomics to both have a true atomic load (ie, performance-wise) and have loads work on read-only-mapped memory. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 5:50 ` Alexander Monakov 2018-02-26 7:24 ` Fw: " Ruslan Nikolaev via gcc @ 2018-02-26 18:56 ` Torvald Riegel 1 sibling, 0 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-26 18:56 UTC (permalink / raw) To: Alexander Monakov; +Cc: Ruslan Nikolaev, gcc On Mon, 2018-02-26 at 08:50 +0300, Alexander Monakov wrote: > > For these reasons, it may be a good idea if GCC folks reconsider past > > decision. And just to clarify: if mcx16 (x86-64) is not specified during > > compilation, it is totally OK to redirect to libatomic, and there make the > > final decision if target CPU supports a given instruction or not. But if it is > > specified, it makes sense for performance reasons and lock-freedom guarantees > > to always generate it directly. > > You don't mention it directly, so just to make it clear for readers: on systems > where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do > exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to > lock-free RMW implementations if so. (I don't like this solution) I thought we had fixed that to not use the wide CAS? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc 2018-02-26 5:50 ` Alexander Monakov @ 2018-02-26 12:30 ` Szabolcs Nagy 2018-02-26 13:57 ` Alexander Monakov 2018-02-26 18:16 ` Florian Weimer 2 siblings, 1 reply; 38+ messages in thread From: Szabolcs Nagy @ 2018-02-26 12:30 UTC (permalink / raw) To: Ruslan Nikolaev, gcc; +Cc: nd On 26/02/18 04:00, Ruslan Nikolaev via gcc wrote: > 1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!) this should be reported as a bug against clang. there is no abi guarantee that double-width atomics will be able to synchronize with code in other modules, you have to introduce a new abi to do this whatever that takes (new elf flag, new dynamic linker name,..). > 4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below). > no, it can't be. > Â Â Â [..]Â The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation. > rmw load is only valid if the implementation can guarantee that atomic objects are never read-only. current implementations on linux (including clang) don't do that, so an rmw load can observably break conforming c code: a static global const object is placed in .rodata section and thus rmw on it is a crash at runtime contrary to c standard requirements. on an aarch64 machine clang miscompiles this code: $ cat a.c #include <stdatomic.h> static const _Atomic struct S {long i,j;} x; int f(const _Atomic struct S *p) { struct S y = *p; return y.i; } int main() { return f(&x); } $ gcc a.c -latomic $ ./a.out $ clang a.c -latomic $ ./a.out Segmentation fault (core dumped) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 12:30 ` Szabolcs Nagy @ 2018-02-26 13:57 ` Alexander Monakov 2018-02-26 14:51 ` Szabolcs Nagy 2018-02-26 14:53 ` Ruslan Nikolaev via gcc 0 siblings, 2 replies; 38+ messages in thread From: Alexander Monakov @ 2018-02-26 13:57 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: Ruslan Nikolaev, gcc, nd On Mon, 26 Feb 2018, Szabolcs Nagy wrote: > > rmw load is only valid if the implementation can > guarantee that atomic objects are never read-only. OK, but that sounds like a matter of not emitting atomic objects into .rodata, which shouldn't be a big problem, if not for backwards compatibility concern? > current implementations on linux (including clang) > don't do that, so an rmw load can observably break > conforming c code: a static global const object is > placed in .rodata section and thus rmw on it is a > crash at runtime contrary to c standard requirements. Note that in your example GCC emits 'x' as a common symbol, you need '... x = { 0 };' for it to appear in .rodata, > on an aarch64 machine clang miscompiles this code: [...] and then with new enough libatomic on Glibc this segfaults with GCC on x86_64 too due to IFUNC redirection mentioned in the other subthread. Alexander ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 13:57 ` Alexander Monakov @ 2018-02-26 14:51 ` Szabolcs Nagy 2018-02-26 14:53 ` Ruslan Nikolaev via gcc 1 sibling, 0 replies; 38+ messages in thread From: Szabolcs Nagy @ 2018-02-26 14:51 UTC (permalink / raw) To: Alexander Monakov; +Cc: nd, Ruslan Nikolaev, gcc On 26/02/18 13:56, Alexander Monakov wrote: > On Mon, 26 Feb 2018, Szabolcs Nagy wrote: >> >> rmw load is only valid if the implementation can >> guarantee that atomic objects are never read-only. > > OK, but that sounds like a matter of not emitting atomic > objects into .rodata, which shouldn't be a big problem, > if not for backwards compatibility concern? > well gcc wants to allow atomic access on non-atomic objects too, otherwise public interfaces may need to change to use the _Atomic qualifier (which is not even valid in c++ so it would cause all sorts of breakage). i think it would be valid to put _Atomic stuff in writable section and then say atomic load is only supported on const objects if it is declared with _Atomic, this would make all strictly conforming c code work as well as most code that ppl write in practice (they probably don't use atomics on global consts). >> current implementations on linux (including clang) >> don't do that, so an rmw load can observably break >> conforming c code: a static global const object is >> placed in .rodata section and thus rmw on it is a >> crash at runtime contrary to c standard requirements. > > Note that in your example GCC emits 'x' as a common symbol, > you need '... x = { 0 };' for it to appear in .rodata, > i see. static ... x = {0}; and static ... x; are equivalent in c, so if gcc treats them differently that's a gcc weirdness, but does not change the issue that there is no guarantee about readonlyness. >> on an aarch64 machine clang miscompiles this code: > [...] > > and then with new enough libatomic on Glibc this segfaults > with GCC on x86_64 too due to IFUNC redirection mentioned > in the other subthread. > that's yet another issue, that this is not fully fixed in x86 gcc. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 13:57 ` Alexander Monakov 2018-02-26 14:51 ` Szabolcs Nagy @ 2018-02-26 14:53 ` Ruslan Nikolaev via gcc 2018-02-26 18:35 ` Torvald Riegel 1 sibling, 1 reply; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 14:53 UTC (permalink / raw) To: Alexander Monakov, Szabolcs Nagy; +Cc: gcc, nd Thank you for more comments, my response is below. On Mon, 26 Feb 2018, Szabolcs Nagy wrote:> > rmw load is only valid if the implementation can > guarantee that atomic objects are never read-only. But per response from WG14 regarding DR 459 which I quoted, the standard does not seem to define behavior for read-only memory (and const qualifier should not suggest that). RMW, according to them, is fine for atomic_load. > current implementations on linux (including clang) > don't do that, so an rmw load can observably break > conforming c code: a static global const object is > placed in .rodata section and thus rmw on it is a > crash at runtime contrary to c standard requirements. I have just tried to compile the code using clang. Latest stable version of clang seems to emit cmpxchg16b for the code you mentioned if I specify mcx16. If I do not, it redirects to libatomic. (I have not tried the version from the trunk, though.) On Monday, February 26, 2018 8:57 AM, Alexander Monakov wrote: > OK, but that sounds like a matter of not emitting atomic > objects into .rodata, which shouldn't be a big problem, > if not for backwards compatibility concern? I agree, sounds like a good idea. Certainly for _Atomic objects > 8 bytes. > and then with new enough libatomic on Glibc this segfaults > with GCC on x86_64 too due to IFUNC redirection mentioned > in the other subthread. Seems like it is a problem anyway. Another reason to never emit _Atomic inside .rodata ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 14:53 ` Ruslan Nikolaev via gcc @ 2018-02-26 18:35 ` Torvald Riegel 2018-02-26 18:59 ` Ruslan Nikolaev via gcc 0 siblings, 1 reply; 38+ messages in thread From: Torvald Riegel @ 2018-02-26 18:35 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd On Mon, 2018-02-26 at 14:53 +0000, Ruslan Nikolaev via gcc wrote: > Thank you for more comments, my response is below. > > > > On Mon, 26 Feb 2018, Szabolcs Nagy wrote:> > > rmw load is only valid if the implementation can > > guarantee that atomic objects are never read-only. > But per response from WG14 regarding DR 459 which I quoted, the standard does not seem to define behavior for read-only memory This ... > (and const qualifier should not suggest that). RMW, according to them, is fine for atomic_load. ... does not imply this latter statement. The statement you cited is about what the standard itself requires, not what makes sense for a particular implementation. For example, one could build an implementation that does not have any read-only memory and doesn't distinguish between loads and atomic RMW operations; in such as case, it wouldn't make sense for the standard to require it. OTOH though, if read-only memory exists, it makes sense for an implementation to try to respect it. Consider trying to use atomics for memory mapped read-only from another process, for example to observe output from that other process. You don't want to make it read-write for security reasons, for example. Atomic operations designated as lock-free by the implementation are supposed to be address-free too, which targets the use case of mapping memory from somewhere else. So, in such a case, using the wide CAS for atomic loads breaks a reasonable assumption. Moreover, it's also a special case, in that 32b atomics do work as intended. Also, I believe the vast majority of synchronization code makes implicit assumptions about the performance of atomic load operations, notably that concurrent loads don't create contention, or at least much less than concurrent writes. The behavior you favor would violate that, and there's no portable way to distinguish one from the other. Thus, GCC only declares operations as lock-free if atomic loads of the particular size/alignment are natively supported, and with the performance properties one would associate with just a load on the particular arch. If an atomic load and an atomic CAS are supported, that's fine; if there's just a CAS, that's not enough. I see your point in wanting to have a builtin or such for the 64b atomic CAS. However, IMO, this doesn't fit into the world of C11/C++11 atomics, and thus rather should be accessible through a separate interface. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 18:35 ` Torvald Riegel @ 2018-02-26 18:59 ` Ruslan Nikolaev via gcc 2018-02-26 19:20 ` Torvald Riegel 0 siblings, 1 reply; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 18:59 UTC (permalink / raw) To: Torvald Riegel; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd Torvald, thank you for your output. See my response below. On Monday, February 26, 2018 1:35 PM, Torvald Riegel <triegel@redhat.com> wrote: > ... does not imply this latter statement. The statement you cited is > about what the standard itself requires, not what makes sense for a > particular implementation. True but makes sense to provide true atomics when they are available. Since the standard seem to allow atomic_load implementation using RMW, does not seem to be a problem. In fact, lock_free flag for this type can return true only if mcx16 is specified; otherwise -- it returns false (since it can only be determined during runtime, assuming worst case scenario) > So, in such a case, using the wide CAS for > atomic loads breaks a reasonable assumption. Moreover, it's also a > special case, in that 32b atomics do work as intended. But in this case a programmer already makes an assumption that atomic_load does not use RMW which C11 does not seem to guarantee.Of course, for single-width operations, the programmer may in most practical cases assume it (even though there is no guarantee). Anyway, there is no good solution here for double-width operations, and the programmer should not assume it is possible when writing portable code.In fact, lock-based solution is even more confusing and potentially error-prone (e.g., cannot be safely used inside signal handlers since it is not lock-free, etc) > The behavior you favor would violate that, and > there's no portable way to distinguish one from the other. There is already a similar problem with IFFUNC (when used with Linux and glibc). In fact, I do not see any difference here. Redirection to libatomic when mcx16 is specified just adds extra cost + less predictable behavior. Moreover, it seems counterintuitive -- I specify a flag that mcx16 is supported but gcc still does not use it (at least directly). It is possible to make a change to libatomic to always use cmpxchg16b when available (even on systems without IFFUNC), this way it is totally consistent and binary compatible for code compiled with and without mcx16. > I see your point in wanting to have a builtin or such for the 64b atomic > CAS. However, IMO, this doesn't fit into the world of C11/C++11 > atomics, and thus rather should be accessible through a separate > interface. Why not? If atomic_load is not really an issue, then it may be good to use standardized interface. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 18:59 ` Ruslan Nikolaev via gcc @ 2018-02-26 19:20 ` Torvald Riegel 0 siblings, 0 replies; 38+ messages in thread From: Torvald Riegel @ 2018-02-26 19:20 UTC (permalink / raw) To: Ruslan Nikolaev; +Cc: Alexander Monakov, Szabolcs Nagy, gcc, nd On Mon, 2018-02-26 at 18:55 +0000, Ruslan Nikolaev via gcc wrote: > Torvald, thank you for your output. See my response below. > > On Monday, February 26, 2018 1:35 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > ... does not imply this latter statement. The statement you cited is > > about what the standard itself requires, not what makes sense for a > > particular implementation. > > True but makes sense to provide true atomics when they are available. What do you mean by "true atomics"? For me, that includes an atomic load that is not emulated through an RMW. > Since the standard seem to allow atomic_load implementation using RMW, does not seem to be a problem. I believe that in the C++ committee, we have consensus that the intent for lock-free atomics is that they should have an atomic load available behaves like a typical natively-supported atomic load. I can't speak for the C committee, but at least the memory models are supposed to be the same. This is a decision that implementations ultimately make, however. > In fact, lock_free flag for this type can return true only if mcx16 is specified; otherwise -- it returns false (since it can only be determined during runtime, assuming worst case scenario) But then -mcx16 is a different ABI effectively, and it also changes what (portable) synchronization code can expect when it sees an atomic type declared as lock-free. > > So, in such a case, using the wide CAS for > > atomic loads breaks a reasonable assumption. Moreover, it's also a > > special case, in that 32b atomics do work as intended. > > But in this case a programmer already makes an assumption that atomic_load does not use RMW which C11 does not seem to guarantee. It makes sense for GCC as an implementation to guarantee that. > Of course, for single-width operations, the programmer may in most practical cases assume it (even though there is no guarantee). Requiring programs to consider what is "single-width" for a particular platform, instead of just being able to test the lock-free property, decreases portability. > Anyway, there is no good solution here for double-width operations, and the programmer should not assume it is possible when writing portable code. That's an argument in favor of splitting wide CAS out into a separate interface -- C11 atomics are portable from the perspective of the major use cases, and they should stay that way. > In fact, lock-based solution is even more confusing and potentially error-prone (e.g., cannot be safely used inside signal handlers since it is not lock-free, etc) > > > The behavior you favor would violate that, and > > there's no portable way to distinguish one from the other. > > There is already a similar problem with IFFUNC (when used with Linux and glibc). In fact, I do not see any difference here. Redirection to libatomic when mcx16 is specified just adds extra cost + less predictable behavior. Moreover, it seems counterintuitive -- I specify a flag that mcx16 is supported but gcc still does not use it (at least directly). It is possible to make a change to libatomic to always use cmpxchg16b when available (even on systems without IFFUNC), this way it is totally consistent and binary compatible for code compiled with and without mcx16. I've commented on that elsewhere in the thread. > > I see your point in wanting to have a builtin or such for the 64b atomic > > CAS. However, IMO, this doesn't fit into the world of C11/C++11 > > atomics, and thus rather should be accessible through a separate > > interface. > Why not? If atomic_load is not really an issue, then it may be good to use standardized interface. See above. The atomic builtins are a package that, at least on GCC's implementation, gives you a set of properties you can rely on in a portable way (in particular when used through the C11/C++11 atomic ops). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc 2018-02-26 5:50 ` Alexander Monakov 2018-02-26 12:30 ` Szabolcs Nagy @ 2018-02-26 18:16 ` Florian Weimer 2018-02-26 18:34 ` Ruslan Nikolaev via gcc 2018-02-26 18:36 ` Janne Blomqvist 2 siblings, 2 replies; 38+ messages in thread From: Florian Weimer @ 2018-02-26 18:16 UTC (permalink / raw) To: nruslan_devel; +Cc: gcc On 02/26/2018 05:00 AM, Ruslan Nikolaev via gcc wrote: > If I understand correctly, the redirection to libatomic was made for 2 reasons: > 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.) > 2. atomic_load on read-only memory. I think x86-64 should be able to do atomic load and store via SSE2 registers, but perhaps if the memory is suitably aligned (which is the other problemâthe libatomic code will work irrespective of alignment, as far as I understand it). Thanks, Florian ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 18:16 ` Florian Weimer @ 2018-02-26 18:34 ` Ruslan Nikolaev via gcc 2018-02-26 18:36 ` Janne Blomqvist 1 sibling, 0 replies; 38+ messages in thread From: Ruslan Nikolaev via gcc @ 2018-02-26 18:34 UTC (permalink / raw) To: Florian Weimer; +Cc: gcc On Monday, February 26, 2018 1:15 PM, Florian Weimer <fweimer@redhat.com> wrote: > I think x86-64 should be able to do atomic load and store via SSE2 > registers, but perhaps if the memory is suitably aligned (which is the > other problem—the libatomic code will work irrespective of alignment, as > far as I understand it). IIRC, it is not always guaranteed to be atomic, so RMW is probably the only safe option for x86-64. And for ARM64, too, as far as I understand. Just to summarize what can be done if the proposed change is accepted (from the discussion so far): 1. _Atomic on objects larger than 8 bytes should not be placed in .rodata even if declared as const. It can also be specified that atomic_load should not be used on read-only memory with double-width operations. 2. libatomic can be modified to redirect to functions that use cmpxchg16b (whenever available on target CPU) through regular functions pointers even if IFFUNC is not available. This will provide consistent behavior everywhere, and binary compatibility for mcx16 and mno-cx16 3. never redirect to libatomic for arm64 (since ldaxp/staxp are available), redirect for x86-64 only if mcx16 is not specified. For ARM64, there is no mcx16 option at all. -- Ruslan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 18:16 ` Florian Weimer 2018-02-26 18:34 ` Ruslan Nikolaev via gcc @ 2018-02-26 18:36 ` Janne Blomqvist 2018-02-27 10:22 ` Florian Weimer 1 sibling, 1 reply; 38+ messages in thread From: Janne Blomqvist @ 2018-02-26 18:36 UTC (permalink / raw) To: Florian Weimer; +Cc: nruslan_devel, gcc mailing list On Mon, Feb 26, 2018 at 8:15 PM, Florian Weimer <fweimer@redhat.com> wrote: > On 02/26/2018 05:00 AM, Ruslan Nikolaev via gcc wrote: >> >> If I understand correctly, the redirection to libatomic was made for 2 >> reasons: >> 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 >> flag already specifies that you use CPUs that have this instruction, so it >> should not be a concern when the flag is specified.) >> 2. atomic_load on read-only memory. > > > I think x86-64 should be able to do atomic load and store via SSE2 > registers, There is no such architectural guarantee. At least on some micro-architecture (AMD Opteron "Istanbul") it's possible to construct a test which fails, proving that at least on that micro-arch SSE2 load/store isn't guaranteed to be atomic. See https://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations/7647825 for more discussion and a testcase. -- Janne Blomqvist ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: GCC interpretation of C11 atomics (DR 459) 2018-02-26 18:36 ` Janne Blomqvist @ 2018-02-27 10:22 ` Florian Weimer 0 siblings, 0 replies; 38+ messages in thread From: Florian Weimer @ 2018-02-27 10:22 UTC (permalink / raw) To: Janne Blomqvist; +Cc: nruslan_devel, gcc mailing list On 02/26/2018 07:36 PM, Janne Blomqvist wrote: > There is no such architectural guarantee. At least on some > micro-architecture (AMD Opteron "Istanbul") it's possible to construct > a test which fails, proving that at least on that micro-arch SSE2 > load/store isn't guaranteed to be atomic. Looks like I was wrong. Ugh. Thanks, Florian ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2018-02-27 17:59 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1615980330.4453149.1519617655582.ref@mail.yahoo.com> 2018-02-26 4:01 ` GCC interpretation of C11 atomics (DR 459) Ruslan Nikolaev via gcc 2018-02-26 5:50 ` Alexander Monakov 2018-02-26 7:24 ` Fw: " Ruslan Nikolaev via gcc 2018-02-26 8:20 ` Alexander Monakov 2018-02-26 8:43 ` Ruslan Nikolaev via gcc 2018-02-26 19:07 ` Torvald Riegel 2018-02-26 19:43 ` Ruslan Nikolaev via gcc 2018-02-26 22:49 ` Ruslan Nikolaev via gcc 2018-02-27 3:33 ` Ruslan Nikolaev via gcc 2018-02-27 10:34 ` Ramana Radhakrishnan 2018-02-27 11:14 ` Torvald Riegel 2018-02-27 12:39 ` Torvald Riegel 2018-02-27 13:04 ` Ruslan Nikolaev via gcc 2018-02-27 13:08 ` Szabolcs Nagy 2018-02-27 13:17 ` Ruslan Nikolaev via gcc 2018-02-27 16:40 ` Torvald Riegel 2018-02-27 17:07 ` Ruslan Nikolaev via gcc 2018-02-27 16:21 ` Torvald Riegel 2018-02-27 16:16 ` Torvald Riegel 2018-02-27 16:46 ` Simon Wright 2018-02-27 16:52 ` Florian Weimer 2018-02-27 17:30 ` Torvald Riegel 2018-02-27 17:33 ` Ruslan Nikolaev via gcc 2018-02-27 19:32 ` Torvald Riegel 2018-02-27 17:59 ` Simon Wright 2018-02-27 10:40 ` Fw: " Torvald Riegel 2018-02-26 18:56 ` Torvald Riegel 2018-02-26 12:30 ` Szabolcs Nagy 2018-02-26 13:57 ` Alexander Monakov 2018-02-26 14:51 ` Szabolcs Nagy 2018-02-26 14:53 ` Ruslan Nikolaev via gcc 2018-02-26 18:35 ` Torvald Riegel 2018-02-26 18:59 ` Ruslan Nikolaev via gcc 2018-02-26 19:20 ` Torvald Riegel 2018-02-26 18:16 ` Florian Weimer 2018-02-26 18:34 ` Ruslan Nikolaev via gcc 2018-02-26 18:36 ` Janne Blomqvist 2018-02-27 10:22 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).