* [Bug c/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
@ 2013-12-10 14:19 ` rearnsha at gcc dot gnu.org
2013-12-10 16:43 ` algrant at acm dot org
` (23 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2013-12-10 14:19 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |arm
Component|target |c
Summary|ARM code generation doesn't |Code generation doesn't
|respect C11 |respect C11
|address-dependency |address-dependency
--- Comment #1 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
I suspect this will affect any target with relaxed HW ordering of loads.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug c/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
2013-12-10 14:19 ` [Bug c/59448] Code " rearnsha at gcc dot gnu.org
@ 2013-12-10 16:43 ` algrant at acm dot org
2013-12-10 17:47 ` joseph at codesourcery dot com
` (22 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: algrant at acm dot org @ 2013-12-10 16:43 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #2 from algrant at acm dot org ---
Just realised that that last example is bogus, it should read:
inline int const *makedep(int const *a, ...) { return a; } // variadic
int f3(int const *p, int const *q) {
int flag = *p;
return flag ? *makedep(q, flag) : 0;
}
In C++, makedep could be a template - the counterpart of kill_dependency.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug c/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
2013-12-10 14:19 ` [Bug c/59448] Code " rearnsha at gcc dot gnu.org
2013-12-10 16:43 ` algrant at acm dot org
@ 2013-12-10 17:47 ` joseph at codesourcery dot com
2013-12-12 18:07 ` [Bug middle-end/59448] " algrant at acm dot org
` (21 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: joseph at codesourcery dot com @ 2013-12-10 17:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #3 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
The code generated appears fully in accordance with the semantics of C11.
You refer to 5.1.2.4#14, the definition of "carries a dependency". The
term "carries a dependency" is used outside its own definition only in the
definition of "dependency-ordered before". This testcase contains no
instances of "dependency-ordered before", because any such instance must
directly or indirectly involve a case of "A performs a release operation
on an atomic object M, and, in another thread, B performs a consume
operation on M and reads a value written by any side effect in the release
sequence headed by A", and the test involves no atomic objects.
"dependency-ordered before", in turn, is only used in the definition of
"inter-thread happens before", which is only used in the definition of
"happens before".
Please provide a complete testcase, using _Atomic or <stdatomic.h> (and so
using GCC 4.9, of course) that demonstrates any bug: that is, that does
not contain a data race according to the C11 definition, but where GCC has
reordered code in a way that introduces one (and so one can be
demonstrated by enough iterations of the threads in the testcase) - or
where the code generated fails some other semantics associated with
"happens before", such as those for "visible side effect" and "visible
sequence of side effects". I expect any such bug, and fix, would probably
not be a front-end issue but an issue with the atomic built-in functions
failing to ensure appropriate ordering in the presence of dependencies
involving atomic operations (dependencies not involving such operations
having no effects on C11 semantics).
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (2 preceding siblings ...)
2013-12-10 17:47 ` joseph at codesourcery dot com
@ 2013-12-12 18:07 ` algrant at acm dot org
2013-12-16 14:38 ` joseph at codesourcery dot com
` (20 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: algrant at acm dot org @ 2013-12-12 18:07 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #4 from algrant at acm dot org ---
So using g++,
#include <atomic>
int f1(std::atomic<int> const *p, std::atomic<int> const *q)
{
int flag = p->load(std::memory_order_consume);
return flag ? (q + flag - flag)->load(std::memory_order_relaxed) : 0;
}
demonstrates the same lack of ordering. You suggest that this might
be a problem with the atomic built-ins - and yes, if this had been a
load-acquire, it would be a problem with the built-in not introducing a
barrier or using a load-acquire instruction. But for a load-consume on
this architecture, no barrier is necessary to separate the load-consume
from a load that is address-dependent on it. The programmer wrote a
dependency but the compiler lost track of it.
It's not necessary to demonstrate failure - there's an architectural
race condition here. Even if it doesn't fail now there's no guarantee
it will never fail on future more aggressively reordering cores.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (3 preceding siblings ...)
2013-12-12 18:07 ` [Bug middle-end/59448] " algrant at acm dot org
@ 2013-12-16 14:38 ` joseph at codesourcery dot com
2014-01-20 8:52 ` algrant at acm dot org
` (19 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: joseph at codesourcery dot com @ 2013-12-16 14:38 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #5 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
On Thu, 12 Dec 2013, algrant at acm dot org wrote:
> demonstrates the same lack of ordering. You suggest that this might
> be a problem with the atomic built-ins - and yes, if this had been a
> load-acquire, it would be a problem with the built-in not introducing a
> barrier or using a load-acquire instruction. But for a load-consume on
> this architecture, no barrier is necessary to separate the load-consume
> from a load that is address-dependent on it. The programmer wrote a
> dependency but the compiler lost track of it.
"address-dependent" is not a C standard concept. As far as I can tell, at
least as regards C there are no such ordering constraints between
non-atomic operations, only between operations at least one of which is
atomic - thus, it is the responsibility of the atomic built-in functions
to ensure whatever ordering may be required. (Whereas the parts of the
memory model defining what counts as a "memory location" *do* have
implications in the absence of atomics, restricting the code sequences
that can be used for struct modifications and preventing speculative
stores.)
> It's not necessary to demonstrate failure - there's an architectural
> race condition here. Even if it doesn't fail now there's no guarantee
> it will never fail on future more aggressively reordering cores.
You still need to provide a testcase (a complete program that can be
compiled and linked with current GCC) that (a) does not show undefined
behavior, (b) that, you justify by reference to the standard definitions
and how they apply to source code constructs, must exhibit specific
observable behavior (values printed, assertions passed, etc. - *not* just
ordering of loads at the architectural level), and (c) that, you justify
by reference to the architecture definition if not to actual observed
failure, could fail to meet the requirements for observable behavior,
given the code generated by GCC and the behavior permitted by the
architecture for that code.
I am unable to tell what code you envisage running in another thread or
what observable failure you think could result from "lack of ordering",
because you have not provided a complete testcase. Thus, I am unable to
tell if there is a genuine bug here at all. The standard definitions
associated with atomicity are extremely complicated; you need to be very
careful about identifying exactly how particular definitions apply to
particular source code constructs and so how you deduce the requirements
on behavior of a particular program, for a bug report to be of any use.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (4 preceding siblings ...)
2013-12-16 14:38 ` joseph at codesourcery dot com
@ 2014-01-20 8:52 ` algrant at acm dot org
2014-01-20 9:02 ` pinskia at gcc dot gnu.org
` (18 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: algrant at acm dot org @ 2014-01-20 8:52 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #6 from algrant at acm dot org ---
Here is a complete C++11 test case. The code generated for the two loads
in thread B doesn't maintain the required ordering:
...
ldrb w1, [x0]
uxtb w1, w1
adrp x0, .LANCHOR0
ldr w2, [x3]
...
According to the architecture specification, to achieve the ordering it's
sufficient to use the result of the first load in the calculation of the
address of the second, even if it's done in such a way as to have no
dependence on the value.
/* Reproducer for 59448: compile with -std=c++11 -lpthread */
#include <atomic>
#include <thread>
#include <stdio.h>
#include <assert.h>
static std::atomic<bool> flag(0);
static int data;
/*
Thread A tries to release a data value 1 to thread B via a
release/consume sequence.
In the demonstration, this is tried repeatedly. The iterations of
the two threads are synchronized via a release/acquire sequence
from B to A. This is not relevant to the ordering problem.
*/
void threadA(void)
{
for (;;) {
data = 1; // A.A
flag.store(1, std::memory_order_release); // A.B
/* By 1.9#14, A.A is sequenced before A.B */
/* Now wait for thread B to accept the data. */
while (flag.load(std::memory_order_acquire) == 1);
assert(data == 0);
}
}
void threadB(void)
{
for (;;) {
int f, d;
do {
f = flag.load(std::memory_order_consume); // B.A
d = *(&data + f - f); // B.B
/* By 1.10#9, B.A carries a dependency to B.B */
/* By 1.10#10, A.B is dependency-ordered before B.B */
/* By 1.10#11, A.A intra-thread happens before B.B */
/* By 1.10#12, A.A happens before B.B */
/* By 1.10#13, A.A is a visible side-effect with respect to B.B
and the value determined by B.B shall be the
value (1) stored by A.A. */
} while (!f);
assert(d == 1);
/* Release thread A to start another iteration. */
data = 0;
flag.store(0, std::memory_order_release);
}
}
int main(void)
{
std::thread a(&threadA);
std::thread b(&threadB);
a.join();
b.join();
return 0;
}
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (5 preceding siblings ...)
2014-01-20 8:52 ` algrant at acm dot org
@ 2014-01-20 9:02 ` pinskia at gcc dot gnu.org
2014-01-20 10:13 ` algrant at acm dot org
` (17 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-01-20 9:02 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>*(&data + f - f);
This could be undefined behavior if f is not 0 or 1.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (6 preceding siblings ...)
2014-01-20 9:02 ` pinskia at gcc dot gnu.org
@ 2014-01-20 10:13 ` algrant at acm dot org
2014-01-20 14:21 ` joseph at codesourcery dot com
` (16 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: algrant at acm dot org @ 2014-01-20 10:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #8 from algrant at acm dot org ---
I don't see how f can not be 0 or 1 here, but to make this even more clear
that there is no UB, define flag this way:
static std::atomic<unsigned int> flag(0);
and calculate the address this way:
d = *(&data + (f - f)); // B.B
The ordering requirements are the same, and there is no possibility of UB.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (7 preceding siblings ...)
2014-01-20 10:13 ` algrant at acm dot org
@ 2014-01-20 14:21 ` joseph at codesourcery dot com
2014-01-23 22:28 ` torvald at gcc dot gnu.org
` (15 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: joseph at codesourcery dot com @ 2014-01-20 14:21 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #9 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
Thanks for the testcase. At this point I think we need Andrew MacLeod or
Torvald Riegel to review it and assess what should happen where to fix
this (both for direct uses of the atomic built-in functions, and for
_Atomic / std::atomic).
Since the original bug report it has come to my attention that there are
known ambiguities in how the C11/C++11 memory model is mapped to processor
operations on some architectures, which will need to be addressed in
platform ABIs for interoperation between implementations (when one thread
is built with one implementation and the other thread with another
implementation, for example).
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
(It is not clear to me if research in this area is sufficiently advanced
to provide a good way for ABIs to specify this that is unambiguous and
does not inhibit optimizations within a single implementation.)
It will also of course be necessary for ABIs to specify such things as
alignment for C11 _Atomic for that to interoperate between implementations
(for GCC at present, if the size of the type is 8 / 16 / 32 / 64/ 128
bits, the alignment gets increased to match that of the corresponding core
type). But first it may take a while for things to work reliably within
single implementations.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (8 preceding siblings ...)
2014-01-20 14:21 ` joseph at codesourcery dot com
@ 2014-01-23 22:28 ` torvald at gcc dot gnu.org
2014-02-17 10:26 ` algrant at acm dot org
` (14 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: torvald at gcc dot gnu.org @ 2014-01-23 22:28 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
torvald at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rth at gcc dot gnu.org
--- Comment #10 from torvald at gcc dot gnu.org ---
(In reply to algrant from comment #6)
> Here is a complete C++11 test case. The code generated for the two loads
> in thread B doesn't maintain the required ordering:
The C++ test has a race condition because the nonatomic load from data is not
ordered wrt. the nonatomic store to data in the other thread. Please fix that
and report back whether this changes anything. (I believe the use of
relaxed-MO atomic accesses should be sufficient.)
Nonetheless, if there is a fixed (data-race-free) similar test, then I believe
this fixed test case to be valid in the sense that AFAIU the standards'
requirements (both C++11's and C11's; they are equal in this aspect), code as
in this test case is supposed to be carrying the dependencies forward.
However, I cannot see a *practical* implementation of this requirement. AFAIU,
dependencies are carried forward through loads/stores too (5.1.2.4#14):
"An evaluation A carries a dependency to an evaluation B if: [...]A writes a
scalar object or bit-field M, B reads from M the value written by A, and A is
sequenced before B".
Because a compiler cannot, in general, analyze which piece of code stored last
to M, it would have to conservatively assume that a dependency is carried
through M. In turn, it could never optimize expressions such as "(x-x)", at
least if those affect atomic operations on M in some way, irrespective of how
remotely that might have happened. (I'm restricting my statement to atomic
accesses because I'm not quite sure whether there's some other reason why
nonatomic accesses couldn't be affected by this.) OTOH, maybe there are
factors that I'm not aware of and that effectively constrain
carries-a-dependency to something that doesn't widely prohibit optimizations.
I don't have significant background information on the design of this part of
the memory model (nor the implementability of this part), but I'll ask around
at the C++ meeting in two weeks.
> According to the architecture specification, to achieve the ordering it's
> sufficient to use the result of the first load in the calculation of the
> address of the second, even if it's done in such a way as to have no
> dependence on the value.
And that's certainly a feasible *architecture* specification. ISTM that the
language standard tried to model this, but then didn't give the compiler enough
leeway to still be able to optimize code. The carries-a-depency rules seem
like an unusual shortcut from the language level to requirements on the
generated code, without the usual indirection using the observable behavior of
the abstract machine and as-if. For example -- and this isn't a precise
definition of course -- if the standard would require that "f" (in your test
case) is needed to process the evaluation, then things like "f-f" could be
optimized out, and the compiler might just do the right thing with standard
optimizations.
As a short-term fix, I believe we can promote all memory_order_consume atomics
to memory_order_acquire, when generating code with consume-MO atomics. GCC is
at fault here, not the code in the builtins or in libatomic. Thus, this would
prevent errors when GCC compiled the code and a correct atomics implementation
is used (e.g., third-party libatomic).
This won't help when another (potentially correct) compiler generated the code
with the consume-MO atomics, and GCC-compiled code fails to carry the
dependencies forward. We could try to change libatomic's consume-MO
implementation to prevent that, but this would any catch cases in which
libatomic is actually used and not the other compiler's builtins or similar.
Thus, changing libatomic does not seem to be necessary.
Richard, Andrew: Thoughts?
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (9 preceding siblings ...)
2014-01-23 22:28 ` torvald at gcc dot gnu.org
@ 2014-02-17 10:26 ` algrant at acm dot org
2014-02-17 21:03 ` torvald at gcc dot gnu.org
` (13 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: algrant at acm dot org @ 2014-02-17 10:26 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #11 from algrant at acm dot org ---
Where do you get that this is racy if the access to data is not atomic? By
design, release/acquire and release/consume sequences don't require wholesale
changes to the way the data payload (in the general case, multiple fields
within a structure) is first constructed and then used. 1.10#13 makes clear
that as a result of the intra-thread sequencing between atomic and non-atomic
operations (1.9#14), and the inter-thread ordering between atomic operations
(1.10 various), there is a resulting ordering on operations to "ordinary" (sic)
objects. Please see the references to the C++ standard in the source example,
for the chain of reasoning here.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (10 preceding siblings ...)
2014-02-17 10:26 ` algrant at acm dot org
@ 2014-02-17 21:03 ` torvald at gcc dot gnu.org
2014-02-17 22:22 ` algrant at acm dot org
` (12 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: torvald at gcc dot gnu.org @ 2014-02-17 21:03 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #12 from torvald at gcc dot gnu.org ---
(In reply to algrant from comment #11)
> Where do you get that this is racy if the access to data is not atomic?
In threadB(), you do:
f = flag.load(std::memory_order_consume); // B.A
d = *(&data + f - f); // B.B
That reads from data irrespective of the value of f, so will read when data is
actually not "owned" by threadB. You can either make the accesses to data
atomic, or move the load from data to after checking that flag equals 1.
> By
> design, release/acquire and release/consume sequences don't require
> wholesale changes to the way the data payload (in the general case, multiple
> fields within a structure) is first constructed and then used. 1.10#13
> makes clear that as a result of the intra-thread sequencing between atomic
> and non-atomic operations (1.9#14), and the inter-thread ordering between
> atomic operations (1.10 various), there is a resulting ordering on
> operations to "ordinary" (sic) objects. Please see the references to the
> C++ standard in the source example, for the chain of reasoning here.
I'm very much aware of the rules, I believe. But as far as you can see,
there's a data race in your test program, which results in undefined behavior.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (11 preceding siblings ...)
2014-02-17 21:03 ` torvald at gcc dot gnu.org
@ 2014-02-17 22:22 ` algrant at acm dot org
2014-10-28 10:56 ` ramana at gcc dot gnu.org
` (11 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: algrant at acm dot org @ 2014-02-17 22:22 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #13 from algrant at acm dot org ---
I see what you mean - there is a race if threadB reads the data when the
flag is not set, i.e. in the case when the read value is never used.
Moving the read of data after the test (as in the fragment in comment 4)
would fix the race.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (12 preceding siblings ...)
2014-02-17 22:22 ` algrant at acm dot org
@ 2014-10-28 10:56 ` ramana at gcc dot gnu.org
2014-10-28 12:48 ` amacleod at redhat dot com
` (10 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: ramana at gcc dot gnu.org @ 2014-10-28 10:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2014-10-28
CC| |ramana at gcc dot gnu.org
Ever confirmed|0 |1
--- Comment #14 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
This appears to have now morphed into a standards issue and possibly affects
all targets that have a weak memory model. Anyway confirming this.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (13 preceding siblings ...)
2014-10-28 10:56 ` ramana at gcc dot gnu.org
@ 2014-10-28 12:48 ` amacleod at redhat dot com
2014-10-28 13:43 ` joseph at codesourcery dot com
` (9 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: amacleod at redhat dot com @ 2014-10-28 12:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #15 from Andrew Macleod <amacleod at redhat dot com> ---
Created attachment 33831
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33831&action=edit
promote memory order consume to acquire
So have we concluded that we should promote memory_order_consume to
memory_order_acquire for now?
That change should be fairly trivial. builtins.c::get_memmodel would simply
change any consumes to acquire. All consumes would then be promoted to
acquire.
I don't know if libatomic does much with consume, but once compiled with this
patch in the compiler, it will also generate acquire code for consumes.
Here's an untested patch provided if someone wants to experiment.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (14 preceding siblings ...)
2014-10-28 12:48 ` amacleod at redhat dot com
@ 2014-10-28 13:43 ` joseph at codesourcery dot com
2014-10-28 17:37 ` t.p.northover at gmail dot com
` (8 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: joseph at codesourcery dot com @ 2014-10-28 13:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #16 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
My view is promote consume to acquire until the standards committees and
formal model people have sorted out what to do with it.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (15 preceding siblings ...)
2014-10-28 13:43 ` joseph at codesourcery dot com
@ 2014-10-28 17:37 ` t.p.northover at gmail dot com
2014-10-29 1:48 ` joseph at codesourcery dot com
` (7 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: t.p.northover at gmail dot com @ 2014-10-28 17:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #18 from Tim Northover <t.p.northover at gmail dot com> ---
> I am not aware of any shipping compiler that would actually try to preserve dependencies, and nobody else mentioned any during the discussion of this topic in ISO C++ SG1.
In case the data point is useful, Clang promotes consume to acquire at the
moment (though I wouldn't rule out bugs even in that choice).
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (16 preceding siblings ...)
2014-10-28 17:37 ` t.p.northover at gmail dot com
@ 2014-10-29 1:48 ` joseph at codesourcery dot com
2014-10-29 9:23 ` torvald at gcc dot gnu.org
` (6 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: joseph at codesourcery dot com @ 2014-10-29 1:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #19 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
On Tue, 28 Oct 2014, torvald at gcc dot gnu.org wrote:
> Alternatively, we could try to be conservative and add an acquire barrier
> before the function body if any parameter of the function has
> carries_dependency; and, likewise, add an acquire barrier after every call to a
> function which has carries_dependency.
>
> I don't have more input from the ISO C side, but I would guess that the
> situation there is similar.
As previously noted:
* ISO C does not have carries_dependency (or attributes at all).
* carries_dependency is about increasing optimization, not decreasing it.
If it affects when barriers are added, it does so by allowing some
barriers to be omitted that would otherwise be required.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (17 preceding siblings ...)
2014-10-29 1:48 ` joseph at codesourcery dot com
@ 2014-10-29 9:23 ` torvald at gcc dot gnu.org
2014-10-30 21:08 ` torvald at gcc dot gnu.org
` (5 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: torvald at gcc dot gnu.org @ 2014-10-29 9:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #20 from torvald at gcc dot gnu.org ---
(In reply to joseph@codesourcery.com from comment #19)
> * carries_dependency is about increasing optimization, not decreasing it.
> If it affects when barriers are added, it does so by allowing some
> barriers to be omitted that would otherwise be required.
That's not quite true, unfortunately, AFAIU. I agree that generally,
attributes are supposed to be "ignorable" -- but in the case of
carries_dependency, C++11 explicitly requires that all declarations of a
function either have the attribute or none has (see 7.6.4p2). This is because
you need that to actually exploit the attribute; it's a contract between
callers and callees. If a compiler does try to preserve dependencies (or not
across function boundaries), then ignoring the attribute is fine. But if there
should be a compiler out there that doesn't, and GCC-generated code is supposed
to link to that other compiler's code, then we need to do something to make
this work.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (18 preceding siblings ...)
2014-10-29 9:23 ` torvald at gcc dot gnu.org
@ 2014-10-30 21:08 ` torvald at gcc dot gnu.org
2014-10-30 22:16 ` filter-gcc at preshing dot com
` (4 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: torvald at gcc dot gnu.org @ 2014-10-30 21:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #21 from torvald at gcc dot gnu.org ---
(In reply to torvald from comment #17)
> (In reply to Andrew Macleod from comment #15)
> > So have we concluded that we should promote memory_order_consume to
> > memory_order_acquire for now?
>
> I also think that this is the best way forward. I believe everyone in ISO
> C++ SG1 agreed that this is basically a defect in the standard.
To clarify, my impression from the discussion was that there was general
consensus that memory_order_consume as specified now wouldn't achieve what it
was intended to in practice, due to the implementation issues. IOW, it's not
useful even though it's not a defect in the sense of being inconsistent or
plain wrong.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (19 preceding siblings ...)
2014-10-30 21:08 ` torvald at gcc dot gnu.org
@ 2014-10-30 22:16 ` filter-gcc at preshing dot com
2014-11-24 12:13 ` filter-gcc at preshing dot com
` (3 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: filter-gcc at preshing dot com @ 2014-10-30 22:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
preshing <filter-gcc at preshing dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |filter-gcc at preshing dot com
--- Comment #22 from preshing <filter-gcc at preshing dot com> ---
(In reply to algrant from comment #13)
> I see what you mean - there is a race if threadB reads the data when the
> flag is not set, i.e. in the case when the read value is never used.
> Moving the read of data after the test (as in the fragment in comment 4)
> would fix the race.
algrant, could you post the complete assembly listing for threadB(), using your
version of AArch64 g++, after all the corrections to the example have been
taken into account, with -O2? For convenience, I've posted the corrected
example below.
When I compile the example using PowerPC g++ 4.9.1, it seems that
memory_order_consume is already promoted to memory_order_acquire, as evidenced
by the cmpw;bne-;isync sequence following the load from flag, which acts as an
acquire barrier. I've posted the PowerPC assembly listing below as well.
Is the bug specific to AArch64?
-------------
/* Reproducer for 59448: compile with -std=c++11 -lpthread -O2 */
#include <atomic>
#include <thread>
#include <stdio.h>
#include <assert.h>
static std::atomic<unsigned int> flag(0);
static int data;
/*
Thread A tries to release a data value 1 to thread B via a
release/consume sequence.
In the demonstration, this is tried repeatedly. The iterations of
the two threads are synchronized via a release/acquire sequence
from B to A. This is not relevant to the ordering problem.
*/
void threadA(void)
{
for (;;) {
data = 1; // A.A
flag.store(1, std::memory_order_release); // A.B
/* By 1.9#14, A.A is sequenced before A.B */
/* Now wait for thread B to accept the data. */
while (flag.load(std::memory_order_acquire) == 1);
assert(data == 0);
}
}
void threadB(void)
{
for (;;) {
int f, d;
do {
f = flag.load(std::memory_order_consume); // B.A
} while (!f);
/* By 1.10#9, B.A carries a dependency to B.B */
/* By 1.10#10, A.B is dependency-ordered before B.B */
/* By 1.10#11, A.A intra-thread happens before B.B */
/* By 1.10#12, A.A happens before B.B */
/* By 1.10#13, A.A is a visible side-effect with respect to B.B
and the value determined by B.B shall be the
value (1) stored by A.A. */
d = *(&data + (f - f)); // B.B
assert(d == 1);
/* Release thread A to start another iteration. */
data = 0;
flag.store(0, std::memory_order_release);
}
}
int main(void)
{
std::thread a(&threadA);
std::thread b(&threadB);
a.join();
b.join();
return 0;
}
------------- PowerPC listing of threadB():
_Z7threadBv:
.LFB2302:
.cfi_startproc
lis 10,.LANCHOR0@ha
li 8,0
la 10,.LANCHOR0@l(10)
.L16:
lwz 9,4(10)
cmpw 7,9,9 <------- acquire barrier
bne- 7,$+4 <-------
isync <-------
cmpwi 7,9,0
beq+ 7,.L16
lwz 9,0(10)
cmpwi 7,9,1
bne- 7,.L22
stw 8,0(10)
lwsync
stw 8,4(10)
b .L16
.L22:
mflr 0
lis 6,.LANCHOR1@ha
stwu 1,-16(1)
.cfi_def_cfa_offset 16
.cfi_register 65, 0
lis 3,.LC2@ha
lis 4,.LC1@ha
la 6,.LANCHOR1@l(6)
la 3,.LC2@l(3)
la 4,.LC1@l(4)
li 5,47
addi 6,6,16
stw 0,20(1)
.cfi_offset 65, 4
bl __assert_fail
.cfi_endproc
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (20 preceding siblings ...)
2014-10-30 22:16 ` filter-gcc at preshing dot com
@ 2014-11-24 12:13 ` filter-gcc at preshing dot com
2015-01-14 13:59 ` amacleod at redhat dot com
` (2 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: filter-gcc at preshing dot com @ 2014-11-24 12:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #23 from preshing <filter-gcc at preshing dot com> ---
Hi,
I went ahead and verified this bug using a cross-compiler built from GCC 4.9.2
sources. The bug indeed exists and happens when compiling for AArch64, but not
PowerPC. Andrew's patch fixes it (changing the first ldr instruction to an ldar
in this case). Full AArch64 assembly listings below.
I've also written a blog post on this subject in the hope of clarifying the
issue for anyone determined enough to make sense of it:
http://preshing.com/20141124/fixing-gccs-implementation-of-memory_order_consume
Andrew's patch, if it works the way I understand it, seems like the correct
thing for GCC to do until somebody figures out how to safely implement the
"efficient" compiler strategy for consume semantics.
I guess the next step is to run the test suite on a few platforms to make sure
there are no regressions, then submit?
Cheers,
Jeff
------------------ AArch64 listing of threadB() without Andrew's patch:
_Z7threadBv:
.LFB2304:
.cfi_startproc
adrp x1, .LANCHOR0
stp x29, x30, [sp, -16]!
.cfi_def_cfa_offset 16
.cfi_offset 29, -16
.cfi_offset 30, -8
add x1, x1, :lo12:.LANCHOR0
add x29, sp, 0
.cfi_def_cfa_register 29
.L10:
add x0, x1, 8
ldr w0, [x0]
cbz w0, .L10
ldr w0, [x1]
cmp w0, 1
bne .L15
str wzr, [x1]
add x0, x1, 8
stlr wzr, [x0]
b .L10
.L15:
adrp x3, .LANCHOR1
adrp x0, .LC2
adrp x1, .LC1
add x3, x3, :lo12:.LANCHOR1
add x0, x0, :lo12:.LC2
add x1, x1, :lo12:.LC1
mov w2, 47
add x3, x3, 16
bl __assert_fail
.cfi_endproc
------------------ AArch64 listing of threadB() with Andrew's patch:
_Z7threadBv:
.LFB2304:
.cfi_startproc
adrp x1, .LANCHOR0
stp x29, x30, [sp, -16]!
.cfi_def_cfa_offset 16
.cfi_offset 29, -16
.cfi_offset 30, -8
add x1, x1, :lo12:.LANCHOR0
add x29, sp, 0
.cfi_def_cfa_register 29
.L10:
add x0, x1, 8
ldar w0, [x0]
cbz w0, .L10
ldr w0, [x1]
cmp w0, 1
bne .L15
str wzr, [x1]
add x0, x1, 8
stlr wzr, [x0]
b .L10
.L15:
adrp x3, .LANCHOR1
adrp x0, .LC2
adrp x1, .LC1
add x3, x3, :lo12:.LANCHOR1
add x0, x0, :lo12:.LC2
add x1, x1, :lo12:.LC1
mov w2, 47
add x3, x3, 16
bl __assert_fail
.cfi_endproc
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (21 preceding siblings ...)
2014-11-24 12:13 ` filter-gcc at preshing dot com
@ 2015-01-14 13:59 ` amacleod at redhat dot com
2022-01-15 1:47 ` pinskia at gcc dot gnu.org
2022-01-15 1:53 ` pinskia at gcc dot gnu.org
24 siblings, 0 replies; 26+ messages in thread
From: amacleod at redhat dot com @ 2015-01-14 13:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #24 from Andrew Macleod <amacleod at redhat dot com> ---
Author: amacleod
Date: Wed Jan 14 13:58:35 2015
New Revision: 219601
URL: https://gcc.gnu.org/viewcvs?rev=219601&root=gcc&view=rev
Log:
2015-01-14 Andrew MacLeod <amacleod@redhat.com>
PR middle-end/59448
* builtins.c (get_memmodel): Promote consume to acquire always.
* testsuite/gcc.dg/atomic-invalid.c: Remove obselete test for illegal
consume in an atomic_exchange.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/builtins.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/atomic-invalid.c
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (22 preceding siblings ...)
2015-01-14 13:59 ` amacleod at redhat dot com
@ 2022-01-15 1:47 ` pinskia at gcc dot gnu.org
2022-01-15 1:53 ` pinskia at gcc dot gnu.org
24 siblings, 0 replies; 26+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-15 1:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |wrong-code
Target Milestone|--- |5.0
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
2013-12-10 12:07 [Bug target/59448] New: ARM code generation doesn't respect C11 address-dependency algrant at acm dot org
` (23 preceding siblings ...)
2022-01-15 1:47 ` pinskia at gcc dot gnu.org
@ 2022-01-15 1:53 ` pinskia at gcc dot gnu.org
24 siblings, 0 replies; 26+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-15 1:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448
--- Comment #26 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html
^ permalink raw reply [flat|nested] 26+ messages in thread