public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: should sync builtins be full optimization barriers?
@ 2011-09-15 16:20 Richard Henderson
  2011-09-15 16:26 ` Paolo Bonzini
  0 siblings, 1 reply; 40+ messages in thread
From: Richard Henderson @ 2011-09-15 16:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gcc

> > I'd say they should be optimization barriers too (and at the tree level
> > they I think work that way, being represented as function calls), so if
> > they don't act as memory barriers in RTL, the *.md patterns should be
> > fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
> > variants - if the CPU can reorder memory accesses across them at will,
> > why shouldn't the compiler be able to do the same as well?
> 
> Agreed, so we have a bug in all released versions of GCC. :(

I wouldn't go that far.  They *used* to be compiler barriers,
but clearly something broke at some point without anyone noticing.
We don't know how many versions are affected until we debug it.
For all we know it broke in 4.5 and 4.4 is fine.

There's no reference to a GCC bug report about this in the thread.
Did the folks over at the libdispatch project never think to file one?
Or does a bug report exist and my search skills are weak?


r~

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-15 16:20 should sync builtins be full optimization barriers? Richard Henderson
@ 2011-09-15 16:26 ` Paolo Bonzini
  2011-09-20  7:56   ` Paolo Bonzini
  2011-09-24  9:24   ` Richard Guenther
  0 siblings, 2 replies; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-15 16:26 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc

On 09/15/2011 06:19 PM, Richard Henderson wrote:
> I wouldn't go that far.  They *used* to be compiler barriers,
> but clearly something broke at some point without anyone noticing.
> We don't know how many versions are affected until we debug it.
> For all we know it broke in 4.5 and 4.4 is fine.

4.4 is not necessarily fine, it may also be that an unrelated 4.5 change 
exposed a latent bug.

But indeed Richard Sandiford mentioned offlist that perhaps 
ALIAS_SET_MEMORY_BARRIER machinery broke.  Fixing the bug in 4.5/4.6/4.7 
will definitely shed more light.

> There's no reference to a GCC bug report about this in the thread.
> Did the folks over at the libdispatch project never think to file one?

I asked them to attach a preprocessed testcase somewhere, but they 
haven't done so yet. :(

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-15 16:26 ` Paolo Bonzini
@ 2011-09-20  7:56   ` Paolo Bonzini
  2011-09-24  9:24   ` Richard Guenther
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-20  7:56 UTC (permalink / raw)
  To: gcc; +Cc: Richard Henderson

On 09/15/2011 06:26 PM, Paolo Bonzini wrote:
>
>> There's no reference to a GCC bug report about this in the thread.
>> Did the folks over at the libdispatch project never think to file one?
>
> I asked them to attach a preprocessed testcase somewhere, but they
> haven't done so yet. :(

They now attached it, and the bug turns out to be a missing parenthesis 
in an #ifdef.  This made libdispatch compile the xchg as an asm rather 
than a sync builtin.  And of course the asm was wrong.

Apparently, Apple people on the mailing list were looking at the Apple 
trunk, but the reporter was obviously compiling from the public trunk.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-15 16:26 ` Paolo Bonzini
  2011-09-20  7:56   ` Paolo Bonzini
@ 2011-09-24  9:24   ` Richard Guenther
  2011-09-26 16:18     ` Richard Guenther
  1 sibling, 1 reply; 40+ messages in thread
From: Richard Guenther @ 2011-09-24  9:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Henderson, gcc

On Thu, Sep 15, 2011 at 6:26 PM, Paolo Bonzini <bonzini@gnu.org> wrote:
> On 09/15/2011 06:19 PM, Richard Henderson wrote:
>>
>> I wouldn't go that far.  They *used* to be compiler barriers,
>> but clearly something broke at some point without anyone noticing.
>> We don't know how many versions are affected until we debug it.
>> For all we know it broke in 4.5 and 4.4 is fine.
>
> 4.4 is not necessarily fine, it may also be that an unrelated 4.5 change
> exposed a latent bug.
>
> But indeed Richard Sandiford mentioned offlist that perhaps
> ALIAS_SET_MEMORY_BARRIER machinery broke.  Fixing the bug in 4.5/4.6/4.7
> will definitely shed more light.

ALIAS_SET_MEMORY_BARRIER?!  Eh, well - that will make TBAA consider
something a memory barrier (I suppose the value of the macro is just zero),
but clearly non-TBAA alias analysis will happily disambiguate things.

An alias-set is certainly not the correct way of making a MEM a memory
barrier.

But well, I might have missed some of GCCs history here and broke things.

Please CC me on the bug that was eventually created.

Richard.

>> There's no reference to a GCC bug report about this in the thread.
>> Did the folks over at the libdispatch project never think to file one?
>
> I asked them to attach a preprocessed testcase somewhere, but they haven't
> done so yet. :(
>
> Paolo
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-24  9:24   ` Richard Guenther
@ 2011-09-26 16:18     ` Richard Guenther
  0 siblings, 0 replies; 40+ messages in thread
From: Richard Guenther @ 2011-09-26 16:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Henderson, gcc

On Sat, Sep 24, 2011 at 11:24 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Thu, Sep 15, 2011 at 6:26 PM, Paolo Bonzini <bonzini@gnu.org> wrote:
>> On 09/15/2011 06:19 PM, Richard Henderson wrote:
>>>
>>> I wouldn't go that far.  They *used* to be compiler barriers,
>>> but clearly something broke at some point without anyone noticing.
>>> We don't know how many versions are affected until we debug it.
>>> For all we know it broke in 4.5 and 4.4 is fine.
>>
>> 4.4 is not necessarily fine, it may also be that an unrelated 4.5 change
>> exposed a latent bug.
>>
>> But indeed Richard Sandiford mentioned offlist that perhaps
>> ALIAS_SET_MEMORY_BARRIER machinery broke.  Fixing the bug in 4.5/4.6/4.7
>> will definitely shed more light.
>
> ALIAS_SET_MEMORY_BARRIER?!  Eh, well - that will make TBAA consider
> something a memory barrier (I suppose the value of the macro is just zero),
> but clearly non-TBAA alias analysis will happily disambiguate things.

Nope, it's implemented/tested in a way that should work (on RTL, that is).

Richard.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-27  5:26                           ` James Dennett
@ 2011-09-27  8:19                             ` Andrew MacLeod
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-27  8:19 UTC (permalink / raw)
  To: James Dennett
  Cc: Michael Matz, Geert Bosch, Paolo Bonzini, Jakub Jelinek,
	GCC Mailing List, Aldy Hernandez, Peter Sewell, Jaroslav Sevcik

On 09/26/2011 01:31 PM, James Dennett wrote:
> On Mon, Sep 26, 2011 at 9:57 AM, Andrew MacLeod<amacleod@redhat.com>  wrote:
>>
>> The C++11 memory model asserts that a program containing data races
>> involving *non-atomic* variables has undefined semantics. The compiler is
>> not allowed to introduce any data races into an otherwise correct program.
> C++11 specifies data races in terms of properties of the source code.
>
> A conforming implementation may translate that code into something
> that races on actual hardware if the race is benign _on that
> hardware_.  For example, it might read from a memory address that's
> being written to concurrently if it knows that the write cannot
> materially affect the value read.  A user's C++ code that attempted to
> do so would have undefined behavior, but the C++ compiler is
> generating code for some more concrete platform that will likely have
> a range of possible behaviors for such races.

I'm only talking about detectable data races, although I didn't clarify 
that here.   We're allowing load races for the most part since they are 
pretty much benign everywhere.  We have to avoid introducing new races 
with stores though because they can usually be detected.

Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-26 18:10                         ` Andrew MacLeod
@ 2011-09-27  5:26                           ` James Dennett
  2011-09-27  8:19                             ` Andrew MacLeod
  0 siblings, 1 reply; 40+ messages in thread
From: James Dennett @ 2011-09-27  5:26 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Michael Matz, Geert Bosch, Paolo Bonzini, Jakub Jelinek,
	GCC Mailing List, Aldy Hernandez, Peter Sewell, Jaroslav Sevcik

On Mon, Sep 26, 2011 at 9:57 AM, Andrew MacLeod <amacleod@redhat.com> wrote:
>> Hi,
>>
>> On Tue, 13 Sep 2011, Andrew MacLeod wrote:
>>
>>> Your example was not about regular stores, it used atomic variables.
>>
>> This reads as if there exists non-atomic variables in the new C++
>> mem-model.  Assuming that this is so, why do those ugly requirements of
>> not introducing new data races also apply to those non-atomic datas?
>>
>>
> Why is it ugly to avoid introducing a data race into a race-free program?  I
> would think that is a basic necessity for a multi threaded program.
>
> There are normal everyday shared variables like we've always had, and there
> are the new atomic variables which have slightly different characteristics.
>
> The C++11 memory model asserts that a program containing data races
> involving *non-atomic* variables has undefined semantics. The compiler is
> not allowed to introduce any data races into an otherwise correct program.

C++11 specifies data races in terms of properties of the source code.

A conforming implementation may translate that code into something
that races on actual hardware if the race is benign _on that
hardware_.  For example, it might read from a memory address that's
being written to concurrently if it knows that the write cannot
materially affect the value read.  A user's C++ code that attempted to
do so would have undefined behavior, but the C++ compiler is
generating code for some more concrete platform that will likely have
a range of possible behaviors for such races.

Of course on a platform that actually diagnosed concurrent accesses
such games would be disallowed, and we'd be back to a land in which
the C++ compiler really could not introduce races unless they already
existed in the user's code.

-- James

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-26 16:17                       ` Michael Matz
  2011-09-26 17:32                         ` Ian Lance Taylor
@ 2011-09-26 18:10                         ` Andrew MacLeod
  2011-09-27  5:26                           ` James Dennett
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-26 18:10 UTC (permalink / raw)
  To: Michael Matz
  Cc: Geert Bosch, Paolo Bonzini, Jakub Jelinek, GCC Mailing List,
	Aldy Hernandez, Peter Sewell, Jaroslav Sevcik

> Hi,
>
> On Tue, 13 Sep 2011, Andrew MacLeod wrote:
>
>> Your example was not about regular stores, it used atomic variables.
> This reads as if there exists non-atomic variables in the new C++
> mem-model.  Assuming that this is so, why do those ugly requirements of
> not introducing new data races also apply to those non-atomic datas?
>
>
Why is it ugly to avoid introducing a data race into a race-free 
program?  I would think that is a basic necessity for a multi threaded 
program.

There are normal everyday shared variables like we've always had, and 
there are the new atomic variables which have slightly different 
characteristics.

The C++11 memory model asserts that a program containing data races 
involving *non-atomic* variables has undefined semantics. The compiler 
is not allowed to introduce any data races into an otherwise correct 
program.

Atomic variables are effectively serialized across the system.  When 2 
threads write to an atomic, one will fully happen before the other and 
*all* threads see it happen in that order. The order may not be 
predictable from one run of the program to another, but all the threads 
in a system will see a consistant view of an atomic.  This may 
make them more expensive to use since writes can't be delayed, cache 
lines may have to be flushed, or other memory subsystems may need to get 
involved to execute the operation properly.

All atomic operations also have a memory model parameter which 
specifies one of 6 synchronization modes. When the atomic value is being 
read or written,  it controls how other outstanding shared memory 
operations may also be flushed into the system at the same time, 
assuring them visibility in other threads.  Since atomic operations may 
have these side effects, there are serious restrictions on how they can 
be moved and modified by the compiler, as well as what optimizations can 
be performed around them.  For now, the optimizers are simply treating 
them as function calls with side effects and doing very little.

Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-26 16:17                       ` Michael Matz
@ 2011-09-26 17:32                         ` Ian Lance Taylor
  2011-09-26 18:10                         ` Andrew MacLeod
  1 sibling, 0 replies; 40+ messages in thread
From: Ian Lance Taylor @ 2011-09-26 17:32 UTC (permalink / raw)
  To: Michael Matz
  Cc: Andrew MacLeod, Geert Bosch, Paolo Bonzini, Jakub Jelinek,
	GCC Mailing List, Aldy Hernandez, Peter Sewell, Jaroslav Sevcik

Michael Matz <matz@suse.de> writes:

> Hi,
>
> On Tue, 13 Sep 2011, Andrew MacLeod wrote:
>
>> Your example was not about regular stores, it used atomic variables.
>
> This reads as if there exists non-atomic variables in the new C++ 
> mem-model.  Assuming that this is so, why do those ugly requirements of 
> not introducing new data races also apply to those non-atomic datas?

My understanding is that the C++ memory model bans speculative stores
even for non-atomic variables.  I believe this is because of the
acquire/release semantics they have adopted--a release of an atomic
variable applies to all writes that occur before the release.

Ian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-13 16:16                     ` Andrew MacLeod
@ 2011-09-26 16:17                       ` Michael Matz
  2011-09-26 17:32                         ` Ian Lance Taylor
  2011-09-26 18:10                         ` Andrew MacLeod
  0 siblings, 2 replies; 40+ messages in thread
From: Michael Matz @ 2011-09-26 16:17 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Geert Bosch, Paolo Bonzini, Jakub Jelinek, GCC Mailing List,
	Aldy Hernandez, Peter Sewell, Jaroslav Sevcik

Hi,

On Tue, 13 Sep 2011, Andrew MacLeod wrote:

> Your example was not about regular stores, it used atomic variables.

This reads as if there exists non-atomic variables in the new C++ 
mem-model.  Assuming that this is so, why do those ugly requirements of 
not introducing new data races also apply to those non-atomic datas?


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-13 14:58                   ` Geert Bosch
@ 2011-09-13 16:16                     ` Andrew MacLeod
  2011-09-26 16:17                       ` Michael Matz
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-13 16:16 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Paolo Bonzini, Jakub Jelinek, GCC Mailing List, Aldy Hernandez,
	Peter Sewell, Jaroslav Sevcik

On 09/13/2011 10:58 AM, Geert Bosch wrote:
>
> On Sep 13, 2011, at 08:08, Andrew MacLeod wrote:
>
>> On 09/12/2011 09:52 PM, Geert Bosch wrote:
>>> No that's false. Even on systems with nice memory models, such as x86 and SPARC with a TSO model, you need a fence to avoid that a write-load of the same location is forced to
> Note that here with write-load I meant a write instruction *and* a subsequent load instruction.
>>>   make it all the way to coherent memory and not forwarded directly from the write buffer or L1 cache. The reasons that fences are expensive is exactly that it requires system-wide agreement.
>>
>> On x86, all the atomic operations are prefixed with LOCK which is suppose to grant them exclusive use of shared memory. Ken's comments would appear to indicate that imposes a total order across all processors.
> Yes, that's right. All atomic read-modify-write operations have an implicit full barrier on x86 and on SPARC. However, my example was about regular stores and loads from an atomic int using the C++ relaxed memory model. Indeed, just using XCHG (or SWAP on SPARC) instructions for writes and regular loads for reads is sufficient to establish a total order.
>


Your example was not about regular stores, it used atomic variables. 
*ALL* atomic variable writes are prefixed by lock on x86. This is one 
reason we have built-ins for all atomic read and writes, to let targets 
define the appropriate sequences to ensure atomicity.

Additional costs may come with synchronizing the *other* shared memory 
variables in a thread, which is what the memory models are there for.

For relaxed mode, no other shared memory values have to be 
flushed/sorted out because relaxed doesn't synchronize.

When you switch to release/acquire, then 2 threads have to get 
themselves into a consistent state, so any other pending writes before 
an atomic release operation in one thread must be flushed back to shared 
memory, possibly requiring a extra instruction(s) on some architectures. 
The simple use of the lock prefix on x86 writes satisfies this 
constraint as well.

And then seq-cst requires pretty much everyone in the system to get 
straightened away which could be a very expensive operation.  As it 
turns out, x86 is still satisfied by just using a lock on the atomic write.

x86 obviously doesn't benefit much from the more relaxed models since 
its pretty much seq-cst by default, but some other arch's do. There a 
table being built here:

http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

which shows what the sequences should be for various architectures, and 
that's what I'm planning to use for the atomic sequences on each of 
those targets.  As you can see, some architectures more closely match 
the various memory models than x86.

The optimizers are still free to do shared memory optimizations subject 
to the memory model restrictions. (ie, all sorts of code motion can 
happen across relaxed atomics, and none across seq-cst).  This is where 
x86 might benefit in performance.

Andrew

BTW, If someone cares about the sequences for their favourite 
architecture, and it isn't listed there, I encourage you to contact 
Peter or Jaroslav with the relevant information to get it added to this 
page.  (I CC'd them on this reply.)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-13 12:09                 ` Andrew MacLeod
@ 2011-09-13 14:58                   ` Geert Bosch
  2011-09-13 16:16                     ` Andrew MacLeod
  0 siblings, 1 reply; 40+ messages in thread
From: Geert Bosch @ 2011-09-13 14:58 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Paolo Bonzini, Jakub Jelinek, GCC Mailing List, Aldy Hernandez


On Sep 13, 2011, at 08:08, Andrew MacLeod wrote:

> On 09/12/2011 09:52 PM, Geert Bosch wrote:
>> No that's false. Even on systems with nice memory models, such as x86 and SPARC with a TSO model, you need a fence to avoid that a write-load of the same location is forced to
Note that here with write-load I meant a write instruction *and* a subsequent load instruction.
>>  make it all the way to coherent memory and not forwarded directly from the write buffer or L1 cache. The reasons that fences are expensive is exactly that it requires system-wide agreement.
> 
> On x86, all the atomic operations are prefixed with LOCK which is suppose to grant them exclusive use of shared memory. Ken's comments would appear to indicate that imposes a total order across all processors.
Yes, that's right. All atomic read-modify-write operations have an implicit full barrier on x86 and on SPARC. However, my example was about regular stores and loads from an atomic int using the C++ relaxed memory model. Indeed, just using XCHG (or SWAP on SPARC) instructions for writes and regular loads for reads is sufficient to establish a total order.

These are expensive synchronizing instructions though, with full barrier semantics. For the relaxed memory model, the compiler would be able to optimize away redundant loads and stores, as you indicated before.

> I presume other architectures have similar mechanisms if they support atomic operations.  You have to have *some* way of having 2 threads which simultaneous perform read/modify/write atomic instructions work properly...
Yes, read-modify-write instructions also function as full barrier.
> 
> Assume x=0, and 2 threads both execute a single atomic increment operation:
>  { read x, add 1, write result back to x }
> When both threads have finished, the result *has* to be x == 2.  So the 2 threads must be able to see some sort of coherent value for x.
Indeed. The trouble is with regular reads and writes.
> 
> If coherency is provided for read/modify/write, it should also be available for read or write as well...


No, unless you replace writes by read-modify-write instructions, or you insert additional fences. Regular writes are buffered, and initially only visible to the processor itself. The reason regular writes to memory are so fast is that the processor doesn't have to wait for the write to percolate down the memory hierarchy, but can continue processing using *its* last written value.

  -Geert

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-13  6:35                 ` Paolo Bonzini
@ 2011-09-13 14:46                   ` Eric Botcazou
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Botcazou @ 2011-09-13 14:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: gcc, Geert Bosch, Andrew MacLeod, Jakub Jelinek, Aldy Hernandez

> You need fences on x86 to implement Petterson or Dekkar spin locks but
> only because they involve write-read ordering to different memory
> locations (I'm mentioning those spin lock algorithms because they do
> not require locked memory accesses).  Write-write, read-read and for
> the same location write-read ordering are guaranteed by the processor.
>  Same for coherency which is a looser property.
>
> However, accesses in the those spin lock algorithm are definitely
> _not_ relaxed; not all of them, at least.
>
> > No that's false. Even on systems with nice memory models, such as x86
> > and SPARC with a TSO model, you need a fence to avoid that a write-load
> > of the same location is forced to make it all the way to coherent memory
> > and not forwarded directly from the write buffer or L1 cache.
>
> Not sure about SPARC, but this is definitely false on x86.

My understanding is that SPARC is on par with x86 here.  In particular, I don't 
think that accesses to the same location can be ordered differently depending 
on the processor.  That admittedly isn't very clear in the V8 manual, but much 
more in the V9 manual.  Quoting it:

"The order of memory operations observed at a single location is a total order 
that preserves the partial orderings of each processor’s transactions to this 
address. There may be many legal total orders for a given program’s execution"

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-13  1:52               ` Geert Bosch
  2011-09-13  6:35                 ` Paolo Bonzini
@ 2011-09-13 12:09                 ` Andrew MacLeod
  2011-09-13 14:58                   ` Geert Bosch
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-13 12:09 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Paolo Bonzini, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On 09/12/2011 09:52 PM, Geert Bosch wrote:
>
>
> No that's false. Even on systems with nice memory models, such as x86 and SPARC with a TSO model, you need a fence to avoid that a write-load of the same location is forced to make it all the way to coherent memory and not forwarded directly from the write buffer or L1 cache. The reasons that fences are expensive is exactly that it requires system-wide agreement.

On x86, all the atomic operations are prefixed with LOCK which is 
suppose to grant them exclusive use of shared memory. Ken's comments 
would appear to indicate that imposes a total order across all processors.

I presume other architectures have similar mechanisms if they support 
atomic operations.  You have to have *some* way of having 2 threads 
which simultaneous perform read/modify/write atomic instructions work 
properly...

Assume x=0, and 2 threads both execute a single atomic increment operation:
   { read x, add 1, write result back to x }
When both threads have finished, the result *has* to be x == 2.  So the 
2 threads must be able to see some sort of coherent value for x.

If coherency is provided for read/modify/write, it should also be 
available for read or write as well...

Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-13  1:52               ` Geert Bosch
@ 2011-09-13  6:35                 ` Paolo Bonzini
  2011-09-13 14:46                   ` Eric Botcazou
  2011-09-13 12:09                 ` Andrew MacLeod
  1 sibling, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-13  6:35 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Andrew MacLeod, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On Tue, Sep 13, 2011 at 03:52, Geert Bosch <bosch@adacore.com> wrote:
> No, it is possible, and actually likely. Basically, the issue is write buffers.
> The coherency mechanisms come into play at a lower level in the
> hierarchy (typically at the last-level cache), which is why we need fences
> to start with to implement things like spin locks.

You need fences on x86 to implement Petterson or Dekkar spin locks but
only because they involve write-read ordering to different memory
locations (I'm mentioning those spin lock algorithms because they do
not require locked memory accesses).  Write-write, read-read and for
the same location write-read ordering are guaranteed by the processor.
 Same for coherency which is a looser property.

However, accesses in the those spin lock algorithm are definitely
_not_ relaxed; not all of them, at least.

> No that's false. Even on systems with nice memory models, such as x86
> and SPARC with a TSO model, you need a fence to avoid that a write-load
> of the same location is forced to make it all the way to coherent memory
> and not forwarded directly from the write buffer or L1 cache.

Not sure about SPARC, but this is definitely false on x86.

Granted, even if you do not have to put fences those writes are likely
_not_ free.  The processor needs to do more than say on PPC, so I
wouldn't be surprised if conflicting memory accesses are quite more
expensive on x86 than PPC.  Recently, a colleague of mine tried
replacing optimization barriers with full barriers in one of two
threads implementing a ring buffer; that thread was now 30% slower,
but the other thread sped up by basically the same time.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 14:13     ` Andrew MacLeod
  2011-09-11 18:23       ` Paolo Bonzini
  2011-09-11 19:00       ` Geert Bosch
@ 2011-09-13  6:31       ` Lawrence Crowl
  2 siblings, 0 replies; 40+ messages in thread
From: Lawrence Crowl @ 2011-09-13  6:31 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Geert Bosch, Jakub Jelinek, Paolo Bonzini, GCC Mailing List,
	Aldy Hernandez

On 9/11/11, Andrew MacLeod <amacleod@redhat.com> wrote:
> On 09/09/2011 09:09 PM, Geert Bosch wrote:
>> For the C++0x atomic types there are:
>>
>> void A::store(C desired, memory_order order = memory_order_seq_cst)
>> volatile;
>> void A::store(C desired, memory_order order = memory_order_seq_cst);
>>
>> where the first variant (with order = memory_order_relaxed)
>> would allow fences to be omitted, while still preventing the compiler from
>> reordering memory accesses, IIUC.
>
> I thought the volatile tags were actually for type correctness so the
> compiler wouldn't complain when used on volatile objects....  Ie, you
> can't call a non volatile method with a volatile object, or something
> like that.

Volatile means the same thing with atomics that it does without the
atomics, that an external agent may affect the value or that writes
to the value may effect external agents.

The C++ standards committee has anticipated optimizing atomic
operations, and even eliminating them entirely when it eliminates
entire threads.  However, one cannot do that with volatile because
you might fail to coordinate with the external agent.

I think the most likely use of volatile atomics is in communicating
between to processes sharing memory.  Note, though, that such
atomics may need to be lock-free and/or address-free.

-- 
Lawrence Crowl

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-10  1:09   ` Geert Bosch
                       ` (2 preceding siblings ...)
  2011-09-11 14:13     ` Andrew MacLeod
@ 2011-09-13  6:20     ` Lawrence Crowl
  3 siblings, 0 replies; 40+ messages in thread
From: Lawrence Crowl @ 2011-09-13  6:20 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Jakub Jelinek, Paolo Bonzini, GCC Mailing List, Aldy Hernandez, amacleod

On 9/9/11, Geert Bosch <bosch@adacore.com> wrote:
> To be honest, I can't quite see the use of completely unordered
> atomic operations, where we not even prohibit compiler optimizations.
> It would seem if we guarantee that a variable will not be accessed
> concurrently from any other thread, we wouldn't need the operation
> to be atomic in the first place. That said, it's quite likely I'm
> missing something here.

The memory_order_relaxed is useful in at least two situations.
First, when maintaining atomic counter of events, you may not need
to synchronize with any other atomic variables.  Second, some
algorithms need more than one atomic operation, but are only
'effective' on one of them, and only that one needs to synchronize
other memory.

-- 
Lawrence Crowl

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-12 23:19             ` Andrew MacLeod
  2011-09-13  0:31               ` Ken Raeburn
@ 2011-09-13  1:52               ` Geert Bosch
  2011-09-13  6:35                 ` Paolo Bonzini
  2011-09-13 12:09                 ` Andrew MacLeod
  1 sibling, 2 replies; 40+ messages in thread
From: Geert Bosch @ 2011-09-13  1:52 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Paolo Bonzini, Jakub Jelinek, GCC Mailing List, Aldy Hernandez


On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:

> Lets simplify it slightly.  The compiler can optimize away x=1 and x=3 as dead stores (even valid on atomics!), leaving us with 2 modification orders..
>     2,4 or 4,2
> and what you are getting at is you don't think we should ever see
> r1==2, r2==4  and r3==4, r4==2
Right, I agree that the compiler can optimize away both the
double writes and double reads.
> 
> lets say the order of the writes turns out to be  2,4...  is it possible for both writes to be travelling around some bus and have thread 4 actually read the second one first, followed by the first one?   It would imply a lack of memory coherency in the system wouldn't it? My simple understanding is that the hardware gives us this sort of minimum guarantee on all shared memory. which means we should never see that happen.

No, it is possible, and actually likely. Basically, the issue is write buffers. The coherency mechanisms come into play at a lower level in the hierarchy (typically at the last-level cache), which is why we need fences to start with to implement things like spin locks.

Threads running on the same CPU may be share the same caches (think about a thread switch or hyper-threading). Now both processors may have a copy of the same cache line and both try to do a write to some location in that line. Then they'll both try to get exclusive access to the cache line. One CPU will succeed, the other will have a cache miss.

However, while all this is going on, the write is just sitting in a write buffer, and any references from the same processor will just get forwarded the value of the outstanding write. 

> And if we can't see that, then I don't see how we can see your example..  *one* of those modification orders has to be what is actually written to x, and reads from that memory location will not be able to see an something else. (ie, if it was 1,2,3,4  then thread 4 would not be able to see r3==4,r4==1 thanks to memory coherency.

No that's false. Even on systems with nice memory models, such as x86 and SPARC with a TSO model, you need a fence to avoid that a write-load of the same location is forced to make it all the way to coherent memory and not forwarded directly from the write buffer or L1 cache. The reasons that fences are expensive is exactly that it requires system-wide agreement.

 -Geert

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-13  0:31               ` Ken Raeburn
@ 2011-09-13  0:39                 ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2011-09-13  0:39 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: Andrew MacLeod, GCC Mailing List

On 09/12/2011 05:30 PM, Ken Raeburn wrote:
> On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:
>> lets say the order of the writes turns out to be  2,4...  is it possible for both writes to be travelling around some bus and have thread 4 actually read the second one first, followed by the first one?   It would imply a lack of memory coherency in the system wouldn't it? My simple understanding is that the hardware gives us this sort of minimum guarantee on all shared memory. which means we should never see that happen.
> 
> According to section 8.2.3.5 "Intra-Processor Forwarding Is Allowed" of "Intel 64 and IA-32 Architectures Software Developer's Manual" volume 3A, December 2009, a processor can see its own store happening before another's, though the example works on two different memory locations.  If at least one of the threads reading the values was on the same processor as one of the writing threads, perhaps it could see the locally-issued store first, unless thread-switching is presumed to include a memory fence.  Consistency of order is guaranteed *from the point of view of other processors* (8.2.3.7), which is not necessarily the case here.  A total order across all processors is imposed for locked instructions (8.2.3.8), but I'm not sure whether their use is assumed here.  I'm still reading up on caching protocols, write-back memory, etc.  Still not sure either way whether the original example can work...

Presumably any sensible operating system insert a fence whenever it
switches between threads to prevent exactly this issue.  Otherwise it
could be nearly impossible to write correct code.

(TBH, it was never entirely clear to me that mfence is guaranteed to
flush the store buffer and force everything to be re-read from the
coherency domain, but if that's not true then it's pretty much
impossible to get this right.)

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-12 23:19             ` Andrew MacLeod
@ 2011-09-13  0:31               ` Ken Raeburn
  2011-09-13  0:39                 ` Andy Lutomirski
  2011-09-13  1:52               ` Geert Bosch
  1 sibling, 1 reply; 40+ messages in thread
From: Ken Raeburn @ 2011-09-13  0:31 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC Mailing List

On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:
> lets say the order of the writes turns out to be  2,4...  is it possible for both writes to be travelling around some bus and have thread 4 actually read the second one first, followed by the first one?   It would imply a lack of memory coherency in the system wouldn't it? My simple understanding is that the hardware gives us this sort of minimum guarantee on all shared memory. which means we should never see that happen.

According to section 8.2.3.5 "Intra-Processor Forwarding Is Allowed" of "Intel 64 and IA-32 Architectures Software Developer's Manual" volume 3A, December 2009, a processor can see its own store happening before another's, though the example works on two different memory locations.  If at least one of the threads reading the values was on the same processor as one of the writing threads, perhaps it could see the locally-issued store first, unless thread-switching is presumed to include a memory fence.  Consistency of order is guaranteed *from the point of view of other processors* (8.2.3.7), which is not necessarily the case here.  A total order across all processors is imposed for locked instructions (8.2.3.8), but I'm not sure whether their use is assumed here.  I'm still reading up on caching protocols, write-back memory, etc.  Still not sure either way whether the original example can work...

Ken

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-12 18:40           ` Geert Bosch
  2011-09-12 20:54             ` Paolo Bonzini
@ 2011-09-12 23:19             ` Andrew MacLeod
  2011-09-13  0:31               ` Ken Raeburn
  2011-09-13  1:52               ` Geert Bosch
  1 sibling, 2 replies; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-12 23:19 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Paolo Bonzini, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On 09/12/2011 02:40 PM, Geert Bosch wrote:
>
>
> thread 1  thread 2  thread 3  thread 4
> --------  --------  --------  --------
>    x=1;      r1=x      x=3;      r3=x;
>    x=2;      r2=x      x=4;      r4=x;
>
> Even with relaxed memory ordering, all modifications to x have to occur in some particular total order, called  the modification order of x.
>
> So, even if each thread preserves its store order, the modification order of x can be any of:
>    1,2,3,4
>    1,3,2,4
>    1,3,4,2
>    3,1,2,4
>    3,1,4,2
>    3,4,1,2
>
> Because there is a single modification order for x, it would be an error for thread 2 and thread 4 to see a different update order.

Lets simplify it slightly.  The compiler can optimize away x=1 and x=3 
as dead stores (even valid on atomics!), leaving us with 2 modification 
orders..
      2,4 or 4,2
and what you are getting at is you don't think we should ever see
r1==2, r2==4  and r3==4, r4==2

lets say the order of the writes turns out to be  2,4...  is it possible 
for both writes to be travelling around some bus and have thread 4 
actually read the second one first, followed by the first one?   It 
would imply a lack of memory coherency in the system wouldn't it? My 
simple understanding is that the hardware gives us this sort of minimum 
guarantee on all shared memory. which means we should never see that happen.

And if we can't see that, then I don't see how we can see your example.. 
  *one* of those modification orders has to be what is actually written 
to x, and reads from that memory location will not be able to see an 
something else. (ie, if it was 1,2,3,4  then thread 4 would not be able 
to see r3==4,r4==1 thanks to memory coherency.

Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-12 18:40           ` Geert Bosch
@ 2011-09-12 20:54             ` Paolo Bonzini
  2011-09-12 23:19             ` Andrew MacLeod
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-12 20:54 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Andrew MacLeod, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On Mon, Sep 12, 2011 at 20:40, Geert Bosch <bosch@adacore.com> wrote:

> Assuming that statement is true, that would imply that even for relaxed
> ordering there has to be an optimization barrier. Clearly fences need to be
> used for any atomic accesses, including those with relaxed memory order.
>
> Consider 4 threads and an atomic int x:
>
> thread 1  thread 2  thread 3  thread 4
> --------  --------  --------  --------
>  x=1;      r1=x      x=3;      r3=x;
>  x=2;      r2=x      x=4;      r4=x;
>
> Even with relaxed memory ordering, all modifications to x have to occur in some particular total order, called  the modification order of x.
>
> So, if r1==2,r2==3 and r3==4,r4==1, that would be an error. However,
> without fences, this can easily happen on an SMP machine, even one with
> a nice memory model such as the x86.

How?  (Honest question).  All stores are to the same location.  I
don't see how that can happen without processor fences, much less
without optimization fences.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-12  7:03         ` Paolo Bonzini
@ 2011-09-12 18:40           ` Geert Bosch
  2011-09-12 20:54             ` Paolo Bonzini
  2011-09-12 23:19             ` Andrew MacLeod
  0 siblings, 2 replies; 40+ messages in thread
From: Geert Bosch @ 2011-09-12 18:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew MacLeod, Jakub Jelinek, GCC Mailing List, Aldy Hernandez


On Sep 12, 2011, at 03:02, Paolo Bonzini wrote:

> On 09/11/2011 09:00 PM, Geert Bosch wrote:
>> So, if I understand correctly, then operations using relaxed memory
>> order will still need fences, but indeed do not require any
>> optimization barrier. For memory_order_seq_cst we'll need a full
>> barrier, and for the others there is a partial barrier.
> 
> If you do not need an optimization barrier, you do not need a processor barrier either, and vice versa.  Optimizations are just another factor that can lead to reordered loads and stores.

Assuming that statement is true, that would imply that even for relaxed ordering there has to be an optimization barrier. Clearly fences need to be used for any atomic accesses, including those with relaxed memory order.

Consider 4 threads and an atomic int x:

thread 1  thread 2  thread 3  thread 4
--------  --------  --------  --------
  x=1;      r1=x      x=3;      r3=x;
  x=2;      r2=x      x=4;      r4=x;

Even with relaxed memory ordering, all modifications to x have to occur in some particular total order, called  the modification order of x.

So, even if each thread preserves its store order, the modification order of x can be any of:
  1,2,3,4
  1,3,2,4
  1,3,4,2
  3,1,2,4
  3,1,4,2
  3,4,1,2

Because there is a single modification order for x, it would be an error for thread 2 and thread 4 to see a different update order.

So, if r1==2,r2==3 and r3==4,r4==1, that would be an error. However, without fences, this can easily happen on an SMP machine, even one with a nice memory model such as the x86.

IIUC, the relaxed memory model mostly seems to allow movement (by compiler and CPU) of unrelated memory operations, but still requires fences between subsequent atomic operations on the same object. 

In other words, while atomic operations with relaxed memory order on some atomic object X cannot be used to synchronize any operations on objects other than X, they themselves cannot cause data races.

  -Geert

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 23:22         ` Andrew MacLeod
@ 2011-09-12  7:07           ` Paolo Bonzini
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-12  7:07 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Geert Bosch, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On 09/12/2011 01:22 AM, Andrew MacLeod wrote:
>> You're right that using lock_test_and_set as an exchange is very wrong
>> because of the compiler barrier semantics, but I think this is
>> entirely a red herring in this case.  The same problem could happen
>> with a fetch_and_add or even a lock_release operation.
>
> My point is that if even once we get the right barriers in place, due to
> its definition as acquire, this testcase could actually still fail, AND
> the optimization is valid...

Ah, sure.

> unless we decide to retroactively make
> all the original sync routine set_cst.

I've certainly seen code using lock_test_and_set to avoid asm for xchg. 
  That would be very much against the documentation with respect to the 
values of the second parameter, and that's also why clang introduced 
__sync_swap.  However, perhaps it makes sense to make lock_test_and_set 
provide sequential consistency.

Probably not much so for lock_release, which is quite clearly a 
store-release.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 19:00       ` Geert Bosch
  2011-09-11 19:12         ` Jakub Jelinek
@ 2011-09-12  7:03         ` Paolo Bonzini
  2011-09-12 18:40           ` Geert Bosch
  1 sibling, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-12  7:03 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Andrew MacLeod, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On 09/11/2011 09:00 PM, Geert Bosch wrote:
> So, if I understand correctly, then operations using relaxed memory
> order will still need fences, but indeed do not require any
> optimization barrier. For memory_order_seq_cst we'll need a full
> barrier, and for the others there is a partial barrier.

If you do not need an optimization barrier, you do not need a processor 
barrier either, and vice versa.  Optimizations are just another factor 
that can lead to reordered loads and stores.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 18:23       ` Paolo Bonzini
@ 2011-09-11 23:22         ` Andrew MacLeod
  2011-09-12  7:07           ` Paolo Bonzini
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-11 23:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Geert Bosch, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On 09/11/2011 02:22 PM, Paolo Bonzini wrote:
> On 09/11/2011 04:12 PM, Andrew MacLeod wrote:
>> tail->value = othervalue                   // global variable write
>> atomic_exchange (&var, tail)           // acquire operation
>>
>> although the optimizer moving the store of tail->value to AFTER the
>> exchange seems very wrong on the surface, it's really emulating what
>> another thread could possibly see.    When another thread synchronizes
>> and reads 'var', an acquire operation doesn't cause outstanding stores
>> to be fully flushed, so the other process has no guarantee that the
>> store to tail->value has happened yet even though it gets the expected
>> value of 'var'.
>
> You're right that using lock_test_and_set as an exchange is very wrong 
> because of the compiler barrier semantics, but I think this is 
> entirely a red herring in this case.  The same problem could happen 
> with a fetch_and_add or even a lock_release operation.

My point is that if even once we get the right barriers in place, due to 
its definition as acquire, this testcase could actually still fail, AND 
the optimization is valid...   unless we decide to retroactively make 
all the original sync routine set_cst.   I'm not saying we don't have 
other issues with rtl optimizations right now.

Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 19:31           ` Geert Bosch
@ 2011-09-11 19:44             ` Jakub Jelinek
  0 siblings, 0 replies; 40+ messages in thread
From: Jakub Jelinek @ 2011-09-11 19:44 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Andrew MacLeod, Paolo Bonzini, GCC Mailing List, Aldy Hernandez

On Sun, Sep 11, 2011 at 03:31:15PM -0400, Geert Bosch wrote:
> > On Sun, Sep 11, 2011 at 03:00:11PM -0400, Geert Bosch wrote:
> >> Also, for relaxed order atomic operations we would only need a single
> >> fence between two accesses (by a thread) to the same atomic object.
> > 
> > I'm not aware of any CPUs that would need any kind of fences for that.
> > Nor the compiler should need any fences for that, MEMs that may (or even are
> > known to be aliased) aren't reordered.
> 
> I guess for CPUs with TSO that might be right wrt. the hardware.
> I wouldn't say it is true in general.

You mean you are aware of CPUs that can reorder accesses to the same memory
location?  If assembly has X = 5; Y = X; X = 6; where X is some memory
location that the operations don't happen in the program order?

> But all atomic operations on an atomic object M should have 
> a total order. That means the compiler 
> 
> So for some atomic int X, with relaxed ordering:
> 
>   if (X == 0) X = 1;
>   else X = 2;
> 
> we can't optimize that to:
> 
>  X = 1;
>  if (X != 0) X = 2;

Depends what actually you mean by X here, if it is some C++0x class
where X = value is __sync_mem_store, then even if it is a relaxed store,
nothing will try to optimize any of the stores or loads from it
(though, of course such a code is quite useless, because X may be changed
in between the test and the store).

	Jakub

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 19:12         ` Jakub Jelinek
@ 2011-09-11 19:31           ` Geert Bosch
  2011-09-11 19:44             ` Jakub Jelinek
  0 siblings, 1 reply; 40+ messages in thread
From: Geert Bosch @ 2011-09-11 19:31 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andrew MacLeod, Paolo Bonzini, GCC Mailing List, Aldy Hernandez


On Sep 11, 2011, at 15:11, Jakub Jelinek wrote:

> On Sun, Sep 11, 2011 at 03:00:11PM -0400, Geert Bosch wrote:
>> Also, for relaxed order atomic operations we would only need a single
>> fence between two accesses (by a thread) to the same atomic object.
> 
> I'm not aware of any CPUs that would need any kind of fences for that.
> Nor the compiler should need any fences for that, MEMs that may (or even are
> known to be aliased) aren't reordered.

I guess for CPUs with TSO that might be right wrt. the hardware.
I wouldn't say it is true in general.
But all atomic operations on an atomic object M should have 
a total order. That means the compiler 

So for some atomic int X, with relaxed ordering:

  if (X == 0) X = 1;
  else X = 2;

we can't optimize that to:

 X = 1;
 if (X != 0) X = 2;

Do you agree?

-Geert

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 19:00       ` Geert Bosch
@ 2011-09-11 19:12         ` Jakub Jelinek
  2011-09-11 19:31           ` Geert Bosch
  2011-09-12  7:03         ` Paolo Bonzini
  1 sibling, 1 reply; 40+ messages in thread
From: Jakub Jelinek @ 2011-09-11 19:12 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Andrew MacLeod, Paolo Bonzini, GCC Mailing List, Aldy Hernandez

On Sun, Sep 11, 2011 at 03:00:11PM -0400, Geert Bosch wrote:
> Also, for relaxed order atomic operations we would only need a single
> fence between two accesses (by a thread) to the same atomic object.

I'm not aware of any CPUs that would need any kind of fences for that.
Nor the compiler should need any fences for that, MEMs that may (or even are
known to be aliased) aren't reordered.

	Jakub

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 14:13     ` Andrew MacLeod
  2011-09-11 18:23       ` Paolo Bonzini
@ 2011-09-11 19:00       ` Geert Bosch
  2011-09-11 19:12         ` Jakub Jelinek
  2011-09-12  7:03         ` Paolo Bonzini
  2011-09-13  6:31       ` Lawrence Crowl
  2 siblings, 2 replies; 40+ messages in thread
From: Geert Bosch @ 2011-09-11 19:00 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Jakub Jelinek, Paolo Bonzini, GCC Mailing List, Aldy Hernandez


On Sep 11, 2011, at 10:12, Andrew MacLeod wrote:

>> To be honest, I can't quite see the use of completely unordered
>> atomic operations, where we not even prohibit compiler optimizations.
>> It would seem if we guarantee that a variable will not be accessed
>> concurrently from any other thread, we wouldn't need the operation
>> to be atomic in the first place. That said, it's quite likely I'm
>> missing something here.
>> 
> there is no guarantee it isnt being accessed concurrently,  we are only guaranteeing that if it is accessed from another thread, it wont be a partially written value...  if you read a 64 bit value on a 32 bit machine, you need to guarantee that both halves are fully written before any read can happen. Thats the bare minimum guarantee of an atomic.

OK, I now see (in §1.10(5) of the n3225 draft) that “relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races. 

However the next paragraph says: 
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. [...] There is a separate order for each atomic object. There is no requirement that these can be combined into a single total order for all objects. In general this will be impossible since different threads may observe modifications to different objects in inconsistent orders.

So, if I understand correctly, then operations using relaxed memory order will still need fences, but indeed do not require any optimization barrier. For memory_order_seq_cst we'll need a full barrier, and for the others there is a partial barrier.

Also, for relaxed order atomic operations we would only need a single fence between two accesses (by a thread) to the same atomic object. 
> 
>> For Ada, all atomic accesses are always memory_order_seq_cst, and we
>> just care about being able to optimize accesses if we know they'll be
>> done from the same processor. For the C++11 model, thinking about
>> the semantics of any memory orders other than memory_order_seq_cst
>> and their interaction with operations with different ordering semantics
>> makes my head hurt.
> I had many headaches over a long period wrapping my head around it, but ultimately it maps pretty closely to various hardware implementations. Best bet?  just use seq-cst until you discover you have a  performance problem!!  I expect thats why its the default :-)

We've already discovered that. Atomic types are used quite a bit in Ada code. Unfortunately, many of the uses are just for accesses to memory-mapped I/O devices, single write. On many systems I/O locations can't be used for synchronization anyway, and only regular cacheable memory can be used for that.

For such operations you don't want the compiler to reorder accesses to different I/O locations, but mutual exclusion wrt. other threads is already taken care of. It seems this is precisely the opposite from what the relaxed memory order provides.

Regards,
  -Geert

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-11 14:13     ` Andrew MacLeod
@ 2011-09-11 18:23       ` Paolo Bonzini
  2011-09-11 23:22         ` Andrew MacLeod
  2011-09-11 19:00       ` Geert Bosch
  2011-09-13  6:31       ` Lawrence Crowl
  2 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-11 18:23 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Geert Bosch, Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On 09/11/2011 04:12 PM, Andrew MacLeod wrote:
> tail->value = othervalue                   // global variable write
> atomic_exchange (&var, tail)           // acquire operation
>
> although the optimizer moving the store of tail->value to AFTER the
> exchange seems very wrong on the surface, it's really emulating what
> another thread could possibly see.    When another thread synchronizes
> and reads 'var', an acquire operation doesn't cause outstanding stores
> to be fully flushed, so the other process has no guarantee that the
> store to tail->value has happened yet even though it gets the expected
> value of 'var'.

You're right that using lock_test_and_set as an exchange is very wrong 
because of the compiler barrier semantics, but I think this is entirely 
a red herring in this case.  The same problem could happen with a 
fetch_and_add or even a lock_release operation.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-10  1:09   ` Geert Bosch
  2011-09-10  5:40     ` Paolo Bonzini
  2011-09-10  6:18     ` Jakub Jelinek
@ 2011-09-11 14:13     ` Andrew MacLeod
  2011-09-11 18:23       ` Paolo Bonzini
                         ` (2 more replies)
  2011-09-13  6:20     ` Lawrence Crowl
  3 siblings, 3 replies; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-11 14:13 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Jakub Jelinek, Paolo Bonzini, GCC Mailing List, Aldy Hernandez

On 09/09/2011 09:09 PM, Geert Bosch wrote:
> For the C++0x atomic types there are:
>
> void A::store(C desired, memory_order order = memory_order_seq_cst) volatile;
> void A::store(C desired, memory_order order = memory_order_seq_cst);
>
> where the first variant (with order = memory_order_relaxed)
> would allow fences to be omitted, while still preventing the compiler from
> reordering memory accesses, IIUC.

I thought the volatile tags were actually for type correctness so the 
compiler wouldn't complain when used on volatile objects....  Ie, you 
can't call a non volatile method with a volatile object, or something 
like that.

The different memory models are meant to provide some level of 
consistency to how these atomic operations are treated.

If you use seq-cst, all shared memory optimizations will be inhibited 
across the operation, and you will see the behaviour you are expecting 
across the system. The cost can be significant on some architectures if 
the code is at all performance sensitive.

The memory models expose the different types of lower cost 
synchronizations available in the hardware. The behaviour potentially 
seen across threads by these different models can also be reflected in 
the optimizations which are allowed.

Back to the original example:

tail->value = othervalue                   // global variable write
atomic_exchange (&var, tail)           // acquire operation

although the optimizer moving the store of tail->value to AFTER the 
exchange seems very wrong on the surface, it's really emulating what 
another thread could possibly see.    When another thread synchronizes 
and reads 'var', an acquire operation doesn't cause outstanding stores 
to be fully flushed, so the other process has no guarantee that the 
store to tail->value has happened yet even though it gets the expected 
value of 'var'.  That is why it is valid for the optimizer to move the 
store.   In order for this program to work as the user expects, this 
atomic exchange has to have at least release semantics if not something 
stronger.  Using the new builtins, specifying a more appropriate memory 
model would resolve the issue.

As it turns out, the sample program would never have failed x86 without 
the optimizer since XCHG has an implicit lock and is really seq-cst by 
nature, but if this program were compiled on another architecture where 
the instruction actually DID have only the documented acquire semantics, 
this exact same failure could be triggered by the hardware rather than 
the optimizer, so the bug would still be there and bloody hard to find.

Allowing the optimizers to move things based on the memory model 
actually increases the chances of detecting an error :-)     I've 
started a summary of what the optimizers can and cant do here: 
http://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations/Details  Its further 
down on the list of todo's, but eventually we'll get there.

Note that this code movement the optimizer performed cannot be detected 
by a single thread program. it satisfies all the various data 
dependencies in order to move it, and any operations which utilizes the 
value will see the store as it should. So as expected, this code "bug" 
would still only show up with multiple threads, its just more likely to 
with optimization.


> To be honest, I can't quite see the use of completely unordered
> atomic operations, where we not even prohibit compiler optimizations.
> It would seem if we guarantee that a variable will not be accessed
> concurrently from any other thread, we wouldn't need the operation
> to be atomic in the first place. That said, it's quite likely I'm
> missing something here.
>
there is no guarantee it isnt being accessed concurrently,  we are only 
guaranteeing that if it is accessed from another thread, it wont be a 
partially written value...  if you read a 64 bit value on a 32 bit 
machine, you need to guarantee that both halves are fully written before 
any read can happen. Thats the bare minimum guarantee of an atomic.

> For Ada, all atomic accesses are always memory_order_seq_cst, and we
> just care about being able to optimize accesses if we know they'll be
> done from the same processor. For the C++11 model, thinking about
> the semantics of any memory orders other than memory_order_seq_cst
> and their interaction with operations with different ordering semantics
> makes my head hurt.
I had many headaches over a long period wrapping my head around it, but 
ultimately it maps pretty closely to various hardware implementations. 
Best bet?  just use seq-cst until you discover you have a  performance 
problem!!  I expect thats why its the default :-)

There is a longer term plan to optimize the actual atomic operations as 
well, but that still drawing board stuff until we have a a solid 
implementation.

Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-10  1:09   ` Geert Bosch
  2011-09-10  5:40     ` Paolo Bonzini
@ 2011-09-10  6:18     ` Jakub Jelinek
  2011-09-11 14:13     ` Andrew MacLeod
  2011-09-13  6:20     ` Lawrence Crowl
  3 siblings, 0 replies; 40+ messages in thread
From: Jakub Jelinek @ 2011-09-10  6:18 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Paolo Bonzini, GCC Mailing List, Aldy Hernandez, amacleod

On Fri, Sep 09, 2011 at 09:09:27PM -0400, Geert Bosch wrote:
> To be honest, I can't quite see the use of completely unordered
> atomic operations, where we not even prohibit compiler optimizations.
> It would seem if we guarantee that a variable will not be accessed
> concurrently from any other thread, we wouldn't need the operation
> to be atomic in the first place. That said, it's quite likely I'm 
> missing something here. 

E.g. OpenMP #pragma omp atomic just documents that the operation performed
on the variable is atomic, but has no requirements on it being any kind of
barrier for stores/loads to/from other memory locations.  That is what I'd
like to use related sync operations for.  Say
  var2 = 5;
#pragma omp atomic update
  var = var + 6;
  var3 = 7;
only guarantees that you atomically increment var by 6, the var2 store can
happen after it or var3 store before it (only var stores/loads should be
before/after the atomic operation in program order, but you don't need any
barriers for it).

Of course if you use atomic operations for locking etc. you want to
serialize other memory accesses too (say acquire, or release, or full
barriers).

	Jakub

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-10  1:09   ` Geert Bosch
@ 2011-09-10  5:40     ` Paolo Bonzini
  2011-09-10  6:18     ` Jakub Jelinek
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-10  5:40 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Jakub Jelinek, GCC Mailing List, Aldy Hernandez, amacleod

On Sat, Sep 10, 2011 at 03:09, Geert Bosch <bosch@adacore.com> wrote:
> For example, for atomic objects accessed only from a single processor
> (but  possibly multiple threads), you'd not want the compiler to reorder
> memory accesses to global variables across the atomic operations, but
> you wouldn't have  to emit the expensive fences.

I am not 100% sure, but I tend to disagree.  The original bug report
can be represented as

   node->next = NULL [relaxed];
   xchg(tail, node) [seq_cst];

and the problem was that the two operations were swapped.  But that's
not a problem with the first access, but rather with the second.  So
it should be fine if the  [relaxed] access does not include a barrier,
because it relies on the [seq_cst] access providing it later.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-09  8:17 ` Jakub Jelinek
  2011-09-09  8:18   ` Paolo Bonzini
  2011-09-09 14:23   ` Andrew MacLeod
@ 2011-09-10  1:09   ` Geert Bosch
  2011-09-10  5:40     ` Paolo Bonzini
                       ` (3 more replies)
  2 siblings, 4 replies; 40+ messages in thread
From: Geert Bosch @ 2011-09-10  1:09 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Paolo Bonzini, GCC Mailing List, Aldy Hernandez, amacleod


On Sep 9, 2011, at 04:17, Jakub Jelinek wrote:

> I'd say they should be optimization barriers too (and at the tree level
> they I think work that way, being represented as function calls), so if
> they don't act as memory barriers in RTL, the *.md patterns should be
> fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
> variants - if the CPU can reorder memory accesses across them at will,
> why shouldn't the compiler be able to do the same as well?

They are different concepts. If a program runs on a single processor,
all memory operations will appear to be sequentially consistent, even if
the CPU reorders them at the hardware level.  However, compiler 
optimizations can still cause multiple threads to see the accesses 
as not sequentially consistent. 

For example, for atomic objects accessed only from a single processor 
(but  possibly multiple threads), you'd not want the compiler to reorder
memory accesses to global variables across the atomic operations, but 
you wouldn't have  to emit the expensive fences.

For the C++0x atomic types there are:

void A::store(C desired, memory_order order = memory_order_seq_cst) volatile;
void A::store(C desired, memory_order order = memory_order_seq_cst);

where the first variant (with order = memory_order_relaxed) 
would allow fences to be omitted, while still preventing the compiler from
reordering memory accesses, IIUC.

To be honest, I can't quite see the use of completely unordered
atomic operations, where we not even prohibit compiler optimizations.
It would seem if we guarantee that a variable will not be accessed
concurrently from any other thread, we wouldn't need the operation
to be atomic in the first place. That said, it's quite likely I'm 
missing something here. 

For Ada, all atomic accesses are always memory_order_seq_cst, and we
just care about being able to optimize accesses if we know they'll be
done from the same processor. For the C++11 model, thinking about
the semantics of any memory orders other than memory_order_seq_cst
and their interaction with operations with different ordering semantics
makes my head hurt.

Regards,
  -Geert

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-09 14:23   ` Andrew MacLeod
@ 2011-09-09 14:27     ` Paolo Bonzini
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-09 14:27 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: Jakub Jelinek, GCC Mailing List, Aldy Hernandez

On 09/09/2011 04:22 PM, Andrew MacLeod wrote:
>>
> Yeah, some of this is part of the ongoing C++0x work... the memory model
> parameter is going to allow certain types of code movement in optimizers
> based on whether its an acquire operation, a release operation, neither,
> or both.    It is ongoing and hopefully we will eventually have proper
> consistency.  The older __sync builtins are eventually going to invoke
> the new__sync_mem routines and their new patterns, but will fall back to
> the old ones if new patterns aren't specified.
>
> In the case of your program, this would in fact be a valid
> transformation I believe...  __sync_lock_test_and_set is documented to
> only have ACQUIRE semantics.

Yes, that's true.  However, there's nothing special in the compiler to 
handle __sync_lock_test_and_set differently (optimization-wise) from say 
__sync_fetch_and_add.

> I don't see anything in this pattern however that would enforce acquire
> mode and prevent the reverse operation.. moving something from after to
> before it... so there may be a bug there anyway.

Yes.

> And I suspect most people actually expect all the old __sync routines to
> be full optimization barriers all the time...  maybe we should consider
> just doing that...

That would be very nice.  I would like to introduce that kind of data 
structure in QEMU, too. :)

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-09  8:17 ` Jakub Jelinek
  2011-09-09  8:18   ` Paolo Bonzini
@ 2011-09-09 14:23   ` Andrew MacLeod
  2011-09-09 14:27     ` Paolo Bonzini
  2011-09-10  1:09   ` Geert Bosch
  2 siblings, 1 reply; 40+ messages in thread
From: Andrew MacLeod @ 2011-09-09 14:23 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Paolo Bonzini, GCC Mailing List, Aldy Hernandez

On 09/09/2011 04:17 AM, Jakub Jelinek wrote:
> On Fri, Sep 09, 2011 at 10:07:30AM +0200, Paolo Bonzini wrote:
>> sync builtins are described in the documentations as being full
>> memory barriers, with the possible exception of
>> __sync_lock_test_and_set. However, GCC is not enforcing the fact
>> that they are also full _optimization_ barriers.  The RTL produced
>> by builtins does not in general include a memory optimization
>> barrier such as a set of (mem/v:BLK (scratch:P)).
>>
>> This can cause problems with lock-free algorithms, for example this:
>>
>> http://libdispatch.macosforge.org/trac/ticket/35
>>
>> This can be solved either in generic code, by wrapping sync builtins
>> (before and after) with an asm("":::"memory"), or in the single
>> machine descriptions by adding a memory barrier in parallel to the
>> locked instructions or with the ll/sc instructions.
>>
>> Is the above analysis correct?  Or should the users put explicit
>> compiler barriers?
> I'd say they should be optimization barriers too (and at the tree level
> they I think work that way, being represented as function calls), so if
> they don't act as memory barriers in RTL, the *.md patterns should be
> fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
> variants - if the CPU can reorder memory accesses across them at will,
> why shouldn't the compiler be able to do the same as well?
> 	
Yeah, some of this is part of the ongoing C++0x work... the memory model 
parameter is going to allow certain types of code movement in optimizers 
based on whether its an acquire operation, a release operation, neither, 
or both.    It is ongoing and hopefully we will eventually have proper 
consistency.  The older __sync builtins are eventually going to invoke 
the new__sync_mem routines and their new patterns, but will fall back to 
the old ones if new patterns aren't specified.

In the case of your program, this would in fact be a valid 
transformation I believe...  __sync_lock_test_and_set is documented to 
only have ACQUIRE semantics. This does not guarantee that a store BEFORE 
the operation will be visible in another thread, which means it possible 
to reorder it.  (A summary of the different modes can be found at : 
http://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations     So you would 
require a  barrier before this code anyway for the behaviour you are 
looking for.    Once the new routines are available and implemented, you 
could simply specify the SEQ_CST model and then it should, in theory, 
work properly with a barrier being emitted for you.

I don't see anything in this pattern however that would enforce acquire 
mode and prevent the reverse operation.. moving something from after to 
before it... so there may be a bug there anyway.

And I suspect most people actually expect all the old __sync routines to 
be full optimization barriers all the time...  maybe we should consider 
just doing that...

Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-09  8:17 ` Jakub Jelinek
@ 2011-09-09  8:18   ` Paolo Bonzini
  2011-09-09 14:23   ` Andrew MacLeod
  2011-09-10  1:09   ` Geert Bosch
  2 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-09  8:18 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Mailing List, Aldy Hernandez, amacleod

On 09/09/2011 10:17 AM, Jakub Jelinek wrote:
>> >  Is the above analysis correct?  Or should the users put explicit
>> >  compiler barriers?
> I'd say they should be optimization barriers too (and at the tree level
> they I think work that way, being represented as function calls), so if
> they don't act as memory barriers in RTL, the *.md patterns should be
> fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
> variants - if the CPU can reorder memory accesses across them at will,
> why shouldn't the compiler be able to do the same as well?

Agreed, so we have a bug in all released versions of GCC. :(

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: should sync builtins be full optimization barriers?
  2011-09-09  8:07 Paolo Bonzini
@ 2011-09-09  8:17 ` Jakub Jelinek
  2011-09-09  8:18   ` Paolo Bonzini
                     ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Jakub Jelinek @ 2011-09-09  8:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Mailing List, Aldy Hernandez, amacleod

On Fri, Sep 09, 2011 at 10:07:30AM +0200, Paolo Bonzini wrote:
> sync builtins are described in the documentations as being full
> memory barriers, with the possible exception of
> __sync_lock_test_and_set. However, GCC is not enforcing the fact
> that they are also full _optimization_ barriers.  The RTL produced
> by builtins does not in general include a memory optimization
> barrier such as a set of (mem/v:BLK (scratch:P)).
> 
> This can cause problems with lock-free algorithms, for example this:
> 
> http://libdispatch.macosforge.org/trac/ticket/35
> 
> This can be solved either in generic code, by wrapping sync builtins
> (before and after) with an asm("":::"memory"), or in the single
> machine descriptions by adding a memory barrier in parallel to the
> locked instructions or with the ll/sc instructions.
> 
> Is the above analysis correct?  Or should the users put explicit
> compiler barriers?

I'd say they should be optimization barriers too (and at the tree level
they I think work that way, being represented as function calls), so if
they don't act as memory barriers in RTL, the *.md patterns should be
fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
variants - if the CPU can reorder memory accesses across them at will,
why shouldn't the compiler be able to do the same as well?

	Jakub

^ permalink raw reply	[flat|nested] 40+ messages in thread

* should sync builtins be full optimization barriers?
@ 2011-09-09  8:07 Paolo Bonzini
  2011-09-09  8:17 ` Jakub Jelinek
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2011-09-09  8:07 UTC (permalink / raw)
  To: GCC Mailing List, Jakub Jelinek, Aldy Hernandez, amacleod

Hi all,

sync builtins are described in the documentations as being full memory 
barriers, with the possible exception of __sync_lock_test_and_set. 
However, GCC is not enforcing the fact that they are also full 
_optimization_ barriers.  The RTL produced by builtins does not in 
general include a memory optimization barrier such as a set of 
(mem/v:BLK (scratch:P)).

This can cause problems with lock-free algorithms, for example this:

http://libdispatch.macosforge.org/trac/ticket/35

This can be solved either in generic code, by wrapping sync builtins 
(before and after) with an asm("":::"memory"), or in the single machine 
descriptions by adding a memory barrier in parallel to the locked 
instructions or with the ll/sc instructions.

Is the above analysis correct?  Or should the users put explicit 
compiler barriers?

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-09-26 18:10 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-15 16:20 should sync builtins be full optimization barriers? Richard Henderson
2011-09-15 16:26 ` Paolo Bonzini
2011-09-20  7:56   ` Paolo Bonzini
2011-09-24  9:24   ` Richard Guenther
2011-09-26 16:18     ` Richard Guenther
  -- strict thread matches above, loose matches on Subject: below --
2011-09-09  8:07 Paolo Bonzini
2011-09-09  8:17 ` Jakub Jelinek
2011-09-09  8:18   ` Paolo Bonzini
2011-09-09 14:23   ` Andrew MacLeod
2011-09-09 14:27     ` Paolo Bonzini
2011-09-10  1:09   ` Geert Bosch
2011-09-10  5:40     ` Paolo Bonzini
2011-09-10  6:18     ` Jakub Jelinek
2011-09-11 14:13     ` Andrew MacLeod
2011-09-11 18:23       ` Paolo Bonzini
2011-09-11 23:22         ` Andrew MacLeod
2011-09-12  7:07           ` Paolo Bonzini
2011-09-11 19:00       ` Geert Bosch
2011-09-11 19:12         ` Jakub Jelinek
2011-09-11 19:31           ` Geert Bosch
2011-09-11 19:44             ` Jakub Jelinek
2011-09-12  7:03         ` Paolo Bonzini
2011-09-12 18:40           ` Geert Bosch
2011-09-12 20:54             ` Paolo Bonzini
2011-09-12 23:19             ` Andrew MacLeod
2011-09-13  0:31               ` Ken Raeburn
2011-09-13  0:39                 ` Andy Lutomirski
2011-09-13  1:52               ` Geert Bosch
2011-09-13  6:35                 ` Paolo Bonzini
2011-09-13 14:46                   ` Eric Botcazou
2011-09-13 12:09                 ` Andrew MacLeod
2011-09-13 14:58                   ` Geert Bosch
2011-09-13 16:16                     ` Andrew MacLeod
2011-09-26 16:17                       ` Michael Matz
2011-09-26 17:32                         ` Ian Lance Taylor
2011-09-26 18:10                         ` Andrew MacLeod
2011-09-27  5:26                           ` James Dennett
2011-09-27  8:19                             ` Andrew MacLeod
2011-09-13  6:31       ` Lawrence Crowl
2011-09-13  6:20     ` Lawrence Crowl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).