Fwd: Re: GCC libatomic questions

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Fwd: Re: GCC libatomic questions
       [not found] <cbd2c83a-b50b-b2ac-b62d-b2d26178c2b1@oracle.com>
@ 2016-07-06 17:50 ` Richard Henderson
  2016-07-06 19:41   ` Richard Henderson
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2016-07-06 17:50 UTC (permalink / raw)
  To: gcc, Torvald Riegel; +Cc: Bin Fan

[-- Attachment #1: Type: text/plain, Size: 1145 bytes --]

Redirecting to the gcc list for discussion.
I'll follow up on that thread directly.

r~

-------- Forwarded Message --------
Subject: 	Re: GCC libatomic questions
Date: 	Wed, 6 Jul 2016 10:27:20 -0700
From: 	Bin Fan <bin.x.fan@oracle.com>
Organization: 	Oracle Corporation
To: 	Richard Henderson <rth@redhat.com>

Hello Richard,

This is Bin in Sun/Oracle compiler group. Sorry about the long delay for the 
libatomic ABI specification I mentioned a long long time ago. I was assigned to 
some other tasks.

Please find a draft of the libatomic ABI specification attached. The text is 
also pasted at the end of the email.

The goal of the ABI specification is twofold. First is to check with the GCC 
community that the ABI matches the latest GCC libatomic implementation. This 
would make sure that GCC and Oracle Developer Studio C/C++ compiler can work 
well together w/o any compatibility issues on Solaris/Linux + SPARC/x86. Second 
and a longer term goal is to integrate the libatomic ABI into the current 
SPARC/x86 ABI specifications.

Could you please review the draft and/or forward it to the community for review?

Thanks,
- Bin

[-- Attachment #2: ABI.txt --]
[-- Type: text/plain, Size: 40223 bytes --]

1. Overview

1.1. Why we need an ABI for atomics

C11 standard allows different size, representation and alignment
between atomic types and the corresponding non-atomic types [1].
The size, representation and alignment of atomic types need to be 
specified in the ABI specification.

A runtime support library, libatomic, already exists on Solaris 
and Linux. The interface of this library needs to be standardized 
as part of the ABI specification, so that

- On a system that supply libatomic, all compilers in compliance 
  with the ABI can generate compatible binaries linking this library.

- The binary can be backward compatible on different versions of 
  the system as long as they support the same ABI.

1.2. What does the atomics ABI specify

The ABI specifies the following

- Data representation of the atomic types.

- The names and behaviors of the implementation-specific support
  functions.

- The atomic types for which the compiler may generate inlined code. 

- Lock-free property of the inlined atomic operations.

Note that the name and behavior of the libatomic functions specified 
in the C standard do not need to be part of this ABI, because they 
are already required to meet the specification in the standard.

1.3. Affected platforms

The following platforms are affected by this ABI specification.

SPARC (32-bit and 64-bit)
x86 (32-bit and 64-bit)

Section 1.1 and 1.2, and the Rationale, Notes and Appendix sections 
in the rest of the document are for explanation purpose only, it 
is not considered as part of the formal ABI specification.

2. Data Representation

2.1. General Rules

The general rules for size, representation and alignment of the data
representation of atomic types are the following

1) Atomic types assume the same size with the corresponding non-atomic 
   types.

2) Atomic types assume the same representation with the corresponding 
   non-atomic types.

3) Atomic types assume the same alignment with the corresponding 
   non-atomic types, with the following exceptions:

   On 32- and 64-bit x86 platforms and on 64-bit SPARC platforms, 
   atomic types of size 1, 2, 4, 8 or 16-byte have the alignment 
   that matches the size.

   On 32-bit SPARC platforms, atomic types of size 1, 2, 4 or 8-byte
   have the alignment that matches the size. If the alignment of a 
   16-byte non-atomic type is less than 8-byte, the alignment of the 
   corresponding atomic type is increased to 8-byte.

Note 

The above rules apply to both scalar types and aggregate types.

2.2. Atomic scalar types

x86

                                          LP64 (AMD64)                     ILP32 (i386)
C Type                          sizeof    Alignment  Inlineable  sizeof    Alignment  Inlineable
atomic_flag                     1         1          Y           1         1	      Y
_Atomic _Bool                   1         1          Y           1         1          Y
_Atomic char                    1         1          Y           1         1          Y
_Atomic signed char             1         1          Y           1         1          Y
_Atomic unsigned char           1         1          Y           1         1          Y
_Atomic short                   2         2          Y           2         2          Y
_Atomic signed short            2         2          Y           2         2          Y
_Atomic unsigned short          2         2          Y           2         2          Y
_Atomic int                     4         4          Y           4         4          Y
_Atomic signed int              4         4          Y           4         4          Y
_Atomic enum                    4         4          Y           4         4          Y
_Atomic unsigned int            4         4          Y           4         4          Y
_Atomic long                    8         8          Y           4         4          Y
_Atomic signed long             8         8          Y           4         4          Y
_Atomic unsigned long           8         8          Y           4         4          Y
_Atomic long long               8         8          Y           8         8          Y
_Atomic signed long long        8         8          Y           8         8          Y
_Atomic unsigned long long      8         8          Y           8         8          Y
_Atomic __int128                16	      16         N               not applicable
any-type _Atomic *              8         8          Y           4         4          Y
_Atomic float                   4         4          Y           4         4          Y
_Atomic double                  8         8          Y           8         8          Y
_Atomic long double             16        16         N           12        4          N
_Atomic float _Complex          8         8(4)       Y           8         8(4)       Y
_Atomic double _Complex         16        16(8)      N           16        16(8)      N
_Atomic long double _Complex    32        16         N           24        4          N
_Atomic float _Imaginary        4         4          Y           4         4          Y
_Atomic double _Imaginary       8         8          Y           8         8          Y
_Atomic long double _Imaginary  16        16         N           12        4          N

SPARC

                                          LP64 (v9)                        ILP32 (sparc)
C Type                          sizeof    Alignment  Inlineable  sizeof    Alignment  Inlineable
atomic_flag                     1         1          Y           1         1	      Y
_Atomic _Bool                   1         1          Y           1         1          Y
_Atomic char                    1         1          Y           1         1          Y
_Atomic signed char             1         1          Y           1         1          Y
_Atomic unsigned char           1         1          Y           1         1          Y
_Atomic short                   2         2          Y           2         2          Y
_Atomic signed short            2         2          Y           2         2          Y
_Atomic unsigned short          2         2          Y           2         2          Y
_Atomic int                     4         4          Y           4         4          Y
_Atomic signed int              4         4          Y           4         4          Y
_Atomic enum                    4         4          Y           4         4          Y
_Atomic unsigned int            4         4          Y           4         4          Y
_Atomic long                    8         8          Y           4         4          Y
_Atomic signed long             8         8          Y           4         4          Y
_Atomic unsigned long           8         8          Y           4         4          Y
_Atomic long long               8         8          Y           8         8          Y
_Atomic signed long long        8         8          Y           8         8          Y
_Atomic unsigned long long      8         8          Y           8         8          Y
_Atomic __int128                16	      16         N               not applicable
any-type _Atomic *              8         8          Y           4         4          Y
_Atomic float                   4         4          Y           4         4          Y
_Atomic double                  8         8          Y           8         8          Y
_Atomic long double             16        16         N           16        8          N
_Atomic float _Complex          8         8(4)       Y           8         8(4)       Y
_Atomic double _Complex         16        16(8)      N           16        8          N
_Atomic long double _Complex    32        16         N           32        8          N
_Atomic float _Imaginary        4         4          Y           4         4          Y
_Atomic double _Imaginary       8         8          Y           8         8          Y
_Atomic long double _Imaginary  16        16         N           16        8          N

Notes: 

C standard also specifies some atomic integer types. They are not
listed in the above table because they have the same representation 
and alignment requirements as the corresponding direct types [2].

We will discuss the inlineable column and __int128 type in section 3.

The value in () shows the alignment of the corresponding non-atomic 
type, if it is different from the alignment of the atomic type.

Because _Atomic specifier can not be used on a function type [7] and 
_Atomic qualifier can not modify a function type [8], there is no 
atomic function type listed in the above table.

On 32-bit x86 platforms, long double is of size 12-byte and is of 
alignment 4-byte. This ABI specification does not increase the 
alignment of _Atomic long double type because it would not be 
lock-free even if it is 16-byte aligned, since there is no 12-byte 
or 16-byte lock-free instruction on 32-bit x86 platforms.

2.3 Atomic Aggregates and Unions

Atomic structures or unions may have different alignment compared to
the corresponding non-atomic types, subject to rule 3) in section 2.1. 
The alignment change only affects the boundary where an entire 
structure or union is aligned. The offset of each member, the internal 
padding and the size of the structure or union are not affected.

The following table shows selective examples of the size and alignment
of atomic structure types.

x86

                                          LP64 (AMD64)                      ILP32 (i386)
C Type                          sizeof    Alignment  Inlineable   sizeof    Alignment  Inlineable
_Atomic struct {char a[2];}     2         2(1)       Y            2         2(1)       Y
_Atomic struct {char a[3];}     3         1          N            3         1          N
_Atomic struct {short a[2];}    4         4(2)       Y            4         4(2)       Y
_Atomic struct {int a[2];}      8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c[2];
                short s;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char a[16];}    16        16(1)      N            16        16(1)      N

SPARC

                                          LP64 (v9)                       ILP32 (sparc)
C Type                          sizeof    Alignment  Inlineable   sizeof    Alignment  Inlineable
_Atomic struct {char a[2];}     2         2(1)       Y            2         2(1)       Y 
_Atomic struct {char a[3];}     3         1          N            3         1          N
_Atomic struct {short a[2];}    4         4(2)       Y            4         4(2)       Y
_Atomic struct {int a[2];}      8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c[2];
                short s;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char a[16];}    16        16(1)      N            16        8(1)       N

Notes

The value in () shows the alignment of the corresponding non-atomic 
type, if it is different from the alignment of the atomic type.

Because the padding of structure types is not affected by _Atomic 
modifier, the contents of any padding in the atomic structure object 
is still undefined, therefore the atomic compare and exchange operation 
on such objects may fail due to the difference of the padding.

The increased alignment of 16-byte atomic struct types might be 
useful to 
- Reduce sharing locks with other atomics. 
- Allow more efficient implementation of runtime support functions
  for atomic operations on such types.

2.4. Bit-fields

It is implementation defined in the C standard that whether atomic 
bit-field types are permitted [3]. In this ABI specification, The 
representation of atomic bit-field is unspecified.

3. Lock-free and Inlineable Property

The implementation of atomic operations may map directly to hardware 
atomic instructions. This kind of implementation is lock-free.

Lock-free atomic operations does not require runtime support functions.
The compiler may generate inlined code for efficiency. This ABI 
specification defines a few inlineable atomic types. An atomic type
is inlineable means the compiler may generate inlined instruction 
sequence for atomic operations on such types. The implementation of
the support functions for the inlineable atomic types must also be 
lock free.

On all affected platforms, atomic types whose size equal to 1, 2, 4 
or 8 and alignment matches the size are inlineable

If an atomic type is not inlineable, the compiler shall always generate
support function call for the atomic operation on the object of such type. 
The implementation of the support functions for non-inlineable atomic
types may be lock-free.

Rationale

It is assumed that there is no way for an atomic object to be accessed 
from both lock-free operation and non-lock-free operation and the 
atomic semantic can be satisfied.

If the compiler always generates runtime support function calls for 
all atomics, the lock-free property would be hidden inside the library 
implementation. However, the compiler may inline the atomic operations, 
and we want to allow such inlining optimizations.

The compiler inlining raises an issue of mix-and-matched accesses to
the same atomic object from the compiler generated code and the runtime 
library function. They have to be consistent on the lock-free property.

One possible solution to achieve the lock-free consistency is to specify 
the lock-free property on a per-type basis. The C and C++ standard seem 
to back this approach: C++ standard provides a query that returns a 
per-type result about whether the type is lock-free [4]. C standard 
does not guarantee that the query result is per-type [5], but it's the 
direction it is going towards [6]. However, the query result does not 
necessarily reflect the implementation of the atomic operation on the 
queried type. The implementation may use lock-free instructions for 
a specific object that meets certain criteria. So specifying the 
lock-free property on a per-type basis is unnecessarily conservative. 

It is possible to specify the lock-free property on a per-object basis.
But it is simpler to disallow the compiler to inline the atomic 
operations for "may be lock-free" types in order to hide the lock-free 
optimization inside the library implementation.

So the ABI achieve the lock-free consistency by specifying which types 
may be inlined and specifying that those types must be lock-free, so 
that for the inlineable atomic types, if there are mix-and-matched 
accesses, they must both be lock-free; and for the non-inlineable atomic
types, the compiler never inlines so the mix-and-match never happens.

Notes:

Here are a few examples of small types which don't qualify as 
inlineable type:

  _Atomic struct {char a[3];} /* size = 3, alignment = 1 */
  _Atomic long double /* (on 32-bit x86) size = 12, alignment = 4 */

A smart compiler may know such an object is located at an address that 
fits in an 8-byte aligned window, but the ABI compliance behavior is
to not generate lock-free inlined code sequence, since a lazy compiler 
may generate a runtime support function call which may not be 
implemented lock-free.

CMPXCHG16B is not always available on 64-bit x86 platforms, so 16-byte
naturally aligned atomics are not inlineable. The support functions for
such atomics are free to use lock-free implementation if the instruction
is available on specific platforms. 

4. libatomic library functions

4.1. Data Definitions

This section contains examples of system header files that provide 
data interface needed by the libatomic functions.

<stdatomic.h>

typedef enum
{
    memory_order_relaxed = 0,
    memory_order_consume = 1,
    memory_order_acquire = 2,
    memory_order_release = 3,
    memory_order_acq_rel = 4,
    memory_order_seq_cst = 5
} memory_order;

typedef _Atomic struct
{
  unsigned char __flag;
} atomic_flag;

Refer to C standard for the meaning of each enumeration constants of
memory_order type.

<fenv.h>

SPARC

#define FE_INEXACT    0x01
#define FE_DIVBYZERO  0x02
#define FE_UNDERFLOW  0x04
#define FE_OVERFLOW   0x08
#define FE_INVALID    0x10

x86

#define FE_INVALID    0x01
#define FE_DIVBYZERO  0x04
#define FE_OVERFLOW   0x08
#define FE_UNDERFLOW  0x10
#define FE_INEXACT    0x20

4.2. Support Functions

The following kinds of atomic operations are supported by the runtime
library: load, store, exchange, compare-and-exchange and arithmetic 
read-modify-write operations. For the arithmetic read-modify-write 
operations, the following kinds of modification operation are supported: 
addition, subtraction, bitwise inclusive or, bitwise exclusive or, 
bitwise and, bitwise nand. There is also classic test-and-set functions.

For each kind of atomic operations, libatomic provide a generic version 
which accepts a pointer of all atomic types and a set of functions that 
accept a pointer of some special atomic types which are of size 
1, 2, 4 and 8- byte on all platforms and 16-byte on 64-bit platforms.

Note: Section 2.1 mentions the alignment adjustment for atomic types of 
sizes 1, 2, 4, 8 and 16-byte. For load, store, exchange and compre-and-
exchange operations, it is safe to convert a pointer of any atomic types 
of those sizes to the pointer of corresponding atomic integer types with 
the same size.

Note: The size specific versions accept and return data by value, the 
generic version use memory pointers to pass and return the data objects.

Most of the functions listed in this section can be mapped to the generic 
functions with the same semantics in the C standard. Refer to the C 
standard for the description of the generic functions and how each memory 
order works.

The following functions are available on all platforms.

void __atomic_load (size_t size, void *object, void *loaded, memory_order order);

    Atomically load the value pointed to by object. Assign the loaded
    value to the memory pointed to by loaded. The size of memory
    affected by the load is designated by size. 

int8_t __atomic_load_1 (int8_t *object, memory_order order);
int16_t __atomic_load_2 (int16_t *object, memory_order order);
int32_t __atomic_load_4 (int32_t *object, memory_order order);
int64_t __atomic_load_8 (int64_t *object, memory_order order);

    Atomically load the value pointed to by object. The loaded value is
    returned. The size of memory affected by the load is designated by
    the type of the object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    Memory is affected according to the value of order. If order is either
    memory_order_release or memory_order_acq_rel, the behavior of the 
    function is undefined.

void __atomic_store (size_t size, void *object, void *desired, memory_order order)

    Atomically replace the value pointed to by object with the value
    pointed to by desired. The size of memory affected by the store
    is designated by size.

void __atomic_store_1 (int8_t *object, int8_t desired, memory_order order);
void __atomic_store_2 (int16_t *object, int16_t desired, memory_order order);
void __atomic_store_4 (int32_t *object, int32_t desired, memory_order order);
void __atomic_store_8 (int64_t *object, int64_t desired, memory_order order);

    Atomically replace the value pointed to by object with desired.
    The size of memory affected by the store is designated by the
    type of the object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    Memory is affected according to the value of order. If order is one of
    memory_order_acquire, memory_order_consume or memory_order_acq_rel, the
    behavior of the function is undefined.

void __atomic_exchange (size_t size, void *object, void *desired, void *loaded, memory_model order);

    Atomically, replace the value pointed to by object with the value
    pointed to by desired and assign the value pointed to by loaded to
    the value pointed to by object immediately before the effect. The 
    size of memory affected by the exchange is designated by size.

int8_t __atomic_exchange_1 (int8_t * object, int8_t desired, memory_order)
int16_t __atomic_exchange_2 (int16_t * object, int16_t desired, memory_order)
int32_t __atomic_exchange_4 (int32_t * object, int32_t desired, memory_order)
int64_t __atomic_exchange_8 (int64_t * object, int64_t desired, memory_order)

    Atomically, replace the value pointed to by object with desired 
    and return the value pointed to by object immediately before the 
    effect. The size of memory affected by the exchange is designated 
    by the type of object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    Memory is affected according to the value of order.

_Bool __atomic_compare_exchange (size_t size, void *object, void *expected, void *desired, memory_model success_order, memory_model failure_order);

    Atomically, compares the memory pointed to by object for equality with 
    the memory pointed to by expected, and if true, replaces the memory
    pointed to by object with the memory pointed to by desired, and if false,
    updates the memory pointed to by expected with the memory pointed to by 
    object. The result of the comparison is returned. The size of memory 
    affected by the compare and exchange is designated by size.

    The compare and exchange never fail spuriously, i.e. if the comparison 
    for equality returns false, the two values in the comparison were not 
    equal. [Note, this is to specify that on SPARC and x86, compare exchange 
    is always implemented with "strong" semantic. The weak flavors in the 
    C standard is translated to strong.]

_Bool __atomic_compare_exchange_1 (int8_t *object, int8_t *expected, int8_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_2 (int16_t *object, int16_t *expected, int16_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_4 (int32_t *object, int32_t *expected, int32_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_8 (int64_t *object, int64_t *expected, int64_t desired, memory_order success_order, memory_order failure_order);

    Atomically, compares the memory pointed to by object for equality with 
    the memory pointed to by expected, and if true, replaces the memory
    pointed to by object with desired, and if false, updates the memory
    pointed to by expected with the memory pointed to by object. The 
    result of the comparison is returned.

    The size of memory affected by the compare and exchange is designated 
    by the type of object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    The compare and exchange never fail spuriously, i.e. if the comparison 
    for equality returns false, the two values in the comparison were not 
    equal.

    If the comparison is true, memory is affected according to the 
    value of success_order, and if the comparison is false, memory is 
    affected according to the value of failure_order. The result of the
    comparison is returned. 

int8_t __atomic_add_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_add_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_add_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_add_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object plus operand and returns the value
    pointed to by object immediately after the effects. If object is 
    not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_add_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_add_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_add_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_add_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object plus operand and returns the value
    pointed to by object immediately before the effects. If object is 
    not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_sub_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_sub_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_sub_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_sub_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object minus operand and returns the value
    pointed to by object immediately after the effects. If object is not 
    aligned properly according to the type of object, the behavior is 
    undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_sub_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_sub_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_sub_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_sub_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object minus operand and returns the value
    pointed to by object immediately before the effects. If object is 
    not aligned properly according to the type of object, the behavior 
    is undefined.  The size of memory affected by the effects is 
    designated by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_and_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_and_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_and_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_and_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise and of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined.  The size of memory affected by the effects is designated 
    by the type of object.  

int8_t __atomic_fetch_and_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_and_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_and_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_and_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise and of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_or_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_or_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_or_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_or_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise or of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_or_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_or_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_or_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_or_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise or of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_xor_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_xor_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_xor_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_xor_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise xor of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_xor_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_xor_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_xor_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_xor_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise xor of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_nand_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_nand_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_nand_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_nand_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise nand of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Bitwise operator nand is defined as the following using ANSI C 
    operators: a nand b is equivalent to ~(a & b).

int8_t __atomic_fetch_nand_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_nand_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_nand_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_nand_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise nand of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Bitwise operator nand is defined as the following using ANSI C 
    operators: a nand b is equivalent to ~(a & b).

    Memory is affected according to the value of order.

_Bool __atomic_test_and_set_1 (int8_t *object, memory_order order);
_Bool __atomic_test_and_set_2 (int16_t *object, memory_order order);
_Bool __atomic_test_and_set_4 (int32_t *object, memory_order order)
_Bool __atomic_test_and_set_8 (int64_t *object, memory_order order)

    Atomically, checks the value pointed to by object and if it is in 
    the clear state, set the value pointed to by object to the set 
    state and returns true, and if it is in the set state, returns false. 
    The size of memory affected by the effects is always one byte.

    Memory is affected according to the value of order.

    The set and clear state are the same as specified for 
    atomic_flag_test_and_set.

_Bool __atomic_is_lock_free (size_t size, void *object);

    Returns whether the object pointed to by object is lock-free.
    The function assumes that the size of the object is size. If object 
    is NULL then the function assumes that object is aligned on an 
    size-byte address.

void __atomic_feraiseexcept (int exception);

   Raise floating point exception(s) that specified by exception. 
   The int input argument exception represents a subset of 
   floating-point exceptions, and can be zero or the bitwise 
   OR of one or more floating-point exception macros. The macros
   are defined in fenv.h in section 4.1.

4.3. 64-bit Specific Interfaces

4.3.1. Data Representation of __int128 type

On x86 platforms, __int128 type is defined in the 64-bit ABI.

On SPARC platforms, the size and alignment of __int128 type is 
specified as the following:

             sizeof   Alignment
__int128       16        16	

4.3.2. Support Functions

The following functions are available only on 64-bit platforms. 

__int128 __atomic_load_16 (__int128 *object, memory_order order);
void __atomic_store_16 (__int128 *object, __int128 desired, memory_order order);
__int128 __atomic_exchange_16 (__int128 * object,  __int128 desired, memory_order order);
_Bool __atomic_compare_exchange_16 (__int128 *object, __int128 *expected, __int128 desired, memory_order success_order, memory_order failure_order);
__int128 __atomic_add_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_add_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_sub_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_sub_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_and_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_and_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_or_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_or_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_xor_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_xor_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_nand_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_nand_16 (__int128 *object, __int128 operand, memory_order order);
_Bool __atomic_test_and_set_16 (__int128 *object, memory_order order);

The description of each function is the same with the corresponding
set of functions specified in section 4.2.

5. Libatomic Assumption on Non-blocking Memory Instructions

libatomic assumes that programmers or compilers properly insert 
SFENCE/MFENCE barriers for the following cases

1) writes executed with CLFLUSH instruction
2) streaming loads/stores (V)MOVNTx, MASKMOVDQU, MASKMOVQ.
3) any other operations which reference Write Combining memory type.

Rationale

x86 has a strong memory model. Memory reads are not reordered with 
other reads, writes are not reordered with reads and other writes. 
The three cases mentioned are exceptions, i.e. those writes will not 
block other writes. 
The ABI specifies that code uses those non-blocking writes should 
contain proper fences, so that libatomic support functions do not need 
fences to synchronize with those instructions.

Appendix

A.1. Compatibility Notes

On 64-bit SPARC platforms, _Atomic long double is a 16-byte naturally 
aligned atomic type. There is no lock-free instruction for such type
in 64-bit SPARC ISA, and it is not inlineable in this ABI specification,
so the libatomic implementation have to use non-lock-free implementation
for atomic operations on such type. 

If in the future, lock-free instructions for 16-byte naturally aligned 
objects are available in a new SPARC ISA, then libatomic could leverage 
them to implement lock-free atomic operations for _Atomic long double.

This would be a backward compatible libatomic change. The type is not
inlineable, all atomic operations on objects of the type must be via
libatomic function calls, so all non-lock-free operations will be
changed to lock-free in those libatomic functions. 

However, if a compiler inlines an atomic operation on an _Atomic long 
double object and uses the new lock-free instructions, it could break 
the compatibility if the library implementation is still non-lock-free. 
So such compiler change must be accompanied by a library change, and 
the ABI must be updated as well.

If a compiler change the data representation of atomic types, such
change will cause incompatible binary and it would be hard to detect
if the incompatible binaries are linked together.

References

[1] C11 Standard, 6.2.5p27
The size, representation, and alignment of an atomic type need not be 
the same as those of the corresponding unqualified type.

[2] C11 Standard, 7.17.6p1
For each line in the following table,257) the atomic type name is 
declared as a type that has the same representation and alignment 
requirements as the corresponding direct type.258)

Footnote 258 
258) The same representation and alignment requirements are meant to 
imply interchangeability as arguments to functions, return values from 
functions, and members of unions.

[3] C11 Standard, 6.7.2.1p5
A bit-field shall have a type that is a qualified or unqualified 
version of _Bool, signed int, unsigned int, or some other 
implementation-defined type. It is implementation-defined whether 
atomic types are permitted.

[4] C++11 Standard, 29.4p2
The function atomic_is_lock_free (29.6) indicates whether the object 
is lock-free. In any given program execution, the result of the 
lock-free query shall be consistent for all pointers of the same type.

[5] C11 Standard, 7.17.5.1p3
The atomic_is_lock_free generic function returns nonzero (true) if 
and only if the object's operations are lock-free. The result of a 
lock-free query on one object cannot be inferred from the result of 
a lock-free query on another object.

[6] http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_465

[7] C11 Standard, 6.7.2.4p3
The type name in an atomic type specifier shall not refer to an array 
type, a function type, an atomic type, or a qualified type.

[8] C11 Standard, 6.7.3p3
The type modified by the _Atomic qualifier shall not be an array type 
or a function type.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Re: GCC libatomic questions
  2016-07-06 17:50 ` Fwd: Re: GCC libatomic questions Richard Henderson
@ 2016-07-06 19:41   ` Richard Henderson
  2016-07-07 23:56     ` Bin Fan
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2016-07-06 19:41 UTC (permalink / raw)
  To: gcc, Torvald Riegel; +Cc: Bin Fan

> CMPXCHG16B is not always available on 64-bit x86 platforms, so 16-byte
> naturally aligned atomics are not inlineable. The support functions for
> such atomics are free to use lock-free implementation if the instruction
> is available on specific platforms.

Except that it is available on almost all 64-bit x86 platforms.  As far as I 
know, only 2004 era AMD processors didn't have it; all Intel 64-bit cpus have 
supported it.

Further, gcc will most certainly make use of it when one specifies any 
command-line option that enables it, such as -march=native.

Therefore we must specify that for x86_64, 16-byte objects are non-locking on 
cpus that support cmpxchg16b.

> However, if a compiler inlines an atomic operation on an _Atomic long
> double object and uses the new lock-free instructions, it could break
> the compatibility if the library implementation is still non-lock-free.
> So such compiler change must be accompanied by a library change, and
> the ABI must be updated as well.

The tie between gcc version and libgcc.so version is tight; I see no reason 
that the libatomic.so version should not also be tight with the compiler version.

It is sufficient that libatomic use atomic instructions when they are 
available.  If a new processor comes out with new capabilities, the compiler 
and runtime are upgraded in lock-step.

How that is selected is beyond the ABI but possible solutions are

(1) ld.so search path, based on processor capabilities,
(2) ifunc (or workalike) where the function is selected at startup,
(3) explicit runtime test within the relevant functions.

All solutions expose the same function interface so the function call ABI is 
not affected.

> _Bool __atomic_is_lock_free (size_t size, void *object);
>
>     Returns whether the object pointed to by object is lock-free.
>     The function assumes that the size of the object is size. If object
>     is NULL then the function assumes that object is aligned on an
>     size-byte address.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65033

The actual code change is completely within libstdc++, but it affects the 
description of the libatomic function.

C++ requires that is_lock_free return the same result for all objects of a 
given type.  Whereas __atomic_is_lock_free, with a non-null object, determines
if we will implement lock free for a *specific* object, using the specific 
object's alignment.

Rather than break the ABI and add a different function that passes the type 
alignment, the solution we hit upon was to pass a "fake", minimally aligned 
pointer as the object parameter: (void *)(uintptr_t)-__alignof(type).

The final component of the ABI that you've forgotten to specify, if you want 
full compatibility of linked binaries, is symbol versioning.

We have had two ABI additions since the original release.  See

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libatomic/libatomic.map;h=39e7c2c6b9a70121b5f4031da346a27ae6c1be98;hb=HEAD

r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Re: GCC libatomic questions
  2016-07-06 19:41   ` Richard Henderson
@ 2016-07-07 23:56     ` Bin Fan
       [not found]       ` <ac2d60ed-a659-f018-1f11-63fa8f5847f5@oracle.com>
  0 siblings, 1 reply; 20+ messages in thread
From: Bin Fan @ 2016-07-07 23:56 UTC (permalink / raw)
  To: Richard Henderson, gcc, Torvald Riegel

[-- Attachment #1: Type: text/plain, Size: 9889 bytes --]

Hi,

I have a revised version of the libatomic ABI draft which tries to 
accommodate Richard's comments. The new version is attached. The diff is 
also appended.

Thanks,
- Bin

diff ABI.txt ABI-1.1.txt
28a29,30
 > - The versioning of the library external symbols
 >
47a50,57
 > Note
 >
 > Some 64-bit x86 ISA does not support the cmpxchg16b instruction, for
 > example, some early AMD64 processors and later Intel Xeon Phi co-
 > processor. Whether cmpxchg16b is supported may affect the ABI
 > specification for certain atomic types. We will discuss the detail
 > where it has an impact.
 >
101c111,112
< _Atomic __int128                16          16 N               not 
applicable
---
 > _Atomic __int128 (with at16)    16        16 Y               not 
applicable
 > _Atomic __int128 (w/o at16)     16        16 N               not 
applicable
105c116,117
< _Atomic long double             16        16 N           12        
4          N
---
 > _Atomic long double (with at16) 16        16 Y           12        
4          N
 > _Atomic long double (w/o at16)  16        16 N           12        
4          N
106a119,120
 > _Atomic double _Complex         16        16(8) Y           16        
16(8)      N
 >                     (with at16)
107a122
 >                     (w/o at16)
110a126,127
 > _Atomic long double _Imaginary  16        16 Y           12        
4          N
 >                     (with at16)
111a129
 >                     (w/o at16)
146a165,167
 > with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA
 > does not support cmpxchg16b.
 >
191a213,214
 > _Atomic struct {char a[16];}    16        16(1) Y            
16        16(1)      N
 >                     (with at16)
192a216
 >                     (w/o at16)
208a233,235
 > with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA
 > does not support cmpxchg16b.
 >
246a274,276
 > On the 64-bit x86 platform which supports the cmpxchg16b instruction,
 > 16-byte atomic types whose alignment matches the size is inlineable.
 >
303,306c333,338
< CMPXCHG16B is not always available on 64-bit x86 platforms, so 16-byte
< naturally aligned atomics are not inlineable. The support functions for
< such atomics are free to use lock-free implementation if the instruction
< is available on specific platforms.
---
 > "Inlineability" is a compile time property, which in most cases depends
 > only on the type. In a few cases it also depends on whether the target
 > ISA supports the cmpxchg16b instruction. A compiler may get the ISA
 > information by either compilation flags or inquiring the hardware
 > capabilities. When the hardware capabilities information is not 
available,
 > the compiler should assume the cmpxchg16b instruction is not supported.
665a698,705
 >     The function takes the size of an object and an address which
 >     is one of the following three cases
 >     - the address of the object
 >     - a faked address that solely indicates the alignment of the
 >       object's address
 >     - NULL, which means that the alignment of the object matches size
 >     and returns whether the object is lock-free.
 >
711c751
< 5. Libatomic Assumption on Non-blocking Memory Instructions
---
 > 5. Libatomic symbol versioning
712a753,868
 > Here is the mapfile for symbol versioning of the libatomic library
 > specified by this ABI specification
 >
 > LIBATOMIC_1.0 {
 >   global:
 >     __atomic_load;
 >     __atomic_store;
 >     __atomic_exchange;
 >     __atomic_compare_exchange;
 >     __atomic_is_lock_free;
 >
 >     __atomic_add_fetch_1;
 >     __atomic_add_fetch_2;
 >     __atomic_add_fetch_4;
 >     __atomic_add_fetch_8;
 >     __atomic_add_fetch_16;
 >     __atomic_and_fetch_1;
 >     __atomic_and_fetch_2;
 >     __atomic_and_fetch_4;
 >     __atomic_and_fetch_8;
 >     __atomic_and_fetch_16;
 >     __atomic_compare_exchange_1;
 >     __atomic_compare_exchange_2;
 >     __atomic_compare_exchange_4;
 >     __atomic_compare_exchange_8;
 >     __atomic_compare_exchange_16;
 >     __atomic_exchange_1;
 >     __atomic_exchange_2;
 >     __atomic_exchange_4;
 >     __atomic_exchange_8;
 >     __atomic_exchange_16;
 >     __atomic_fetch_add_1;
 >     __atomic_fetch_add_2;
 >     __atomic_fetch_add_4;
 >     __atomic_fetch_add_8;
 >     __atomic_fetch_add_16;
 >     __atomic_fetch_and_1;
 >     __atomic_fetch_and_2;
 >     __atomic_fetch_and_4;
 >     __atomic_fetch_and_8;
 >     __atomic_fetch_and_16;
 >     __atomic_fetch_nand_1;
 >     __atomic_fetch_nand_2;
 >     __atomic_fetch_nand_4;
 >     __atomic_fetch_nand_8;
 >     __atomic_fetch_nand_16;
 >     __atomic_fetch_or_1;
 >     __atomic_fetch_or_2;
 >     __atomic_fetch_or_4;
 >     __atomic_fetch_or_8;
 >     __atomic_fetch_or_16;
 >     __atomic_fetch_sub_1;
 >     __atomic_fetch_sub_2;
 >     __atomic_fetch_sub_4;
 >     __atomic_fetch_sub_8;
 >     __atomic_fetch_sub_16;
 >     __atomic_fetch_xor_1;
 >     __atomic_fetch_xor_2;
 >     __atomic_fetch_xor_4;
 >     __atomic_fetch_xor_8;
 >     __atomic_fetch_xor_16;
 >     __atomic_load_1;
 >     __atomic_load_2;
 >     __atomic_load_4;
 >     __atomic_load_8;
 >     __atomic_load_16;
 >     __atomic_nand_fetch_1;
 >     __atomic_nand_fetch_2;
 >     __atomic_nand_fetch_4;
 >     __atomic_nand_fetch_8;
 >     __atomic_nand_fetch_16;
 >     __atomic_or_fetch_1;
 >     __atomic_or_fetch_2;
 >     __atomic_or_fetch_4;
 >     __atomic_or_fetch_8;
 >     __atomic_or_fetch_16;
 >     __atomic_store_1;
 >     __atomic_store_2;
 >     __atomic_store_4;
 >     __atomic_store_8;
 >     __atomic_store_16;
 >     __atomic_sub_fetch_1;
 >     __atomic_sub_fetch_2;
 >     __atomic_sub_fetch_4;
 >     __atomic_sub_fetch_8;
 >     __atomic_sub_fetch_16;
 >     __atomic_test_and_set_1;
 >     __atomic_test_and_set_2;
 >     __atomic_test_and_set_4;
 >     __atomic_test_and_set_8;
 >     __atomic_test_and_set_16;
 >     __atomic_xor_fetch_1;
 >     __atomic_xor_fetch_2;
 >     __atomic_xor_fetch_4;
 >     __atomic_xor_fetch_8;
 >     __atomic_xor_fetch_16;
 >
 >   local:
 >     *;
 > };
 > LIBATOMIC_1.1 {
 >   global:
 >     __atomic_feraiseexcept;
 > } LIBATOMIC_1.0;
 > LIBATOMIC_1.2 {
 >   global:
 >     atomic_thread_fence;
 >     atomic_signal_fence;
 >     atomic_flag_test_and_set;
 >     atomic_flag_test_and_set_explicit;
 >     atomic_flag_clear;
 >     atomic_flag_clear_explicit;
 > } LIBATOMIC_1.1;
 >
 > 6. Libatomic Assumption on Non-blocking Memory Instructions
 >
752,753c908,910
< So such compiler change must be accompanied by a library change, and
< the ABI must be updated as well.
---
 > In such case, the libatomic library and the compiler should be upgraded
 > in lock-step, and the inlineable property for certain atomic types
 > will be changed from false to true.


On 7/6/2016 12:41 PM, Richard Henderson wrote:
>> CMPXCHG16B is not always available on 64-bit x86 platforms, so 16-byte
>> naturally aligned atomics are not inlineable. The support functions for
>> such atomics are free to use lock-free implementation if the instruction
>> is available on specific platforms.
>
> Except that it is available on almost all 64-bit x86 platforms. As far 
> as I know, only 2004 era AMD processors didn't have it; all Intel 
> 64-bit cpus have supported it.
>
> Further, gcc will most certainly make use of it when one specifies any 
> command-line option that enables it, such as -march=native.
>
> Therefore we must specify that for x86_64, 16-byte objects are 
> non-locking on cpus that support cmpxchg16b.
>
>> However, if a compiler inlines an atomic operation on an _Atomic long
>> double object and uses the new lock-free instructions, it could break
>> the compatibility if the library implementation is still non-lock-free.
>> So such compiler change must be accompanied by a library change, and
>> the ABI must be updated as well.
>
> The tie between gcc version and libgcc.so version is tight; I see no 
> reason that the libatomic.so version should not also be tight with the 
> compiler version.
>
> It is sufficient that libatomic use atomic instructions when they are 
> available.  If a new processor comes out with new capabilities, the 
> compiler and runtime are upgraded in lock-step.
>
> How that is selected is beyond the ABI but possible solutions are
>
> (1) ld.so search path, based on processor capabilities,
> (2) ifunc (or workalike) where the function is selected at startup,
> (3) explicit runtime test within the relevant functions.
>
> All solutions expose the same function interface so the function call 
> ABI is not affected.
>
>> _Bool __atomic_is_lock_free (size_t size, void *object);
>>
>>     Returns whether the object pointed to by object is lock-free.
>>     The function assumes that the size of the object is size. If object
>>     is NULL then the function assumes that object is aligned on an
>>     size-byte address.
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65033
>
> The actual code change is completely within libstdc++, but it affects 
> the description of the libatomic function.
>
> C++ requires that is_lock_free return the same result for all objects 
> of a given type.  Whereas __atomic_is_lock_free, with a non-null 
> object, determines
> if we will implement lock free for a *specific* object, using the 
> specific object's alignment.
>
> Rather than break the ABI and add a different function that passes the 
> type alignment, the solution we hit upon was to pass a "fake", 
> minimally aligned pointer as the object parameter: (void 
> *)(uintptr_t)-__alignof(type).
>
>
> The final component of the ABI that you've forgotten to specify, if 
> you want full compatibility of linked binaries, is symbol versioning.
>
> We have had two ABI additions since the original release.  See
>
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libatomic/libatomic.map;h=39e7c2c6b9a70121b5f4031da346a27ae6c1be98;hb=HEAD 
>
>
>
> r~


[-- Attachment #2: ABI-1.1.txt --]
[-- Type: text/plain, Size: 44946 bytes --]

1. Overview

1.1. Why we need an ABI for atomics

C11 standard allows different size, representation and alignment
between atomic types and the corresponding non-atomic types [1].
The size, representation and alignment of atomic types need to be 
specified in the ABI specification.

A runtime support library, libatomic, already exists on Solaris 
and Linux. The interface of this library needs to be standardized 
as part of the ABI specification, so that

- On a system that supply libatomic, all compilers in compliance 
  with the ABI can generate compatible binaries linking this library.

- The binary can be backward compatible on different versions of 
  the system as long as they support the same ABI.

1.2. What does the atomics ABI specify

The ABI specifies the following

- Data representation of the atomic types.

- The names and behaviors of the implementation-specific support
  functions.

- The versioning of the library external symbols

- The atomic types for which the compiler may generate inlined code. 

- Lock-free property of the inlined atomic operations.

Note that the name and behavior of the libatomic functions specified 
in the C standard do not need to be part of this ABI, because they 
are already required to meet the specification in the standard.

1.3. Affected platforms

The following platforms are affected by this ABI specification.

SPARC (32-bit and 64-bit)
x86 (32-bit and 64-bit)

Section 1.1 and 1.2, and the Rationale, Notes and Appendix sections 
in the rest of the document are for explanation purpose only, it 
is not considered as part of the formal ABI specification.

Note

Some 64-bit x86 ISA does not support the cmpxchg16b instruction, for
example, some early AMD64 processors and later Intel Xeon Phi co-
processor. Whether cmpxchg16b is supported may affect the ABI 
specification for certain atomic types. We will discuss the detail
where it has an impact.

2. Data Representation

2.1. General Rules

The general rules for size, representation and alignment of the data
representation of atomic types are the following

1) Atomic types assume the same size with the corresponding non-atomic 
   types.

2) Atomic types assume the same representation with the corresponding 
   non-atomic types.

3) Atomic types assume the same alignment with the corresponding 
   non-atomic types, with the following exceptions:

   On 32- and 64-bit x86 platforms and on 64-bit SPARC platforms, 
   atomic types of size 1, 2, 4, 8 or 16-byte have the alignment 
   that matches the size.

   On 32-bit SPARC platforms, atomic types of size 1, 2, 4 or 8-byte
   have the alignment that matches the size. If the alignment of a 
   16-byte non-atomic type is less than 8-byte, the alignment of the 
   corresponding atomic type is increased to 8-byte.

Note 

The above rules apply to both scalar types and aggregate types.

2.2. Atomic scalar types

x86

                                          LP64 (AMD64)                     ILP32 (i386)
C Type                          sizeof    Alignment  Inlineable  sizeof    Alignment  Inlineable
atomic_flag                     1         1          Y           1         1	      Y
_Atomic _Bool                   1         1          Y           1         1          Y
_Atomic char                    1         1          Y           1         1          Y
_Atomic signed char             1         1          Y           1         1          Y
_Atomic unsigned char           1         1          Y           1         1          Y
_Atomic short                   2         2          Y           2         2          Y
_Atomic signed short            2         2          Y           2         2          Y
_Atomic unsigned short          2         2          Y           2         2          Y
_Atomic int                     4         4          Y           4         4          Y
_Atomic signed int              4         4          Y           4         4          Y
_Atomic enum                    4         4          Y           4         4          Y
_Atomic unsigned int            4         4          Y           4         4          Y
_Atomic long                    8         8          Y           4         4          Y
_Atomic signed long             8         8          Y           4         4          Y
_Atomic unsigned long           8         8          Y           4         4          Y
_Atomic long long               8         8          Y           8         8          Y
_Atomic signed long long        8         8          Y           8         8          Y
_Atomic unsigned long long      8         8          Y           8         8          Y
_Atomic __int128 (with at16)    16        16         Y               not applicable
_Atomic __int128 (w/o at16)     16        16         N               not applicable
any-type _Atomic *              8         8          Y           4         4          Y
_Atomic float                   4         4          Y           4         4          Y
_Atomic double                  8         8          Y           8         8          Y
_Atomic long double (with at16) 16        16         Y           12        4          N
_Atomic long double (w/o at16)  16        16         N           12        4          N
_Atomic float _Complex          8         8(4)       Y           8         8(4)       Y
_Atomic double _Complex         16        16(8)      Y           16        16(8)      N
                    (with at16)
_Atomic double _Complex         16        16(8)      N           16        16(8)      N
                    (w/o at16)
_Atomic long double _Complex    32        16         N           24        4          N
_Atomic float _Imaginary        4         4          Y           4         4          Y
_Atomic double _Imaginary       8         8          Y           8         8          Y
_Atomic long double _Imaginary  16        16         Y           12        4          N
                    (with at16)
_Atomic long double _Imaginary  16        16         N           12        4          N
                    (w/o at16)

SPARC

                                          LP64 (v9)                        ILP32 (sparc)
C Type                          sizeof    Alignment  Inlineable  sizeof    Alignment  Inlineable
atomic_flag                     1         1          Y           1         1	      Y
_Atomic _Bool                   1         1          Y           1         1          Y
_Atomic char                    1         1          Y           1         1          Y
_Atomic signed char             1         1          Y           1         1          Y
_Atomic unsigned char           1         1          Y           1         1          Y
_Atomic short                   2         2          Y           2         2          Y
_Atomic signed short            2         2          Y           2         2          Y
_Atomic unsigned short          2         2          Y           2         2          Y
_Atomic int                     4         4          Y           4         4          Y
_Atomic signed int              4         4          Y           4         4          Y
_Atomic enum                    4         4          Y           4         4          Y
_Atomic unsigned int            4         4          Y           4         4          Y
_Atomic long                    8         8          Y           4         4          Y
_Atomic signed long             8         8          Y           4         4          Y
_Atomic unsigned long           8         8          Y           4         4          Y
_Atomic long long               8         8          Y           8         8          Y
_Atomic signed long long        8         8          Y           8         8          Y
_Atomic unsigned long long      8         8          Y           8         8          Y
_Atomic __int128                16	      16         N               not applicable
any-type _Atomic *              8         8          Y           4         4          Y
_Atomic float                   4         4          Y           4         4          Y
_Atomic double                  8         8          Y           8         8          Y
_Atomic long double             16        16         N           16        8          N
_Atomic float _Complex          8         8(4)       Y           8         8(4)       Y
_Atomic double _Complex         16        16(8)      N           16        8          N
_Atomic long double _Complex    32        16         N           32        8          N
_Atomic float _Imaginary        4         4          Y           4         4          Y
_Atomic double _Imaginary       8         8          Y           8         8          Y
_Atomic long double _Imaginary  16        16         N           16        8          N

with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA
does not support cmpxchg16b.

Notes: 

C standard also specifies some atomic integer types. They are not
listed in the above table because they have the same representation 
and alignment requirements as the corresponding direct types [2].

We will discuss the inlineable column and __int128 type in section 3.

The value in () shows the alignment of the corresponding non-atomic 
type, if it is different from the alignment of the atomic type.

Because _Atomic specifier can not be used on a function type [7] and 
_Atomic qualifier can not modify a function type [8], there is no 
atomic function type listed in the above table.

On 32-bit x86 platforms, long double is of size 12-byte and is of 
alignment 4-byte. This ABI specification does not increase the 
alignment of _Atomic long double type because it would not be 
lock-free even if it is 16-byte aligned, since there is no 12-byte 
or 16-byte lock-free instruction on 32-bit x86 platforms.

2.3 Atomic Aggregates and Unions

Atomic structures or unions may have different alignment compared to
the corresponding non-atomic types, subject to rule 3) in section 2.1. 
The alignment change only affects the boundary where an entire 
structure or union is aligned. The offset of each member, the internal 
padding and the size of the structure or union are not affected.

The following table shows selective examples of the size and alignment
of atomic structure types.

x86

                                          LP64 (AMD64)                      ILP32 (i386)
C Type                          sizeof    Alignment  Inlineable   sizeof    Alignment  Inlineable
_Atomic struct {char a[2];}     2         2(1)       Y            2         2(1)       Y
_Atomic struct {char a[3];}     3         1          N            3         1          N
_Atomic struct {short a[2];}    4         4(2)       Y            4         4(2)       Y
_Atomic struct {int a[2];}      8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c[2];
                short s;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char a[16];}    16        16(1)      Y            16        16(1)      N
                    (with at16)
_Atomic struct {char a[16];}    16        16(1)      N            16        16(1)      N
                    (w/o at16)

SPARC

                                          LP64 (v9)                       ILP32 (sparc)
C Type                          sizeof    Alignment  Inlineable   sizeof    Alignment  Inlineable
_Atomic struct {char a[2];}     2         2(1)       Y            2         2(1)       Y 
_Atomic struct {char a[3];}     3         1          N            3         1          N
_Atomic struct {short a[2];}    4         4(2)       Y            4         4(2)       Y
_Atomic struct {int a[2];}      8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c[2];
                short s;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char a[16];}    16        16(1)      N            16        8(1)       N

with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA
does not support cmpxchg16b.

Notes

The value in () shows the alignment of the corresponding non-atomic 
type, if it is different from the alignment of the atomic type.

Because the padding of structure types is not affected by _Atomic 
modifier, the contents of any padding in the atomic structure object 
is still undefined, therefore the atomic compare and exchange operation 
on such objects may fail due to the difference of the padding.

The increased alignment of 16-byte atomic struct types might be 
useful to 
- Reduce sharing locks with other atomics. 
- Allow more efficient implementation of runtime support functions
  for atomic operations on such types.

2.4. Bit-fields

It is implementation defined in the C standard that whether atomic 
bit-field types are permitted [3]. In this ABI specification, The 
representation of atomic bit-field is unspecified.

3. Lock-free and Inlineable Property

The implementation of atomic operations may map directly to hardware 
atomic instructions. This kind of implementation is lock-free.

Lock-free atomic operations does not require runtime support functions.
The compiler may generate inlined code for efficiency. This ABI 
specification defines a few inlineable atomic types. An atomic type
is inlineable means the compiler may generate inlined instruction 
sequence for atomic operations on such types. The implementation of
the support functions for the inlineable atomic types must also be 
lock free.

On all affected platforms, atomic types whose size equal to 1, 2, 4 
or 8 and alignment matches the size are inlineable

On the 64-bit x86 platform which supports the cmpxchg16b instruction,
16-byte atomic types whose alignment matches the size is inlineable.

If an atomic type is not inlineable, the compiler shall always generate
support function call for the atomic operation on the object of such type. 
The implementation of the support functions for non-inlineable atomic
types may be lock-free.

Rationale

It is assumed that there is no way for an atomic object to be accessed 
from both lock-free operation and non-lock-free operation and the 
atomic semantic can be satisfied.

If the compiler always generates runtime support function calls for 
all atomics, the lock-free property would be hidden inside the library 
implementation. However, the compiler may inline the atomic operations, 
and we want to allow such inlining optimizations.

The compiler inlining raises an issue of mix-and-matched accesses to
the same atomic object from the compiler generated code and the runtime 
library function. They have to be consistent on the lock-free property.

One possible solution to achieve the lock-free consistency is to specify 
the lock-free property on a per-type basis. The C and C++ standard seem 
to back this approach: C++ standard provides a query that returns a 
per-type result about whether the type is lock-free [4]. C standard 
does not guarantee that the query result is per-type [5], but it's the 
direction it is going towards [6]. However, the query result does not 
necessarily reflect the implementation of the atomic operation on the 
queried type. The implementation may use lock-free instructions for 
a specific object that meets certain criteria. So specifying the 
lock-free property on a per-type basis is unnecessarily conservative. 

It is possible to specify the lock-free property on a per-object basis.
But it is simpler to disallow the compiler to inline the atomic 
operations for "may be lock-free" types in order to hide the lock-free 
optimization inside the library implementation.

So the ABI achieve the lock-free consistency by specifying which types 
may be inlined and specifying that those types must be lock-free, so 
that for the inlineable atomic types, if there are mix-and-matched 
accesses, they must both be lock-free; and for the non-inlineable atomic
types, the compiler never inlines so the mix-and-match never happens.

Notes:

Here are a few examples of small types which don't qualify as 
inlineable type:

  _Atomic struct {char a[3];} /* size = 3, alignment = 1 */
  _Atomic long double /* (on 32-bit x86) size = 12, alignment = 4 */

A smart compiler may know such an object is located at an address that 
fits in an 8-byte aligned window, but the ABI compliance behavior is
to not generate lock-free inlined code sequence, since a lazy compiler 
may generate a runtime support function call which may not be 
implemented lock-free.

"Inlineability" is a compile time property, which in most cases depends
only on the type. In a few cases it also depends on whether the target 
ISA supports the cmpxchg16b instruction. A compiler may get the ISA 
information by either compilation flags or inquiring the hardware 
capabilities. When the hardware capabilities information is not available,
the compiler should assume the cmpxchg16b instruction is not supported.

4. libatomic library functions

4.1. Data Definitions

This section contains examples of system header files that provide 
data interface needed by the libatomic functions.

<stdatomic.h>

typedef enum
{
    memory_order_relaxed = 0,
    memory_order_consume = 1,
    memory_order_acquire = 2,
    memory_order_release = 3,
    memory_order_acq_rel = 4,
    memory_order_seq_cst = 5
} memory_order;

typedef _Atomic struct
{
  unsigned char __flag;
} atomic_flag;

Refer to C standard for the meaning of each enumeration constants of
memory_order type.

<fenv.h>

SPARC

#define FE_INEXACT    0x01
#define FE_DIVBYZERO  0x02
#define FE_UNDERFLOW  0x04
#define FE_OVERFLOW   0x08
#define FE_INVALID    0x10

x86

#define FE_INVALID    0x01
#define FE_DIVBYZERO  0x04
#define FE_OVERFLOW   0x08
#define FE_UNDERFLOW  0x10
#define FE_INEXACT    0x20

4.2. Support Functions

The following kinds of atomic operations are supported by the runtime
library: load, store, exchange, compare-and-exchange and arithmetic 
read-modify-write operations. For the arithmetic read-modify-write 
operations, the following kinds of modification operation are supported: 
addition, subtraction, bitwise inclusive or, bitwise exclusive or, 
bitwise and, bitwise nand. There is also classic test-and-set functions.

For each kind of atomic operations, libatomic provide a generic version 
which accepts a pointer of all atomic types and a set of functions that 
accept a pointer of some special atomic types which are of size 
1, 2, 4 and 8- byte on all platforms and 16-byte on 64-bit platforms.

Note: Section 2.1 mentions the alignment adjustment for atomic types of 
sizes 1, 2, 4, 8 and 16-byte. For load, store, exchange and compre-and-
exchange operations, it is safe to convert a pointer of any atomic types 
of those sizes to the pointer of corresponding atomic integer types with 
the same size.

Note: The size specific versions accept and return data by value, the 
generic version use memory pointers to pass and return the data objects.

Most of the functions listed in this section can be mapped to the generic 
functions with the same semantics in the C standard. Refer to the C 
standard for the description of the generic functions and how each memory 
order works.

The following functions are available on all platforms.

void __atomic_load (size_t size, void *object, void *loaded, memory_order order);

    Atomically load the value pointed to by object. Assign the loaded
    value to the memory pointed to by loaded. The size of memory
    affected by the load is designated by size. 

int8_t __atomic_load_1 (int8_t *object, memory_order order);
int16_t __atomic_load_2 (int16_t *object, memory_order order);
int32_t __atomic_load_4 (int32_t *object, memory_order order);
int64_t __atomic_load_8 (int64_t *object, memory_order order);

    Atomically load the value pointed to by object. The loaded value is
    returned. The size of memory affected by the load is designated by
    the type of the object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    Memory is affected according to the value of order. If order is either
    memory_order_release or memory_order_acq_rel, the behavior of the 
    function is undefined.

void __atomic_store (size_t size, void *object, void *desired, memory_order order)

    Atomically replace the value pointed to by object with the value
    pointed to by desired. The size of memory affected by the store
    is designated by size.

void __atomic_store_1 (int8_t *object, int8_t desired, memory_order order);
void __atomic_store_2 (int16_t *object, int16_t desired, memory_order order);
void __atomic_store_4 (int32_t *object, int32_t desired, memory_order order);
void __atomic_store_8 (int64_t *object, int64_t desired, memory_order order);

    Atomically replace the value pointed to by object with desired.
    The size of memory affected by the store is designated by the
    type of the object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    Memory is affected according to the value of order. If order is one of
    memory_order_acquire, memory_order_consume or memory_order_acq_rel, the
    behavior of the function is undefined.

void __atomic_exchange (size_t size, void *object, void *desired, void *loaded, memory_model order);

    Atomically, replace the value pointed to by object with the value
    pointed to by desired and assign the value pointed to by loaded to
    the value pointed to by object immediately before the effect. The 
    size of memory affected by the exchange is designated by size.

int8_t __atomic_exchange_1 (int8_t * object, int8_t desired, memory_order)
int16_t __atomic_exchange_2 (int16_t * object, int16_t desired, memory_order)
int32_t __atomic_exchange_4 (int32_t * object, int32_t desired, memory_order)
int64_t __atomic_exchange_8 (int64_t * object, int64_t desired, memory_order)

    Atomically, replace the value pointed to by object with desired 
    and return the value pointed to by object immediately before the 
    effect. The size of memory affected by the exchange is designated 
    by the type of object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    Memory is affected according to the value of order.

_Bool __atomic_compare_exchange (size_t size, void *object, void *expected, void *desired, memory_model success_order, memory_model failure_order);

    Atomically, compares the memory pointed to by object for equality with 
    the memory pointed to by expected, and if true, replaces the memory
    pointed to by object with the memory pointed to by desired, and if false,
    updates the memory pointed to by expected with the memory pointed to by 
    object. The result of the comparison is returned. The size of memory 
    affected by the compare and exchange is designated by size.

    The compare and exchange never fail spuriously, i.e. if the comparison 
    for equality returns false, the two values in the comparison were not 
    equal. [Note, this is to specify that on SPARC and x86, compare exchange 
    is always implemented with "strong" semantic. The weak flavors in the 
    C standard is translated to strong.]

_Bool __atomic_compare_exchange_1 (int8_t *object, int8_t *expected, int8_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_2 (int16_t *object, int16_t *expected, int16_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_4 (int32_t *object, int32_t *expected, int32_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_8 (int64_t *object, int64_t *expected, int64_t desired, memory_order success_order, memory_order failure_order);

    Atomically, compares the memory pointed to by object for equality with 
    the memory pointed to by expected, and if true, replaces the memory
    pointed to by object with desired, and if false, updates the memory
    pointed to by expected with the memory pointed to by object. The 
    result of the comparison is returned.

    The size of memory affected by the compare and exchange is designated 
    by the type of object. If object is not aligned properly according 
    to the type of object, the behavior is undefined.

    The compare and exchange never fail spuriously, i.e. if the comparison 
    for equality returns false, the two values in the comparison were not 
    equal.

    If the comparison is true, memory is affected according to the 
    value of success_order, and if the comparison is false, memory is 
    affected according to the value of failure_order. The result of the
    comparison is returned. 

int8_t __atomic_add_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_add_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_add_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_add_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object plus operand and returns the value
    pointed to by object immediately after the effects. If object is 
    not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_add_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_add_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_add_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_add_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object plus operand and returns the value
    pointed to by object immediately before the effects. If object is 
    not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_sub_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_sub_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_sub_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_sub_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object minus operand and returns the value
    pointed to by object immediately after the effects. If object is not 
    aligned properly according to the type of object, the behavior is 
    undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_sub_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_sub_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_sub_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_sub_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically replaces the value pointed to by object with the result of
    the value pointed to by object minus operand and returns the value
    pointed to by object immediately before the effects. If object is 
    not aligned properly according to the type of object, the behavior 
    is undefined.  The size of memory affected by the effects is 
    designated by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_and_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_and_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_and_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_and_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise and of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined.  The size of memory affected by the effects is designated 
    by the type of object.  

int8_t __atomic_fetch_and_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_and_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_and_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_and_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise and of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_or_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_or_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_or_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_or_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise or of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_or_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_or_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_or_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_or_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise or of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_xor_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_xor_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_xor_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_xor_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise xor of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

int8_t __atomic_fetch_xor_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_xor_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_xor_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_xor_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise xor of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Memory is affected according to the value of order.

int8_t __atomic_nand_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_nand_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_nand_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_nand_fetch_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise nand of the value pointed to by object and operand and returns 
    the value pointed to by object immediately after the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Bitwise operator nand is defined as the following using ANSI C 
    operators: a nand b is equivalent to ~(a & b).

int8_t __atomic_fetch_nand_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_nand_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_nand_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_nand_8 (int64_t *object, int64_t operand, memory_order order);

    Atomically, replaces the value pointed to by object with the result of 
    bitwise nand of the value pointed to by object and operand and returns 
    the value pointed to by object immediately before the effects. If object 
    is not aligned properly according to the type of object, the behavior 
    is undefined. The size of memory affected by the effects is designated 
    by the type of object.

    Bitwise operator nand is defined as the following using ANSI C 
    operators: a nand b is equivalent to ~(a & b).

    Memory is affected according to the value of order.

_Bool __atomic_test_and_set_1 (int8_t *object, memory_order order);
_Bool __atomic_test_and_set_2 (int16_t *object, memory_order order);
_Bool __atomic_test_and_set_4 (int32_t *object, memory_order order)
_Bool __atomic_test_and_set_8 (int64_t *object, memory_order order)

    Atomically, checks the value pointed to by object and if it is in 
    the clear state, set the value pointed to by object to the set 
    state and returns true, and if it is in the set state, returns false. 
    The size of memory affected by the effects is always one byte.

    Memory is affected according to the value of order.

    The set and clear state are the same as specified for 
    atomic_flag_test_and_set.

_Bool __atomic_is_lock_free (size_t size, void *object);

    Returns whether the object pointed to by object is lock-free.
    The function assumes that the size of the object is size. If object 
    is NULL then the function assumes that object is aligned on an 
    size-byte address.

    The function takes the size of an object and an address which
    is one of the following three cases
    - the address of the object 
    - a faked address that solely indicates the alignment of the 
      object's address
    - NULL, which means that the alignment of the object matches size 
    and returns whether the object is lock-free.

void __atomic_feraiseexcept (int exception);

   Raise floating point exception(s) that specified by exception. 
   The int input argument exception represents a subset of 
   floating-point exceptions, and can be zero or the bitwise 
   OR of one or more floating-point exception macros. The macros
   are defined in fenv.h in section 4.1.

4.3. 64-bit Specific Interfaces

4.3.1. Data Representation of __int128 type

On x86 platforms, __int128 type is defined in the 64-bit ABI.

On SPARC platforms, the size and alignment of __int128 type is 
specified as the following:

             sizeof   Alignment
__int128       16        16	

4.3.2. Support Functions

The following functions are available only on 64-bit platforms. 

__int128 __atomic_load_16 (__int128 *object, memory_order order);
void __atomic_store_16 (__int128 *object, __int128 desired, memory_order order);
__int128 __atomic_exchange_16 (__int128 * object,  __int128 desired, memory_order order);
_Bool __atomic_compare_exchange_16 (__int128 *object, __int128 *expected, __int128 desired, memory_order success_order, memory_order failure_order);
__int128 __atomic_add_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_add_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_sub_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_sub_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_and_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_and_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_or_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_or_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_xor_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_xor_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_nand_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_nand_16 (__int128 *object, __int128 operand, memory_order order);
_Bool __atomic_test_and_set_16 (__int128 *object, memory_order order);

The description of each function is the same with the corresponding
set of functions specified in section 4.2.

5. Libatomic symbol versioning

Here is the mapfile for symbol versioning of the libatomic library 
specified by this ABI specification

LIBATOMIC_1.0 {
  global:
    __atomic_load;
    __atomic_store;
    __atomic_exchange;
    __atomic_compare_exchange;
    __atomic_is_lock_free;

    __atomic_add_fetch_1;
    __atomic_add_fetch_2;
    __atomic_add_fetch_4;
    __atomic_add_fetch_8;
    __atomic_add_fetch_16;
    __atomic_and_fetch_1;
    __atomic_and_fetch_2;
    __atomic_and_fetch_4;
    __atomic_and_fetch_8;
    __atomic_and_fetch_16;
    __atomic_compare_exchange_1;
    __atomic_compare_exchange_2;
    __atomic_compare_exchange_4;
    __atomic_compare_exchange_8;
    __atomic_compare_exchange_16;
    __atomic_exchange_1;
    __atomic_exchange_2;
    __atomic_exchange_4;
    __atomic_exchange_8;
    __atomic_exchange_16;
    __atomic_fetch_add_1;
    __atomic_fetch_add_2;
    __atomic_fetch_add_4;
    __atomic_fetch_add_8;
    __atomic_fetch_add_16;
    __atomic_fetch_and_1;
    __atomic_fetch_and_2;
    __atomic_fetch_and_4;
    __atomic_fetch_and_8;
    __atomic_fetch_and_16;
    __atomic_fetch_nand_1;
    __atomic_fetch_nand_2;
    __atomic_fetch_nand_4;
    __atomic_fetch_nand_8;
    __atomic_fetch_nand_16;
    __atomic_fetch_or_1;
    __atomic_fetch_or_2;
    __atomic_fetch_or_4;
    __atomic_fetch_or_8;
    __atomic_fetch_or_16;
    __atomic_fetch_sub_1;
    __atomic_fetch_sub_2;
    __atomic_fetch_sub_4;
    __atomic_fetch_sub_8;
    __atomic_fetch_sub_16;
    __atomic_fetch_xor_1;
    __atomic_fetch_xor_2;
    __atomic_fetch_xor_4;
    __atomic_fetch_xor_8;
    __atomic_fetch_xor_16;
    __atomic_load_1;
    __atomic_load_2;
    __atomic_load_4;
    __atomic_load_8;
    __atomic_load_16;
    __atomic_nand_fetch_1;
    __atomic_nand_fetch_2;
    __atomic_nand_fetch_4;
    __atomic_nand_fetch_8;
    __atomic_nand_fetch_16;
    __atomic_or_fetch_1;
    __atomic_or_fetch_2;
    __atomic_or_fetch_4;
    __atomic_or_fetch_8;
    __atomic_or_fetch_16;
    __atomic_store_1;
    __atomic_store_2;
    __atomic_store_4;
    __atomic_store_8;
    __atomic_store_16;
    __atomic_sub_fetch_1;
    __atomic_sub_fetch_2;
    __atomic_sub_fetch_4;
    __atomic_sub_fetch_8;
    __atomic_sub_fetch_16;
    __atomic_test_and_set_1;
    __atomic_test_and_set_2;
    __atomic_test_and_set_4;
    __atomic_test_and_set_8;
    __atomic_test_and_set_16;
    __atomic_xor_fetch_1;
    __atomic_xor_fetch_2;
    __atomic_xor_fetch_4;
    __atomic_xor_fetch_8;
    __atomic_xor_fetch_16;

  local:
    *;
};
LIBATOMIC_1.1 {
  global:
    __atomic_feraiseexcept;
} LIBATOMIC_1.0;
LIBATOMIC_1.2 {
  global:
    atomic_thread_fence;
    atomic_signal_fence;
    atomic_flag_test_and_set;
    atomic_flag_test_and_set_explicit;
    atomic_flag_clear;
    atomic_flag_clear_explicit;
} LIBATOMIC_1.1;

6. Libatomic Assumption on Non-blocking Memory Instructions

libatomic assumes that programmers or compilers properly insert 
SFENCE/MFENCE barriers for the following cases

1) writes executed with CLFLUSH instruction
2) streaming loads/stores (V)MOVNTx, MASKMOVDQU, MASKMOVQ.
3) any other operations which reference Write Combining memory type.

Rationale

x86 has a strong memory model. Memory reads are not reordered with 
other reads, writes are not reordered with reads and other writes. 
The three cases mentioned are exceptions, i.e. those writes will not 
block other writes. 
The ABI specifies that code uses those non-blocking writes should 
contain proper fences, so that libatomic support functions do not need 
fences to synchronize with those instructions.

Appendix

A.1. Compatibility Notes

On 64-bit SPARC platforms, _Atomic long double is a 16-byte naturally 
aligned atomic type. There is no lock-free instruction for such type
in 64-bit SPARC ISA, and it is not inlineable in this ABI specification,
so the libatomic implementation have to use non-lock-free implementation
for atomic operations on such type. 

If in the future, lock-free instructions for 16-byte naturally aligned 
objects are available in a new SPARC ISA, then libatomic could leverage 
them to implement lock-free atomic operations for _Atomic long double.

This would be a backward compatible libatomic change. The type is not
inlineable, all atomic operations on objects of the type must be via
libatomic function calls, so all non-lock-free operations will be
changed to lock-free in those libatomic functions. 

However, if a compiler inlines an atomic operation on an _Atomic long 
double object and uses the new lock-free instructions, it could break 
the compatibility if the library implementation is still non-lock-free. 
In such case, the libatomic library and the compiler should be upgraded
in lock-step, and the inlineable property for certain atomic types
will be changed from false to true.

If a compiler change the data representation of atomic types, such
change will cause incompatible binary and it would be hard to detect
if the incompatible binaries are linked together.

References

[1] C11 Standard, 6.2.5p27
The size, representation, and alignment of an atomic type need not be 
the same as those of the corresponding unqualified type.

[2] C11 Standard, 7.17.6p1
For each line in the following table,257) the atomic type name is 
declared as a type that has the same representation and alignment 
requirements as the corresponding direct type.258)

Footnote 258 
258) The same representation and alignment requirements are meant to 
imply interchangeability as arguments to functions, return values from 
functions, and members of unions.

[3] C11 Standard, 6.7.2.1p5
A bit-field shall have a type that is a qualified or unqualified 
version of _Bool, signed int, unsigned int, or some other 
implementation-defined type. It is implementation-defined whether 
atomic types are permitted.

[4] C++11 Standard, 29.4p2
The function atomic_is_lock_free (29.6) indicates whether the object 
is lock-free. In any given program execution, the result of the 
lock-free query shall be consistent for all pointers of the same type.

[5] C11 Standard, 7.17.5.1p3
The atomic_is_lock_free generic function returns nonzero (true) if 
and only if the object's operations are lock-free. The result of a 
lock-free query on one object cannot be inferred from the result of 
a lock-free query on another object.

[6] http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_465

[7] C11 Standard, 6.7.2.4p3
The type name in an atomic type specifier shall not refer to an array 
type, a function type, an atomic type, or a qualified type.

[8] C11 Standard, 6.7.3p3
The type modified by the _Atomic qualifier shall not be an array type 
or a function type.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* GCC libatomic ABI specification draft
       [not found]             ` <8317ec9d-41ad-d806-9144-eac2984cdd38@oracle.com>
@ 2016-11-17 20:12               ` Bin Fan
  2016-11-29 11:12                 ` Szabolcs Nagy
  2017-01-17 17:00                 ` Torvald Riegel
  0 siblings, 2 replies; 20+ messages in thread
From: Bin Fan @ 2016-11-17 20:12 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 12493 bytes --]

Got an error from gcc@gcc.gnu.org alias. Remove the pdf attachment and 
re-send it to the alias ...

On 11/14/2016 4:34 PM, Bin Fan wrote:
> Hi All,
>
> I have an updated version of libatomic ABI specification draft. Please 
> take a look to see if it matches GCC implementation. The purpose of 
> this document is to establish an official GCC libatomic ABI, and allow 
> compatible compiler and runtime implementations on the affected 
> platforms.
>
> Compared to the last version you have reviewed, here are the major 
> updates
>
> - Rewrite the notes in N2.3.2 to explicit mention the implementation 
> of __atomic_compare_exchange follows memcmp/memcpy semantics, and the 
> consequence of it.
>
> - Rewrite section 3 to replace "lock-free" operations with "hardware 
> backed" instructions. The digest of this section is: 1) inlineable 
> atomics must be implemented with the hardware backed atomic 
> instructions. 2) for non-inlineable atomics, the compiler must 
> generate a runtime call, and the runtime support function is free to 
> use any implementation.
>
> - The Rationale section in section 3 is also revised to remove the 
> mentioning of "lock-free", but there is not major change of concept.
>
> - Add note N3.1 to emphasize the assumption of general hardware 
> supported atomic instruction
>
> - Add note N3.2 to discuss the issues of cmpxchg16b
>
> - Add a paragraph in section 4.1 to specify memory_order_consume must 
> be implemented through memory_order_acquire. Section 4.2 emphasizes it 
> again.
>
> - The specification of each runtime functions mostly maps to the 
> corresponding generic functions in the C11 standard. Two functions are 
> worth noting:
> 1) C11 atomic_compare_exchange compares and updates the "value" while 
> __atomic_compare_exchange functions in this ABI compare and update the 
> "memory", which implies the memcmp and memcpy semantics.
> 2) The specification of __atomic_is_lock_free allows both a per-object 
> result and a per-type result. A per-type implementation could pass 
> NULL, or a faked address as the address of the object. A per-object 
> implementation could pass the actual address of the object.
>
> Thanks,
> - Bin
>
> On 8/10/2016 3:33 PM, Bin Fan wrote:
>> Hi Torvald,
>>
>> Thanks a lot for your review. Please find my response inline...
>>
>> On 8/5/2016 8:51 AM, Torvald Riegel wrote:
>>> [CC'ing Andrew MacLeod, who has been working on the atomics too.]
>>>
>>> On Tue, 2016-08-02 at 16:28 -0700, Bin Fan wrote:
>>>> I'm wondering if you have a chance to review the revised libatomic ABI
>>>> draft. The email was rejected by the gcc alias once due to some html
>>>> stuff in the email text. Though I resend a pure txt format version, 
>>>> I'm
>>>> not sure if it worked, so this time I drop the gcc alias.
>>>>
>>>> If you do not have any issues, I'm wondering if this ABI draft 
>>>> could be
>>>> published in some GCC wiki or documentation? I'd be happy to prepare a
>>>> version without the "notes" part.
>>>
>>>
>>>
>>>> Because the padding of structure types is not affected by _Atomic
>>>> modifier, the contents of any padding in the atomic structure object
>>>> is still undefined, therefore the atomic compare and exchange 
>>>> operation
>>>> on such objects may fail due to the difference of the padding.
>>> I think this isn't quite clear.
>> This paragraph is just to clarify that _Atomic does not change (e.g. 
>> zeroing out) the padding
>> bits, whose content were undefined in the current SPARC and x86 ABI 
>> specifications, and will
>> still be undefined for _Atomic aggregates.
>>
>> This paragraph is part of "notes" rather than the main body of the 
>> ABI draft. If it is not clear,
>> I will change it by mentioning the memcmp/memcpy-like semantics.
>>
>>> Perhaps it's easier to describe it in
>>> the way that C++ does, referring to the memcmp/memcpy-like semantics of
>>> compare_exchange (e.g., see N4606 29.6.5p27).
>>> C11 isn't quite clear about this, or I am misunderstanding what they
>>> really mean by "value of the object" (see N1570 7.17.7.4p2).
>> This is the subject of C11 Defect Report 431:
>> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_431
>> which has been fixed to align with the C++ standard and closed with a
>> Proposed Technical Corrigendum which will appear in the next revision
>> of the C standard (~2017).
>>
>> Note that in section 4.2 of this ABI draft, the function description of
>> __atomic_compare_exchange uses "compares the memory pointed to by 
>> object" instead of
>> "compares the value pointed to by object" as you quoted from N1570 
>> 7.17.7.4p.
>>
>> Since you asked about whether you should review the function 
>> descriptions, this is one
>> of the two worth noticing cases. I will mention another one later in 
>> this email.
>>>
>>>> Lock-free atomic operations does not require runtime support 
>>>> functions.
>>>> The compiler may generate inlined code for efficiency. This ABI
>>>> specification defines a few inlineable atomic types. An atomic type
>>>> is inlineable means the compiler may generate inlined instruction
>>>> sequence for atomic operations on such types. The implementation of
>>>> the support functions for the inlineable atomic types must also be
>>>> lock free.
>>> I think it's better to say that the support functions must be 
>>> compatible
>>> with what the compiler would generate.  That they are "lock-free" is
>>> just a forward progress property.  This also applies to later 
>>> paragraphs
>>> in the draft.  Maybe we need to use a different term here, so we can 
>>> use
>>> it for what we want (ie, a HW-backed, inlineable operation).
>> I agree that lock-free atomic operations does not equivalent to 
>> HW-backed atomic
>> operations. I will think about how to mention it in the ABI. My 
>> current thought is as
>> you suggested, to change "lock-free" to "HW-backed".
>>
>> So an example of the updated specification would be like this:
>> The implementation of the support functions for the inlineable atomic 
>> types must use
>> HW-backed atomic instructions. For atomic operations on not 
>> inlineable types, the compiler
>> must always generate support function calls.
>>>
>>>> On all affected platforms, atomic types whose size equal to 1, 2, 4
>>>> or 8 and alignment matches the size are inlineable
>>>>
>>>> On the 64-bit x86 platform which supports the cmpxchg16b instruction,
>>>> 16-byte atomic types whose alignment matches the size is inlineable.
>>> I still think making 16-byte atomic types inlined / lock-free when all
>>> we have is a wide cmpxchg is wrong.  AFAIK there is no atomic 16-byte
>>> load instruction on x86 (or is there?), even though cmpxchg16b might be
>>> available.
>> At least GCC 6.1.0 still generates cmpxchg16b for an atomic load with 
>> -march=native
>> on my haswell machine.
>>> I'd prefer if we could fix this in GCC in some way instead
>>> of requiring this by putting it into the ABI.  This also applies to the
>>> double-wide CAS on i386.
>>> IIRC, there is a BZ about this somewhere, but I don't find it.
>>> Andrew, do you remember?
>>>
>>> Basically, there is a correctness and a performance problem.
>>> The atomic variable might be in a read-only-mapped page, which isn't
>>> unreasonable given that the C/C++ standards explicitly require 
>>> lock-free
>>> atomics to be address-free too, which is a clear hint towards enabling
>>> mapping memory to more than one place in the address space. So, if the
>>> user does an atomic load on a 16-byte variable accessible through a
>>> read-only page, we'll get a segfault.
>>> One could argue that C/C++ don't provide any mmap feature, and thus you
>>> can't expect this to work.  But this doesn't seem a good argument to
>>> make from a user's perspective.
>>>
>>> Second, I'd argue that the "lock-free" property is used by most 
>>> users as
>>> an indication of which atomics might be as fast as one would expect
>>> typical HW to be -- not because they are interested in the forward
>>> progress aspect or the address-free aspect.  If atomic loads do cause
>>> writes, the performance of a load will be horrible because of the
>>> contention in cases where many threads issue loads.
>> If the 16-byte atomic read is implemented in software, the current 
>> implementation
>> still uses a lock/mutex, meaning a write will happen somewhere, maybe 
>> not directly
>> on the object memory but on somewhere else(a spinlock or a mutex). It 
>> can resolve
>> the read-only issue you mentioned, because the write is on the lock 
>> rather than on the
>> object, But there would still be the performance issue of contention.
>>
>> There are some advanced software algorithms that can make this
>> most-reader-occational-writer scenario more efficient. (For example, 
>> seqlock mentioned
>> in here: http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf)
>> The performance of such algorithms would depend highly on the use 
>> cases, so maybe the
>> user should implement their own algorithm instead of relying on the 
>> compiler/libatomic
>> library to provide the best performance in all cases.
>>> This is even more
>>> unfortunate considering that if one has a 64b CAS, then one can
>>> increment a 64b counter which can be considered to never overflow, 
>>> which
>>> allows one to build efficient atomic snapshots of larger atomic
>>> variables.
>>> OTOH, some people would like to use the GCC builtins to get access to
>>> cmpxchg16b.
>>>
>>> Irrespective of how we deal with this, we should at least document the
>>> current state and the problems associated with it.  Maybe we should
>>> consider providing separate builtins for cmpxchg16b.
>> I'm OK with the current GCC implementation, which I believe matches 
>> the ABI draft. And
>> we can document the current issues as appendix or whatever.
>> If GCC is willing to change, I'm also OK with specifying that 16-byte 
>> atomic types are
>> not inlineable.
>>>
>>>> "Inlineability" is a compile time property, which in most cases 
>>>> depends
>>>> only on the type. In a few cases it also depends on whether the target
>>>> ISA supports the cmpxchg16b instruction. A compiler may get the ISA
>>>> information by either compilation flags or inquiring the hardware
>>>> capabilities. When the hardware capabilities information is not 
>>>> available,
>>>> the compiler should assume the cmpxchg16b instruction is not 
>>>> supported.
>>> I think that strictly speaking, it always depends on the target ISA,
>>> because we assume that it provides 1-byte atomic operations, for
>>> example.
>> Right. The ABI specification itself is ISA-specific. For example, if 
>> we call it SPARC V9 ABI
>> amendment, then it is safe to assume that the ISA support 1,2,4,8 
>> -byte atomic hardware
>> instructions, then it is safe to make such specification of 
>> "inlineable" in the ABI.
>>
>> I'm not very familiar with x86 ISA versioning. I used to assume 
>> cmpxchg16b is available
>> on all today's mainstream x86 platforms until I found Xeon Phi does 
>> not support it. That's
>> why the ABI says it depends on target ISA.
>>>
>>>>     memory_order_consume = 1,
>>> [...]
>>>> Refer to C standard for the meaning of each enumeration constants of
>>>> memory_order type.
>>> [...]
>>>> Most of the functions listed in this section can be mapped to the 
>>>> generic
>>>> functions with the same semantics in the C standard. Refer to the C
>>>> standard for the description of the generic functions and how each 
>>>> memory
>>>> order works.
>>> We need to say that memory_order_consume must be implemented through
>>> memory_order_acquire.  The compiler can't preserve dependencies
>>> correctly and will never be able to for the current specification of
>>> consume.  Thus, we must fall back to acquire MO.
>> As far as I can tell, neither SPARC or x86 has instructions that may 
>> benefit from the consume
>> order. So I'm happy to make this change.
>>>
>>> I haven't looked at the descriptions of the individual atomic 
>>> operations
>>> in detail.  Let me know if I should.
>> In the above I mentioned there may be two places in the descriptions 
>> that may be interesting.
>> I have mentioned one in the above (__atomic_compare_exchange). The 
>> other one is
>> __atomic_is_lock_free. This is based on Richard's comments.
>>
>> Thanks again for your review, I will send a new draft based on your 
>> comments. Please send me
>> any further comments/suggestions.
>>
>> Thanks,
>> - Bin
>>>
>>>
>>> Torvald
>>>
>>

[-- Attachment #2: libatomicABIdraft.txt --]
[-- Type: text/plain, Size: 50655 bytes --]

LIBATOMIC ABI SPECIFICATION DRAFT

1. Overview

1.1. Why we need an ABI for atomics

C11 standard allows different size, representation and alignment between atomic types and the corresponding non-atomic types [1]. The size, representation and alignment of atomic types need to be specified in the ABI specification.

A runtime support library, libatomic, already exists on Solaris and Linux. The interface of this library needs to be standardized as part of the ABI specification, so that

- On a system that supply libatomic, all compilers in compliance with the ABI can generate compatible binaries linking this library.
- The binaries can be backward compatible on different versions of the system as long as they support the same ABI.

1.2. What does the atomics ABI specify

The ABI specifies the following

- Data representation of the atomic types.
- The names and behaviour of the implementation-specific support functions.
- The versioning of the library external symbols
- The atomic types for which the compiler may generate inlined code.
- compatibility requirement for the inlined atomic operations.

Note that the libatomic functions specified in the C Standard are not part of this ABI, because they are not implementation-specific functions. 

1.3. Platforms affected by this ABI specification

SPARC (32-bit and 64-bit)
x86 (32-bit and 64-bit)

It is assume that 64-bit SPARC platform only implement TSO (Total Store Order) memory model.

Section 1.1 and 1.2, and the Rationale, Notes and Appendix sections are for explanation purpose only, it is not part of the formal ABI specification.

Notes

N1.3.1. Some 64-bit x86 platforms, such as some early AMD64 processors and the more modern Intel Xeon Phi co-processor do not support the cmpxchg16b instruction. We will discuss in detail about cmpxchg16b in Section 3.

2. Data Representation

2.1. General Rules

The general rules of the size, representation and alignment of atomic types’ data representation are the following 

1) Atomic types assume the same size with the corresponding non-atomic types. 

2) Atomic types assume the same representation with the corresponding non-atomic types. 

3) Atomic types assume the same alignment with the corresponding non-atomic types, with the following exceptions:

On 32- and 64-bit x86 platforms and on 64-bit SPARC platforms, atomic types of size 1, 2, 4, 8 or 16 -byte have the alignment that matches the size. 

On 32-bit SPARC platforms, atomic types of size 1, 2, 4 or 8-byte have the alignment that matches the size. If the alignment of a 16-byte non-atomic type is less than 8-byte, the alignment of the corresponding atomic type is raised to 8-byte.

Notes

N2.1.1. The above rules are applied to both scalar types and aggregate types.

2.2. Atomic scalar types

x86

                                          LP64 (AMD64)                     ILP32 (i386)
C Type                          sizeof    Alignment  Inlineable  sizeof    Alignment  Inlineable
atomic_flag                     1         1          Y           1         1          Y
_Atomic _Bool                   1         1          Y           1         1          Y
_Atomic char                    1         1          Y           1         1          Y
_Atomic signed char             1         1          Y           1         1          Y
_Atomic unsigned char           1         1          Y           1         1          Y
_Atomic short                   2         2          Y           2         2          Y
_Atomic signed short            2         2          Y           2         2          Y
_Atomic unsigned short          2         2          Y           2         2          Y
_Atomic int                     4         4          Y           4         4          Y
_Atomic signed int              4         4          Y           4         4          Y
_Atomic enum                    4         4          Y           4         4          Y
_Atomic unsigned int            4         4          Y           4         4          Y
_Atomic long                    8         8          Y           4         4          Y
_Atomic signed long             8         8          Y           4         4          Y
_Atomic unsigned long           8         8          Y           4         4          Y
_Atomic long long               8         8          Y           8         8          Y
_Atomic signed long long        8         8          Y           8         8          Y
_Atomic unsigned long long      8         8          Y           8         8          Y
_Atomic __int128 (with at16)    16        16         Y               not applicable
_Atomic __int128 (w/o at16)     16        16         N               not applicable
any-type _Atomic *              8         8          Y           4         4          Y
_Atomic float                   4         4          Y           4         4          Y
_Atomic double                  8         8          Y           8         8          Y
_Atomic long double (with at16) 16        16         Y           12        4          N
_Atomic long double (w/o at16)  16        16         N           12        4          N
_Atomic float _Complex          8         8(4)       Y           8         8(4)       Y
_Atomic double _Complex         16        16(8)      Y           16        16(8)      N
                    (with at16)
_Atomic double _Complex         16        16(8)      N           16        16(8)      N
                    (w/o at16)
_Atomic long double _Complex    32        16         N           24        4          N
_Atomic float _Imaginary        4         4          Y           4         4          Y
_Atomic double _Imaginary       8         8          Y           8         8          Y
_Atomic long double _Imaginary  16        16         Y           12        4          N
                    (with at16)
_Atomic long double _Imaginary  16        16         N           12        4          N
                    (w/o at16)

SPARC

                                          LP64 (v9)                        ILP32 (sparc)
C Type                          sizeof    Alignment  Inlineable  sizeof    Alignment  Inlineable
atomic_flag                     1         1          Y           1         1          Y
_Atomic _Bool                   1         1          Y           1         1          Y
_Atomic char                    1         1          Y           1         1          Y
_Atomic signed char             1         1          Y           1         1          Y
_Atomic unsigned char           1         1          Y           1         1          Y
_Atomic short                   2         2          Y           2         2          Y
_Atomic signed short            2         2          Y           2         2          Y
_Atomic unsigned short          2         2          Y           2         2          Y
_Atomic int                     4         4          Y           4         4          Y
_Atomic signed int              4         4          Y           4         4          Y
_Atomic enum                    4         4          Y           4         4          Y
_Atomic unsigned int            4         4          Y           4         4          Y
_Atomic long                    8         8          Y           4         4          Y
_Atomic signed long             8         8          Y           4         4          Y
_Atomic unsigned long           8         8          Y           4         4          Y
_Atomic long long               8         8          Y           8         8          Y
_Atomic signed long long        8         8          Y           8         8          Y
_Atomic unsigned long long      8         8          Y           8         8          Y
_Atomic __int128                16            16         N               not applicable
any-type _Atomic *              8         8          Y           4         4          Y
_Atomic float                   4         4          Y           4         4          Y
_Atomic double                  8         8          Y           8         8          Y
_Atomic long double             16        16         N           16        8          N
_Atomic float _Complex          8         8(4)       Y           8         8(4)       Y
_Atomic double _Complex         16        16(8)      N           16        8          N
_Atomic long double _Complex    32        16         N           32        8          N
_Atomic float _Imaginary        4         4          Y           4         4          Y
_Atomic double _Imaginary       8         8          Y           8         8          Y
_Atomic long double _Imaginary  16        16         N           16        8          N

with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA
does not support cmpxchg16b.

Notes

N2.2.1. C standard also specifies some atomic integer types. They are not in the above table because they have the same representation and alignment requirements as the corresponding direct types [2].

N2.2.2. We will discuss the inlineable column and __int128 type in section 3.

N2.2.3. The value in parenthesis is the alignment of the corresponding non-atomic type, if it is different from the alignment of the atomic type.

N2.2.4. Because _Atomic specifier can not be used on a function type [7] and _Atomic qualifier can not modify a function type [8], there is no atomic function type listed in the above table.

N2.2.5. On 32-bit x86 platforms, long double is of size 12-byte and is of alignment 4-byte. This ABI specification does not increase the alignment of _Atomic long double type.

2.3 Atomic Aggregates and Unions

Atomic structures or unions may have different alignment compared to the corresponding non-atomic types, subject to rule 3) in section 2.1. The alignment change only affects the boundary where an entire structure or union is aligned. The offset of each member, the internal padding and the size of the structure or union are not affected.

The following table shows selected examples of the size and alignment of atomic structure types.

x86

                                          LP64 (AMD64)                      ILP32 (i386)
C Type                          sizeof    Alignment  Inlineable   sizeof    Alignment  Inlineable
_Atomic struct {char a[2];}     2         2(1)       Y            2         2(1)       Y
_Atomic struct {char a[3];}     3         1          N            3         1          N
_Atomic struct {short a[2];}    4         4(2)       Y            4         4(2)       Y
_Atomic struct {int a[2];}      8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c[2];
                short s;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char a[16];}    16        16(1)      Y            16        16(1)      N
                    (with at16)
_Atomic struct {char a[16];}    16        16(1)      N            16        16(1)      N
                    (w/o at16)

SPARC

                                          LP64 (v9)                       ILP32 (sparc)
C Type                          sizeof    Alignment  Inlineable   sizeof    Alignment  Inlineable
_Atomic struct {char a[2];}     2         2(1)       Y            2         2(1)       Y
_Atomic struct {char a[3];}     3         1          N            3         1          N
_Atomic struct {short a[2];}    4         4(2)       Y            4         4(2)       Y
_Atomic struct {int a[2];}      8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char c[2];
                short s;
                int i;}         8         8(4)       Y            8         8(4)       Y
_Atomic struct {char a[16];}    16        16(1)      N            16        8(1)       N

with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA does not support cmpxchg16b.

Notes

N2.3.1. The value in parenthesis is the alignment of the corresponding non-atomic type, if it is different from the alignment of the atomic type.

N2.3.2. For aggregates that are not modified by _Atomic, the contents of the padding bits are undefined. For _Atomic aggregates, the contents of the padding bits are also undefined. The implementation of __atomic_compare_exchange follows the memcmp/memcpy semantics, which may result in unsuccessful comparisons due to the undefined contents of the padding bits. C11 is not clear about this. DR 431 [9] raised this issue, which has been fixed and will appear in the next revision of the C standard (~2017).

N2.3.3. The special alignment requirement on 16-byte atomic struct types might be useful for the following:
- Reducing sharing locks with other atomics.
- Allowing related runtime support functions to choose more efficient instructions.

2.4. Bit-fields

It is implementation defined in the C standard that whether atomic bit-field types are permitted [3]. In this ABI specification, The representation of atomic bit-field is unspecified.

3. Inlineable Property

Some atomic operation can map directly to hardware backed atomic instructions. To implement an atomic operation, the compiler may generate inlined code using such instructions, or a support function call. This ABI specification defines a few inlineable atomic types. The specification of the inlineable attribute is the following:

1. The compiler may generate inlined hardware backed atomic instructions for atomic operations on an object of inlineable atomic type. The compiler is also allowed to generate a support function call.

2. The implementation of the support functions for an inlineable atomic type must use hardware backed atomic instructions to be compatible with the inlined code the compiler may generate.

3. If an atomic type is not inlineable, the compiler shall always generate support function calls for atomic operations on the objects of the type. The implementation of the support functions for the type is free to use hardware backed atomic instructions or any other approaches.

On all affected platforms, if the size of an atomic type is 1, 2, 4 or 8 -byte and its alignment matches the size, then the atomic type is inlineable.

On the 64-bit x86 platform which supports the cmpxchg16b instruction, if the size of an atomic type is 16-byte and its alignment matches the size, then the atomic type is inlineable (see notes in this section for some caveats about this).

Rationale

It is assumed that an atomic object must be accessed by compatible instructions to achieve atomicity. For example, a C atomic_compare_exchange operation may be implemented by the hardware compare-and-swap instruction, or by doing the compare and the swap in two separated steps protected by a software lock. The two implementations are not compatible because the software lock used by thread T2 is not visible to thread T1’s hardware compare-and-swap instruction therefore the swap may happen while thread T1 is holding the lock. So the two implementations should not be used to access the same object at the same time in a run of the program.

If the compiler always generates support function calls for all atomic operations, the aforementioned compatibility problem would never happen. But the compiler should be allowed, yet not be forced, to generate inlined code for some atomic operations for better performance. It should be guaranteed that if/when the compiler generates inlined code, it must be compatible with the library implementation.

So this ABI specifies a few inlineable atomic types, for which the compiler may generate inlined code, and both the inlined code and the implementation of the corresponding support functions must use hardware backed atomic instructions. 

Two alternatives considered

1. To specify a type based criteria, and for all types that meet the criteria, both compiler and support function must use hardware backed atomic instructions; and for all types that do not meet the criteria, neither the compiler nor the support function may use hardware backed atomic instructions.

The C and C++ standard seem to back this approach: C++ standard provides a query that returns a per-type result about whether the type is lock-free [4]. C standard does not guarantee that the query result is per-type [5], but it will be in the next revision [6]. The problem is that the query result does not necessarily reflect the implementation of the atomic operation on the
queried type. Even is_lock_free returns false for an object because of its type, the implementation may still use hardware backed atomic instructions for the object. Say there is a size=3-byte alignment=1-byte atomic type. This type can not always use hardware backed atomic instruction because of its alignment, but it can when the runtime address happens to be a 4-byte aligned. So this approach is unnecessarily conservative.

The ABI differs from this alternative in that the ABI allows the runtime implementation for a non-inlineable atomic type to use hardware backed atomic instructions.

2. To specify an object based criteria, and if an atomic object meets the criteria, both the compiler and the support functions must use hardware backed atomic instructions; otherwise, neither the compiler nor the support function may use the hardware backed atomic instructions. 

The criteria would be based on some runtime information, such the alignment of the object’s address, which would be difficult for the compiler to get at the compile time. It would be much easier for the runtime to do such optimization, and let the compiler always generates calls for such type of objects.

Notes:

N3.1. This ABI assumes 1, 2, 4, 8 -byte hardware atomic instructions are available on all relevant platforms. This means for objects of those sizes, naturally aligned load and store instructions are guaranteed to be atomic, and variants of atomic compare-and-swap instructions are available as well.

N3.2. About cmpxchg16b

This ABI document specifies that if cmpxchg16b is supported on a 64-bit x86 platform, then 16-byte properly aligned atomics are inlineable on the platform. 

The only available instruction on such platforms to implement atomic load, store, exchange and compare_exchange operations is cmpxchg16b. One could argue that xmm registers can be used to do a 16-byte memory move, but it is not guaranteed to be atomic in the current Intel manual [12]. This causes the following caveats to implement the current ABI specification

1. cmpxchg16b performs a write on the affected memory location. If the atomic variable is in a read-only mapped page, then using cmpxchg16b to do the load will cause a segfault. One could argue that mmap is not part of C/C++ specification. But some notes in C/C++ specifications imply the mmap semantics. C11 explicitly mentions lock-free atomic operations should be address-free. The same memory location could be mapped to two different addresses, and atomic operations on this location should still communicate atomically [10]. Similar note can also be found  in C++11 [11]. 

2. Using cmpxchg16b may not give the atomic_load on a GCC _Atomic __int128 object the expected performance. One would expect that in the non-contention scenario, the hardware-backed atomic load implementation will run in full speed just like the 1,2,4,8-byte atomic_loads. However, the write operation on the affected memory location will effectively make the read-only scenario a high contention scenario, significantly slowing down the performance. One might argue that a software lock implementation is not any better because the lock implementation will probably still perform a write or a compare-and-swap operation anyway. But a runtime implementation could also choose a more flexible implementation, such as seqlock [13], to make the most-reader scenario more efficient. Or if the runtime just expose cmpxchg16 as an intrinsic, an expert user can build his/her own implementation. 

Although this ABI specification specifies that 16-byte properly aligned atomics are inlineable on platforms supporting cmpxchg16b, we document the caveats here for further discussion. If we decide to change the inlineable attribute for those atomics, then this ABI, the compiler and the runtime implementation should be updated together at the same time.

The compiler and runtime need to check the availability of cmpxchg16b to implement this ABI specification. Here is how it would work: The compiler can get the information either from the compiler flags or by inquiring the hardware capabilities. When the information is not available, the compiler should assume that cmpxchg16b instruction is not supported. The runtime library implementation can also query the hardware compatibility and choose the implementation at runtime. Assuming the user provides correct compiler options and the inquiry returns the correct information, on a platform that supports cmpxchg16b, the code generated by the compiler will both use cmpxchg16b; on a platform that does not support cmpxchg16b, the code generated by the compiler, including the code generated for a generic platform, always call the support function, so there is no compatibility problem. 

N3.3. Here are a few examples of small types which don't qualify as inlineable:

  _Atomic struct {char a[3];} /* size = 3, alignment = 1 */
  _Atomic long double /* (on 32-bit x86) size = 12, alignment = 4 */

A smart compiler may know such an object is located at an address that fits in an 8-byte aligned window, but the ABI does not allow the compiler to generate inlined code sequence using hardware backed atomic instructions. This is because another compiler, or the same compiler with a different optimization level may generate a support function call, and the support function implementation is not required to use compatible instructions.

4. libatomic library functions

4.1. Data Definitions

This section contains examples of system header files that provide data interface needed by the libatomic functions.

<stdatomic.h>

typedef enum
{
    memory_order_relaxed = 0,
    memory_order_consume = 1,
    memory_order_acquire = 2,
    memory_order_release = 3,
    memory_order_acq_rel = 4,
    memory_order_seq_cst = 5
} memory_order;

typedef _Atomic struct
{
  unsigned char __flag;
} atomic_flag;

Refer to C standard for the meaning of each enumeration constants of
memory_order type.

memory_order_consume must be implemented through memory_order_acquire.

Notes
N4.1.1. All affected platforms of this ABI specification implement a strong memory model on which memory_order_consume does not provide any benefit over memory_order_acquire. Therefore this ABI specifies that memory_order_consume is raised to memory_order_acquire. 

<fenv.h>

SPARC

#define FE_INEXACT    0x01
#define FE_DIVBYZERO  0x02
#define FE_UNDERFLOW  0x04
#define FE_OVERFLOW   0x08
#define FE_INVALID    0x10

x86

#define FE_INVALID    0x01
#define FE_DIVBYZERO  0x04
#define FE_OVERFLOW   0x08
#define FE_UNDERFLOW  0x10
#define FE_INEXACT    0x20

4.2. Support Functions

The following kinds of atomic operations are supported by the runtime library: load, store, exchange, compare-and-exchange and arithmetic read-modify-write operations. For the arithmetic read-modify-write operations, the following kinds of modification operation are supported: addition, subtraction, bitwise inclusive or, bitwise exclusive or, bitwise and, bitwise nand. There are also test-and-set functions.

For each kind of atomic operations, libatomic provide a generic version which accepts a pointer of all atomic types and some size specific functions. The size specific versions pass and return data by value, the generic version pass and return data via pointers.

Most of the functions listed in this section can be mapped to the corresponding generic functions in the C11. Refer to the C11 Standard for the description of the generic functions and how each memory order works. Note that memory_order_consume must be implemented through memory_order_acquire.

The following functions are available on all platforms.

void __atomic_load (size_t size, void *object, void *loaded, memory_order order);

Atomically load the value pointed to by object. Assign the loaded value to the memory pointed to by loaded. The size of memory affected by the load is designated by size.

int8_t __atomic_load_1 (int8_t *object, memory_order order);
int16_t __atomic_load_2 (int16_t *object, memory_order order);
int32_t __atomic_load_4 (int32_t *object, memory_order order);
int64_t __atomic_load_8 (int64_t *object, memory_order order);

Atomically load the value pointed to by object. The loaded value is returned. The size of memory affected by the load is designated by the type of the object. If object is not aligned properly according to the type of object, the behavior is undefined. 

Memory is affected according to the value of order. If order is either memory_order_release or memory_order_acq_rel, the behavior of the function is undefined.

void __atomic_store (size_t size, void *object, void *desired, memory_order order)

Atomically replace the value pointed to by object with the value pointed to by desired. The size of memory affected by the store is designated by size.

void __atomic_store_1 (int8_t *object, int8_t desired, memory_order order);
void __atomic_store_2 (int16_t *object, int16_t desired, memory_order order);
void __atomic_store_4 (int32_t *object, int32_t desired, memory_order order);
void __atomic_store_8 (int64_t *object, int64_t desired, memory_order order);

Atomically replace the value pointed to by object with desired. The size of memory affected by the store is designated by the type of the object. If object is not aligned properly according to the type of object, the behavior is undefined.

Memory is affected according to the value of order. If order is one of memory_order_acquire, memory_order_consume or memory_order_acq_rel, the behavior of the function is undefined.

void __atomic_exchange (size_t size, void *object, void *desired, void *loaded, memory_model order);

Atomically, replace the value pointed to by object with the value pointed to by desired and assign the value pointed to by loaded to the value pointed to by object immediately before the effect. The size of memory affected by the exchange is designated by size.

int8_t __atomic_exchange_1 (int8_t * object, int8_t desired, memory_order)
int16_t __atomic_exchange_2 (int16_t * object, int16_t desired, memory_order)
int32_t __atomic_exchange_4 (int32_t * object, int32_t desired, memory_order)
int64_t __atomic_exchange_8 (int64_t * object, int64_t desired, memory_order)

Atomically, replace the value pointed to by object with desired and return the value pointed to by object immediately before the effect. The size of memory affected by the exchange is designated by the type of object. If object is not aligned properly according to the type of object, the behavior is undefined.

Memory is affected according to the value of order.

_Bool __atomic_compare_exchange (size_t size, void *object, void *expected, void *desired, memory_model success_order, memory_model failure_order);

Atomically, compares the memory pointed to by object for equality with the memory pointed to by expected, and if true, replaces the memory pointed to by object with the memory pointed to by desired, and if false, updates the memory pointed to by expected with the memory pointed to by object. The result of the comparison is returned. The size of memory affected by the compare and exchange is designated by size.

The compare and exchange never fail spuriously, i.e. if the comparison for equality returns false, the two values in the comparison were not equal. [Note, this is to specify that on SPARC and x86, compare exchange is always implemented with "strong" semantic. The weak flavors in the C standard is translated to strong.]

_Bool __atomic_compare_exchange_1 (int8_t *object, int8_t *expected, int8_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_2 (int16_t *object, int16_t *expected, int16_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_4 (int32_t *object, int32_t *expected, int32_t desired, memory_order success_order, memory_order failure_order);
_Bool __atomic_compare_exchange_8 (int64_t *object, int64_t *expected, int64_t desired, memory_order success_order, memory_order failure_order);

Atomically, compares the memory pointed to by object for equality with the memory pointed to by expected, and if true, replaces the memory pointed to by object with desired, and if false, updates the memory pointed to by expected with the memory pointed to by object. The result of the comparison is returned.

The size of memory affected by the compare and exchange is designated by the type of object. If object is not aligned properly according to the type of object, the behavior is undefined.

The compare and exchange never fail spuriously, i.e. if the comparison for equality returns false, the two values in the comparison were not equal.

If the comparison is true, memory is affected according to the value of success_order, and if the comparison is false, memory is affected according to the value of failure_order. The result of the comparison is returned.

int8_t __atomic_add_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_add_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_add_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_add_fetch_8 (int64_t *object, int64_t operand, memory_order order);

Atomically replaces the value pointed to by object with the result of the value pointed to by object plus operand and returns the value pointed to by object immediately after the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

int8_t __atomic_fetch_add_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_add_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_add_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_add_8 (int64_t *object, int64_t operand, memory_order order);

Atomically replaces the value pointed to by object with the result of the value pointed to by object plus operand and returns the value pointed to by object immediately before the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

Memory is affected according to the value of order.

int8_t __atomic_sub_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_sub_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_sub_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_sub_fetch_8 (int64_t *object, int64_t operand, memory_order order);

Atomically replaces the value pointed to by object with the result of the value pointed to by object minus operand and returns the value pointed to by object immediately after the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

int8_t __atomic_fetch_sub_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_sub_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_sub_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_sub_8 (int64_t *object, int64_t operand, memory_order order);

Atomically replaces the value pointed to by object with the result of the value pointed to by object minus operand and returns the value pointed to by object immediately before the effects. If object is not aligned properly according to the type of object, the behavior is undefined.  The size of memory affected by the effects is designated by the type of object.

Memory is affected according to the value of order.

int8_t __atomic_and_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_and_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_and_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_and_fetch_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise and of the value pointed to by object and operand and returns the value pointed to by object immediately after the effects. If object is not aligned properly according to the type of object, the behavior is undefined.  The size of memory affected by the effects is designated by the type of object.

int8_t __atomic_fetch_and_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_and_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_and_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_and_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise and of the value pointed to by object and operand and returns the value pointed to by object immediately before the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

Memory is affected according to the value of order.

int8_t __atomic_or_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_or_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_or_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_or_fetch_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise or of the value pointed to by object and operand and returns the value pointed to by object immediately after the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

int8_t __atomic_fetch_or_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_or_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_or_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_or_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise or of the value pointed to by object and operand and returns the value pointed to by object immediately before the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

Memory is affected according to the value of order.

int8_t __atomic_xor_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_xor_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_xor_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_xor_fetch_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise xor of the value pointed to by object and operand and returns the value pointed to by object immediately after the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

int8_t __atomic_fetch_xor_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_xor_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_xor_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_xor_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise xor of the value pointed to by object and operand and returns the value pointed to by object immediately before the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

Memory is affected according to the value of order.

int8_t __atomic_nand_fetch_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_nand_fetch_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_nand_fetch_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_nand_fetch_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise nand of the value pointed to by object and operand and returns the value pointed to by object immediately after the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

Bitwise operator nand is defined as the following using ANSI C operators: a nand b is equivalent to ~(a & b).

int8_t __atomic_fetch_nand_1 (int8_t *object, int8_t operand, memory_order order);
int16_t __atomic_fetch_nand_2 (int16_t *object, int16_t operand, memory_order order);
int32_t __atomic_fetch_nand_4 (int32_t *object, int32_t operand, memory_order order);
int64_t __atomic_fetch_nand_8 (int64_t *object, int64_t operand, memory_order order);

Atomically, replaces the value pointed to by object with the result of bitwise nand of the value pointed to by object and operand and returns the value pointed to by object immediately before the effects. If object is not aligned properly according to the type of object, the behavior is undefined. The size of memory affected by the effects is designated by the type of object.

Bitwise operator nand is defined as the following using ANSI C operators: a nand b is equivalent to ~(a & b).

Memory is affected according to the value of order.

_Bool __atomic_test_and_set_1 (int8_t *object, memory_order order);
_Bool __atomic_test_and_set_2 (int16_t *object, memory_order order);
_Bool __atomic_test_and_set_4 (int32_t *object, memory_order order)
_Bool __atomic_test_and_set_8 (int64_t *object, memory_order order)

Atomically, checks the value pointed to by object and if it is in the clear state, set the value pointed to by object to the set state and returns true, and if it is in the set state, returns false. The size of memory affected by the effects is always one byte.

Memory is affected according to the value of order.

The set and clear state are the same as specified for atomic_flag_test_and_set.

_Bool __atomic_is_lock_free (size_t size, void *object);

Returns whether the object pointed to by object is lock-free. The function assumes that the size of the object is size. If object is NULL then the function assumes that object is aligned on an size-byte address.

The function takes the size of an object and an address which is one of the following three cases 
- the address of the object 
- a faked address that solely indicates the alignment of the object's address 
- NULL, which means that the alignment of the object matches size and returns whether the object is lock-free.

void __atomic_feraiseexcept (int exception);

Raise floating point exception(s) that specified by exception. The int input argument exception represents a subset of floating-point exceptions, and can be zero or the bitwise OR of one or more floating-point exception macros. The macros are defined in fenv.h in section 4.1.

4.3. 64-bit Specific Interfaces

4.3.1. Data Representation of __int128 type

On x86 platforms, __int128 type is defined in the 64-bit ABI.

On SPARC platforms, the size and alignment of __int128 type is specified as the following:

             sizeof   Alignment
__int128       16        16

4.3.2. Support Functions

The following functions are available only on 64-bit platforms.

__int128 __atomic_load_16 (__int128 *object, memory_order order);
void __atomic_store_16 (__int128 *object, __int128 desired, memory_order order);
__int128 __atomic_exchange_16 (__int128 * object,  __int128 desired, memory_order order);
_Bool __atomic_compare_exchange_16 (__int128 *object, __int128 *expected, __int128 desired, memory_order success_order, memory_order failure_order);
__int128 __atomic_add_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_add_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_sub_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_sub_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_and_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_and_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_or_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_or_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_xor_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_xor_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_nand_fetch_16 (__int128 *object, __int128 operand, memory_order order);
__int128 __atomic_fetch_nand_16 (__int128 *object, __int128 operand, memory_order order);
_Bool __atomic_test_and_set_16 (__int128 *object, memory_order order);

The description of each function is the same with the corresponding set of functions specified in section 4.2.

5. Libatomic symbol versioning

Here is the mapfile for symbol versioning of the libatomic library specified by this ABI specification

LIBATOMIC_1.0 {
  global:
    __atomic_load;
    __atomic_store;
    __atomic_exchange;
    __atomic_compare_exchange;
    __atomic_is_lock_free;

    __atomic_add_fetch_1;
    __atomic_add_fetch_2;
    __atomic_add_fetch_4;
    __atomic_add_fetch_8;
    __atomic_add_fetch_16;
    __atomic_and_fetch_1;
    __atomic_and_fetch_2;
    __atomic_and_fetch_4;
    __atomic_and_fetch_8;
    __atomic_and_fetch_16;
    __atomic_compare_exchange_1;
    __atomic_compare_exchange_2;
    __atomic_compare_exchange_4;
    __atomic_compare_exchange_8;
    __atomic_compare_exchange_16;
    __atomic_exchange_1;
    __atomic_exchange_2;
    __atomic_exchange_4;
    __atomic_exchange_8;
    __atomic_exchange_16;
    __atomic_fetch_add_1;
    __atomic_fetch_add_2;
    __atomic_fetch_add_4;
    __atomic_fetch_add_8;
    __atomic_fetch_add_16;
    __atomic_fetch_and_1;
    __atomic_fetch_and_2;
    __atomic_fetch_and_4;
    __atomic_fetch_and_8;
    __atomic_fetch_and_16;
    __atomic_fetch_nand_1;
    __atomic_fetch_nand_2;
    __atomic_fetch_nand_4;
    __atomic_fetch_nand_8;
    __atomic_fetch_nand_16;
    __atomic_fetch_or_1;
    __atomic_fetch_or_2;
    __atomic_fetch_or_4;
    __atomic_fetch_or_8;
    __atomic_fetch_or_16;
    __atomic_fetch_sub_1;
    __atomic_fetch_sub_2;
    __atomic_fetch_sub_4;
    __atomic_fetch_sub_8;
    __atomic_fetch_sub_16;
    __atomic_fetch_xor_1;
    __atomic_fetch_xor_2;
    __atomic_fetch_xor_4;
    __atomic_fetch_xor_8;
    __atomic_fetch_xor_16;
    __atomic_load_1;
    __atomic_load_2;
    __atomic_load_4;
    __atomic_load_8;
    __atomic_load_16;
    __atomic_nand_fetch_1;
    __atomic_nand_fetch_2;
    __atomic_nand_fetch_4;
    __atomic_nand_fetch_8;
    __atomic_nand_fetch_16;
    __atomic_or_fetch_1;
    __atomic_or_fetch_2;
    __atomic_or_fetch_4;
    __atomic_or_fetch_8;
    __atomic_or_fetch_16;
    __atomic_store_1;
    __atomic_store_2;
    __atomic_store_4;
    __atomic_store_8;
    __atomic_store_16;
    __atomic_sub_fetch_1;
    __atomic_sub_fetch_2;
    __atomic_sub_fetch_4;
    __atomic_sub_fetch_8;
    __atomic_sub_fetch_16;
    __atomic_test_and_set_1;
    __atomic_test_and_set_2;
    __atomic_test_and_set_4;
    __atomic_test_and_set_8;
    __atomic_test_and_set_16;
    __atomic_xor_fetch_1;
    __atomic_xor_fetch_2;
    __atomic_xor_fetch_4;
    __atomic_xor_fetch_8;
    __atomic_xor_fetch_16;

  local:
    *;
};
LIBATOMIC_1.1 {
  global:
    __atomic_feraiseexcept;
} LIBATOMIC_1.0;
LIBATOMIC_1.2 {
  global:
    atomic_thread_fence;
    atomic_signal_fence;
    atomic_flag_test_and_set;
    atomic_flag_test_and_set_explicit;
    atomic_flag_clear;
    atomic_flag_clear_explicit;
} LIBATOMIC_1.1;

6. Libatomic Assumption on Non-blocking Memory Instructions

libatomic assumes that programmers or compilers properly insert
SFENCE/MFENCE barriers for the following cases

1) writes executed with CLFLUSH instruction
2) streaming loads/stores (V)MOVNTx, MASKMOVDQU, MASKMOVQ.
3) any other operations which reference Write Combining memory type.

Rationale

x86 has a strong memory model. Memory reads are not reordered with other reads, writes are not reordered with reads and other writes. The three cases mentioned are exceptions, i.e. those writes will not block other writes. The ABI specifies that code uses those non-blocking writes should contain proper fences, so that libatomic support functions do not need fences to synchronize with those instructions.

Appendix

A.1. Compatibility Notes

On 64-bit SPARC platforms, _Atomic long double is a 16-byte naturally aligned atomic type. There is no hardware atomic instruction for such type in 64-bit SPARC ISA, and it is not inlineable in this ABI specification.

If in the future, hardware atomic instructions for 16-byte naturally aligned objects are available in a new SPARC ISA, then libatomic could leverage such instructions to implement atomic operations for _Atomic long double.

This would be a backward compatible libatomic change. The type is not inlineable, all atomic operations on objects of the type must be via libatomic function calls, so all such atomic operations will be changed to use hardware atomic instructions in those libatomic functions without breaking the compiler-library interface.

However, if a compiler inlines an atomic operation on an _Atomic long double object using the new hardware atomic instructions, it breaks the compatibility if the library implementation still does not use such instructions. In such case, the libatomic library and the compiler should be upgraded in lock-step, and the inlineable property for certain atomic types must be updated.

If the compiler change the data representation of atomic types, such change will cause incompatible binary and it would be hard to detect if the incompatible binaries are linked together.

A.2. References

[1] INCITS/ISO/IEC 9899-2011[2012], 6.2.5p27
The size, representation, and alignment of an atomic type need not be the same as those of the corresponding unqualified type.

[2] INCITS/ISO/IEC 9899-2011[2012], 7.17.6p1
For each line in the following table,257) the atomic type name is declared as a type that has the same representation and alignment requirements as the corresponding direct type.258)

Footnote 258
258) The same representation and alignment requirements are meant to imply interchangeability as arguments to functions, return values from functions, and members of unions.

[3] INCITS/ISO/IEC 9899-2011[2012], 6.7.2.1p5
A bit-field shall have a type that is a qualified or unqualified version of _Bool, signed int, unsigned int, or some other implementation-defined type. It is implementation-defined whether atomic types are permitted.

[4] INCITS/ISO/IEC 14882-2011[2012], 29.4p2
The function atomic_is_lock_free (29.6) indicates whether the object is lock-free. In any given program execution, the result of the lock-free query shall be consistent for all pointers of the same type.

[5] INCITS/ISO/IEC 9899-2011[2012], 7.17.5.1p3
The atomic_is_lock_free generic function returns nonzero (true) if and only if the object's operations are lock-free. The result of a lock-free query on one object cannot be inferred from the result of a lock-free query on another object.

[6] DR 465: http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_465

[7] INCITS/ISO/IEC 9899-2011[2012], 6.7.2.4p3
The type name in an atomic type specifier shall not refer to an array type, a function type, an atomic type, or a qualified type.

[8] INCITS/ISO/IEC 9899-2011[2012], 6.7.3p3
The type modified by the _Atomic qualifier shall not be an array type or a function type.

[9] DR 431: http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_431

[10] INCITS/ISO/IEC 9899-2011[2012], 7.17.5p2

[11] INCITS/ISO/IEC 14882-2011[2012], 29.4p3

[12] Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 8.1.1 Guaranteed Atomic Operations

[13] Can Seqlocks Get Along with Programming Language Memory Models? Hans-J. Boehm, http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-11-17 20:12               ` GCC libatomic ABI specification draft Bin Fan
@ 2016-11-29 11:12                 ` Szabolcs Nagy
  2016-12-01 19:14                   ` Bin Fan at Work
  2017-01-17 17:00                 ` Torvald Riegel
  1 sibling, 1 reply; 20+ messages in thread
From: Szabolcs Nagy @ 2016-11-29 11:12 UTC (permalink / raw)
  To: Bin Fan, gcc; +Cc: nd

On 17/11/16 20:12, Bin Fan wrote:
> 
> Although this ABI specification specifies that 16-byte properly aligned atomics are inlineable on platforms
> supporting cmpxchg16b, we document the caveats here for further discussion. If we decide to change the
> inlineable attribute for those atomics, then this ABI, the compiler and the runtime implementation should be
> updated together at the same time.
> 
> 
> The compiler and runtime need to check the availability of cmpxchg16b to implement this ABI specification.
> Here is how it would work: The compiler can get the information either from the compiler flags or by
> inquiring the hardware capabilities. When the information is not available, the compiler should assume that
> cmpxchg16b instruction is not supported. The runtime library implementation can also query the hardware
> compatibility and choose the implementation at runtime. Assuming the user provides correct compiler options

with this abi the runtime implementation *must* query the hardware
(because there might be inlined cmpxchg16b in use in another module
on a hardware that supports it and the runtime must be able to sync
with it).

currently gcc libatomic does not guarantee this which is dangerously
broken: if gcc is configured with --disable-gnu-indirect-function
(or on targets without ifunc support: solaris, bsd, android, musl,..)
the compiler may inline cmpxchg16b in one translation unit but use
incompatible runtime function in another.

there is PR 70191 but this issue has wider scope.

> and the inquiry returns the correct information, on a platform that supports cmpxchg16b, the code generated
> by the compiler will both use cmpxchg16b; on a platform that does not support cmpxchg16b, the code generated
> by the compiler, including the code generated for a generic platform, always call the support function, so
> there is no compatibility problem.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-11-29 11:12                 ` Szabolcs Nagy
@ 2016-12-01 19:14                   ` Bin Fan at Work
  2016-12-02 11:13                     ` Gabriel Paubert
  0 siblings, 1 reply; 20+ messages in thread
From: Bin Fan at Work @ 2016-12-01 19:14 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Bin Fan at Work, gcc, nd

Hi Szabolcs,

> On Nov 29, 2016, at 3:11 AM, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> 
> On 17/11/16 20:12, Bin Fan wrote:
>> 
>> Although this ABI specification specifies that 16-byte properly aligned atomics are inlineable on platforms
>> supporting cmpxchg16b, we document the caveats here for further discussion. If we decide to change the
>> inlineable attribute for those atomics, then this ABI, the compiler and the runtime implementation should be
>> updated together at the same time.
>> 
>> 
>> The compiler and runtime need to check the availability of cmpxchg16b to implement this ABI specification.
>> Here is how it would work: The compiler can get the information either from the compiler flags or by
>> inquiring the hardware capabilities. When the information is not available, the compiler should assume that
>> cmpxchg16b instruction is not supported. The runtime library implementation can also query the hardware
>> compatibility and choose the implementation at runtime. Assuming the user provides correct compiler options
> 
> with this abi the runtime implementation *must* query the hardware
> (because there might be inlined cmpxchg16b in use in another module
> on a hardware that supports it and the runtime must be able to sync
> with it).

Thanks for the comment. Yes, the ABI requires libatomic must query the hardware. This is necessary if we want the compiler to generate inlined code for 16-byte atomics. Note that this particular issue only affects x86. I notice GCC already have a few builtins declared in cpuid.h. The functions are x86 specific. So couldn’t the query be done by those functions?

> 
> currently gcc libatomic does not guarantee this which is dangerously
> broken: if gcc is configured with --disable-gnu-indirect-function
> (or on targets without ifunc support: solaris, bsd, android, musl,..)
> the compiler may inline cmpxchg16b in one translation unit but use
> incompatible runtime function in another.
> 
> there is PR 70191 but this issue has wider scope.

This issue was actually found by us while we are working on the ABI draft. So we filed the bug and we think it should be fixed.

Compiler inlining 16-byte atomics has other issues as noted in the ABI draft. So the alternative is stop inlining those atomics, but that would need a compiler fix.

Thanks,
- Bin

> 
>> and the inquiry returns the correct information, on a platform that supports cmpxchg16b, the code generated
>> by the compiler will both use cmpxchg16b; on a platform that does not support cmpxchg16b, the code generated
>> by the compiler, including the code generated for a generic platform, always call the support function, so
>> there is no compatibility problem.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-01 19:14                   ` Bin Fan at Work
@ 2016-12-02 11:13                     ` Gabriel Paubert
  2016-12-19 16:33                       ` Torvald Riegel
  0 siblings, 1 reply; 20+ messages in thread
From: Gabriel Paubert @ 2016-12-02 11:13 UTC (permalink / raw)
  To: Bin Fan at Work; +Cc: Szabolcs Nagy, gcc, nd

On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
> Hi Szabolcs,
> 
> > On Nov 29, 2016, at 3:11 AM, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > 
> > On 17/11/16 20:12, Bin Fan wrote:
> >> 
> >> Although this ABI specification specifies that 16-byte properly aligned atomics are inlineable on platforms
> >> supporting cmpxchg16b, we document the caveats here for further discussion. If we decide to change the
> >> inlineable attribute for those atomics, then this ABI, the compiler and the runtime implementation should be
> >> updated together at the same time.
> >> 
> >> 
> >> The compiler and runtime need to check the availability of cmpxchg16b to implement this ABI specification.
> >> Here is how it would work: The compiler can get the information either from the compiler flags or by
> >> inquiring the hardware capabilities. When the information is not available, the compiler should assume that
> >> cmpxchg16b instruction is not supported. The runtime library implementation can also query the hardware
> >> compatibility and choose the implementation at runtime. Assuming the user provides correct compiler options
> > 
> > with this abi the runtime implementation *must* query the hardware
> > (because there might be inlined cmpxchg16b in use in another module
> > on a hardware that supports it and the runtime must be able to sync
> > with it).
> 
> Thanks for the comment. Yes, the ABI requires libatomic must query the hardware. This is 
> necessary if we want the compiler to generate inlined code for 16-byte atomics. Note that 
> this particular issue only affects x86. 

Why? Power (at least recent ones) has 128 bit atomic instructions
(lqarx/stqcx.) and Z has 128 bit compare and swap. 

    Gabriel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-02 11:13                     ` Gabriel Paubert
@ 2016-12-19 16:33                       ` Torvald Riegel
  2016-12-20 13:27                         ` Ulrich Weigand
  0 siblings, 1 reply; 20+ messages in thread
From: Torvald Riegel @ 2016-12-19 16:33 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Bin Fan at Work, Szabolcs Nagy, gcc, nd

On Fri, 2016-12-02 at 12:13 +0100, Gabriel Paubert wrote:
> On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
> > Hi Szabolcs,
> > 
> > > On Nov 29, 2016, at 3:11 AM, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > > 
> > > On 17/11/16 20:12, Bin Fan wrote:
> > >> 
> > >> Although this ABI specification specifies that 16-byte properly aligned atomics are inlineable on platforms
> > >> supporting cmpxchg16b, we document the caveats here for further discussion. If we decide to change the
> > >> inlineable attribute for those atomics, then this ABI, the compiler and the runtime implementation should be
> > >> updated together at the same time.
> > >> 
> > >> 
> > >> The compiler and runtime need to check the availability of cmpxchg16b to implement this ABI specification.
> > >> Here is how it would work: The compiler can get the information either from the compiler flags or by
> > >> inquiring the hardware capabilities. When the information is not available, the compiler should assume that
> > >> cmpxchg16b instruction is not supported. The runtime library implementation can also query the hardware
> > >> compatibility and choose the implementation at runtime. Assuming the user provides correct compiler options
> > > 
> > > with this abi the runtime implementation *must* query the hardware
> > > (because there might be inlined cmpxchg16b in use in another module
> > > on a hardware that supports it and the runtime must be able to sync
> > > with it).
> > 
> > Thanks for the comment. Yes, the ABI requires libatomic must query the hardware. This is 
> > necessary if we want the compiler to generate inlined code for 16-byte atomics. Note that 
> > this particular issue only affects x86. 
> 
> Why? Power (at least recent ones) has 128 bit atomic instructions
> (lqarx/stqcx.) and Z has 128 bit compare and swap. 

That's not the only factor affecting whether cmpxchg16b or such is used
for atomics.  If the HW just offers a wide CAS but no wide atomic load,
then even an atomic load is not truly just a load, which breaks (1)
atomic loads on read-only mapped memory and (2) volatile atomic loads
(unless we claim that an idempotent store is like a load, which is quite
a stretch for volatile I think).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-19 16:33                       ` Torvald Riegel
@ 2016-12-20 13:27                         ` Ulrich Weigand
  2016-12-20 13:58                           ` Szabolcs Nagy
  0 siblings, 1 reply; 20+ messages in thread
From: Ulrich Weigand @ 2016-12-20 13:27 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Gabriel Paubert, Bin Fan@Work, Szabolcs Nagy, gcc, nd

Torvald Riegel wrote:
> On Fri, 2016-12-02 at 12:13 +0100, Gabriel Paubert wrote:
> > On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
> > > Thanks for the comment. Yes, the ABI requires libatomic must query the hardware. This is 
> > > necessary if we want the compiler to generate inlined code for 16-byte atomics. Note that 
> > > this particular issue only affects x86. 
> > 
> > Why? Power (at least recent ones) has 128 bit atomic instructions
> > (lqarx/stqcx.) and Z has 128 bit compare and swap. 
> 
> That's not the only factor affecting whether cmpxchg16b or such is used
> for atomics.  If the HW just offers a wide CAS but no wide atomic load,
> then even an atomic load is not truly just a load, which breaks (1)
> atomic loads on read-only mapped memory and (2) volatile atomic loads
> (unless we claim that an idempotent store is like a load, which is quite
> a stretch for volatile I think).

I may have missed the context of the discussion, but just on the
specific ISA question here: both Power and z not only have the
16-byte CAS (or load-and-reserve/store-conditional), but they also both
have specific 16-byte atomic load and store instructions (lpq/stpq
on z, lq/stq on Power).

Those are available on any system supporting z/Architecture (z900 and up),
and on any Power system supporting the V2.07 ISA (POWER8 and up).  GCC
does in fact use those instructions to implement atomic operations on
16-byte data types on those machines.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  Ulrich.Weigand@de.ibm.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-20 13:27                         ` Ulrich Weigand
@ 2016-12-20 13:58                           ` Szabolcs Nagy
  2016-12-22 14:29                             ` Ulrich Weigand
  0 siblings, 1 reply; 20+ messages in thread
From: Szabolcs Nagy @ 2016-12-20 13:58 UTC (permalink / raw)
  To: Ulrich Weigand, Torvald Riegel; +Cc: nd, Gabriel Paubert, Bin Fan@Work, gcc

On 20/12/16 13:26, Ulrich Weigand wrote:
> Torvald Riegel wrote:
>> On Fri, 2016-12-02 at 12:13 +0100, Gabriel Paubert wrote:
>>> On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
>>>> Thanks for the comment. Yes, the ABI requires libatomic must query the hardware. This is 
>>>> necessary if we want the compiler to generate inlined code for 16-byte atomics. Note that 
>>>> this particular issue only affects x86. 
>>>
>>> Why? Power (at least recent ones) has 128 bit atomic instructions
>>> (lqarx/stqcx.) and Z has 128 bit compare and swap. 
>>
>> That's not the only factor affecting whether cmpxchg16b or such is used
>> for atomics.  If the HW just offers a wide CAS but no wide atomic load,
>> then even an atomic load is not truly just a load, which breaks (1)
>> atomic loads on read-only mapped memory and (2) volatile atomic loads
>> (unless we claim that an idempotent store is like a load, which is quite
>> a stretch for volatile I think).
> 
> I may have missed the context of the discussion, but just on the
> specific ISA question here: both Power and z not only have the
> 16-byte CAS (or load-and-reserve/store-conditional), but they also both
> have specific 16-byte atomic load and store instructions (lpq/stpq
> on z, lq/stq on Power).
> 
> Those are available on any system supporting z/Architecture (z900 and up),
> and on any Power system supporting the V2.07 ISA (POWER8 and up).  GCC
> does in fact use those instructions to implement atomic operations on
> 16-byte data types on those machines.

that's a bug.

at least i don't see how gcc makes sure the libatomic
calls can interoperate with inlined atomics.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-20 13:58                           ` Szabolcs Nagy
@ 2016-12-22 14:29                             ` Ulrich Weigand
  2016-12-22 17:38                               ` Segher Boessenkool
  0 siblings, 1 reply; 20+ messages in thread
From: Ulrich Weigand @ 2016-12-22 14:29 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Torvald Riegel, nd, Gabriel Paubert, Bin Fan@Work, gcc,
	Andreas.Krebbel, dje.gcc, segher

Szabolcs Nagy wrote:
> On 20/12/16 13:26, Ulrich Weigand wrote:
> > I may have missed the context of the discussion, but just on the
> > specific ISA question here: both Power and z not only have the
> > 16-byte CAS (or load-and-reserve/store-conditional), but they also both
> > have specific 16-byte atomic load and store instructions (lpq/stpq
> > on z, lq/stq on Power).
> > 
> > Those are available on any system supporting z/Architecture (z900 and up),
> > and on any Power system supporting the V2.07 ISA (POWER8 and up).  GCC
> > does in fact use those instructions to implement atomic operations on
> > 16-byte data types on those machines.
> 
> that's a bug.
> 
> at least i don't see how gcc makes sure the libatomic
> calls can interoperate with inlined atomics.

Hmm, interesting.  On z, there is no issue with ISA levels, since *all*
64-bit platforms support the 16-byte atomics (and on non-64-bit platforms,
16-byte data types are not supported at all).

However, there still seems to be a problem, but this time related to
alignment issues.  We do have the 16-byte atomic instructions, but they
only work on 16-byte aligned data.  This is a problem in particular
since the default alignment of 16-byte data types is still 8 bytes
on our platform (since the ABI only guarantees 8-byte stack alignment).

That's why the libatomic configure check thinks it cannot use the
atomic instructions when building on z, and generates code that uses
the separate lock.  However, *if* a particular object can be proven
by the compiler to be 16-byte aligned, it will emit the inline
atomic instruction.  This means there is indeed a bug if that same
object is also operated on via the library routine.

Andreas suggested that the best way to fix this would be to add a
runtime alignment check to the libatomic routines and also use the
atomic instructions in the library whenever the object actually
happens to be correctly aligned.  It seems that this should indeed
fix the problem (and also use the most efficient way in all cases).

Not sure about Power -- adding David and Segher on CC ...

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  Ulrich.Weigand@de.ibm.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-22 14:29                             ` Ulrich Weigand
@ 2016-12-22 17:38                               ` Segher Boessenkool
  2017-01-04 11:25                                 ` Szabolcs Nagy
  2017-01-19 15:18                                 ` Torvald Riegel
  0 siblings, 2 replies; 20+ messages in thread
From: Segher Boessenkool @ 2016-12-22 17:38 UTC (permalink / raw)
  To: Ulrich Weigand
  Cc: Szabolcs Nagy, Torvald Riegel, nd, Gabriel Paubert, Bin Fan@Work,
	gcc, Andreas.Krebbel, dje.gcc

On Thu, Dec 22, 2016 at 03:28:56PM +0100, Ulrich Weigand wrote:
> However, there still seems to be a problem, but this time related to
> alignment issues.  We do have the 16-byte atomic instructions, but they
> only work on 16-byte aligned data.  This is a problem in particular
> since the default alignment of 16-byte data types is still 8 bytes
> on our platform (since the ABI only guarantees 8-byte stack alignment).
> 
> That's why the libatomic configure check thinks it cannot use the
> atomic instructions when building on z, and generates code that uses
> the separate lock.  However, *if* a particular object can be proven
> by the compiler to be 16-byte aligned, it will emit the inline
> atomic instruction.  This means there is indeed a bug if that same
> object is also operated on via the library routine.
> 
> Andreas suggested that the best way to fix this would be to add a
> runtime alignment check to the libatomic routines and also use the
> atomic instructions in the library whenever the object actually
> happens to be correctly aligned.  It seems that this should indeed
> fix the problem (and also use the most efficient way in all cases).
> 
> 
> Not sure about Power -- adding David and Segher on CC ...

We do not always have all atomic instructions.  Not all processors have
all, and it depends on the compiler flags used which are used.  How would
libatomic know what compiler flags are used to compile the program it is
linked to?

Sounds like a job for multilibs?


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-22 17:38                               ` Segher Boessenkool
@ 2017-01-04 11:25                                 ` Szabolcs Nagy
  2017-01-19 15:18                                 ` Torvald Riegel
  1 sibling, 0 replies; 20+ messages in thread
From: Szabolcs Nagy @ 2017-01-04 11:25 UTC (permalink / raw)
  To: Segher Boessenkool, Ulrich Weigand
  Cc: nd, Torvald Riegel, Gabriel Paubert, Bin Fan@Work, gcc,
	Andreas.Krebbel, dje.gcc

On 22/12/16 17:37, Segher Boessenkool wrote:
> We do not always have all atomic instructions.  Not all processors have
> all, and it depends on the compiler flags used which are used.  How would
> libatomic know what compiler flags are used to compile the program it is
> linked to?
> 
> Sounds like a job for multilibs?

x86_64 uses ifunc dispatch to always use atomic
instructions if available (which is bad because
ifunc is not supported on all platforms).

either such runtime feature detection and dispatch
is needed in libatomic or different abis have to
be supported (with the usual hassle).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-11-17 20:12               ` GCC libatomic ABI specification draft Bin Fan
  2016-11-29 11:12                 ` Szabolcs Nagy
@ 2017-01-17 17:00                 ` Torvald Riegel
  2017-01-18 22:23                   ` Richard Henderson
  1 sibling, 1 reply; 20+ messages in thread
From: Torvald Riegel @ 2017-01-17 17:00 UTC (permalink / raw)
  To: Bin Fan; +Cc: gcc, Richard Henderson, Jakub Jelinek

On Thu, 2016-11-17 at 12:12 -0800, Bin Fan wrote:
> On 11/14/2016 4:34 PM, Bin Fan wrote:
> > Hi All,
> >
> > I have an updated version of libatomic ABI specification draft. Please 
> > take a look to see if it matches GCC implementation. The purpose of 
> > this document is to establish an official GCC libatomic ABI, and allow 
> > compatible compiler and runtime implementations on the affected 
> > platforms.

Thanks for the update, and sorry for the late reply.  Comments below.

> > - Rewrite section 3 to replace "lock-free" operations with "hardware 
> > backed" instructions. The digest of this section is: 1) inlineable 
> > atomics must be implemented with the hardware backed atomic 
> > instructions. 2) for non-inlineable atomics, the compiler must 
> > generate a runtime call, and the runtime support function is free to 
> > use any implementation.

OK.

I still think that using hardware-backed instructions for a particular
type requires that there is a true atomic load instruction for that
type.  Emulating a load with an idempotent store (eg, cmpxchg16b) is not
useful, overall.

One could argue that an idempotent atomic HW store such as a cmpxchg16b
in a loop is indeed lock-free.  However, IMO the intention behind
"lock-free" atomics in C and C++ is to offer atomics that are both
lock-free *and* as fast as one would assume for a fully HW-backed
solution for atomic accesses.  This includes that loads must be cheaper
than stores, in particular under contention / concurrent accesses by
several threads.
I believe that "fast" is much more often part of the motivation for
using lock-free atomics than the actual "lock-free", so the
progress-guarantee aspect (which isn't even lock-free but
obstruction-free, see below).  If we do see a sufficiently strong need
for lock-free atomics, which should build something just for that (eg,
if removing the address-free requirement, we can support lock-free (in
the progress-guarantee sense) operations for a lot more types).

Also, while that previous issue is "just" a performance issue, the fact
that we could issue a store when calling to atomic_load() is a
correctness issue, I think.
One example are volatile atomic loads; while C/C++ don't really
constrain what a volatile load needs to be in the underlying
implementation, I think most users would assume that a load really means
a hardware load instruction of some sort, and nothing else.  cmpxchg16b
conflicts with such an assumption.
Another example is read-only mapped memory.

Bottom line: we shouldn't rely solely on cmpxchg16b and similar.
(Though this doesn't necessarily mean that there can't be compiler flags
that enable its use.)

I think the ABI should set a baseline for each architecture, and the
baseline decides whether something is inlinable or not.  Thus, the
x86_64 ABI would make __int128 operations not imlinable (because of the
issues with cmpxchg16b, see above).

If users want to use capabilities beyond the baseline, they can choose
to use flags that alter/extend the ABI.  For example, if they use a flag
that explicitly enables the use of cmpxchg16b for atomics, they also
need to use a libatomic implementation built in the same way (if
possible).  This then creates a new ABI(-variant), basically.

I've made a few tests on my x86_64 machine a few weeks ago, and I didn't
see cmpxchg16b being used.  IIRC, I also looked at libatomic and didn't
see it (but I don't remember for sure).  Either way, if I should have
been wrong, and we are using cmpxchg16b for loads, this should be fixed.
Ideally, this should be fixed before the stage 3 deadline this Friday.
Such a fix might potentially break existing uses, but the earlier we fix
this, the better.

Section 3 Rationale, alternative 1: I'm wondering if the example is
correct.  For a 4-byte-aligned type of size 3, the implementation cannot
simply use 4-byte hardware-backed atomics because this will inevitably
touch the 4th byte I think, and the implementation can't know whether
this is padding or not.  Or do we expect that things like packed structs
are disallowed?

N3.1:  Why do you assume that 8-byte HW atomics are available on i386?
Because cmpxchg8b is available for CPUs that are the lowest i?86 we
still intend to support?

I'd also use "hardware-backed" instead of "hardware backed".

> > - The Rationale section in section 3 is also revised to remove the 
> > mentioning of "lock-free", but there is not major change of concept.
> >
> > - Add note N3.1 to emphasize the assumption of general hardware 
> > supported atomic instruction
> >
> > - Add note N3.2 to discuss the issues of cmpxchg16b

See above.

> > - Add a paragraph in section 4.1 to specify memory_order_consume must 
> > be implemented through memory_order_acquire. Section 4.2 emphasizes it 
> > again.
> >
> > - The specification of each runtime functions mostly maps to the 
> > corresponding generic functions in the C11 standard. Two functions are 
> > worth noting:
> > 1) C11 atomic_compare_exchange compares and updates the "value" while 
> > __atomic_compare_exchange functions in this ABI compare and update the 
> > "memory", which implies the memcmp and memcpy semantics.

In Section 4, parts about atomic_compare_exchange: should there be a
back-reference to the memcmp point made earlier in the document?

> > 2) The specification of __atomic_is_lock_free allows both a per-object 
> > result and a per-type result. A per-type implementation could pass 
> > NULL, or a faked address as the address of the object. A per-object 
> > implementation could pass the actual address of the object.

The __atomic_is_lock_free description should specify that "lock-free"
refers to the definition of "lock-free" in C++14, which includes
"address-free".  I'm referring to C++14 specifically because this
contains an update which is relevant for (1) LL/SC-based architectures
(ie, that "lock-free" is actually what is called obstruction-free in the
literature) and (2) for any libatomic implementation that wants to use
HW atomics for things like the example in Section 3's Rationale,
alternative 1 (see above).

This ABI needs to also specify how hardware-backed atomics are
implemented on a particular architecture.  For example, on architectures
where there is more than one choice for how to certain memory orders
(eg, ARM), the ABI should pick a certain mapping.  I guess this should
be a note in Section 4, maybe as a separate subsection and/or an
additional note around the memory_order enum description; I'd keep the
note about implementing something equivalent to C11/C++11 semantics.
What we would document is something like the possible mappings discussed
here: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

There are typos in Section 2.4.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2017-01-17 17:00                 ` Torvald Riegel
@ 2017-01-18 22:23                   ` Richard Henderson
  2017-01-19 15:02                     ` Torvald Riegel
  2017-01-20 13:42                     ` Michael Matz
  0 siblings, 2 replies; 20+ messages in thread
From: Richard Henderson @ 2017-01-18 22:23 UTC (permalink / raw)
  To: Torvald Riegel, Bin Fan; +Cc: gcc, Jakub Jelinek

On 01/17/2017 09:00 AM, Torvald Riegel wrote:
> I think the ABI should set a baseline for each architecture, and the
> baseline decides whether something is inlinable or not.  Thus, the
> x86_64 ABI would make __int128 operations not imlinable (because of the
> issues with cmpxchg16b, see above).
>
> If users want to use capabilities beyond the baseline, they can choose
> to use flags that alter/extend the ABI.  For example, if they use a flag
> that explicitly enables the use of cmpxchg16b for atomics, they also
> need to use a libatomic implementation built in the same way (if
> possible).  This then creates a new ABI(-variant), basically.

Yes.  Other examples here are power7/power8 and armv6/armv7.

In both cases, the architecture added double-word load(-locked) and 
store(-conditional) instructions.  In order for us to use these new 
instructions inline, libatomic must be updated to use them as well.

The general principal, in my opinion, is that extensions to the ISA should 
require that libatomic either be re-built, or perform runtime detection in 
order to select the internal algorithm used.

In the case of arm, distributions normally either (1) build for a specific cpu 
revision, (2) build for old-arm + soft-fpu, (3) build for armv7 + hard-fpu.  So 
most distributions would not actually require a runtime check for arm.

In the case of power, I assume it's possible to run ppc64 on power8, but every 
power8 system to which I have access has ppc64le deployed.  Certainly ppc64le 
would not need a runtime check, but it would seem prudent for ppc64 to gain a 
runtime check for the power8 insns.

> I've made a few tests on my x86_64 machine a few weeks ago, and I didn't
> see cmpxchg16b being used.  IIRC, I also looked at libatomic and didn't
> see it (but I don't remember for sure).  Either way, if I should have
> been wrong, and we are using cmpxchg16b for loads, this should be fixed.
> Ideally, this should be fixed before the stage 3 deadline this Friday.
> Such a fix might potentially break existing uses, but the earlier we fix
> this, the better.

You needed to use -mcx16, or any other option (such as -march=host) that 
implies that.  And, you will find that expand_atomic_load does have a 
larger-than-word-size fallback path that does use expand_atomic_compare_and_swap.

So, yes, there's something here that needs adjustment.

> Section 3 Rationale, alternative 1: I'm wondering if the example is
> correct.  For a 4-byte-aligned type of size 3, the implementation cannot
> simply use 4-byte hardware-backed atomics because this will inevitably
> touch the 4th byte I think, and the implementation can't know whether
> this is padding or not.  Or do we expect that things like packed structs
> are disallowed?

If we atomically store an unchanged value into the 4th byte, can we tell?

> N3.1:  Why do you assume that 8-byte HW atomics are available on i386?
> Because cmpxchg8b is available for CPUs that are the lowest i?86 we
> still intend to support?

For various definitions of "we", I suppose.  Red Hat certainly does not support 
anything lower than i686, which does have cmpxchg8b.

I suspect that the GNU project still supports i486.  I do know that glibc has 
dropped support for i386.

I should note that supporting 64-bit atomics on i686 *is* possible, without the 
CAS problem that you describe for cmpxchg16b, because we *are* guaranteed that 
the FPU supports a 64-bit atomic load/store.  And we do already handle this; 
see the atomic_loaddi_fpu and atomic_storedi_fpu patterns.

I'll also note that, as per above, this implies that if we build for i586-*, 
libatomic should provide runtime paths that detect and use i686 insns, so that 
the library is compatible with what the compiler will generate inline given 
appropriate command-line options.

r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2017-01-18 22:23                   ` Richard Henderson
@ 2017-01-19 15:02                     ` Torvald Riegel
  2017-01-20 13:42                     ` Michael Matz
  1 sibling, 0 replies; 20+ messages in thread
From: Torvald Riegel @ 2017-01-19 15:02 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bin Fan, gcc, Jakub Jelinek

On Wed, 2017-01-18 at 14:23 -0800, Richard Henderson wrote:
> On 01/17/2017 09:00 AM, Torvald Riegel wrote:
> > I think the ABI should set a baseline for each architecture, and the
> > baseline decides whether something is inlinable or not.  Thus, the
> > x86_64 ABI would make __int128 operations not imlinable (because of the
> > issues with cmpxchg16b, see above).
> >
> > If users want to use capabilities beyond the baseline, they can choose
> > to use flags that alter/extend the ABI.  For example, if they use a flag
> > that explicitly enables the use of cmpxchg16b for atomics, they also
> > need to use a libatomic implementation built in the same way (if
> > possible).  This then creates a new ABI(-variant), basically.
> 
> Yes.  Other examples here are power7/power8 and armv6/armv7.
> 
> In both cases, the architecture added double-word load(-locked) and 
> store(-conditional) instructions.  In order for us to use these new 
> instructions inline, libatomic must be updated to use them as well.
> 
> The general principal, in my opinion, is that extensions to the ISA should 
> require that libatomic either be re-built, or perform runtime detection in 
> order to select the internal algorithm used.

That sounds okay for me.  I think we would have to make that clear in
the ABI specification though, because this also includes requirements
for the user of the ABI (eg, if you compile for power8, you need to use
a suitably built libatomic) and for distributions.

> In the case of arm, distributions normally either (1) build for a specific cpu 
> revision, (2) build for old-arm + soft-fpu, (3) build for armv7 + hard-fpu.  So 
> most distributions would not actually require a runtime check for arm.
> 
> In the case of power, I assume it's possible to run ppc64 on power8, but every 
> power8 system to which I have access has ppc64le deployed.  Certainly ppc64le 
> would not need a runtime check, but it would seem prudent for ppc64 to gain a 
> runtime check for the power8 insns.

OK.  I think it would be good if ARM/Power people could contribute to
the ABI specification and extend it to also cover ARM/Power.

> > I've made a few tests on my x86_64 machine a few weeks ago, and I didn't
> > see cmpxchg16b being used.  IIRC, I also looked at libatomic and didn't
> > see it (but I don't remember for sure).  Either way, if I should have
> > been wrong, and we are using cmpxchg16b for loads, this should be fixed.
> > Ideally, this should be fixed before the stage 3 deadline this Friday.
> > Such a fix might potentially break existing uses, but the earlier we fix
> > this, the better.
> 
> You needed to use -mcx16, or any other option (such as -march=host) that 
> implies that.  And, you will find that expand_atomic_load does have a 
> larger-than-word-size fallback path that does use expand_atomic_compare_and_swap.
> 
> So, yes, there's something here that needs adjustment.

I'll send a separate email describing the options I see currently.

> > Section 3 Rationale, alternative 1: I'm wondering if the example is
> > correct.  For a 4-byte-aligned type of size 3, the implementation cannot
> > simply use 4-byte hardware-backed atomics because this will inevitably
> > touch the 4th byte I think, and the implementation can't know whether
> > this is padding or not.  Or do we expect that things like packed structs
> > are disallowed?
> 
> If we atomically store an unchanged value into the 4th byte, can we tell?

Probably not in terms of the value.  But race detectors, HW breakpoints
etc. could observe the store.  I'm not sure whether potentially having
to adapt these is justified by being able to optimize atomic access to
3-byte structs...

> > N3.1:  Why do you assume that 8-byte HW atomics are available on i386?
> > Because cmpxchg8b is available for CPUs that are the lowest i?86 we
> > still intend to support?
> 
> For various definitions of "we", I suppose.  Red Hat certainly does not support 
> anything lower than i686, which does have cmpxchg8b.
> 
> I suspect that the GNU project still supports i486.  I do know that glibc has 
> dropped support for i386.
> 
> I should note that supporting 64-bit atomics on i686 *is* possible, without the 
> CAS problem that you describe for cmpxchg16b, because we *are* guaranteed that 
> the FPU supports a 64-bit atomic load/store.  And we do already handle this; 
> see the atomic_loaddi_fpu and atomic_storedi_fpu patterns.
> 
> I'll also note that, as per above, this implies that if we build for i586-*, 
> libatomic should provide runtime paths that detect and use i686 insns, so that 
> the library is compatible with what the compiler will generate inline given 
> appropriate command-line options.

OK.  So these rules should be added to the ABI spec too, I suppose.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2016-12-22 17:38                               ` Segher Boessenkool
  2017-01-04 11:25                                 ` Szabolcs Nagy
@ 2017-01-19 15:18                                 ` Torvald Riegel
  1 sibling, 0 replies; 20+ messages in thread
From: Torvald Riegel @ 2017-01-19 15:18 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Ulrich Weigand, Szabolcs Nagy, nd, Gabriel Paubert, Bin Fan@Work,
	gcc, Andreas.Krebbel, dje.gcc

On Thu, 2016-12-22 at 11:37 -0600, Segher Boessenkool wrote:
> On Thu, Dec 22, 2016 at 03:28:56PM +0100, Ulrich Weigand wrote:
> > However, there still seems to be a problem, but this time related to
> > alignment issues.  We do have the 16-byte atomic instructions, but they
> > only work on 16-byte aligned data.  This is a problem in particular
> > since the default alignment of 16-byte data types is still 8 bytes
> > on our platform (since the ABI only guarantees 8-byte stack alignment).
> > 
> > That's why the libatomic configure check thinks it cannot use the
> > atomic instructions when building on z, and generates code that uses
> > the separate lock.  However, *if* a particular object can be proven
> > by the compiler to be 16-byte aligned, it will emit the inline
> > atomic instruction.  This means there is indeed a bug if that same
> > object is also operated on via the library routine.
> > 
> > Andreas suggested that the best way to fix this would be to add a
> > runtime alignment check to the libatomic routines and also use the
> > atomic instructions in the library whenever the object actually
> > happens to be correctly aligned.  It seems that this should indeed
> > fix the problem (and also use the most efficient way in all cases).
> > 
> > 
> > Not sure about Power -- adding David and Segher on CC ...
> 
> We do not always have all atomic instructions.  Not all processors have
> all, and it depends on the compiler flags used which are used.  How would
> libatomic know what compiler flags are used to compile the program it is
> linked to?

I think the approach would be to require the user to always use a
suitably built libatomic that's at least as capable as the code that
will use it (e.g., see Richard Henderson's comments).  Thus, if the
program uses some flags to enable a certain set of HW instructions, the
program also should use a libatomic that is built with the same (or
stronger) flags.  That keeps old code working, and new code that uses
the HW instructions directly can interoperate with old code that still
calls libatomic.

If we find consensus to follow this approach, this requirement on
libatomic builds should be made explicit in the ABI spec.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2017-01-18 22:23                   ` Richard Henderson
  2017-01-19 15:02                     ` Torvald Riegel
@ 2017-01-20 13:42                     ` Michael Matz
  2017-01-20 17:17                       ` Richard Henderson
  1 sibling, 1 reply; 20+ messages in thread
From: Michael Matz @ 2017-01-20 13:42 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Torvald Riegel, Bin Fan, gcc, Jakub Jelinek

Hi,

On Wed, 18 Jan 2017, Richard Henderson wrote:

> > Section 3 Rationale, alternative 1: I'm wondering if the example is 
> > correct.  For a 4-byte-aligned type of size 3, the implementation 
> > cannot simply use 4-byte hardware-backed atomics because this will 
> > inevitably touch the 4th byte I think, and the implementation can't 
> > know whether this is padding or not.  Or do we expect that things like 
> > packed structs are disallowed?
> 
> If we atomically store an unchanged value into the 4th byte, can we 
> tell?

You can't have a 4-aligned type of size 3.  Sizes must be multiples of 
alignment (otherwise arrays don't work).  The type of a 3-sized field in 
a packed struct that syntactically might be a 4-aligned type (e.g. by 
using attributes on char-array types) is actually a different type having 
an alignment of 1.  It's easier to simply regard all types inside packed 
structs as 1-aligned (which is IMO what we try to do).

That is, the byte after a 4-aligned "3-sized" type is always padding.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2017-01-20 13:42                     ` Michael Matz
@ 2017-01-20 17:17                       ` Richard Henderson
  2017-01-23 14:00                         ` Michael Matz
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2017-01-20 17:17 UTC (permalink / raw)
  To: Michael Matz; +Cc: Torvald Riegel, Bin Fan, gcc, Jakub Jelinek

On 01/20/2017 05:41 AM, Michael Matz wrote:
> Hi,
>
> On Wed, 18 Jan 2017, Richard Henderson wrote:
>
>>> Section 3 Rationale, alternative 1: I'm wondering if the example is
>>> correct.  For a 4-byte-aligned type of size 3, the implementation
>>> cannot simply use 4-byte hardware-backed atomics because this will
>>> inevitably touch the 4th byte I think, and the implementation can't
>>> know whether this is padding or not.  Or do we expect that things like
>>> packed structs are disallowed?
>>
>> If we atomically store an unchanged value into the 4th byte, can we
>> tell?
>
> You can't have a 4-aligned type of size 3.  Sizes must be multiples of
> alignment (otherwise arrays don't work).  The type of a 3-sized field in
> a packed struct that syntactically might be a 4-aligned type (e.g. by
> using attributes on char-array types) is actually a different type having
> an alignment of 1.  It's easier to simply regard all types inside packed
> structs as 1-aligned (which is IMO what we try to do).
>
> That is, the byte after a 4-aligned "3-sized" type is always padding.

[ I read Bin Fan's original email some months ago, but I don't have it handy 
now.  Take faulty memory with a grain of salt. ]

I thought this was about libatomic being presented with an unaligned 3-byte 
structure that happens to sit within an aligned 4-byte word, and choosing to 
atomically operate on the 4-byte word instead of taking a lock on the side.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: GCC libatomic ABI specification draft
  2017-01-20 17:17                       ` Richard Henderson
@ 2017-01-23 14:00                         ` Michael Matz
  0 siblings, 0 replies; 20+ messages in thread
From: Michael Matz @ 2017-01-23 14:00 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Torvald Riegel, Bin Fan, gcc, Jakub Jelinek

Hi,

On Fri, 20 Jan 2017, Richard Henderson wrote:

> > You can't have a 4-aligned type of size 3.  Sizes must be multiples of 
> > alignment (otherwise arrays don't work).  The type of a 3-sized field 
> > in a packed struct that syntactically might be a 4-aligned type (e.g. 
> > by using attributes on char-array types) is actually a different type 
> > having an alignment of 1.  It's easier to simply regard all types 
> > inside packed structs as 1-aligned (which is IMO what we try to do).
> > 
> > That is, the byte after a 4-aligned "3-sized" type is always padding.
> 
> [ I read Bin Fan's original email some months ago, but I don't have it handy
> now.  Take faulty memory with a grain of salt. ]
> 
> I thought this was about libatomic being presented with an unaligned 3-byte
> structure that happens to sit within an aligned 4-byte word, and choosing to
> atomically operate on the 4-byte word instead of taking a lock on the side.

Ah well, in that case I lost context as well ;)


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-01-23 14:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cbd2c83a-b50b-b2ac-b62d-b2d26178c2b1@oracle.com>
2016-07-06 17:50 ` Fwd: Re: GCC libatomic questions Richard Henderson
2016-07-06 19:41   ` Richard Henderson
2016-07-07 23:56     ` Bin Fan
     [not found]       ` <ac2d60ed-a659-f018-1f11-63fa8f5847f5@oracle.com>
     [not found]         ` <1470412312.14544.4.camel@localhost.localdomain>
     [not found]           ` <4a182edd-41a8-4ad9-444a-bf0af567ae98@oracle.com>
     [not found]             ` <8317ec9d-41ad-d806-9144-eac2984cdd38@oracle.com>
2016-11-17 20:12               ` GCC libatomic ABI specification draft Bin Fan
2016-11-29 11:12                 ` Szabolcs Nagy
2016-12-01 19:14                   ` Bin Fan at Work
2016-12-02 11:13                     ` Gabriel Paubert
2016-12-19 16:33                       ` Torvald Riegel
2016-12-20 13:27                         ` Ulrich Weigand
2016-12-20 13:58                           ` Szabolcs Nagy
2016-12-22 14:29                             ` Ulrich Weigand
2016-12-22 17:38                               ` Segher Boessenkool
2017-01-04 11:25                                 ` Szabolcs Nagy
2017-01-19 15:18                                 ` Torvald Riegel
2017-01-17 17:00                 ` Torvald Riegel
2017-01-18 22:23                   ` Richard Henderson
2017-01-19 15:02                     ` Torvald Riegel
2017-01-20 13:42                     ` Michael Matz
2017-01-20 17:17                       ` Richard Henderson
2017-01-23 14:00                         ` Michael Matz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).