[PATCH 0/3] Improve ARM atomic performance for malloc

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Improve ARM atomic performance for malloc
@ 2014-10-03 15:11 Will Newton
  2014-10-03 15:11 ` [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_* Will Newton
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Will Newton @ 2014-10-03 15:11 UTC (permalink / raw)
  To: libc-alpha

The intention of this series is to improve the performance of ARM
atomics and hence malloc.

The first patch adds a malloc microbenchmark which is pretty much
the same code that I posted earlier in the year but with the
support for multiple threads taken out. The threaded aspect of the
benchmark appeared to be an area of contention so hopefully this
makes things simpler.

The second patch widens the range of supported atomic operations
in the ARM port which improves the generated code sequences for
things like atomic add, or and and.

The third patch which can be considered really more of an RFC
implements the single-threaded atomic optimization similarly to
the implementation for Power that was posted back in August.
There is a small performance gain at the cost of some complexity so
I wonder whether this optimization is really worth it, I would
be interested in people's opinions on that.

The resulting atomic.h is hopefully somewhere close to a generic
implementation based on the gcc intrinsics so could potentially
be used as a base for a generic header.

Will Newton (3):
  benchtests: Add malloc microbenchmark
  sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations
  sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_*

 benchtests/Makefile       |   2 +-
 benchtests/bench-malloc.c | 219 ++++++++++++++++++++++++++++++++++++++++
 sysdeps/arm/bits/atomic.h | 248 +++++++++++++++++++++++++++++++++++++---------
 3 files changed, 420 insertions(+), 49 deletions(-)
 create mode 100644 benchtests/bench-malloc.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations
  2014-10-03 15:11 [PATCH 0/3] Improve ARM atomic performance for malloc Will Newton
  2014-10-03 15:11 ` [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_* Will Newton
@ 2014-10-03 15:11 ` Will Newton
  2014-10-03 16:31   ` Joseph S. Myers
  2014-10-03 15:11 ` [PATCH 1/3] benchtests: Add malloc microbenchmark Will Newton
  2014-10-03 16:27 ` [PATCH 0/3] Improve ARM atomic performance for malloc Joseph S. Myers
  3 siblings, 1 reply; 19+ messages in thread
From: Will Newton @ 2014-10-03 15:11 UTC (permalink / raw)
  To: libc-alpha

For the case where atomic operations are fully supported by the
compiler expose more of these operations directly to glibc. So for
example, instead of implementing atomic_or using the compare and
exchange compiler builtin, implement it by using the atomic or
compiler builtin directly.

This results in an approximate 1kB code size reduction in libc.so and
a small improvement on the malloc benchtest:

Before: 266.279
After: 259.073

ChangeLog:

2014-10-03  Will Newton  <will.newton@linaro.org>

	* sysdeps/arm/bits/atomic.h [__GNUC_PREREQ (4, 7) &&
	__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4]
	(__arch_compare_and_exchange_bool_8_int): Define in terms
	of gcc atomic builtin rather than link error.
	(__arch_compare_and_exchange_bool_16_int): Likewise.
	(__arch_compare_and_exchange_bool_64_int): Likewise.
	(__arch_compare_and_exchange_val_8_int): Likewise.
	(__arch_compare_and_exchange_val_16_int): Likewise.
	(__arch_compare_and_exchange_val_64_int): Likewise.
	(__arch_exchange_8_int): Likewise.
	(__arch_exchange_16_int): Likewise.
	(__arch_exchange_64_int): Likewise.
	(__arch_exchange_and_add_8_int): New define.
	(__arch_exchange_and_add_16_int): Likewise.
	(__arch_exchange_and_add_32_int): Likewise.
	(__arch_exchange_and_add_64_int): Likewise.
	(atomic_exchange_and_add_acq): Likewise.
	(atomic_exchange_and_add_rel): Likewise.
	(catomic_exchange_and_add): Likewise.
	(__arch_exchange_and_and_8_int): New define.
	(__arch_exchange_and_and_16_int): Likewise.
	(__arch_exchange_and_and_32_int): Likewise.
	(__arch_exchange_and_and_64_int): Likewise.
	(atomic_and): Likewise.
	(atomic_and_val): Likewise.
	(catomic_and): Likewise.
	(__arch_exchange_and_or_8_int): New define.
	(__arch_exchange_and_or_16_int): Likewise.
	(__arch_exchange_and_or_32_int): Likewise.
	(__arch_exchange_and_or_64_int): Likewise.
	(atomic_or): Likewise.
	(atomic_or_val): Likewise.
	(catomic_or): Likewise.
---
 sysdeps/arm/bits/atomic.h | 203 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 153 insertions(+), 50 deletions(-)

diff --git a/sysdeps/arm/bits/atomic.h b/sysdeps/arm/bits/atomic.h
index 88cbe67..be314e4 100644
--- a/sysdeps/arm/bits/atomic.h
+++ b/sysdeps/arm/bits/atomic.h
@@ -52,84 +52,184 @@ void __arm_link_error (void);
    a pattern to do this efficiently.  */
 #if __GNUC_PREREQ (4, 7) && defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
 
-# define atomic_exchange_acq(mem, value)                                \
-  __atomic_val_bysize (__arch_exchange, int, mem, value, __ATOMIC_ACQUIRE)
+/* Compare and exchange.
+   For all "bool" routines, we return FALSE if exchange successful.  */
 
-# define atomic_exchange_rel(mem, value)                                \
-  __atomic_val_bysize (__arch_exchange, int, mem, value, __ATOMIC_RELEASE)
+# define __arch_compare_and_exchange_bool_8_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    !__atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				  model, __ATOMIC_RELAXED);		\
+  })
 
-/* Atomic exchange (without compare).  */
+# define __arch_compare_and_exchange_bool_16_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    !__atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				  model, __ATOMIC_RELAXED);		\
+  })
 
-# define __arch_exchange_8_int(mem, newval, model)      \
-  (__arm_link_error (), (typeof (*mem)) 0)
+# define __arch_compare_and_exchange_bool_32_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    !__atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				  model, __ATOMIC_RELAXED);		\
+  })
 
-# define __arch_exchange_16_int(mem, newval, model)     \
-  (__arm_link_error (), (typeof (*mem)) 0)
+#  define __arch_compare_and_exchange_bool_64_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    !__atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				  model, __ATOMIC_RELAXED);		\
+  })
 
-# define __arch_exchange_32_int(mem, newval, model)     \
-  __atomic_exchange_n (mem, newval, model)
+# define __arch_compare_and_exchange_val_8_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    __atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				 model, __ATOMIC_RELAXED);		\
+    __oldval;								\
+  })
+
+# define __arch_compare_and_exchange_val_16_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    __atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				 model, __ATOMIC_RELAXED);		\
+    __oldval;								\
+  })
+
+# define __arch_compare_and_exchange_val_32_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    __atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				 model, __ATOMIC_RELAXED);		\
+    __oldval;								\
+  })
+
+#  define __arch_compare_and_exchange_val_64_int(mem, newval, oldval, model) \
+  ({									\
+    typeof (*mem) __oldval = (oldval);					\
+    __atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,	\
+				 model, __ATOMIC_RELAXED);		\
+    __oldval;								\
+  })
 
-# define __arch_exchange_64_int(mem, newval, model)     \
-  (__arm_link_error (), (typeof (*mem)) 0)
 
 /* Compare and exchange with "acquire" semantics, ie barrier after.  */
 
-# define atomic_compare_and_exchange_bool_acq(mem, new, old)    \
-  __atomic_bool_bysize (__arch_compare_and_exchange_bool, int,  \
-                        mem, new, old, __ATOMIC_ACQUIRE)
+# define atomic_compare_and_exchange_bool_acq(mem, new, old)	\
+  __atomic_bool_bysize (__arch_compare_and_exchange_bool, int,	\
+			mem, new, old, __ATOMIC_ACQUIRE)
 
-# define atomic_compare_and_exchange_val_acq(mem, new, old)     \
-  __atomic_val_bysize (__arch_compare_and_exchange_val, int,    \
-                       mem, new, old, __ATOMIC_ACQUIRE)
+# define atomic_compare_and_exchange_val_acq(mem, new, old)	\
+  __atomic_val_bysize (__arch_compare_and_exchange_val, int,	\
+		       mem, new, old, __ATOMIC_ACQUIRE)
 
 /* Compare and exchange with "release" semantics, ie barrier before.  */
 
-# define atomic_compare_and_exchange_bool_rel(mem, new, old)    \
-  __atomic_bool_bysize (__arch_compare_and_exchange_bool, int,  \
-                        mem, new, old, __ATOMIC_RELEASE)
+# define atomic_compare_and_exchange_bool_rel(mem, new, old)	\
+  __atomic_bool_bysize (__arch_compare_and_exchange_bool, int,	\
+			mem, new, old, __ATOMIC_RELEASE)
 
-# define atomic_compare_and_exchange_val_rel(mem, new, old)      \
+# define atomic_compare_and_exchange_val_rel(mem, new, old)	 \
   __atomic_val_bysize (__arch_compare_and_exchange_val, int,    \
                        mem, new, old, __ATOMIC_RELEASE)
 
-/* Compare and exchange.
-   For all "bool" routines, we return FALSE if exchange succesful.  */
 
-# define __arch_compare_and_exchange_bool_8_int(mem, newval, oldval, model) \
-  ({__arm_link_error (); 0; })
+/* Atomic exchange (without compare).  */
 
-# define __arch_compare_and_exchange_bool_16_int(mem, newval, oldval, model) \
-  ({__arm_link_error (); 0; })
+# define __arch_exchange_8_int(mem, newval, model)	\
+  __atomic_exchange_n (mem, newval, model)
 
-# define __arch_compare_and_exchange_bool_32_int(mem, newval, oldval, model) \
-  ({                                                                    \
-    typeof (*mem) __oldval = (oldval);                                  \
-    !__atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,   \
-                                  model, __ATOMIC_RELAXED);             \
-  })
+# define __arch_exchange_16_int(mem, newval, model)	\
+  __atomic_exchange_n (mem, newval, model)
 
-# define __arch_compare_and_exchange_bool_64_int(mem, newval, oldval, model) \
-  ({__arm_link_error (); 0; })
+# define __arch_exchange_32_int(mem, newval, model)	\
+  __atomic_exchange_n (mem, newval, model)
 
-# define __arch_compare_and_exchange_val_8_int(mem, newval, oldval, model) \
-  ({__arm_link_error (); oldval; })
+#  define __arch_exchange_64_int(mem, newval, model)	\
+  __atomic_exchange_n (mem, newval, model)
 
-# define __arch_compare_and_exchange_val_16_int(mem, newval, oldval, model) \
-  ({__arm_link_error (); oldval; })
+# define atomic_exchange_acq(mem, value)				\
+  __atomic_val_bysize (__arch_exchange, int, mem, value, __ATOMIC_ACQUIRE)
 
-# define __arch_compare_and_exchange_val_32_int(mem, newval, oldval, model) \
-  ({                                                                    \
-    typeof (*mem) __oldval = (oldval);                                  \
-    __atomic_compare_exchange_n (mem, (void *) &__oldval, newval, 0,    \
-                                 model, __ATOMIC_RELAXED);              \
-    __oldval;                                                           \
-  })
+# define atomic_exchange_rel(mem, value)				\
+  __atomic_val_bysize (__arch_exchange, int, mem, value, __ATOMIC_RELEASE)
+
+
+/* Atomically add value and return the previous (unincremented) value.  */
+
+# define __arch_exchange_and_add_8_int(mem, value, model)	\
+  __atomic_fetch_add (mem, value, model)
+
+# define __arch_exchange_and_add_16_int(mem, value, model)	\
+  __atomic_fetch_add (mem, value, model)
+
+# define __arch_exchange_and_add_32_int(mem, value, model)	\
+  __atomic_fetch_add (mem, value, model)
+
+#  define __arch_exchange_and_add_64_int(mem, value, model)	\
+  __atomic_fetch_add (mem, value, model)
+
+# define atomic_exchange_and_add_acq(mem, value)			\
+  __atomic_val_bysize (__arch_exchange_and_add, int, mem, value,	\
+		       __ATOMIC_ACQUIRE)
+
+# define atomic_exchange_and_add_rel(mem, value)			\
+  __atomic_val_bysize (__arch_exchange_and_add, int, mem, value,	\
+		       __ATOMIC_RELEASE)
+
+# define catomic_exchange_and_add atomic_exchange_and_add
+
+/* Atomically bitwise and value and return the previous value.  */
+
+# define __arch_exchange_and_and_8_int(mem, value, model)	\
+  __atomic_fetch_and (mem, value, model)
+
+# define __arch_exchange_and_and_16_int(mem, value, model)	\
+  __atomic_fetch_and (mem, value, model)
 
-# define __arch_compare_and_exchange_val_64_int(mem, newval, oldval, model) \
-  ({__arm_link_error (); oldval; })
+# define __arch_exchange_and_and_32_int(mem, value, model)	\
+  __atomic_fetch_and (mem, value, model)
+
+#  define __arch_exchange_and_and_64_int(mem, value, model)	\
+  __atomic_fetch_and (mem, value, model)
+
+# define atomic_and(mem, value)						\
+  __atomic_val_bysize (__arch_exchange_and_and, int, mem, value,	\
+		       __ATOMIC_ACQUIRE)
+
+# define atomic_and_val atomic_and
+
+# define catomic_and atomic_and
+
+/* Atomically bitwise or value and return the previous value.  */
+
+# define __arch_exchange_and_or_8_int(mem, value, model)	\
+  __atomic_fetch_or (mem, value, model)
+
+# define __arch_exchange_and_or_16_int(mem, value, model)	\
+  __atomic_fetch_or (mem, value, model)
+
+# define __arch_exchange_and_or_32_int(mem, value, model)	\
+  __atomic_fetch_or (mem, value, model)
+
+#  define __arch_exchange_and_or_64_int(mem, value, model)	\
+  __atomic_fetch_or (mem, value, model)
+
+# define atomic_or(mem, value)					\
+  __atomic_val_bysize (__arch_exchange_and_or, int, mem, value,	\
+		       __ATOMIC_ACQUIRE)
+
+# define atomic_or_val atomic_or
+
+# define catomic_or atomic_or
 
 #elif defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
+
 /* Atomic compare and exchange.  */
+
 # define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
   __sync_val_compare_and_swap ((mem), (oldval), (newval))
 #else
@@ -138,8 +238,10 @@ void __arm_link_error (void);
 #endif
 
 #if !__GNUC_PREREQ (4, 7) || !defined (__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4)
+
 /* We don't support atomic operations on any non-word types.
    So make them link errors.  */
+
 # define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
   ({ __arm_link_error (); oldval; })
 
@@ -148,6 +250,7 @@ void __arm_link_error (void);
 
 # define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
   ({ __arm_link_error (); oldval; })
+
 #endif
 
 /* An OS-specific bits/atomic.h file will define this macro if
-- 
1.9.3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_*
  2014-10-03 15:11 [PATCH 0/3] Improve ARM atomic performance for malloc Will Newton
@ 2014-10-03 15:11 ` Will Newton
  2014-10-06 13:43   ` Torvald Riegel
  2014-10-03 15:11 ` [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations Will Newton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Will Newton @ 2014-10-03 15:11 UTC (permalink / raw)
  To: libc-alpha

Using the relaxed memory model for atomics when single-threaded allows
a reduction in the number of barriers (dmb) executed and an improvement in
single thread performance on the malloc benchtest:

Before: 259.073
After: 246.749

ChangeLog:

2014-10-03  Will Newton  <will.newton@linaro.org>

	* sysdeps/arm/bits/atomic.h [__GNUC_PREREQ (4, 7) &&
	__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4]
	(__atomic_is_single_thread): New define.
	(atomic_exchange_and_add_relaxed): Likewise.
	(catomic_exchange_and_add): Use relaxed memory model
	if single threaded.
	(atomic_and_relaxed): New define.
	(catomic_and): Use relaxed memory model
	if single threaded.
	(atomic_or_relaxed): New define.
	(catomic_or): Use relaxed memory model
	if single threaded.
---
 sysdeps/arm/bits/atomic.h | 55 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/sysdeps/arm/bits/atomic.h b/sysdeps/arm/bits/atomic.h
index be314e4..0fbd82b 100644
--- a/sysdeps/arm/bits/atomic.h
+++ b/sysdeps/arm/bits/atomic.h
@@ -52,6 +52,19 @@ void __arm_link_error (void);
    a pattern to do this efficiently.  */
 #if __GNUC_PREREQ (4, 7) && defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
 
+# if defined IS_IN_libpthread || !defined NOT_IN_libc
+#  ifdef IS_IN_libpthread
+extern int __pthread_multiple_threads attribute_hidden;
+#   define __atomic_is_single_thread (__pthread_multiple_threads == 0)
+#  else
+extern int __libc_multiple_threads attribute_hidden;
+#   define __atomic_is_single_thread (__libc_multiple_threads == 0)
+#  endif
+# else
+#  define __atomic_is_single_thread 0
+# endif
+
+
 /* Compare and exchange.
    For all "bool" routines, we return FALSE if exchange successful.  */
 
@@ -180,7 +193,19 @@ void __arm_link_error (void);
   __atomic_val_bysize (__arch_exchange_and_add, int, mem, value,	\
 		       __ATOMIC_RELEASE)
 
-# define catomic_exchange_and_add atomic_exchange_and_add
+# define atomic_exchange_and_add_relaxed(mem, value)			\
+  __atomic_val_bysize (__arch_exchange_and_add, int, mem, value,	\
+		       __ATOMIC_RELAXED)
+
+# define catomic_exchange_and_add(mem, value)				\
+  ({									\
+  __typeof (*(mem)) __res;						\
+  if (__atomic_is_single_thread)					\
+    __res = atomic_exchange_and_add_relaxed (mem, value);		\
+  else									\
+    __res = atomic_exchange_and_add_acq (mem, value);			\
+  __res;								\
+  })
 
 /* Atomically bitwise and value and return the previous value.  */
 
@@ -200,9 +225,21 @@ void __arm_link_error (void);
   __atomic_val_bysize (__arch_exchange_and_and, int, mem, value,	\
 		       __ATOMIC_ACQUIRE)
 
+# define atomic_and_relaxed(mem, value)					\
+  __atomic_val_bysize (__arch_exchange_and_and, int, mem, value,	\
+		       __ATOMIC_RELAXED)
+
 # define atomic_and_val atomic_and
 
-# define catomic_and atomic_and
+# define catomic_and(mem, value)					\
+  ({									\
+  __typeof (*(mem)) __res;						\
+  if (__atomic_is_single_thread)					\
+    __res = atomic_and_relaxed (mem, value);				\
+  else									\
+    __res = atomic_and (mem, value);					\
+  __res;								\
+  })
 
 /* Atomically bitwise or value and return the previous value.  */
 
@@ -222,9 +259,21 @@ void __arm_link_error (void);
   __atomic_val_bysize (__arch_exchange_and_or, int, mem, value,	\
 		       __ATOMIC_ACQUIRE)
 
+# define atomic_or_relaxed(mem, value)				\
+  __atomic_val_bysize (__arch_exchange_and_or, int, mem, value,	\
+		       __ATOMIC_RELAXED)
+
 # define atomic_or_val atomic_or
 
-# define catomic_or atomic_or
+# define catomic_or(mem, value)						\
+  ({									\
+  __typeof (*(mem)) __res;						\
+  if (__atomic_is_single_thread)					\
+    __res = atomic_or_relaxed (mem, value);				\
+  else									\
+    __res = atomic_or (mem, value);					\
+  __res;								\
+  })
 
 #elif defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] benchtests: Add malloc microbenchmark
  2014-10-03 15:11 [PATCH 0/3] Improve ARM atomic performance for malloc Will Newton
  2014-10-03 15:11 ` [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_* Will Newton
  2014-10-03 15:11 ` [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations Will Newton
@ 2014-10-03 15:11 ` Will Newton
  2014-10-07 15:54   ` Siddhesh Poyarekar
  2014-10-03 16:27 ` [PATCH 0/3] Improve ARM atomic performance for malloc Joseph S. Myers
  3 siblings, 1 reply; 19+ messages in thread
From: Will Newton @ 2014-10-03 15:11 UTC (permalink / raw)
  To: libc-alpha

Add a microbenchmark for measuring malloc and free performance. The
benchmark allocates and frees buffers of random sizes in a random order
and measures the overall execution time and RSS.

The random block sizes used follow an inverse square distribution
which is intended to mimic the behaviour of real applications which
tend to allocate many more small blocks than large ones.

ChangeLog:

2014-10-03  Will Newton  <will.newton@linaro.org>

	* benchtests/Makefile (stdlib-bench): Add malloc benchmark.
	* benchtests/bench-malloc.c: New file.
---
 benchtests/Makefile       |   2 +-
 benchtests/bench-malloc.c | 219 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 benchtests/bench-malloc.c

diff --git a/benchtests/Makefile b/benchtests/Makefile
index fd3036d..1f8eb82 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -37,7 +37,7 @@ string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
 		strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok
 string-bench-all := $(string-bench)
 
-stdlib-bench := strtod
+stdlib-bench := strtod malloc
 
 benchset := $(string-bench-all) $(stdlib-bench)
 
diff --git a/benchtests/bench-malloc.c b/benchtests/bench-malloc.c
new file mode 100644
index 0000000..54631ed
--- /dev/null
+++ b/benchtests/bench-malloc.c
@@ -0,0 +1,219 @@
+/* Benchmark malloc and free functions.
+   Copyright (C) 2013-2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <errno.h>
+#include <math.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <unistd.h>
+
+#include "bench-timing.h"
+#include "json-lib.h"
+
+/* Benchmark duration in seconds.  */
+#define BENCHMARK_DURATION	60
+#define RAND_SEED		88
+
+/* Maximum memory that can be allocated at any one time is:
+
+   WORKING_SET_SIZE * MAX_ALLOCATION_SIZE
+
+   However due to the distribution of the random block sizes
+   the typical amount allocated will be much smaller.  */
+#define WORKING_SET_SIZE	1024
+
+#define MIN_ALLOCATION_SIZE	4
+#define MAX_ALLOCATION_SIZE	32768
+
+/* Get a random block size with an inverse square distribution.  */
+static unsigned int
+get_block_size (unsigned int rand_data)
+{
+  /* Inverse square.  */
+  const float exponent = -2;
+  /* Minimum value of distribution.  */
+  const float dist_min = MIN_ALLOCATION_SIZE;
+  /* Maximum value of distribution.  */
+  const float dist_max = MAX_ALLOCATION_SIZE;
+
+  float min_pow = powf (dist_min, exponent + 1);
+  float max_pow = powf (dist_max, exponent + 1);
+
+  float r = (float) rand_data / RAND_MAX;
+
+  return (unsigned int) powf ((max_pow - min_pow) * r + min_pow, 1 / (exponent + 1));
+}
+
+#define NUM_BLOCK_SIZES	8000
+#define NUM_OFFSETS	((WORKING_SET_SIZE) * 4)
+
+static unsigned int random_block_sizes[NUM_BLOCK_SIZES];
+static unsigned int random_offsets[NUM_OFFSETS];
+
+static void
+init_random_values (void)
+{
+  for (size_t i = 0; i < NUM_BLOCK_SIZES; i++)
+    random_block_sizes[i] = get_block_size (rand ());
+
+  for (size_t i = 0; i < NUM_OFFSETS; i++)
+    random_offsets[i] = rand () % WORKING_SET_SIZE;
+}
+
+static unsigned int
+get_random_block_size (unsigned int *state)
+{
+  unsigned int idx = *state;
+
+  if (idx >= NUM_BLOCK_SIZES - 1)
+    idx = 0;
+  else
+    idx++;
+
+  *state = idx;
+
+  return random_block_sizes[idx];
+}
+
+static unsigned int
+get_random_offset (unsigned int *state)
+{
+  unsigned int idx = *state;
+
+  if (idx >= NUM_OFFSETS - 1)
+    idx = 0;
+  else
+    idx++;
+
+  *state = idx;
+
+  return random_offsets[idx];
+}
+
+static volatile bool timeout;
+
+static void alarm_handler (int signum)
+{
+  timeout = true;
+}
+
+/* Allocate and free blocks in a random order.  */
+static size_t
+malloc_benchmark_loop (void **ptr_arr)
+{
+  unsigned int offset_state = 0, block_state = 0;
+  size_t iters = 0;
+
+  while (!timeout)
+    {
+      unsigned int next_idx = get_random_offset (&offset_state);
+      unsigned int next_block = get_random_block_size (&block_state);
+
+      free (ptr_arr[next_idx]);
+
+      ptr_arr[next_idx] = malloc (next_block);
+
+      iters++;
+    }
+
+  return iters;
+}
+
+static timing_t
+do_benchmark (size_t *iters)
+{
+  timing_t elapsed, start, stop;
+  void *working_set[WORKING_SET_SIZE];
+
+  memset (working_set, 0, sizeof (working_set));
+
+  TIMING_NOW (start);
+  *iters = malloc_benchmark_loop (working_set);
+  TIMING_NOW (stop);
+
+  TIMING_DIFF (elapsed, start, stop);
+
+  return elapsed;
+}
+
+int
+main (int argc, char **argv)
+{
+  timing_t cur;
+  size_t iters = 0;
+  unsigned long res;
+  json_ctx_t json_ctx;
+  double d_total_s, d_total_i;
+  struct sigaction act;
+
+  init_random_values ();
+
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  json_attr_object_begin (&json_ctx, "functions");
+
+  json_attr_object_begin (&json_ctx, "malloc");
+
+  json_attr_object_begin (&json_ctx, "");
+
+  TIMING_INIT (res);
+
+  (void) res;
+
+  memset (&act, 0, sizeof (act));
+  act.sa_handler = &alarm_handler;
+
+  sigaction (SIGALRM, &act, NULL);
+
+  alarm (BENCHMARK_DURATION);
+
+  cur = do_benchmark (&iters);
+
+  struct rusage usage;
+  getrusage(RUSAGE_SELF, &usage);
+
+  d_total_s = cur;
+  d_total_i = iters;
+
+  json_attr_double (&json_ctx, "duration", d_total_s);
+  json_attr_double (&json_ctx, "iterations", d_total_i);
+  json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i);
+  json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss);
+
+  json_attr_double (&json_ctx, "min_size", MIN_ALLOCATION_SIZE);
+  json_attr_double (&json_ctx, "max_size", MAX_ALLOCATION_SIZE);
+  json_attr_double (&json_ctx, "random_seed", RAND_SEED);
+
+  json_attr_object_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+
+  json_document_end (&json_ctx);
+
+  return 0;
+}
-- 
1.9.3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-03 15:11 [PATCH 0/3] Improve ARM atomic performance for malloc Will Newton
                   ` (2 preceding siblings ...)
  2014-10-03 15:11 ` [PATCH 1/3] benchtests: Add malloc microbenchmark Will Newton
@ 2014-10-03 16:27 ` Joseph S. Myers
  2014-10-06 13:31   ` Torvald Riegel
  3 siblings, 1 reply; 19+ messages in thread
From: Joseph S. Myers @ 2014-10-03 16:27 UTC (permalink / raw)
  To: Will Newton; +Cc: libc-alpha

On Fri, 3 Oct 2014, Will Newton wrote:

> The resulting atomic.h is hopefully somewhere close to a generic
> implementation based on the gcc intrinsics so could potentially
> be used as a base for a generic header.

That suggests to me that the starting point should be setting up a generic 
header that can be used for multiple architectures and making the ARM 
header inherit from it in the case where the relevant compiler support is 
available, rather than putting all this generic code in an ARM header.  
(And in turn, the starting point in the generic header could be the 
particular operations for which more or less generic code already exists 
in the ARM header, with other operations added to it later.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations
  2014-10-03 15:11 ` [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations Will Newton
@ 2014-10-03 16:31   ` Joseph S. Myers
  2014-10-06 13:29     ` Torvald Riegel
  0 siblings, 1 reply; 19+ messages in thread
From: Joseph S. Myers @ 2014-10-03 16:31 UTC (permalink / raw)
  To: Will Newton; +Cc: libc-alpha

On Fri, 3 Oct 2014, Will Newton wrote:

> -# define __arch_compare_and_exchange_bool_64_int(mem, newval, oldval, model) \
> -  ({__arm_link_error (); 0; })
> +# define __arch_exchange_32_int(mem, newval, model)	\
> +  __atomic_exchange_n (mem, newval, model)

I think obvious link errors are desirable for 64-bit atomics, rather than 
possibly calling libatomic functions using locks, or libgcc functions that 
would introduce a hidden dependency on a more recent kernel version (for 
64-bit atomics helpers) than we currently require (the helpers having been 
introduced in 3.1).

(So in a generic header it should be configurable whether it provides 
64-bit atomic operations or not.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations
  2014-10-03 16:31   ` Joseph S. Myers
@ 2014-10-06 13:29     ` Torvald Riegel
  2014-10-06 15:55       ` Joseph S. Myers
  0 siblings, 1 reply; 19+ messages in thread
From: Torvald Riegel @ 2014-10-06 13:29 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Will Newton, libc-alpha

On Fri, 2014-10-03 at 16:31 +0000, Joseph S. Myers wrote:
> On Fri, 3 Oct 2014, Will Newton wrote:
> 
> > -# define __arch_compare_and_exchange_bool_64_int(mem, newval, oldval, model) \
> > -  ({__arm_link_error (); 0; })
> > +# define __arch_exchange_32_int(mem, newval, model)	\
> > +  __atomic_exchange_n (mem, newval, model)
> 
> I think obvious link errors are desirable for 64-bit atomics, rather than 
> possibly calling libatomic functions using locks, or libgcc functions that 
> would introduce a hidden dependency on a more recent kernel version (for 
> 64-bit atomics helpers) than we currently require (the helpers having been 
> introduced in 3.1).
> 
> (So in a generic header it should be configurable whether it provides 
> 64-bit atomic operations or not.)

I don't feel like we need link errors for 64b atomics, but if we do I
agree that the generic header should take care of this, under
arch-specific control.  I don't think we need to have finer granularity
than, roughly, "supports 64b atomic ops on naturally aligned 64b
integers".

Some concurrent algorithms might be written differently if an arch
provides atomic read-modify-write ops like atomic_or instead of just a
CAS; but I don't think we have such optimizations in current code, so we
can just add this later (including the arch-specific flags for this)
instead of havign to consider this now.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-03 16:27 ` [PATCH 0/3] Improve ARM atomic performance for malloc Joseph S. Myers
@ 2014-10-06 13:31   ` Torvald Riegel
  2014-10-06 13:55     ` Will Newton
  0 siblings, 1 reply; 19+ messages in thread
From: Torvald Riegel @ 2014-10-06 13:31 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Will Newton, libc-alpha

On Fri, 2014-10-03 at 16:27 +0000, Joseph S. Myers wrote:
> On Fri, 3 Oct 2014, Will Newton wrote:
> 
> > The resulting atomic.h is hopefully somewhere close to a generic
> > implementation based on the gcc intrinsics so could potentially
> > be used as a base for a generic header.
> 
> That suggests to me that the starting point should be setting up a generic 
> header that can be used for multiple architectures and making the ARM 
> header inherit from it in the case where the relevant compiler support is 
> available, rather than putting all this generic code in an ARM header.  
> (And in turn, the starting point in the generic header could be the 
> particular operations for which more or less generic code already exists 
> in the ARM header, with other operations added to it later.)

I agree.

In addition, I think that the best step to do this would be when we
switch to C11-like atomics because with this switch, this falls out kind
of naturally.

Will, have you looked at my suggestion and the POC patch I posted for
how C11-like atomics could look like?  I won't get to continue to work
on this topic this week, but it's still on my agenda.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_*
  2014-10-03 15:11 ` [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_* Will Newton
@ 2014-10-06 13:43   ` Torvald Riegel
  2014-10-06 14:13     ` Will Newton
  0 siblings, 1 reply; 19+ messages in thread
From: Torvald Riegel @ 2014-10-06 13:43 UTC (permalink / raw)
  To: Will Newton; +Cc: libc-alpha

On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote:
> Using the relaxed memory model for atomics when single-threaded allows
> a reduction in the number of barriers (dmb) executed and an improvement in
> single thread performance on the malloc benchtest:

I'm aware that we do have catomic* functions and they are being used.
However, I'm wondering whether they are the right tool for what we want
to achieve.

They simply allow to avoid some of the overhead of atomics (but not
necessarily all).  Wouldn't it be better to change the calling code to
contain optimized paths for single-threaded execution?

Also, calling code could either be reentrant or not.  For the former,
you could even avoid actual atomic accesses instead of just avoiding the
barriers.  Also, the compiler could perhaps generate more efficient code
if it doesn't have to deal with (relaxed) atomics.

> Before: 259.073
> After: 246.749

What is the performance number for a program that does have several
threads but runs with your patch (i.e., has conditional execution but
can't avoid the barriers)?

Do you have numbers for a hacked malloc that uses no atomics?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-06 13:31   ` Torvald Riegel
@ 2014-10-06 13:55     ` Will Newton
  2014-10-07 14:18       ` Torvald Riegel
  0 siblings, 1 reply; 19+ messages in thread
From: Will Newton @ 2014-10-06 13:55 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Joseph S. Myers, libc-alpha

On 6 October 2014 14:31, Torvald Riegel <triegel@redhat.com> wrote:
> On Fri, 2014-10-03 at 16:27 +0000, Joseph S. Myers wrote:
>> On Fri, 3 Oct 2014, Will Newton wrote:
>>
>> > The resulting atomic.h is hopefully somewhere close to a generic
>> > implementation based on the gcc intrinsics so could potentially
>> > be used as a base for a generic header.
>>
>> That suggests to me that the starting point should be setting up a generic
>> header that can be used for multiple architectures and making the ARM
>> header inherit from it in the case where the relevant compiler support is
>> available, rather than putting all this generic code in an ARM header.
>> (And in turn, the starting point in the generic header could be the
>> particular operations for which more or less generic code already exists
>> in the ARM header, with other operations added to it later.)
>
> I agree.
>
> In addition, I think that the best step to do this would be when we
> switch to C11-like atomics because with this switch, this falls out kind
> of naturally.
>
> Will, have you looked at my suggestion and the POC patch I posted for
> how C11-like atomics could look like?  I won't get to continue to work
> on this topic this week, but it's still on my agenda.

It's interesting, and long term seems like the best way of doing
things. However I do not see any viable chance of that work being
completed for 2.21. Do you have a timescale in mind? It seems we would
need to convert all uses of the atomic API and all the architecture
ports.

-- 
Will Newton
Toolchain Working Group, Linaro

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_*
  2014-10-06 13:43   ` Torvald Riegel
@ 2014-10-06 14:13     ` Will Newton
  2014-10-07 14:10       ` Torvald Riegel
  0 siblings, 1 reply; 19+ messages in thread
From: Will Newton @ 2014-10-06 14:13 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: libc-alpha

On 6 October 2014 14:43, Torvald Riegel <triegel@redhat.com> wrote:
> On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote:
>> Using the relaxed memory model for atomics when single-threaded allows
>> a reduction in the number of barriers (dmb) executed and an improvement in
>> single thread performance on the malloc benchtest:
>
> I'm aware that we do have catomic* functions and they are being used.
> However, I'm wondering whether they are the right tool for what we want
> to achieve.

They are kind of ugly and rather undocumented as to their precise
semantics, so I share your general concerns about these functions.

> They simply allow to avoid some of the overhead of atomics (but not
> necessarily all).  Wouldn't it be better to change the calling code to
> contain optimized paths for single-threaded execution?

How would you suggest implementing that? My first instinct is that the
result would look a lot like what we have now, i.e. some kind of
wrapper round atomic functions that special-cases the single-threaded
case.

malloc is the main performance critical subsystem using catomic, so it
may be possible to do more of this work there and reduce the
complexity of the generic atomic implementation (although I believe an
earlier patch did do this to some extent but was rejected).

> Also, calling code could either be reentrant or not.  For the former,
> you could even avoid actual atomic accesses instead of just avoiding the
> barriers.  Also, the compiler could perhaps generate more efficient code
> if it doesn't have to deal with (relaxed) atomics.

Yes, that would be ideal if we had that option. It's not clear to me
what catomic_ actually means, it seems from the previous discussions
that it has to be atomic wrt. signal handlers which is why the atomic
operations remain (but the barriers can be dropped). malloc is
generally not re-entrant or AS-safe so optimizing away the atomic
instructions would be a bg win here...

>> Before: 259.073
>> After: 246.749
>
> What is the performance number for a program that does have several
> threads but runs with your patch (i.e., has conditional execution but
> can't avoid the barriers)?

I don't have them, but I will look at that.

> Do you have numbers for a hacked malloc that uses no atomics?

No, I'll see what I can do.

-- 
Will Newton
Toolchain Working Group, Linaro

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations
  2014-10-06 13:29     ` Torvald Riegel
@ 2014-10-06 15:55       ` Joseph S. Myers
  0 siblings, 0 replies; 19+ messages in thread
From: Joseph S. Myers @ 2014-10-06 15:55 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Will Newton, libc-alpha

On Mon, 6 Oct 2014, Torvald Riegel wrote:

> On Fri, 2014-10-03 at 16:31 +0000, Joseph S. Myers wrote:
> > On Fri, 3 Oct 2014, Will Newton wrote:
> > 
> > > -# define __arch_compare_and_exchange_bool_64_int(mem, newval, oldval, model) \
> > > -  ({__arm_link_error (); 0; })
> > > +# define __arch_exchange_32_int(mem, newval, model)	\
> > > +  __atomic_exchange_n (mem, newval, model)
> > 
> > I think obvious link errors are desirable for 64-bit atomics, rather than 
> > possibly calling libatomic functions using locks, or libgcc functions that 
> > would introduce a hidden dependency on a more recent kernel version (for 
> > 64-bit atomics helpers) than we currently require (the helpers having been 
> > introduced in 3.1).
> > 
> > (So in a generic header it should be configurable whether it provides 
> > 64-bit atomic operations or not.)
> 
> I don't feel like we need link errors for 64b atomics, but if we do I

It seems prudent to me to avoid non-obvious problems if such an atomic 
operation creeps in.

> agree that the generic header should take care of this, under
> arch-specific control.  I don't think we need to have finer granularity
> than, roughly, "supports 64b atomic ops on naturally aligned 64b
> integers".

Well, architectures might also configure whether 8-bit or 16-bit atomic 
operations are available (but I'm not sure there's anything in glibc that 
would try to use them even conditionally, so maybe the generic code should 
just make those into link errors unconditionally).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_*
  2014-10-06 14:13     ` Will Newton
@ 2014-10-07 14:10       ` Torvald Riegel
  0 siblings, 0 replies; 19+ messages in thread
From: Torvald Riegel @ 2014-10-07 14:10 UTC (permalink / raw)
  To: Will Newton; +Cc: libc-alpha

On Mon, 2014-10-06 at 15:13 +0100, Will Newton wrote:
> On 6 October 2014 14:43, Torvald Riegel <triegel@redhat.com> wrote:
> > On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote:
> >> Using the relaxed memory model for atomics when single-threaded allows
> >> a reduction in the number of barriers (dmb) executed and an improvement in
> >> single thread performance on the malloc benchtest:
> >
> > I'm aware that we do have catomic* functions and they are being used.
> > However, I'm wondering whether they are the right tool for what we want
> > to achieve.
> 
> They are kind of ugly and rather undocumented as to their precise
> semantics, so I share your general concerns about these functions.
> 
> > They simply allow to avoid some of the overhead of atomics (but not
> > necessarily all).  Wouldn't it be better to change the calling code to
> > contain optimized paths for single-threaded execution?
> 
> How would you suggest implementing that? My first instinct is that the
> result would look a lot like what we have now, i.e. some kind of
> wrapper round atomic functions that special-cases the single-threaded
> case.

First, I'd differentiate between functions that can be reentrant and
those that aren't.  For the former, we'd use sequential code, which
would probably also avoid a few loops and branches because there's
nothing happening concurrently.  For the latter, using catomic* or
memory_order_relaxed atomics directly (after the transition to C11
atomics for this particular piece of code) would be the way to go I
guess, because you still need atomic ops even though you don't need to
synchronize with other threads (although the rules as to which atomics
can really be used in signal handlers is still being discussed, at least
in the C++ committee).

> malloc is the main performance critical subsystem using catomic, so it
> may be possible to do more of this work there and reduce the
> complexity of the generic atomic implementation (although I believe an
> earlier patch did do this to some extent but was rejected).

Having a single-threaded malloc implementation sounds like the right
thing to me.  We'd have some additional code, but the sequential code
would only be a subset of the program logic for the concurrent case
(IOW, you just would leave out stuff that cares about other
interleavings with other threads).

> > Also, calling code could either be reentrant or not.  For the former,
> > you could even avoid actual atomic accesses instead of just avoiding the
> > barriers.  Also, the compiler could perhaps generate more efficient code
> > if it doesn't have to deal with (relaxed) atomics.
> 
> Yes, that would be ideal if we had that option. It's not clear to me
> what catomic_ actually means, it seems from the previous discussions
> that it has to be atomic wrt. signal handlers which is why the atomic
> operations remain (but the barriers can be dropped).

That's my understanding too.  IIRC, it drops the "lock" prefix on x86,
for example.

> malloc is
> generally not re-entrant or AS-safe so optimizing away the atomic
> instructions would be a bg win here...
> 
> >> Before: 259.073
> >> After: 246.749
> >
> > What is the performance number for a program that does have several
> > threads but runs with your patch (i.e., has conditional execution but
> > can't avoid the barriers)?
> 
> I don't have them, but I will look at that.
> 
> > Do you have numbers for a hacked malloc that uses no atomics?
> 
> No, I'll see what I can do.
> 

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-06 13:55     ` Will Newton
@ 2014-10-07 14:18       ` Torvald Riegel
  2014-10-07 16:29         ` Adhemerval Zanella
  2014-10-07 16:57         ` Joseph S. Myers
  0 siblings, 2 replies; 19+ messages in thread
From: Torvald Riegel @ 2014-10-07 14:18 UTC (permalink / raw)
  To: Will Newton; +Cc: Joseph S. Myers, libc-alpha

On Mon, 2014-10-06 at 14:55 +0100, Will Newton wrote:
> On 6 October 2014 14:31, Torvald Riegel <triegel@redhat.com> wrote:
> > On Fri, 2014-10-03 at 16:27 +0000, Joseph S. Myers wrote:
> >> On Fri, 3 Oct 2014, Will Newton wrote:
> >>
> >> > The resulting atomic.h is hopefully somewhere close to a generic
> >> > implementation based on the gcc intrinsics so could potentially
> >> > be used as a base for a generic header.
> >>
> >> That suggests to me that the starting point should be setting up a generic
> >> header that can be used for multiple architectures and making the ARM
> >> header inherit from it in the case where the relevant compiler support is
> >> available, rather than putting all this generic code in an ARM header.
> >> (And in turn, the starting point in the generic header could be the
> >> particular operations for which more or less generic code already exists
> >> in the ARM header, with other operations added to it later.)
> >
> > I agree.
> >
> > In addition, I think that the best step to do this would be when we
> > switch to C11-like atomics because with this switch, this falls out kind
> > of naturally.
> >
> > Will, have you looked at my suggestion and the POC patch I posted for
> > how C11-like atomics could look like?  I won't get to continue to work
> > on this topic this week, but it's still on my agenda.
> 
> It's interesting, and long term seems like the best way of doing
> things. However I do not see any viable chance of that work being
> completed for 2.21. Do you have a timescale in mind? It seems we would
> need to convert all uses of the atomic API and all the architecture
> ports.

I think 2.21 may be fully doable for at least a subset of this.  As I
outlined in my other email where I proposed the transition, we indeed do
have a big first step in that we need for all architectures to provide
C11-like atomics.  I've already scanned through existing code, and I
haven't seen any big issues wrt. that: x86 is clear, ARM already uses
GCC builtins, for PowerPC we have a clear mapping from C11 to HW
instructions.  Many of the "smaller" archs just have simple ops, so
there's less specific stuff to do.

The C11-like atomics would then coexist for a while with the old-style
atomics.  We can then move one piece of concurrent code (ie, a cluster
of functions that's complete in terms of including all functions that
another function in the cluster synchronizes with) to C11-style atomics
at a time.  There's no hurry to do this before 2.21, although I already
spotted a few things that are likely bugs (and they do affect ARM and
Power).

What I do need though is consensus from the community that the move
towards C11 is fine, and feedback on any patches for that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] benchtests: Add malloc microbenchmark
  2014-10-03 15:11 ` [PATCH 1/3] benchtests: Add malloc microbenchmark Will Newton
@ 2014-10-07 15:54   ` Siddhesh Poyarekar
  0 siblings, 0 replies; 19+ messages in thread
From: Siddhesh Poyarekar @ 2014-10-07 15:54 UTC (permalink / raw)
  To: Will Newton; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 7701 bytes --]

On Fri, Oct 03, 2014 at 04:11:24PM +0100, Will Newton wrote:
> Add a microbenchmark for measuring malloc and free performance. The
> benchmark allocates and frees buffers of random sizes in a random order
> and measures the overall execution time and RSS.
> 
> The random block sizes used follow an inverse square distribution
> which is intended to mimic the behaviour of real applications which
> tend to allocate many more small blocks than large ones.
> 
> ChangeLog:
> 
> 2014-10-03  Will Newton  <will.newton@linaro.org>
> 
> 	* benchtests/Makefile (stdlib-bench): Add malloc benchmark.
> 	* benchtests/bench-malloc.c: New file.

Looks OK to me with one minor nit (pointed out below) fixed.

> ---
>  benchtests/Makefile       |   2 +-
>  benchtests/bench-malloc.c | 219 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 220 insertions(+), 1 deletion(-)
>  create mode 100644 benchtests/bench-malloc.c
> 
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index fd3036d..1f8eb82 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -37,7 +37,7 @@ string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
>  		strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok
>  string-bench-all := $(string-bench)
>  
> -stdlib-bench := strtod
> +stdlib-bench := strtod malloc
>  
>  benchset := $(string-bench-all) $(stdlib-bench)
>  
> diff --git a/benchtests/bench-malloc.c b/benchtests/bench-malloc.c
> new file mode 100644
> index 0000000..54631ed
> --- /dev/null
> +++ b/benchtests/bench-malloc.c
> @@ -0,0 +1,219 @@
> +/* Benchmark malloc and free functions.
> +   Copyright (C) 2013-2014 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <errno.h>
> +#include <math.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/time.h>
> +#include <sys/resource.h>
> +#include <unistd.h>
> +
> +#include "bench-timing.h"
> +#include "json-lib.h"
> +
> +/* Benchmark duration in seconds.  */
> +#define BENCHMARK_DURATION	60
> +#define RAND_SEED		88
> +
> +/* Maximum memory that can be allocated at any one time is:
> +
> +   WORKING_SET_SIZE * MAX_ALLOCATION_SIZE
> +
> +   However due to the distribution of the random block sizes
> +   the typical amount allocated will be much smaller.  */
> +#define WORKING_SET_SIZE	1024
> +
> +#define MIN_ALLOCATION_SIZE	4
> +#define MAX_ALLOCATION_SIZE	32768
> +
> +/* Get a random block size with an inverse square distribution.  */
> +static unsigned int
> +get_block_size (unsigned int rand_data)
> +{
> +  /* Inverse square.  */
> +  const float exponent = -2;
> +  /* Minimum value of distribution.  */
> +  const float dist_min = MIN_ALLOCATION_SIZE;
> +  /* Maximum value of distribution.  */
> +  const float dist_max = MAX_ALLOCATION_SIZE;
> +
> +  float min_pow = powf (dist_min, exponent + 1);
> +  float max_pow = powf (dist_max, exponent + 1);
> +
> +  float r = (float) rand_data / RAND_MAX;
> +
> +  return (unsigned int) powf ((max_pow - min_pow) * r + min_pow, 1 / (exponent + 1));
> +}
> +
> +#define NUM_BLOCK_SIZES	8000
> +#define NUM_OFFSETS	((WORKING_SET_SIZE) * 4)
> +
> +static unsigned int random_block_sizes[NUM_BLOCK_SIZES];
> +static unsigned int random_offsets[NUM_OFFSETS];
> +
> +static void
> +init_random_values (void)
> +{
> +  for (size_t i = 0; i < NUM_BLOCK_SIZES; i++)
> +    random_block_sizes[i] = get_block_size (rand ());
> +
> +  for (size_t i = 0; i < NUM_OFFSETS; i++)
> +    random_offsets[i] = rand () % WORKING_SET_SIZE;
> +}
> +
> +static unsigned int
> +get_random_block_size (unsigned int *state)
> +{
> +  unsigned int idx = *state;
> +
> +  if (idx >= NUM_BLOCK_SIZES - 1)
> +    idx = 0;
> +  else
> +    idx++;
> +
> +  *state = idx;
> +
> +  return random_block_sizes[idx];
> +}
> +
> +static unsigned int
> +get_random_offset (unsigned int *state)
> +{
> +  unsigned int idx = *state;
> +
> +  if (idx >= NUM_OFFSETS - 1)
> +    idx = 0;
> +  else
> +    idx++;
> +
> +  *state = idx;
> +
> +  return random_offsets[idx];
> +}
> +
> +static volatile bool timeout;
> +
> +static void alarm_handler (int signum)

Line break after void.

> +{
> +  timeout = true;
> +}
> +
> +/* Allocate and free blocks in a random order.  */
> +static size_t
> +malloc_benchmark_loop (void **ptr_arr)
> +{
> +  unsigned int offset_state = 0, block_state = 0;
> +  size_t iters = 0;
> +
> +  while (!timeout)
> +    {
> +      unsigned int next_idx = get_random_offset (&offset_state);
> +      unsigned int next_block = get_random_block_size (&block_state);
> +
> +      free (ptr_arr[next_idx]);
> +
> +      ptr_arr[next_idx] = malloc (next_block);
> +
> +      iters++;
> +    }
> +
> +  return iters;
> +}
> +
> +static timing_t
> +do_benchmark (size_t *iters)
> +{
> +  timing_t elapsed, start, stop;
> +  void *working_set[WORKING_SET_SIZE];
> +
> +  memset (working_set, 0, sizeof (working_set));
> +
> +  TIMING_NOW (start);
> +  *iters = malloc_benchmark_loop (working_set);
> +  TIMING_NOW (stop);
> +
> +  TIMING_DIFF (elapsed, start, stop);
> +
> +  return elapsed;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  timing_t cur;
> +  size_t iters = 0;
> +  unsigned long res;
> +  json_ctx_t json_ctx;
> +  double d_total_s, d_total_i;
> +  struct sigaction act;
> +
> +  init_random_values ();
> +
> +  json_init (&json_ctx, 0, stdout);
> +
> +  json_document_begin (&json_ctx);
> +
> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> +
> +  json_attr_object_begin (&json_ctx, "functions");
> +
> +  json_attr_object_begin (&json_ctx, "malloc");
> +
> +  json_attr_object_begin (&json_ctx, "");
> +
> +  TIMING_INIT (res);
> +
> +  (void) res;
> +
> +  memset (&act, 0, sizeof (act));
> +  act.sa_handler = &alarm_handler;
> +
> +  sigaction (SIGALRM, &act, NULL);
> +
> +  alarm (BENCHMARK_DURATION);
> +
> +  cur = do_benchmark (&iters);
> +
> +  struct rusage usage;
> +  getrusage(RUSAGE_SELF, &usage);
> +
> +  d_total_s = cur;
> +  d_total_i = iters;
> +
> +  json_attr_double (&json_ctx, "duration", d_total_s);
> +  json_attr_double (&json_ctx, "iterations", d_total_i);
> +  json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i);
> +  json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss);
> +
> +  json_attr_double (&json_ctx, "min_size", MIN_ALLOCATION_SIZE);
> +  json_attr_double (&json_ctx, "max_size", MAX_ALLOCATION_SIZE);
> +  json_attr_double (&json_ctx, "random_seed", RAND_SEED);
> +
> +  json_attr_object_end (&json_ctx);
> +
> +  json_attr_object_end (&json_ctx);
> +
> +  json_attr_object_end (&json_ctx);
> +
> +  json_document_end (&json_ctx);
> +
> +  return 0;
> +}
> -- 
> 1.9.3
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-07 14:18       ` Torvald Riegel
@ 2014-10-07 16:29         ` Adhemerval Zanella
  2014-10-08 12:28           ` Torvald Riegel
  2014-10-15  9:48           ` Will Newton
  2014-10-07 16:57         ` Joseph S. Myers
  1 sibling, 2 replies; 19+ messages in thread
From: Adhemerval Zanella @ 2014-10-07 16:29 UTC (permalink / raw)
  To: libc-alpha

On 07-10-2014 11:18, Torvald Riegel wrote:
> On Mon, 2014-10-06 at 14:55 +0100, Will Newton wrote:
>> On 6 October 2014 14:31, Torvald Riegel <triegel@redhat.com> wrote:
>>> On Fri, 2014-10-03 at 16:27 +0000, Joseph S. Myers wrote:
>>>> On Fri, 3 Oct 2014, Will Newton wrote:
>>>>
>>>>> The resulting atomic.h is hopefully somewhere close to a generic
>>>>> implementation based on the gcc intrinsics so could potentially
>>>>> be used as a base for a generic header.
>>>> That suggests to me that the starting point should be setting up a generic
>>>> header that can be used for multiple architectures and making the ARM
>>>> header inherit from it in the case where the relevant compiler support is
>>>> available, rather than putting all this generic code in an ARM header.
>>>> (And in turn, the starting point in the generic header could be the
>>>> particular operations for which more or less generic code already exists
>>>> in the ARM header, with other operations added to it later.)
>>> I agree.
>>>
>>> In addition, I think that the best step to do this would be when we
>>> switch to C11-like atomics because with this switch, this falls out kind
>>> of naturally.
>>>
>>> Will, have you looked at my suggestion and the POC patch I posted for
>>> how C11-like atomics could look like?  I won't get to continue to work
>>> on this topic this week, but it's still on my agenda.
>> It's interesting, and long term seems like the best way of doing
>> things. However I do not see any viable chance of that work being
>> completed for 2.21. Do you have a timescale in mind? It seems we would
>> need to convert all uses of the atomic API and all the architecture
>> ports.
> I think 2.21 may be fully doable for at least a subset of this.  As I
> outlined in my other email where I proposed the transition, we indeed do
> have a big first step in that we need for all architectures to provide
> C11-like atomics.  I've already scanned through existing code, and I
> haven't seen any big issues wrt. that: x86 is clear, ARM already uses
> GCC builtins, for PowerPC we have a clear mapping from C11 to HW
> instructions.  Many of the "smaller" archs just have simple ops, so
> there's less specific stuff to do.

I am in favor of support this transition, since I do also also to push the very
single-thread optimization for PPC. What do you have in mind for the subset of
this besides your initial approach some weeks ago? 

I will try to summarize the topics raised in your thread "Transition to C11
atomics and memory model" on a wiki entry for 2.21 [1].

[1] https://sourceware.org/glibc/wiki/Release/2.21


>
> The C11-like atomics would then coexist for a while with the old-style
> atomics.  We can then move one piece of concurrent code (ie, a cluster
> of functions that's complete in terms of including all functions that
> another function in the cluster synchronizes with) to C11-style atomics
> at a time.  There's no hurry to do this before 2.21, although I already
> spotted a few things that are likely bugs (and they do affect ARM and
> Power).
>
> What I do need though is consensus from the community that the move
> towards C11 is fine, and feedback on any patches for that.


I think we can actually use the malloc code to exact this experiment, since now
both ARM and PPC wants to add the same single-thread optimization.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-07 14:18       ` Torvald Riegel
  2014-10-07 16:29         ` Adhemerval Zanella
@ 2014-10-07 16:57         ` Joseph S. Myers
  1 sibling, 0 replies; 19+ messages in thread
From: Joseph S. Myers @ 2014-10-07 16:57 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: Will Newton, libc-alpha

On Tue, 7 Oct 2014, Torvald Riegel wrote:

> What I do need though is consensus from the community that the move
> towards C11 is fine, and feedback on any patches for that.

I think moving towards C11 atomics is desirable, provided that 
architectures can still provide their own implementations of atomic 
operations in the cases where direct use of __atomic_* would not be 
expanded inline or might otherwise be problematic or suboptimal, and can 
still control what size operands atomic operations are applied to (so that 
any attempt to use atomic operations on arguments other than 32-bit, or 
64-bit on 64-bit platforms, can continue to produce obvious compile or 
link errors rather than possibly non-obvious references to libgcc).

I'd be happy with requiring GCC >= 4.7 to build glibc, which might reduce 
the need for architecture-specific code in some cases, but there may not 
yet be consensus for requiring something later than 4.6.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-07 16:29         ` Adhemerval Zanella
@ 2014-10-08 12:28           ` Torvald Riegel
  2014-10-15  9:48           ` Will Newton
  1 sibling, 0 replies; 19+ messages in thread
From: Torvald Riegel @ 2014-10-08 12:28 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Tue, 2014-10-07 at 13:28 -0300, Adhemerval Zanella wrote:
> On 07-10-2014 11:18, Torvald Riegel wrote:
> > On Mon, 2014-10-06 at 14:55 +0100, Will Newton wrote:
> >> On 6 October 2014 14:31, Torvald Riegel <triegel@redhat.com> wrote:
> >>> On Fri, 2014-10-03 at 16:27 +0000, Joseph S. Myers wrote:
> >>>> On Fri, 3 Oct 2014, Will Newton wrote:
> >>>>
> >>>>> The resulting atomic.h is hopefully somewhere close to a generic
> >>>>> implementation based on the gcc intrinsics so could potentially
> >>>>> be used as a base for a generic header.
> >>>> That suggests to me that the starting point should be setting up a generic
> >>>> header that can be used for multiple architectures and making the ARM
> >>>> header inherit from it in the case where the relevant compiler support is
> >>>> available, rather than putting all this generic code in an ARM header.
> >>>> (And in turn, the starting point in the generic header could be the
> >>>> particular operations for which more or less generic code already exists
> >>>> in the ARM header, with other operations added to it later.)
> >>> I agree.
> >>>
> >>> In addition, I think that the best step to do this would be when we
> >>> switch to C11-like atomics because with this switch, this falls out kind
> >>> of naturally.
> >>>
> >>> Will, have you looked at my suggestion and the POC patch I posted for
> >>> how C11-like atomics could look like?  I won't get to continue to work
> >>> on this topic this week, but it's still on my agenda.
> >> It's interesting, and long term seems like the best way of doing
> >> things. However I do not see any viable chance of that work being
> >> completed for 2.21. Do you have a timescale in mind? It seems we would
> >> need to convert all uses of the atomic API and all the architecture
> >> ports.
> > I think 2.21 may be fully doable for at least a subset of this.  As I
> > outlined in my other email where I proposed the transition, we indeed do
> > have a big first step in that we need for all architectures to provide
> > C11-like atomics.  I've already scanned through existing code, and I
> > haven't seen any big issues wrt. that: x86 is clear, ARM already uses
> > GCC builtins, for PowerPC we have a clear mapping from C11 to HW
> > instructions.  Many of the "smaller" archs just have simple ops, so
> > there's less specific stuff to do.
> 
> I am in favor of support this transition, since I do also also to push the very
> single-thread optimization for PPC. What do you have in mind for the subset of
> this besides your initial approach some weeks ago? 

The initial step that provides C11-like atomics is of course part of the
subset I mentioned.  Depending on how much time we have until 2.21, we
could also do:
* Move over pthread_once implementation.  I've reviewed (and changed)
this code already and so I'm confident how it would look like with
C11-like atomics.
* While browsing through the code bases for atomic* uses, I saw a couple
of cases that looked like broken implementations of pthread_once-like
functionality.  acquire barriers where missing in a few cases.  Moving
these over would also be a good step.

Other pieces of concurrent code might take more time, so I can't say
yet.  If the *intended* synchronization isn't rather obvious or
documented (which is, sadly, often the case), then review will take more
time.

If any of you are familiar with a piece of code (ie, know about the
intended synchronization), I can also help you move that code over to
C11-like synchronization.

> I will try to summarize the topics raised in your thread "Transition to C11
> atomics and memory model" on a wiki entry for 2.21 [1].
> 
> [1] https://sourceware.org/glibc/wiki/Release/2.21

Thanks!  I don't see it there yet, but let me know when you have
something for review.

> 
> >
> > The C11-like atomics would then coexist for a while with the old-style
> > atomics.  We can then move one piece of concurrent code (ie, a cluster
> > of functions that's complete in terms of including all functions that
> > another function in the cluster synchronizes with) to C11-style atomics
> > at a time.  There's no hurry to do this before 2.21, although I already
> > spotted a few things that are likely bugs (and they do affect ARM and
> > Power).
> >
> > What I do need though is consensus from the community that the move
> > towards C11 is fine, and feedback on any patches for that.
> 
> 
> I think we can actually use the malloc code to exact this experiment, since now
> both ARM and PPC wants to add the same single-thread optimization.

Okay.  Then I'll leave it to you two to get some more insight in what we
could do about catomic*.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] Improve ARM atomic performance for malloc
  2014-10-07 16:29         ` Adhemerval Zanella
  2014-10-08 12:28           ` Torvald Riegel
@ 2014-10-15  9:48           ` Will Newton
  1 sibling, 0 replies; 19+ messages in thread
From: Will Newton @ 2014-10-15  9:48 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On 7 October 2014 17:28, Adhemerval Zanella <azanella@linux.vnet.ibm.com> wrote:
> On 07-10-2014 11:18, Torvald Riegel wrote:
>> On Mon, 2014-10-06 at 14:55 +0100, Will Newton wrote:
>>> On 6 October 2014 14:31, Torvald Riegel <triegel@redhat.com> wrote:
>>>> On Fri, 2014-10-03 at 16:27 +0000, Joseph S. Myers wrote:
>>>>> On Fri, 3 Oct 2014, Will Newton wrote:
>>>>>
>>>>>> The resulting atomic.h is hopefully somewhere close to a generic
>>>>>> implementation based on the gcc intrinsics so could potentially
>>>>>> be used as a base for a generic header.
>>>>> That suggests to me that the starting point should be setting up a generic
>>>>> header that can be used for multiple architectures and making the ARM
>>>>> header inherit from it in the case where the relevant compiler support is
>>>>> available, rather than putting all this generic code in an ARM header.
>>>>> (And in turn, the starting point in the generic header could be the
>>>>> particular operations for which more or less generic code already exists
>>>>> in the ARM header, with other operations added to it later.)
>>>> I agree.
>>>>
>>>> In addition, I think that the best step to do this would be when we
>>>> switch to C11-like atomics because with this switch, this falls out kind
>>>> of naturally.
>>>>
>>>> Will, have you looked at my suggestion and the POC patch I posted for
>>>> how C11-like atomics could look like?  I won't get to continue to work
>>>> on this topic this week, but it's still on my agenda.
>>> It's interesting, and long term seems like the best way of doing
>>> things. However I do not see any viable chance of that work being
>>> completed for 2.21. Do you have a timescale in mind? It seems we would
>>> need to convert all uses of the atomic API and all the architecture
>>> ports.
>> I think 2.21 may be fully doable for at least a subset of this.  As I
>> outlined in my other email where I proposed the transition, we indeed do
>> have a big first step in that we need for all architectures to provide
>> C11-like atomics.  I've already scanned through existing code, and I
>> haven't seen any big issues wrt. that: x86 is clear, ARM already uses
>> GCC builtins, for PowerPC we have a clear mapping from C11 to HW
>> instructions.  Many of the "smaller" archs just have simple ops, so
>> there's less specific stuff to do.
>
> I am in favor of support this transition, since I do also also to push the very
> single-thread optimization for PPC. What do you have in mind for the subset of
> this besides your initial approach some weeks ago?
>
> I will try to summarize the topics raised in your thread "Transition to C11
> atomics and memory model" on a wiki entry for 2.21 [1].
>
> [1] https://sourceware.org/glibc/wiki/Release/2.21

Did this information ever get added to the wiki anywhere?

I would like to get consensus on what we can get done for 2.21. I will
respin my patches based on the feedback received so far but I'm also
perfectly happy to help with the C11 atomics work once the direction
is clear.

Thanks,

-- 
Will Newton
Toolchain Working Group, Linaro

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-10-15  9:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-03 15:11 [PATCH 0/3] Improve ARM atomic performance for malloc Will Newton
2014-10-03 15:11 ` [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_* Will Newton
2014-10-06 13:43   ` Torvald Riegel
2014-10-06 14:13     ` Will Newton
2014-10-07 14:10       ` Torvald Riegel
2014-10-03 15:11 ` [PATCH 2/3] sysdeps/arm/bits/atomic.h: Add a wider range of atomic operations Will Newton
2014-10-03 16:31   ` Joseph S. Myers
2014-10-06 13:29     ` Torvald Riegel
2014-10-06 15:55       ` Joseph S. Myers
2014-10-03 15:11 ` [PATCH 1/3] benchtests: Add malloc microbenchmark Will Newton
2014-10-07 15:54   ` Siddhesh Poyarekar
2014-10-03 16:27 ` [PATCH 0/3] Improve ARM atomic performance for malloc Joseph S. Myers
2014-10-06 13:31   ` Torvald Riegel
2014-10-06 13:55     ` Will Newton
2014-10-07 14:18       ` Torvald Riegel
2014-10-07 16:29         ` Adhemerval Zanella
2014-10-08 12:28           ` Torvald Riegel
2014-10-15  9:48           ` Will Newton
2014-10-07 16:57         ` Joseph S. Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).