public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
@ 2018-04-25  2:59 Kemi Wang
  2018-04-25  2:59 ` [PATCH v2 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark Kemi Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Kemi Wang @ 2018-04-25  2:59 UTC (permalink / raw)
  To: Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

This patch does not have any functionality change, we only provide a spin
count tunes for pthread adaptive spin mutex. The tunable
glibc.mutex.spin_count tunes can be used by system administrator to squeeze
system performance according to different hardware capability and workload
model.

This is the preparation work for the next patch, in which the way of
adaptive spin would be changed from an expensive cmpxchg to read while
spinning.

   * elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
   * manual/tunables.texi: Add glibc.mutex.spin_count description.
   * nptl/Makefile: Add pthread_mutex_conf.c for compilation.
   * nptl/pthread_mutex_conf.h: New file.
   * nptl/pthread_mutex_conf.c: New file.

ChangeLog:
    V1->V2
    a) Renamed nptl/mutex-conf.h -> nptl/pthread_mutex_conf.h
    b) Renamed nptl/mutex-conf.c -> nptl/pthread_mutex_conf.c
    c) Change the Makefile to compile pthread_mutex_conf.c
    d) Modify the copyright "2013-2018" -> "2018" for new added files
    e) Fix the indentation issue (tab -> double space) in
    elf/dl-tunables.list
    f) Remove the env alias LD_SPIN_COUNT in elf/dl-tunables.list
    g) Fix the typo errors and refresh glibc.mutex.spin_count tunable
    description in manual/tunables.texi.
    h) Fix the indentation issue in nptl/pthread_mutex_conf.c
    i) Fix the indentation issue for nested preprocessor (add one space for
    each level)

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 ChangeLog                 |  8 +++++
 elf/dl-tunables.list      |  9 ++++++
 manual/tunables.texi      | 21 +++++++++++++
 nptl/Makefile             |  3 +-
 nptl/pthread_mutex_conf.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++
 nptl/pthread_mutex_conf.h | 31 +++++++++++++++++++
 6 files changed, 148 insertions(+), 1 deletion(-)
 create mode 100644 nptl/pthread_mutex_conf.c
 create mode 100644 nptl/pthread_mutex_conf.h

diff --git a/ChangeLog b/ChangeLog
index efa8d39..4750b11 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2018-04-24  Kemi Wang <kemi.wang@intel.com>
+
+	* elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
+	* manual/tunables.texi: Add glibc.mutex.spin_count description.
+	* nptl/Makefile: Add pthread_mutex_conf.c for compilation.
+	* nptl/pthread_mutex_conf.h: New file.
+	* nptl/pthread_mutex_conf.c: New file.
+
 2018-04-24  Joseph Myers  <joseph@codesourcery.com>
 
 	* sysdeps/mach/hurd/dl-sysdep.c: Include <not-errno.h>.
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index 1f8ecb8..dc1e8f4 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -121,4 +121,13 @@ glibc {
       default: 3
     }
   }
+
+  mutex {
+    spin_count {
+      type: INT_32
+      minval: 0
+      maxval: 30000
+      default: 1000
+    }
+  }
 }
diff --git a/manual/tunables.texi b/manual/tunables.texi
index be33c9f..c2cf8c6 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -281,6 +281,27 @@ of try lock attempts.
 The default value of this tunable is @samp{3}.
 @end deftp
 
+@node Pthread Mutex Tunables
+@section Pthread Mutex Tunables
+@cindex pthread mutex tunables
+
+@deftp {Tunable namespace} glibc.mutex
+Behavior of pthread mutex can be tuned to gain performance improvement
+according to specific hardware capability and workload character by setting
+the following tunables in the @code{mutex} namespace.
+@end deftp
+
+@deftp Tunable glibc.mutex.spin_count
+The @code{glibc.mutex.spin_count} tunable sets the maximum spin times that
+a thread should spin on the lock before calling into the kernel to block.
+Adaptive spin is used for the mutex initialized with PTHREAD_MUTEX_ADAPTIVE_NP
+GNU extension. It affects both pthread_mutex_lock and pthread_mutex_timedlock.
+The spinning is done in case of either the maximum spin times is reached or
+the lock is acquired during spinning.
+
+The default value of this tunable is @samp{1000}.
+@end deftp
+
 @node Hardware Capability Tunables
 @section Hardware Capability Tunables
 @cindex hardware capability tunables
diff --git a/nptl/Makefile b/nptl/Makefile
index 94be92c..bd1096f 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -139,7 +139,8 @@ libpthread-routines = nptl-init vars events version pt-interp \
 		      pthread_mutex_getprioceiling \
 		      pthread_mutex_setprioceiling \
 		      pthread_setname pthread_getname \
-		      pthread_setattr_default_np pthread_getattr_default_np
+		      pthread_setattr_default_np pthread_getattr_default_np \
+		      pthread_mutex_conf
 #		      pthread_setuid pthread_seteuid pthread_setreuid \
 #		      pthread_setresuid \
 #		      pthread_setgid pthread_setegid pthread_setregid \
diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c
new file mode 100644
index 0000000..6340b5d
--- /dev/null
+++ b/nptl/pthread_mutex_conf.c
@@ -0,0 +1,77 @@
+/* pthread_mutex_conf.c: Pthread mutex tunable parameters.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include <pthreadP.h>
+#include <init-arch.h>
+#include <pthread_mutex_conf.h>
+#include <unistd.h>
+
+#if HAVE_TUNABLES
+# define TUNABLE_NAMESPACE mutex
+#endif
+#include <elf/dl-tunables.h>
+
+
+struct mutex_config __mutex_aconf =
+  {
+    /* The maximum times a thread spin on the lock before going to block */
+    .spin_count = 1000,
+  };
+
+#if HAVE_TUNABLES
+# define TUNABLE_CALLBACK_FNDECL(__name, __type)			\
+static inline void						\
+__always_inline							\
+do_set_mutex_ ## __name (__type value)			\
+{								\
+  __mutex_aconf.__name = value;				\
+}								\
+void								\
+TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
+{								\
+  __type value = (__type) (valp)->numval;			\
+  do_set_mutex_ ## __name (value);				\
+}
+
+TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
+#endif
+
+static void
+mutex_tunables_init (int argc __attribute__ ((unused)),
+			      char **argv  __attribute__ ((unused)),
+					      char **environ)
+{
+#if HAVE_TUNABLES
+
+  TUNABLE_GET (spin_count, int32_t,
+		  TUNABLE_CALLBACK (set_mutex_spin_count));
+#endif
+}
+
+#ifdef SHARED
+# define INIT_SECTION ".init_array"
+#else
+# define INIT_SECTION ".preinit_array"
+#endif
+
+void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
+  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
+{
+  &mutex_tunables_init
+};
diff --git a/nptl/pthread_mutex_conf.h b/nptl/pthread_mutex_conf.h
new file mode 100644
index 0000000..e5b027c
--- /dev/null
+++ b/nptl/pthread_mutex_conf.h
@@ -0,0 +1,31 @@
+/* pthread_mutex_conf.h: Pthread mutex tunable parameters.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_CONF_H
+#define _PTHREAD_MUTEX_CONF_H 1
+
+#include <pthread.h>
+#include <time.h>
+
+struct mutex_config
+{
+  int spin_count;
+};
+
+extern struct mutex_config __mutex_aconf attribute_hidden;
+
+#endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm
  2018-04-25  2:59 [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Kemi Wang
  2018-04-25  2:59 ` [PATCH v2 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark Kemi Wang
@ 2018-04-25  2:59 ` Kemi Wang
  2018-05-02  8:19   ` Florian Weimer
  2018-04-25  4:02 ` [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Rical Jasan
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Kemi Wang @ 2018-04-25  2:59 UTC (permalink / raw)
  To: Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggested by Andi Kleen.

Usually, it is useless to go on spinning on the lock if fail to acquire the
lock when lock is available. That's because the spinner probably does not
have the possibility to acquire the lock during the spin process in case of
severe lock contention. Therefore, it would be better to call into the
kernel to block the thread, as suggested by Tim Chen and we can gain the
benefit at least from:
a) save the CPU time;
b) save power budget;
c) reduce the overhead of cache line bouncing during the spinning.

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
with global update)
Usage: make bench BENCHSET=mutex-adaptive-thread
Test result:
+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |           Critical section size: 1x            |
+----------------+------------------------------------------------+
|1 thread        |  7.06542e+08    |  7.08998e+08    |   +0.3%    |
|2 threads       |  5.73018e+07    |  7.20815e+07    |   +25.6%   |
|3 threads       |  3.78511e+07    |  1.15544e+08    |   +205.3%  |
|4 threads       |  2.28214e+07    |  6.57055e+07    |   +187.9%  |
|28 threads      |  1.68839e+07    |  5.19314e+07    |   +207.6%  |
|56 threads      |  1.84983e+07    |  5.06522e+07    |   +173.8%  |
|112 threads     |  2.3568e+07     |  4.95375e+07    |   +110.2%  |
+----------------+------------------------------------------------+
|                |           Critical section size: 10x           |
+----------------+------------------------------------------------+
|1 thread        |  5.40274e+08    |  5.47189e+08    |   +1.3%    |
|2 threads       |  4.55684e+07    |  6.03275e+07    |   +32.4%   |
|3 threads       |  3.05702e+07    |  1.04035e+08    |   +240.3%  |
|4 threads       |  2.17341e+07    |  5.57264e+07    |   +156.4%  |
|28 threads      |  1.39503e+07    |  4.53525e+07    |   +225.1%  |
|56 threads      |  1.50154e+07    |  4.16203e+07    |   +177.2%  |
|112 threads     |  1.90175e+07    |  3.88308e+07    |   +104.2%  |
+----------------+------------------------------------------------+
|                |           Critical section size: 100x          |
+----------------+------------------------------------------------+
|1 thread        |  7.23372e+07    | 7.25654e+07     |   +0.3%    |
|2 threads       |  2.67302e+07    | 2.40265e+07     |   -10.1%   |
|3 threads       |  1.89936e+07    | 2.70759e+07     |   +42.6%   |
|4 threads       |  1.62423e+07    | 2.25097e+07     |   +38.6%   |
|28 threads      |  9.85977e+06    | 1.59003e+07     |   +61.3%   |
|56 threads      |  8.11471e+06    | 1.6344e+07      |   +101.4%  |
|112 threads     |  8.58044e+06    | 1.53827e+07     |   +79.3%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 1000x         |
+----------------+------------------------------------------------+
|1 thread        |  8.16913e+06    |  8.16126e+06    |   -0.1%    |
|2 threads       |  5.82987e+06    |  5.92752e+06    |   +1.7%    |
|3 threads       |  6.05125e+06    |  6.37068e+06    |   +5.3%    |
|4 threads       |  5.91259e+06    |  6.27616e+06    |   +6.1%    |
|28 threads      |  2.40584e+06    |  2.60738e+06    |   +8.4%    |
|56 threads      |  2.32643e+06    |  2.3245e+06     |   -0.1%    |
|112 threads     |  2.32366e+06    |  2.30271e+06    |   -0.9%    |
+----------------+-----------------+-----------------+------------+

    * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex

ChangLog:
    V1->V2: fix format issue

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 ChangeLog                 |  4 ++++
 nptl/pthread_mutex_lock.c | 32 ++++++++++++++++++--------------
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 76d2628..4c81693 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
 2018-04-24  Kemi Wang <kemi.wang@intel.com>
 
+	* nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex.
+
+2018-04-24  Kemi Wang <kemi.wang@intel.com>
+
 	* benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive
 	spin mutex.
 	* benchmark/Makefile: Add adaptive spin mutex benchmark.
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..3442c58 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include <pthread_mutex_conf.h>
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
       if (LLL_MUTEX_TRYLOCK (mutex) != 0)
 	{
 	  int cnt = 0;
-	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
-			     mutex->__data.__spins * 2 + 10);
-	  do
-	    {
-	      if (cnt++ >= max_cnt)
-		{
-		  LLL_MUTEX_LOCK (mutex);
-		  break;
-		}
-	      atomic_spin_nop ();
-	    }
-	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
+	  int max_cnt = MIN (__mutex_aconf.spin_count,
+			mutex->__data.__spins * 2 + 100);
+
+      /* MO read while spinning */
+      do
+        {
+         atomic_spin_nop ();
+        }
+      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
+            ++cnt < max_cnt);
+        /* Try to acquire the lock if lock is available or the spin count
+         * is run out, call into kernel to block if fails
+         */
+      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
+        LLL_MUTEX_LOCK (mutex);
 
-	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
-	}
+      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+    }
       assert (mutex->__data.__owner == 0);
     }
   else
-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark
  2018-04-25  2:59 [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Kemi Wang
@ 2018-04-25  2:59 ` Kemi Wang
  2018-04-25  2:59 ` [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm Kemi Wang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Kemi Wang @ 2018-04-25  2:59 UTC (permalink / raw)
  To: Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

Add a microbenchmark for measuring mutex lock and unlock performance with
varying numbers of threads and varying size of a critical section. The
benchmark leverages the mutex lock and unlock operation for protecting the
critical section and measures the minimum execution time, maximum execution
time, total execution time within a fixed duration. Variants of benchmark
are run with 1, 2, 3, 4, nproc/4, nproc/2, nproc threads.

The size of critical section is determined by the times of global variable
increment which is intended to emulate the critical region of real
applications. In this microbenchmark, the number 1, 10, 100, and 1000 are
used to represent different size of critical sections in the working set.

    * benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for
    adaptive spin mutex
    * benchmark/Makefile: Add adaptive spin mutex benchmark

ChangLog:
    V1->V2: new added microbenchmark, as requested by Adhemerval

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 ChangeLog                                |   6 +
 benchtests/Makefile                      |  36 ++++-
 benchtests/bench-mutex-adaptive-thread.c | 230 +++++++++++++++++++++++++++++++
 3 files changed, 265 insertions(+), 7 deletions(-)
 create mode 100644 benchtests/bench-mutex-adaptive-thread.c

diff --git a/ChangeLog b/ChangeLog
index 4750b11..76d2628 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,11 @@
 2018-04-24  Kemi Wang <kemi.wang@intel.com>
 
+	* benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive
+	spin mutex.
+	* benchmark/Makefile: Add adaptive spin mutex benchmark.
+
+2018-04-24  Kemi Wang <kemi.wang@intel.com>
+
 	* elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
 	* manual/tunables.texi: Add glibc.mutex.spin_count description.
 	* nptl/Makefile: Add pthread_mutex_conf.c for compilation.
diff --git a/benchtests/Makefile b/benchtests/Makefile
index bcd6a9c..fcc768f 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -95,10 +95,17 @@ else
 bench-malloc := $(filter malloc-%,${BENCHSET})
 endif
 
+ifeq (${BENCHSET},)
+bench-mutex := mutex-adaptive-thread
+else
+bench-mutex := $(filter mutex-%,${BENCHSET})
+endif
+
 $(addprefix $(objpfx)bench-,$(bench-math)): $(libm)
 $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm)
 $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library)
 $(objpfx)bench-malloc-thread: $(shared-thread-library)
+$(addprefix $(objpfx)bench-,$(bench-mutex)): $(shared-thread-library)
 
 \f
 
@@ -119,6 +126,7 @@ include ../Rules
 binaries-bench := $(addprefix $(objpfx)bench-,$(bench))
 binaries-benchset := $(addprefix $(objpfx)bench-,$(benchset))
 binaries-bench-malloc := $(addprefix $(objpfx)bench-,$(bench-malloc))
+binaries-bench-mutex := $(addprefix $(objpfx)bench-,$(bench-mutex))
 
 # The default duration: 10 seconds.
 ifndef BENCH_DURATION
@@ -142,7 +150,7 @@ endif
 # This makes sure CPPFLAGS-nonlib and CFLAGS-nonlib are passed
 # for all these modules.
 cpp-srcs-left := $(binaries-benchset:=.c) $(binaries-bench:=.c) \
-		 $(binaries-bench-malloc:=.c)
+	$(binaries-bench-malloc:=.c) $(binaries-bench-mutex:=.c)
 lib := nonlib
 include $(patsubst %,$(..)libof-iterator.mk,$(cpp-srcs-left))
 
@@ -158,6 +166,7 @@ bench-clean:
 	rm -f $(binaries-bench) $(addsuffix .o,$(binaries-bench))
 	rm -f $(binaries-benchset) $(addsuffix .o,$(binaries-benchset))
 	rm -f $(binaries-bench-malloc) $(addsuffix .o,$(binaries-bench-malloc))
+	rm -f $(binaries-bench-mutex) $(addsuffix .o,$(binaries-bench-mutex))
 	rm -f $(timing-type) $(addsuffix .o,$(timing-type))
 	rm -f $(addprefix $(objpfx),$(bench-extra-objs))
 
@@ -165,7 +174,7 @@ bench-clean:
 ifneq ($(strip ${BENCHSET}),)
 VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \
    wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \
-   malloc-thread
+   malloc-thread mutex-adaptive-thread
 INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET})
 ifneq (${INVALIDBENCHSETNAMES},)
 $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES})
@@ -176,7 +185,7 @@ endif
 
 # Define the bench target only if the target has a usable python installation.
 ifdef PYTHON
-bench: bench-build bench-set bench-func bench-malloc
+bench: bench-build bench-set bench-func bench-malloc bench-mutex
 else
 bench:
 	@echo "The bench target needs python to run."
@@ -187,10 +196,10 @@ endif
 # only if we're building natively.
 ifeq (no,$(cross-compiling))
 bench-build: $(gen-locales) $(timing-type) $(binaries-bench) \
-	$(binaries-benchset) $(binaries-bench-malloc)
+	$(binaries-benchset) $(binaries-bench-malloc) $(binaries-bench-mutex)
 else
 bench-build: $(timing-type) $(binaries-bench) $(binaries-benchset) \
-	$(binaries-bench-malloc)
+	$(binaries-bench-malloc) $(binaries-bench-mutex)
 endif
 
 bench-set: $(binaries-benchset)
@@ -207,6 +216,19 @@ bench-malloc: $(binaries-bench-malloc)
 	  done;\
 	done
 
+# Run benchmark with 1, 2, 3, nproc/2, nproc threads
+bench-mutex: $(binaries-bench-mutex)
+	for run in $^; do \
+		prev=0; \
+		for thr in 1 2 3 4 $$((`nproc` / 4)) $$((`nproc` / 2)) `nproc`; do \
+			if [ $$thr -gt $$prev -a $$thr -lt `nproc` ]; then \
+			echo "Running $${run} $${thr}"; \
+			fi; \
+			prev=$$thr; \
+	  $(run-bench) $${thr} > $${run}-$${thr}.out; \
+	  done;\
+	done
+
 # Build and execute the benchmark functions.  This target generates JSON
 # formatted bench.out.  Each of the programs produce independent JSON output,
 # so one could even execute them individually and process it using any JSON
@@ -236,8 +258,8 @@ bench-func: $(binaries-bench)
 	fi
 
 $(timing-type) $(binaries-bench) $(binaries-benchset) \
-	$(binaries-bench-malloc): %: %.o $(objpfx)json-lib.o \
-	$(link-extra-libs-tests) \
+	$(binaries-bench-malloc) $(binaries-bench-mutex): \
+	%: %.o $(objpfx)json-lib.o $(link-extra-libs-tests) \
   $(sort $(filter $(common-objpfx)lib%,$(link-libc))) \
   $(addprefix $(csu-objpfx),start.o) $(+preinit) $(+postinit)
 	$(+link-tests)
diff --git a/benchtests/bench-mutex-adaptive-thread.c b/benchtests/bench-mutex-adaptive-thread.c
new file mode 100644
index 0000000..51a92a7
--- /dev/null
+++ b/benchtests/bench-mutex-adaptive-thread.c
@@ -0,0 +1,230 @@
+/* Benchmark pthread adaptive spin mutex lock and unlock functions.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <errno.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "bench-timing.h"
+#include "json-lib.h"
+
+/* Benchmark duration in seconds.  */
+#define BENCHMARK_DURATION	15
+
+#define TYPE PTHREAD_MUTEX_ADAPTIVE_NP
+
+static unsigned long long val;
+static pthread_mutexattr_t attr;
+static pthread_mutex_t mutex;
+
+#define WORKING_SET_SIZE  4
+int working_set[] = {1, 10, 100, 1000};
+
+struct thread_args
+{
+  unsigned long long iters;
+  int working_set;
+  timing_t elapsed;
+};
+
+static void init_mutex (void)
+{
+	pthread_mutexattr_init (&attr);
+	pthread_mutexattr_settype (&attr, TYPE);
+	pthread_mutex_init (&mutex, &attr);
+}
+
+static void init_parameter (int size, struct thread_args *args,
+		int num_thread)
+{
+  int i;
+  for (i = 0; i < num_thread; i++)
+{
+  memset(&args[i], 0, sizeof(struct thread_args));
+  args[i].working_set = size;
+}
+}
+
+static volatile bool timeout;
+
+static void
+alarm_handler (int signum)
+{
+  timeout = true;
+}
+
+/* Lock and unlock for protecting the critical section. */
+static unsigned long long
+mutex_benchmark_loop (int size)
+{
+  volatile int count;
+  unsigned long long iters = 0;
+
+  while (!timeout)
+    {
+      count = size;
+      pthread_mutex_lock (&mutex);
+      while (count > 0)
+        {
+          val++;
+          count--;
+        }
+      pthread_mutex_unlock (&mutex);
+      iters++;
+    }
+  return iters;
+}
+
+static void *
+benchmark_thread (void *arg)
+{
+  struct thread_args *args = (struct thread_args *) arg;
+  unsigned long long iters;
+  timing_t start, stop;
+
+  TIMING_NOW (start);
+  iters = mutex_benchmark_loop (args->working_set);
+  TIMING_NOW (stop);
+
+  TIMING_DIFF (args->elapsed, start, stop);
+  args->iters = iters;
+
+  return NULL;
+}
+
+static void
+do_benchmark (size_t num_thread, struct thread_args *args)
+{
+
+  if (num_thread == 1)
+    {
+      timing_t start, stop;
+
+      TIMING_NOW (start);
+      args->iters = mutex_benchmark_loop (args->working_set);
+      TIMING_NOW (stop);
+
+      TIMING_DIFF (args->elapsed, start, stop);
+    }
+  else
+    {
+      pthread_t threads[num_thread];
+
+      for (size_t i = 0; i < num_thread; i++)
+        pthread_create(&threads[i], NULL, benchmark_thread, args + i);
+
+      for (size_t i = 0; i < num_thread; i++)
+        pthread_join(threads[i], NULL);
+    }
+}
+
+static void usage(const char *name)
+{
+  fprintf (stderr, "%s: <num_thread>\n", name);
+  exit (1);
+}
+
+int
+main (int argc, char **argv)
+{
+  int i, j, num_thread = 1;
+  json_ctx_t json_ctx;
+  struct sigaction act;
+
+  if (argc == 1)
+    num_thread = 1;
+  else if (argc == 2)
+    {
+      long ret;
+
+      errno = 0;
+      ret = strtol(argv[1], NULL, 10);
+
+      if (errno || ret == 0)
+	    usage(argv[0]);
+
+      num_thread = ret;
+    }
+  else
+    usage(argv[0]);
+
+  /* Benchmark for different critical section size */
+  for (i = 0; i < WORKING_SET_SIZE; i++)
+{
+  int size = working_set[i];
+  struct thread_args args[num_thread];
+  unsigned long long iters = 0, min_iters = -1ULL, max_iters = 0;
+  double d_total_s = 0, d_total_i = 0;
+
+  timeout = false;
+  init_mutex ();
+  init_parameter (size, args, num_thread);
+
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  json_attr_object_begin (&json_ctx, "functions");
+
+  json_attr_object_begin (&json_ctx, "mutex");
+
+  json_attr_object_begin (&json_ctx, "");
+
+  memset (&act, 0, sizeof (act));
+  act.sa_handler = &alarm_handler;
+
+  sigaction (SIGALRM, &act, NULL);
+
+  alarm (BENCHMARK_DURATION);
+
+  do_benchmark (num_thread, args);
+
+  for (j = 0; j < num_thread; j++)
+{
+  iters = args[j].iters;
+  if (iters < min_iters)
+    min_iters = iters;
+  if (iters >= max_iters)
+    max_iters = iters;
+  d_total_i += iters;
+  TIMING_ACCUM (d_total_s, args[j].elapsed);
+}
+  json_attr_double (&json_ctx, "duration", d_total_s);
+  json_attr_double (&json_ctx, "total_iterations", d_total_i);
+  json_attr_double (&json_ctx, "min_iteration", min_iters);
+  json_attr_double (&json_ctx, "max_iteration", max_iters);
+  json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i);
+  json_attr_double (&json_ctx, "threads", num_thread);
+  json_attr_double (&json_ctx, "critical_section_size", size);
+
+  json_attr_object_end (&json_ctx);
+  json_attr_object_end (&json_ctx);
+  json_attr_object_end (&json_ctx);
+
+  json_document_end (&json_ctx);
+  fputs("\n", (&json_ctx)->fp);
+}
+  return 0;
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-04-25  2:59 [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Kemi Wang
  2018-04-25  2:59 ` [PATCH v2 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark Kemi Wang
  2018-04-25  2:59 ` [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm Kemi Wang
@ 2018-04-25  4:02 ` Rical Jasan
  2018-04-25  5:14   ` kemi
  2018-05-02  1:54 ` kemi
  2018-05-02  8:04 ` Florian Weimer
  4 siblings, 1 reply; 17+ messages in thread
From: Rical Jasan @ 2018-04-25  4:02 UTC (permalink / raw)
  To: Kemi Wang, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

On 04/24/2018 07:56 PM, Kemi Wang wrote:
...
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index be33c9f..c2cf8c6 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -281,6 +281,27 @@ of try lock attempts.
>  The default value of this tunable is @samp{3}.
>  @end deftp
>  
> +@node Pthread Mutex Tunables
> +@section Pthread Mutex Tunables
> +@cindex pthread mutex tunables
> +
> +@deftp {Tunable namespace} glibc.mutex
> +Behavior of pthread mutex can be tuned to gain performance improvement
> +according to specific hardware capability and workload character by setting
> +the following tunables in the @code{mutex} namespace.

The behavior of pthread mutexes can be tuned to gain performance
improvements according to specific hardware capabilities and workload
characteristics by setting the following tunables in the @code{mutex}
namespace:

> +@end deftp
> +
> +@deftp Tunable glibc.mutex.spin_count
> +The @code{glibc.mutex.spin_count} tunable sets the maximum spin times that
> +a thread should spin on the lock before calling into the kernel to block.

"maximum number of times a thread should spin"

> +Adaptive spin is used for the mutex initialized with PTHREAD_MUTEX_ADAPTIVE_NP

"used for mutexes initialized with the"

> +GNU extension. It affects both pthread_mutex_lock and pthread_mutex_timedlock.

Double spaces between sentences.

> +The spinning is done in case of either the maximum spin times is reached or
> +the lock is acquired during spinning.

Why would we spin after we've maxed out spinning (or acquired the lock)?
 Perhaps this should read:

"Spinning is done until either the maximum spin count is reached or the
lock is acquired."

> +
> +The default value of this tunable is @samp{1000}.
> +@end deftp
> +
>  @node Hardware Capability Tunables
>  @section Hardware Capability Tunables
>  @cindex hardware capability tunables

Rical

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-04-25  4:02 ` [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Rical Jasan
@ 2018-04-25  5:14   ` kemi
  0 siblings, 0 replies; 17+ messages in thread
From: kemi @ 2018-04-25  5:14 UTC (permalink / raw)
  To: Rical Jasan, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

Hi, Rical
   Thanks for your help to polish this description.
That's much better than before, I will fold these changes in the next version.

On 2018年04月25日 12:02, Rical Jasan wrote:
> On 04/24/2018 07:56 PM, Kemi Wang wrote:
> ...
>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>> index be33c9f..c2cf8c6 100644
>> --- a/manual/tunables.texi
>> +++ b/manual/tunables.texi
>> @@ -281,6 +281,27 @@ of try lock attempts.
>>  The default value of this tunable is @samp{3}.
>>  @end deftp
>>  
>> +@node Pthread Mutex Tunables
>> +@section Pthread Mutex Tunables
>> +@cindex pthread mutex tunables
>> +
>> +@deftp {Tunable namespace} glibc.mutex
>> +Behavior of pthread mutex can be tuned to gain performance improvement
>> +according to specific hardware capability and workload character by setting
>> +the following tunables in the @code{mutex} namespace.
> 
> The behavior of pthread mutexes can be tuned to gain performance
> improvements according to specific hardware capabilities and workload
> characteristics by setting the following tunables in the @code{mutex}
> namespace:
> 

OK.

>> +@end deftp
>> +
>> +@deftp Tunable glibc.mutex.spin_count
>> +The @code{glibc.mutex.spin_count} tunable sets the maximum spin times that
>> +a thread should spin on the lock before calling into the kernel to block.
> 
> "maximum number of times a thread should spin"
>

OK

>> +Adaptive spin is used for the mutex initialized with PTHREAD_MUTEX_ADAPTIVE_NP
> 
> "used for mutexes initialized with the"
> 
>> +GNU extension. It affects both pthread_mutex_lock and pthread_mutex_timedlock.
> 
> Double spaces between sentences.
> 

Sure.

>> +The spinning is done in case of either the maximum spin times is reached or
>> +the lock is acquired during spinning.
> 
> Why would we spin after we've maxed out spinning (or acquired the lock)?
>  Perhaps this should read:
> 
> "Spinning is done until either the maximum spin count is reached or the
> lock is acquired."
> 

more readable, thanks.

>> +
>> +The default value of this tunable is @samp{1000}.
>> +@end deftp
>> +
>>  @node Hardware Capability Tunables
>>  @section Hardware Capability Tunables
>>  @cindex hardware capability tunables
> 
> Rical
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-04-25  2:59 [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Kemi Wang
                   ` (2 preceding siblings ...)
  2018-04-25  4:02 ` [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Rical Jasan
@ 2018-05-02  1:54 ` kemi
  2018-05-02  8:04 ` Florian Weimer
  4 siblings, 0 replies; 17+ messages in thread
From: kemi @ 2018-05-02  1:54 UTC (permalink / raw)
  To: Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

PING for series

On 2018年04月25日 10:56, Kemi Wang wrote:
> This patch does not have any functionality change, we only provide a spin
> count tunes for pthread adaptive spin mutex. The tunable
> glibc.mutex.spin_count tunes can be used by system administrator to squeeze
> system performance according to different hardware capability and workload
> model.
> 
> This is the preparation work for the next patch, in which the way of
> adaptive spin would be changed from an expensive cmpxchg to read while
> spinning.
> 
>    * elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
>    * manual/tunables.texi: Add glibc.mutex.spin_count description.
>    * nptl/Makefile: Add pthread_mutex_conf.c for compilation.
>    * nptl/pthread_mutex_conf.h: New file.
>    * nptl/pthread_mutex_conf.c: New file.
> 
> ChangeLog:
>     V1->V2
>     a) Renamed nptl/mutex-conf.h -> nptl/pthread_mutex_conf.h
>     b) Renamed nptl/mutex-conf.c -> nptl/pthread_mutex_conf.c
>     c) Change the Makefile to compile pthread_mutex_conf.c
>     d) Modify the copyright "2013-2018" -> "2018" for new added files
>     e) Fix the indentation issue (tab -> double space) in
>     elf/dl-tunables.list
>     f) Remove the env alias LD_SPIN_COUNT in elf/dl-tunables.list
>     g) Fix the typo errors and refresh glibc.mutex.spin_count tunable
>     description in manual/tunables.texi.
>     h) Fix the indentation issue in nptl/pthread_mutex_conf.c
>     i) Fix the indentation issue for nested preprocessor (add one space for
>     each level)
> 
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  ChangeLog                 |  8 +++++
>  elf/dl-tunables.list      |  9 ++++++
>  manual/tunables.texi      | 21 +++++++++++++
>  nptl/Makefile             |  3 +-
>  nptl/pthread_mutex_conf.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++
>  nptl/pthread_mutex_conf.h | 31 +++++++++++++++++++
>  6 files changed, 148 insertions(+), 1 deletion(-)
>  create mode 100644 nptl/pthread_mutex_conf.c
>  create mode 100644 nptl/pthread_mutex_conf.h
> 
> diff --git a/ChangeLog b/ChangeLog
> index efa8d39..4750b11 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,3 +1,11 @@
> +2018-04-24  Kemi Wang <kemi.wang@intel.com>
> +
> +	* elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
> +	* manual/tunables.texi: Add glibc.mutex.spin_count description.
> +	* nptl/Makefile: Add pthread_mutex_conf.c for compilation.
> +	* nptl/pthread_mutex_conf.h: New file.
> +	* nptl/pthread_mutex_conf.c: New file.
> +
>  2018-04-24  Joseph Myers  <joseph@codesourcery.com>
>  
>  	* sysdeps/mach/hurd/dl-sysdep.c: Include <not-errno.h>.
> diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
> index 1f8ecb8..dc1e8f4 100644
> --- a/elf/dl-tunables.list
> +++ b/elf/dl-tunables.list
> @@ -121,4 +121,13 @@ glibc {
>        default: 3
>      }
>    }
> +
> +  mutex {
> +    spin_count {
> +      type: INT_32
> +      minval: 0
> +      maxval: 30000
> +      default: 1000
> +    }
> +  }
>  }
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index be33c9f..c2cf8c6 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -281,6 +281,27 @@ of try lock attempts.
>  The default value of this tunable is @samp{3}.
>  @end deftp
>  
> +@node Pthread Mutex Tunables
> +@section Pthread Mutex Tunables
> +@cindex pthread mutex tunables
> +
> +@deftp {Tunable namespace} glibc.mutex
> +Behavior of pthread mutex can be tuned to gain performance improvement
> +according to specific hardware capability and workload character by setting
> +the following tunables in the @code{mutex} namespace.
> +@end deftp
> +
> +@deftp Tunable glibc.mutex.spin_count
> +The @code{glibc.mutex.spin_count} tunable sets the maximum spin times that
> +a thread should spin on the lock before calling into the kernel to block.
> +Adaptive spin is used for the mutex initialized with PTHREAD_MUTEX_ADAPTIVE_NP
> +GNU extension. It affects both pthread_mutex_lock and pthread_mutex_timedlock.
> +The spinning is done in case of either the maximum spin times is reached or
> +the lock is acquired during spinning.
> +
> +The default value of this tunable is @samp{1000}.
> +@end deftp
> +
>  @node Hardware Capability Tunables
>  @section Hardware Capability Tunables
>  @cindex hardware capability tunables
> diff --git a/nptl/Makefile b/nptl/Makefile
> index 94be92c..bd1096f 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -139,7 +139,8 @@ libpthread-routines = nptl-init vars events version pt-interp \
>  		      pthread_mutex_getprioceiling \
>  		      pthread_mutex_setprioceiling \
>  		      pthread_setname pthread_getname \
> -		      pthread_setattr_default_np pthread_getattr_default_np
> +		      pthread_setattr_default_np pthread_getattr_default_np \
> +		      pthread_mutex_conf
>  #		      pthread_setuid pthread_seteuid pthread_setreuid \
>  #		      pthread_setresuid \
>  #		      pthread_setgid pthread_setegid pthread_setregid \
> diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c
> new file mode 100644
> index 0000000..6340b5d
> --- /dev/null
> +++ b/nptl/pthread_mutex_conf.c
> @@ -0,0 +1,77 @@
> +/* pthread_mutex_conf.c: Pthread mutex tunable parameters.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include <pthreadP.h>
> +#include <init-arch.h>
> +#include <pthread_mutex_conf.h>
> +#include <unistd.h>
> +
> +#if HAVE_TUNABLES
> +# define TUNABLE_NAMESPACE mutex
> +#endif
> +#include <elf/dl-tunables.h>
> +
> +
> +struct mutex_config __mutex_aconf =
> +  {
> +    /* The maximum times a thread spin on the lock before going to block */
> +    .spin_count = 1000,
> +  };
> +
> +#if HAVE_TUNABLES
> +# define TUNABLE_CALLBACK_FNDECL(__name, __type)			\
> +static inline void						\
> +__always_inline							\
> +do_set_mutex_ ## __name (__type value)			\
> +{								\
> +  __mutex_aconf.__name = value;				\
> +}								\
> +void								\
> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
> +{								\
> +  __type value = (__type) (valp)->numval;			\
> +  do_set_mutex_ ## __name (value);				\
> +}
> +
> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
> +#endif
> +
> +static void
> +mutex_tunables_init (int argc __attribute__ ((unused)),
> +			      char **argv  __attribute__ ((unused)),
> +					      char **environ)
> +{
> +#if HAVE_TUNABLES
> +
> +  TUNABLE_GET (spin_count, int32_t,
> +		  TUNABLE_CALLBACK (set_mutex_spin_count));
> +#endif
> +}
> +
> +#ifdef SHARED
> +# define INIT_SECTION ".init_array"
> +#else
> +# define INIT_SECTION ".preinit_array"
> +#endif
> +
> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
> +{
> +  &mutex_tunables_init
> +};
> diff --git a/nptl/pthread_mutex_conf.h b/nptl/pthread_mutex_conf.h
> new file mode 100644
> index 0000000..e5b027c
> --- /dev/null
> +++ b/nptl/pthread_mutex_conf.h
> @@ -0,0 +1,31 @@
> +/* pthread_mutex_conf.h: Pthread mutex tunable parameters.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +#ifndef _PTHREAD_MUTEX_CONF_H
> +#define _PTHREAD_MUTEX_CONF_H 1
> +
> +#include <pthread.h>
> +#include <time.h>
> +
> +struct mutex_config
> +{
> +  int spin_count;
> +};
> +
> +extern struct mutex_config __mutex_aconf attribute_hidden;
> +
> +#endif
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-04-25  2:59 [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Kemi Wang
                   ` (3 preceding siblings ...)
  2018-05-02  1:54 ` kemi
@ 2018-05-02  8:04 ` Florian Weimer
  2018-05-02 11:08   ` kemi
  4 siblings, 1 reply; 17+ messages in thread
From: Florian Weimer @ 2018-05-02  8:04 UTC (permalink / raw)
  To: Kemi Wang, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

On 04/25/2018 04:56 AM, Kemi Wang wrote:

> +  mutex {
> +    spin_count {
> +      type: INT_32
> +      minval: 0
> +      maxval: 30000
> +      default: 1000
> +    }

How did you come up with the default and maximum values?  Larger maximum 
values might be useful for testing boundary conditions.

> +# define TUNABLE_CALLBACK_FNDECL(__name, __type)			\
> +static inline void						\
> +__always_inline							\
> +do_set_mutex_ ## __name (__type value)			\
> +{								\
> +  __mutex_aconf.__name = value;				\
> +}								\
> +void								\
> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
> +{								\
> +  __type value = (__type) (valp)->numval;			\
> +  do_set_mutex_ ## __name (value);				\
> +}
> +
> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);

I'm not sure if the macro is helpful in this context.

> +static void
> +mutex_tunables_init (int argc __attribute__ ((unused)),
> +			      char **argv  __attribute__ ((unused)),
> +					      char **environ)
> +{
> +#if HAVE_TUNABLES
> +
> +  TUNABLE_GET (spin_count, int32_t,
> +		  TUNABLE_CALLBACK (set_mutex_spin_count));
> +#endif
> +}
> +
> +#ifdef SHARED
> +# define INIT_SECTION ".init_array"
> +#else
> +# define INIT_SECTION ".preinit_array"
> +#endif
> +
> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
> +{
> +  &mutex_tunables_init
> +};

Can't you perform the initialization as part of overall pthread 
initialization?  This would avoid the extra relocation.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm
  2018-04-25  2:59 ` [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm Kemi Wang
@ 2018-05-02  8:19   ` Florian Weimer
  2018-05-02 11:07     ` kemi
  0 siblings, 1 reply; 17+ messages in thread
From: Florian Weimer @ 2018-05-02  8:19 UTC (permalink / raw)
  To: Kemi Wang, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

On 04/25/2018 04:56 AM, Kemi Wang wrote:
> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>         if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>   	{
>   	  int cnt = 0;
…
> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
> +			mutex->__data.__spins * 2 + 100);
> +
> +      /* MO read while spinning */
> +      do
> +        {
> +         atomic_spin_nop ();
> +        }
> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
> +            ++cnt < max_cnt);
> +        /* Try to acquire the lock if lock is available or the spin count
> +         * is run out, call into kernel to block if fails
> +         */
> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
> +        LLL_MUTEX_LOCK (mutex);
>   
…
> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
> +    }

The indentation is off.  Comments should end with a ”.  ” (dot and two 
spaces).  Multi-line comments do not start with “*” on subsequent lines. 
  We don't use braces when we can avoid them.  Operators such as “&&” 
should be on the following line when breaking up lines.

Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an 
unconditional call to LLL_MUTEX_LOCK be sufficient?

But the real question is if the old way of doing CAS in a loop is 
beneficial on other, non-Intel architectures.  You either need get broad 
consensus from the large SMP architectures (various aarch64 
implementations, IBM POWER and Z), or somehow make this opt-in at the 
source level.

Ideally, we would also get an ack from AMD for the x86 change, but they 
seem not particularly active on libc-alpha, so that may not be realistic.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm
  2018-05-02  8:19   ` Florian Weimer
@ 2018-05-02 11:07     ` kemi
  2018-05-08 15:08       ` Florian Weimer
  0 siblings, 1 reply; 17+ messages in thread
From: kemi @ 2018-05-02 11:07 UTC (permalink / raw)
  To: Florian Weimer, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey



On 2018年05月02日 16:19, Florian Weimer wrote:
> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>         if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>       {
>>         int cnt = 0;
> …
>> +      int max_cnt = MIN (__mutex_aconf.spin_count,
>> +            mutex->__data.__spins * 2 + 100);
>> +
>> +      /* MO read while spinning */
>> +      do
>> +        {
>> +         atomic_spin_nop ();
>> +        }
>> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>> +            ++cnt < max_cnt);
>> +        /* Try to acquire the lock if lock is available or the spin count
>> +         * is run out, call into kernel to block if fails
>> +         */
>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>> +        LLL_MUTEX_LOCK (mutex);
>>   
> …
>> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>> +    }
> 
> The indentation is off.  Comments should end with a ”.  ” (dot and two spaces).  Multi-line comments do not start with “*” on subsequent lines.  We don't use braces when we can avoid them.  Operators such as “&&” should be on the following line when breaking up lines.
> 

Will fold these changes in next version. 
I am not familiar with glibc coding style, apologize for that.

> Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an unconditional call to LLL_MUTEX_LOCK be sufficient?
> 

The purpose of calling LLL_MUTEX_TRYLOCK here is to try to acquire the lock at user 
space without block when we observed the lock is available. Thus, in case of multiple 
spinners contending for the lock,  only one spinner can acquire the lock successfully
and others fall into block.

I am not sure an unconditional call to LLL_MUTEX_LOCK as you mentioned here can satisfy 
this purpose.

> But the real question is if the old way of doing CAS in a loop is beneficial on other, non-Intel architectures.  You either need get broad consensus from the large SMP architectures (various aarch64 implementations, IBM POWER and Z), or somehow make this opt-in at the source level.
> 

That would be a platform-specific change and have obvious performance improvement for x86 architecture.
And according to Adhemerval, this change could also have some improvement for arrch64 architecture.
If you or someone else still have some concern of performance regression on other architecture, making
this opt-in could eliminate people's worries. 

"
I checked the change on a 64 cores aarch64 machine, but
differently than previous patch this one seems to show improvements:

nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
1               27566206        28776779 (4.206770)  28778073 (4.211078)
2               8498813         9129102 (6.904173)   7042975 (-20.670782)
7               5019434         5832195 (13.935765)  5098511 (1.550982)
14              4379155         6507212 (32.703053)  5200018 (15.785772)
28              4397464         4584480 (4.079329)   4456767 (1.330628)
56              4020956         3534899 (-13.750237) 4096197 (1.836850)
"

Also, I would appreciate if someone could test this patch on other architectures using
the benchmark provided in the second patch. 

> Ideally, we would also get an ack from AMD for the x86 change, but they seem not particularly active on libc-alpha, so that may not be realistic.
> 
> Thanks,
> Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-05-02  8:04 ` Florian Weimer
@ 2018-05-02 11:08   ` kemi
  2018-05-08 15:44     ` Florian Weimer
  0 siblings, 1 reply; 17+ messages in thread
From: kemi @ 2018-05-02 11:08 UTC (permalink / raw)
  To: Florian Weimer, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

Hi, Florian
    Thanks for your time to review.

On 2018年05月02日 16:04, Florian Weimer wrote:
> On 04/25/2018 04:56 AM, Kemi Wang wrote:
> 
>> +  mutex {
>> +    spin_count {
>> +      type: INT_32
>> +      minval: 0
>> +      maxval: 30000
>> +      default: 1000
>> +    }
> 
> How did you come up with the default and maximum values?  Larger maximum values might be useful for testing boundary conditions.
> 

For the maximum value of spin count:
Please notice that mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8, and the variable *cnt* could reach 
the value of spin count due to spinning timeout. In such case, mutex->__data.__spins is increased and could be close to *cnt*
(close to the value of spin count). Keeping the value of spin count less than MAX_SHORT can avoid the overflow of
 mutex->__data.__spins variable with the possible type of short.


For the default value of spin count:
I referred to the previous number of 100 times for trylock in the loop. When this mode is changed to read only while spinning.
I suppose the value could be larger because of lower overhead and latency of read compared with cmpxchg.

Perhaps we should make the default value of spin count differently according to architecture. 

>> +# define TUNABLE_CALLBACK_FNDECL(__name, __type)            \
>> +static inline void                        \
>> +__always_inline                            \
>> +do_set_mutex_ ## __name (__type value)            \
>> +{                                \
>> +  __mutex_aconf.__name = value;                \
>> +}                                \
>> +void                                \
>> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
>> +{                                \
>> +  __type value = (__type) (valp)->numval;            \
>> +  do_set_mutex_ ## __name (value);                \
>> +}
>> +
>> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
> 
> I'm not sure if the macro is helpful in this context.
> 

It is a matter of taste.
But, perhaps we have other mutex tunables in future.

>> +static void
>> +mutex_tunables_init (int argc __attribute__ ((unused)),
>> +                  char **argv  __attribute__ ((unused)),
>> +                          char **environ)
>> +{
>> +#if HAVE_TUNABLES
>> +
>> +  TUNABLE_GET (spin_count, int32_t,
>> +          TUNABLE_CALLBACK (set_mutex_spin_count));
>> +#endif
>> +}
>> +
>> +#ifdef SHARED
>> +# define INIT_SECTION ".init_array"
>> +#else
>> +# define INIT_SECTION ".preinit_array"
>> +#endif
>> +
>> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
>> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
>> +{
>> +  &mutex_tunables_init
>> +};
> 
> Can't you perform the initialization as part of overall pthread initialization?  This would avoid the extra relocation.
> 

Thanks for your suggestion. I am not sure how to do it now and will take a look at it.

> Thanks,
> Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm
  2018-05-02 11:07     ` kemi
@ 2018-05-08 15:08       ` Florian Weimer
  2018-05-14  8:12         ` kemi
  0 siblings, 1 reply; 17+ messages in thread
From: Florian Weimer @ 2018-05-08 15:08 UTC (permalink / raw)
  To: kemi, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

On 05/02/2018 01:04 PM, kemi wrote:
> 
> 
> On 2018年05月02日 16:19, Florian Weimer wrote:
>> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>>> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>          if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>        {
>>>          int cnt = 0;
>> …
>>> +      int max_cnt = MIN (__mutex_aconf.spin_count,
>>> +            mutex->__data.__spins * 2 + 100);
>>> +
>>> +      /* MO read while spinning */
>>> +      do
>>> +        {
>>> +         atomic_spin_nop ();
>>> +        }
>>> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>> +            ++cnt < max_cnt);
>>> +        /* Try to acquire the lock if lock is available or the spin count
>>> +         * is run out, call into kernel to block if fails
>>> +         */
>>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>> +        LLL_MUTEX_LOCK (mutex);
>>>    
>> …
>>> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>>> +    }
>>
>> The indentation is off.  Comments should end with a ”.  ” (dot and two spaces).  Multi-line comments do not start with “*” on subsequent lines.  We don't use braces when we can avoid them.  Operators such as “&&” should be on the following line when breaking up lines.
>>
> 
> Will fold these changes in next version.
> I am not familiar with glibc coding style, apologize for that.

No apology needed, it takes some time to get use to.

>> Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an unconditional call to LLL_MUTEX_LOCK be sufficient?
>>
> 
> The purpose of calling LLL_MUTEX_TRYLOCK here is to try to acquire the lock at user
> space without block when we observed the lock is available. Thus, in case of multiple
> spinners contending for the lock,  only one spinner can acquire the lock successfully
> and others fall into block.
> 
> I am not sure an unconditional call to LLL_MUTEX_LOCK as you mentioned here can satisfy
> this purpose.

It's what we use for the default case.  It expands to lll_lock, so it 
should try atomic_compare_and_exchange_bool_acq first and only perform a 
futex syscall in case there is contention.  So I do think that 
LLL_MUTEX_TRYLOCK is redundant here.  Perhaps manually review the 
disassembly to make sure?

> 
>> But the real question is if the old way of doing CAS in a loop is beneficial on other, non-Intel architectures.  You either need get broad consensus from the large SMP architectures (various aarch64 implementations, IBM POWER and Z), or somehow make this opt-in at the source level.
>>
> 
> That would be a platform-specific change and have obvious performance improvement for x86 architecture.
> And according to Adhemerval, this change could also have some improvement for arrch64 architecture.
> If you or someone else still have some concern of performance regression on other architecture, making
> this opt-in could eliminate people's worries.
> 
> "
> I checked the change on a 64 cores aarch64 machine, but
> differently than previous patch this one seems to show improvements:
> 
> nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1               27566206        28776779 (4.206770)  28778073 (4.211078)
> 2               8498813         9129102 (6.904173)   7042975 (-20.670782)
> 7               5019434         5832195 (13.935765)  5098511 (1.550982)
> 14              4379155         6507212 (32.703053)  5200018 (15.785772)
> 28              4397464         4584480 (4.079329)   4456767 (1.330628)
> 56              4020956         3534899 (-13.750237) 4096197 (1.836850)
> "

Ah, nice, I had missed that.  I suppose this means we can risk enabling 
it by default.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-05-02 11:08   ` kemi
@ 2018-05-08 15:44     ` Florian Weimer
  2018-05-14  4:06       ` kemi
  0 siblings, 1 reply; 17+ messages in thread
From: Florian Weimer @ 2018-05-08 15:44 UTC (permalink / raw)
  To: kemi, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

On 05/02/2018 01:06 PM, kemi wrote:
> Hi, Florian
>      Thanks for your time to review.
> 
> On 2018年05月02日 16:04, Florian Weimer wrote:
>> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>>
>>> +  mutex {
>>> +    spin_count {
>>> +      type: INT_32
>>> +      minval: 0
>>> +      maxval: 30000
>>> +      default: 1000
>>> +    }
>>
>> How did you come up with the default and maximum values?  Larger maximum values might be useful for testing boundary conditions.
>>
> 
> For the maximum value of spin count:
> Please notice that mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8, and the variable *cnt* could reach
> the value of spin count due to spinning timeout. In such case, mutex->__data.__spins is increased and could be close to *cnt*
> (close to the value of spin count). Keeping the value of spin count less than MAX_SHORT can avoid the overflow of
>   mutex->__data.__spins variable with the possible type of short.

Could you add this as a comment, please?

> For the default value of spin count:
> I referred to the previous number of 100 times for trylock in the loop. When this mode is changed to read only while spinning.
> I suppose the value could be larger because of lower overhead and latency of read compared with cmpxchg.

Ahh, makes sense.  Perhaps put this information into the commit message.

> Perhaps we should make the default value of spin count differently according to architecture.

Sure, or if there is just a single good choice for the tunable, just use 
that and remove the tunable again.  I guess one aspect here is to 
experiment with different values and see if there's a clear winner.

>>> +# define TUNABLE_CALLBACK_FNDECL(__name, __type)            \
>>> +static inline void                        \
>>> +__always_inline                            \
>>> +do_set_mutex_ ## __name (__type value)            \
>>> +{                                \
>>> +  __mutex_aconf.__name = value;                \
>>> +}                                \
>>> +void                                \
>>> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
>>> +{                                \
>>> +  __type value = (__type) (valp)->numval;            \
>>> +  do_set_mutex_ ## __name (value);                \
>>> +}
>>> +
>>> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
>>
>> I'm not sure if the macro is helpful in this context.

> It is a matter of taste.
> But, perhaps we have other mutex tunables in future.

We can still macroize the code at that point.  But no strong preference 
here.

>>> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
>>> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
>>> +{
>>> +  &mutex_tunables_init
>>> +};
>>
>> Can't you perform the initialization as part of overall pthread initialization?  This would avoid the extra relocation.

> Thanks for your suggestion. I am not sure how to do it now and will take a look at it.

The code would go into nptl/nptl-init.c.  It's just an idea, but I think 
it should be possible to make it work.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-05-08 15:44     ` Florian Weimer
@ 2018-05-14  4:06       ` kemi
  2018-05-14  5:05         ` kemi
  2018-05-14  7:30         ` Florian Weimer
  0 siblings, 2 replies; 17+ messages in thread
From: kemi @ 2018-05-14  4:06 UTC (permalink / raw)
  To: Florian Weimer, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey



On 2018年05月08日 23:44, Florian Weimer wrote:
> On 05/02/2018 01:06 PM, kemi wrote:
>> Hi, Florian
>>      Thanks for your time to review.
>>
>> On 2018年05月02日 16:04, Florian Weimer wrote:
>>> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>>>
>>>> +  mutex {
>>>> +    spin_count {
>>>> +      type: INT_32
>>>> +      minval: 0
>>>> +      maxval: 30000
>>>> +      default: 1000
>>>> +    }
>>>
>>> How did you come up with the default and maximum values?  Larger maximum values might be useful for testing boundary conditions.
>>>
>>
>> For the maximum value of spin count:
>> Please notice that mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8, and the variable *cnt* could reach
>> the value of spin count due to spinning timeout. In such case, mutex->__data.__spins is increased and could be close to *cnt*
>> (close to the value of spin count). Keeping the value of spin count less than MAX_SHORT can avoid the overflow of
>>   mutex->__data.__spins variable with the possible type of short.
> 
> Could you add this as a comment, please?
> 

Sure:)

>> For the default value of spin count:
>> I referred to the previous number of 100 times for trylock in the loop. When this mode is changed to read only while spinning.
>> I suppose the value could be larger because of lower overhead and latency of read compared with cmpxchg.
> 
> Ahh, makes sense.  Perhaps put this information into the commit message.
> 

I investigated more on the default value of spin count recently.

It's obvious that we should provide a larger default value since the spinning way would be changed from "TRYLOCK"(cmpxchg) to 
read only while spinning. But it's not a trivial issue to determine which one is the best if possible.
The latency of atomic operation and read (e.g. cmpxchg) is determined by many factors, such as the position of cache line by which
the data is owned, the state of cache line(M/E/S/I or even O), cache line transformation and etc.
And some research report[1](Fig2 in that paper) before has shown that the latency of cmpxchg is 1.5x longer than single read at
the same condition in Haswell.

So, let's set the default value of spin count as 150, and run some benchmark to test it.

What's your idea?

[1] Lesani, Mohsen, Todd Millstein, and Jens Palsberg. "Automatic Atomicity Verification for Clients of Concurrent Data Structures." 
International Conference on Computer Aided Verification. Springer, Cham, 2014.

>> Perhaps we should make the default value of spin count differently according to architecture.
> 
> Sure, or if there is just a single good choice for the tunable, just use that and remove the tunable again.  I guess one aspect here is to experiment with different values and see if there's a clear winner.
>

Two reasons for keeping the tunables here:
1) The overhead of instructions are architecture-specific, so, it is hard or even impossible to have a perfect default value that fits all the architecture well.
E.g. The pause instruction in the Skylake platform is 10x expensive than before.

2) There are kinds of workload which may need a different spin timeout.
I have heard many grumble of the pthread adaptive spin mutex from customers that does not work well in their practical workload. 
Let's keep a tunable here for them.

>>>> +# define TUNABLE_CALLBACK_FNDECL(__name, __type)            \
>>>> +static inline void                        \
>>>> +__always_inline                            \
>>>> +do_set_mutex_ ## __name (__type value)            \
>>>> +{                                \
>>>> +  __mutex_aconf.__name = value;                \
>>>> +}                                \
>>>> +void                                \
>>>> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
>>>> +{                                \
>>>> +  __type value = (__type) (valp)->numval;            \
>>>> +  do_set_mutex_ ## __name (value);                \
>>>> +}
>>>> +
>>>> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
>>>
>>> I'm not sure if the macro is helpful in this context.
> 
>> It is a matter of taste.
>> But, perhaps we have other mutex tunables in future.
> 
> We can still macroize the code at that point.  But no strong preference here.
> 

That's all right.

>>>> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
>>>> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
>>>> +{
>>>> +  &mutex_tunables_init
>>>> +};
>>>
>>> Can't you perform the initialization as part of overall pthread initialization?  This would avoid the extra relocation.
> 
>> Thanks for your suggestion. I am not sure how to do it now and will take a look at it.
> 
> The code would go into nptl/nptl-init.c.  It's just an idea, but I think it should be possible to make it work.
> 

thanks, will take a look at it and see whether we can get benefit.
> Thanks,
> Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-05-14  4:06       ` kemi
@ 2018-05-14  5:05         ` kemi
  2018-05-14  7:30         ` Florian Weimer
  1 sibling, 0 replies; 17+ messages in thread
From: kemi @ 2018-05-14  5:05 UTC (permalink / raw)
  To: Florian Weimer, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey



On 2018年05月14日 12:03, kemi wrote:
> 
> 
> On 2018年05月08日 23:44, Florian Weimer wrote:
>> On 05/02/2018 01:06 PM, kemi wrote:
>>> Hi, Florian
>>>      Thanks for your time to review.
>>>
>>> On 2018年05月02日 16:04, Florian Weimer wrote:
>>>> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>>>>
>>>>> +  mutex {
>>>>> +    spin_count {
>>>>> +      type: INT_32
>>>>> +      minval: 0
>>>>> +      maxval: 30000
>>>>> +      default: 1000
>>>>> +    }
>>>>
>>>> How did you come up with the default and maximum values?  Larger maximum values might be useful for testing boundary conditions.
>>>>
>>>
>>> For the maximum value of spin count:
>>> Please notice that mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8, and the variable *cnt* could reach
>>> the value of spin count due to spinning timeout. In such case, mutex->__data.__spins is increased and could be close to *cnt*
>>> (close to the value of spin count). Keeping the value of spin count less than MAX_SHORT can avoid the overflow of
>>>   mutex->__data.__spins variable with the possible type of short.
>>
>> Could you add this as a comment, please?
>>
> 
> Sure:)
> 
>>> For the default value of spin count:
>>> I referred to the previous number of 100 times for trylock in the loop. When this mode is changed to read only while spinning.
>>> I suppose the value could be larger because of lower overhead and latency of read compared with cmpxchg.
>>
>> Ahh, makes sense.  Perhaps put this information into the commit message.
>>
> 
> I investigated more on the default value of spin count recently.
> 
> It's obvious that we should provide a larger default value since the spinning way would be changed from "TRYLOCK"(cmpxchg) to 
> read only while spinning. But it's not a trivial issue to determine which one is the best if possible.
> The latency of atomic operation and read (e.g. cmpxchg) is determined by many factors, such as the position of cache line by which
> the data is owned, the state of cache line(M/E/S/I or even O), cache line transformation and etc.
> And some research report[1](Fig2 in that paper) before has shown that the latency of cmpxchg is 1.5x longer than single read at
> the same condition in Haswell.
> 
> So, let's set the default value of spin count as 150, and run some benchmark to test it.
> 
> What's your idea?
> 
> [1] Lesani, Mohsen, Todd Millstein, and Jens Palsberg. "Automatic Atomicity Verification for Clients of Concurrent Data Structures." 
> International Conference on Computer Aided Verification. Springer, Cham, 2014.
> 

The follow paper is the correct reference. Misoperation before.
Schweizer, Hermann, Maciej Besta, and Torsten Hoefler. "Evaluating the cost of atomic operations on modern architectures." Parallel 
Architecture and Compilation (PACT), 2015 International Conference on. IEEE, 2015.

>>> Perhaps we should make the default value of spin count differently according to architecture.
>>
>> Sure, or if there is just a single good choice for the tunable, just use that and remove the tunable again.  I guess one aspect here is to experiment with different values and see if there's a clear winner.
>>
> 
> Two reasons for keeping the tunables here:
> 1) The overhead of instructions are architecture-specific, so, it is hard or even impossible to have a perfect default value that fits all the architecture well.
> E.g. The pause instruction in the Skylake platform is 10x expensive than before.
> 
> 2) There are kinds of workload which may need a different spin timeout.
> I have heard many grumble of the pthread adaptive spin mutex from customers that does not work well in their practical workload. 
> Let's keep a tunable here for them.
> 
>>>>> +# define TUNABLE_CALLBACK_FNDECL(__name, __type)            \
>>>>> +static inline void                        \
>>>>> +__always_inline                            \
>>>>> +do_set_mutex_ ## __name (__type value)            \
>>>>> +{                                \
>>>>> +  __mutex_aconf.__name = value;                \
>>>>> +}                                \
>>>>> +void                                \
>>>>> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
>>>>> +{                                \
>>>>> +  __type value = (__type) (valp)->numval;            \
>>>>> +  do_set_mutex_ ## __name (value);                \
>>>>> +}
>>>>> +
>>>>> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
>>>>
>>>> I'm not sure if the macro is helpful in this context.
>>
>>> It is a matter of taste.
>>> But, perhaps we have other mutex tunables in future.
>>
>> We can still macroize the code at that point.  But no strong preference here.
>>
> 
> That's all right.
> 
>>>>> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
>>>>> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
>>>>> +{
>>>>> +  &mutex_tunables_init
>>>>> +};
>>>>
>>>> Can't you perform the initialization as part of overall pthread initialization?  This would avoid the extra relocation.
>>
>>> Thanks for your suggestion. I am not sure how to do it now and will take a look at it.
>>
>> The code would go into nptl/nptl-init.c.  It's just an idea, but I think it should be possible to make it work.
>>
> 
> thanks, will take a look at it and see whether we can get benefit.
>> Thanks,
>> Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-05-14  4:06       ` kemi
  2018-05-14  5:05         ` kemi
@ 2018-05-14  7:30         ` Florian Weimer
  2018-05-14  7:39           ` kemi
  1 sibling, 1 reply; 17+ messages in thread
From: Florian Weimer @ 2018-05-14  7:30 UTC (permalink / raw)
  To: kemi, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

On 05/14/2018 06:03 AM, kemi wrote:

> So, let's set the default value of spin count as 150, and run some benchmark to test it.
> 
> What's your idea?

Picking an arbitrary value for now is good enough, and using benchmarks 
to tune it in the future certainly makes sense.

> Two reasons for keeping the tunables here:
> 1) The overhead of instructions are architecture-specific, so, it is hard or even impossible to have a perfect default value that fits all the architecture well.
> E.g. The pause instruction in the Skylake platform is 10x expensive than before.

If we have particularly good numbers for one architecture, we could 
supply an architecture-specific default.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  2018-05-14  7:30         ` Florian Weimer
@ 2018-05-14  7:39           ` kemi
  0 siblings, 0 replies; 17+ messages in thread
From: kemi @ 2018-05-14  7:39 UTC (permalink / raw)
  To: Florian Weimer, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey



On 2018年05月14日 15:30, Florian Weimer wrote:
> On 05/14/2018 06:03 AM, kemi wrote:
> 
>> So, let's set the default value of spin count as 150, and run some benchmark to test it.
>>
>> What's your idea?
> 
> Picking an arbitrary value for now is good enough, and using benchmarks to tune it in the future certainly makes sense.
> 

Agree.
And will enhance lock benchmark in V3.

>> Two reasons for keeping the tunables here:
>> 1) The overhead of instructions are architecture-specific, so, it is hard or even impossible to have a perfect default value that fits all the architecture well.
>> E.g. The pause instruction in the Skylake platform is 10x expensive than before.
> 
> If we have particularly good numbers for one architecture, we could supply an architecture-specific default.
> 

Agree, and I will do later for x86 architecture.

> Thanks,
> Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm
  2018-05-08 15:08       ` Florian Weimer
@ 2018-05-14  8:12         ` kemi
  0 siblings, 0 replies; 17+ messages in thread
From: kemi @ 2018-05-14  8:12 UTC (permalink / raw)
  To: Florian Weimer, Adhemerval Zanella, Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey



On 2018年05月08日 23:08, Florian Weimer wrote:
> On 05/02/2018 01:04 PM, kemi wrote:
>>
>>
>> On 2018年05月02日 16:19, Florian Weimer wrote:
>>> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>>>> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>>          if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>>        {
>>>>          int cnt = 0;
>>> …
>>>> +      int max_cnt = MIN (__mutex_aconf.spin_count,
>>>> +            mutex->__data.__spins * 2 + 100);
>>>> +
>>>> +      /* MO read while spinning */
>>>> +      do
>>>> +        {
>>>> +         atomic_spin_nop ();
>>>> +        }
>>>> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>>> +            ++cnt < max_cnt);
>>>> +        /* Try to acquire the lock if lock is available or the spin count
>>>> +         * is run out, call into kernel to block if fails
>>>> +         */
>>>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>> +        LLL_MUTEX_LOCK (mutex);
>>>>    
>>> …
>>>> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>>>> +    }
>>>
>>> The indentation is off.  Comments should end with a ”.  ” (dot and two spaces).  Multi-line comments do not start with “*” on subsequent lines.  We don't use braces when we can avoid them.  Operators such as “&&” should be on the following line when breaking up lines.
>>>
>>
>> Will fold these changes in next version.
>> I am not familiar with glibc coding style, apologize for that.
> 
> No apology needed, it takes some time to get use to.
> 
>>> Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an unconditional call to LLL_MUTEX_LOCK be sufficient?
>>>
>>
>> The purpose of calling LLL_MUTEX_TRYLOCK here is to try to acquire the lock at user
>> space without block when we observed the lock is available. Thus, in case of multiple
>> spinners contending for the lock,  only one spinner can acquire the lock successfully
>> and others fall into block.
>>
>> I am not sure an unconditional call to LLL_MUTEX_LOCK as you mentioned here can satisfy
>> this purpose.
> 
> It's what we use for the default case.  It expands to lll_lock, so it should try atomic_compare_and_exchange_bool_acq first and only perform a futex syscall in case there is contention.  So I do think that LLL_MUTEX_TRYLOCK is redundant here.  Perhaps manually review the disassembly to make sure?
> 

Make sense. I think you are right.

78            /* Normal mutex.  */
   0x00007ffff7989b20 <+80>:    and    $0x80,%edx
   0x00007ffff7989b26 <+86>:    mov    $0x1,%edi
   0x00007ffff7989b2b <+91>:    xor    %eax,%eax
   0x00007ffff7989b2d <+93>:    mov    %edx,%esi
   *0x00007ffff7989b2f <+95>:    lock cmpxchg %edi,(%r8)*
   *0x00007ffff7989b34 <+100>:   je     0x7ffff7989b4c <__GI___pthread_mutex_lock+124>*
   0x00007ffff7989b36 <+102>:   lea    (%r8),%rdi
   0x00007ffff7989b39 <+105>:   sub    $0x80,%rsp
   0x00007ffff7989b40 <+112>:   callq  0x7ffff7990830 <__lll_lock_wait>
   0x00007ffff7989b45 <+117>:   add    $0x80,%rsp

79            LLL_MUTEX_LOCK (mutex);
   0x00007ffff7989b4c <+124>:   mov    0x8(%r8),%edx
   0x00007ffff7989b50 <+128>:   test   %edx,%edx
   0x00007ffff7989b52 <+130>:   jne    0x7ffff7989cc9 <__GI___pthread_mutex_lock+505>
   0x00007ffff7989cc9 <+505>:   lea    0xa090(%rip),%rcx        # 0x7ffff7993d60 <__PRETTY_FUNCTION__.8768>
   0x00007ffff7989cd0 <+512>:   lea    0x9f05(%rip),%rsi        # 0x7ffff7993bdc
   0x00007ffff7989cd7 <+519>:   lea    0x9f1b(%rip),%rdi        # 0x7ffff7993bf9
   0x00007ffff7989cde <+526>:   mov    $0x4f,%edx
   0x00007ffff7989ce3 <+531>:   callq  0x7ffff7985840 <__assert_fail@plt>

80            assert (mutex->__data.__owner == 0);


>>
>>> But the real question is if the old way of doing CAS in a loop is beneficial on other, non-Intel architectures.  You either need get broad consensus from the large SMP architectures (various aarch64 implementations, IBM POWER and Z), or somehow make this opt-in at the source level.
>>>
>>
>> That would be a platform-specific change and have obvious performance improvement for x86 architecture.
>> And according to Adhemerval, this change could also have some improvement for arrch64 architecture.
>> If you or someone else still have some concern of performance regression on other architecture, making
>> this opt-in could eliminate people's worries.
>>
>> "
>> I checked the change on a 64 cores aarch64 machine, but
>> differently than previous patch this one seems to show improvements:
>>
>> nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>> 1               27566206        28776779 (4.206770)  28778073 (4.211078)
>> 2               8498813         9129102 (6.904173)   7042975 (-20.670782)
>> 7               5019434         5832195 (13.935765)  5098511 (1.550982)
>> 14              4379155         6507212 (32.703053)  5200018 (15.785772)
>> 28              4397464         4584480 (4.079329)   4456767 (1.330628)
>> 56              4020956         3534899 (-13.750237) 4096197 (1.836850)
>> "
> 
> Ah, nice, I had missed that.  I suppose this means we can risk enabling it by default.
> 
There are two major changes in this patch.
1) Change the spinning way from trylock to read only while spinning, it should benefit many
architectures since it can avoid expensive memory synchronization among processors caused by
read-for-ownership request. Similar way has been adopted by pthread/kernel spin lock.

2) Fall a thread into block if it fails to acquire lock when we observed the lock is available.
This is a radical change. It can bring significant performance improvement in the case of severe
lock contention(many threads contending for a lock), because the spinner always can't acquire
the lock in its spinning duration. Thus, putting it to a waiting list can save CPU time, power
budget and eliminate the overhead of atomic operation.
For the case in which a thread may acquire the lock while spinning, I understood it is controversial
to call into the kernel to block rather than go on spinning until timeout. 

Therefore, I will drop the second change for V3 series. More investigation and tests will be done before
pushing the second change.

I will send V3 for that if no one has different ideas? Thanks

> Thanks,
> Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-05-14  8:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-25  2:59 [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Kemi Wang
2018-04-25  2:59 ` [PATCH v2 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark Kemi Wang
2018-04-25  2:59 ` [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm Kemi Wang
2018-05-02  8:19   ` Florian Weimer
2018-05-02 11:07     ` kemi
2018-05-08 15:08       ` Florian Weimer
2018-05-14  8:12         ` kemi
2018-04-25  4:02 ` [PATCH v2 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Rical Jasan
2018-04-25  5:14   ` kemi
2018-05-02  1:54 ` kemi
2018-05-02  8:04 ` Florian Weimer
2018-05-02 11:08   ` kemi
2018-05-08 15:44     ` Florian Weimer
2018-05-14  4:06       ` kemi
2018-05-14  5:05         ` kemi
2018-05-14  7:30         ` Florian Weimer
2018-05-14  7:39           ` kemi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).