public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex
@ 2018-03-30  7:17 Kemi Wang
  2018-03-30  7:17 ` [PATCH 2/3] Mutex: Only read while spinning Kemi Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Kemi Wang @ 2018-03-30  7:17 UTC (permalink / raw)
  To: Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

This patch does not have any functionality change, we only provide a spin
count tunes for pthread adaptive spin mutex. The tunable
glibc.mutex.spin_count tunes can be used by system adminstrator to squeeze
system performance according to different hardware capability and workload
model.

This is the preparation work for the next patch, in which the way of
adaptive spin would be changed from an expensive cmpxchg to read while
spinning.

   * elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
   * manual/tunables.texi: Add glibc.mutex.spin_count description.
   * nptl/Makefile: Add mutex-conf.c for compilation.
   * nptl/mutex-conf.h: New file.
   * nptl/mutex-conf.c: New file.

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 ChangeLog            | 10 ++++++-
 elf/dl-tunables.list | 10 +++++++
 manual/tunables.texi | 17 ++++++++++++
 nptl/Makefile        |  3 +-
 nptl/mutex-conf.c    | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 nptl/mutex-conf.h    | 31 +++++++++++++++++++++
 6 files changed, 147 insertions(+), 2 deletions(-)
 create mode 100644 nptl/mutex-conf.c
 create mode 100644 nptl/mutex-conf.h

diff --git a/ChangeLog b/ChangeLog
index 1f98425..472657c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,13 @@
-2018-03-29  Florian Weimer  <fweimer@redhat.com>
+2018-03-30 Kemi Wang <kemi.wang@intel.com>
+
+	* elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
+	* manual/tunables.texi: Add glibc.mutex.spin_count description.
+	* nptl/Makefile: Add mutex-conf.c for compilation.
+	* nptl/mutex-conf.h: New file.
+	* nptl/mutex-conf.c: New file.
+	* nptl/Makefile: Add new file compilation.
 
+2018-03-29  Florian Weimer  <fweimer@redhat.com>
 	* sysdeps/unix/sysv/linux/i386/tst-bz21269.c (do_test): Also
 	capture SIGBUS.
 
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index 1f8ecb8..0c27c14 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -121,4 +121,14 @@ glibc {
       default: 3
     }
   }
+
+  mutex {
+	  spin_count {
+		  type: INT_32
+		  minval: 0
+		  maxval: 30000
+		  env_alias: LD_SPIN_COUNT
+		  default: 1000
+	  }
+  }
 }
diff --git a/manual/tunables.texi b/manual/tunables.texi
index be33c9f..9c6a9f1 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -281,6 +281,23 @@ of try lock attempts.
 The default value of this tunable is @samp{3}.
 @end deftp
 
+@node Pthread Mutex Tunables
+@section Pthread Mutex Tunables
+@cindex pthread mutex tunables
+
+@deftp {Tunable namespace} glibc.mutex
+Behavior of ptherad mutex can be tuned to acquire performance improvement
+according to specific hardware capablity and workload character by setting
+the following tunables in the @code{mutex} namespace.
+@end deftp
+
+@deftp Tunable glibc.mutex.spin_count
+The @code{glibc.mutex.spin_count} tunable set the maximum times the thread
+should spin on the lock before going to sleep.
+
+The default value of this tunable is @samp{1000}.
+@end deftp
+
 @node Hardware Capability Tunables
 @section Hardware Capability Tunables
 @cindex hardware capability tunables
diff --git a/nptl/Makefile b/nptl/Makefile
index 94be92c..5edacea 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -139,7 +139,8 @@ libpthread-routines = nptl-init vars events version pt-interp \
 		      pthread_mutex_getprioceiling \
 		      pthread_mutex_setprioceiling \
 		      pthread_setname pthread_getname \
-		      pthread_setattr_default_np pthread_getattr_default_np
+		      pthread_setattr_default_np pthread_getattr_default_np \
+		      mutex-conf
 #		      pthread_setuid pthread_seteuid pthread_setreuid \
 #		      pthread_setresuid \
 #		      pthread_setgid pthread_setegid pthread_setregid \
diff --git a/nptl/mutex-conf.c b/nptl/mutex-conf.c
new file mode 100644
index 0000000..f4ffd6d
--- /dev/null
+++ b/nptl/mutex-conf.c
@@ -0,0 +1,78 @@
+/* mutex-conf.c: Pthread mutex tunable parameters.
+   Copyright (C) 2013-2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include <pthreadP.h>
+#include <init-arch.h>
+#include <mutex-conf.h>
+#include <unistd.h>
+
+#if HAVE_TUNABLES
+# define TUNABLE_NAMESPACE mutex
+#endif
+#include <elf/dl-tunables.h>
+
+
+struct mutex_config __mutex_aconf =
+  {
+	  /* The maximum times a thread spin on the lock before
+	   * going to sleep */
+	  .spin_count = 1000,
+  };
+
+#if HAVE_TUNABLES
+#define TUNABLE_CALLBACK_FNDECL(__name, __type)			\
+static inline void						\
+__always_inline							\
+do_set_mutex_ ## __name (__type value)			\
+{								\
+  __mutex_aconf.__name = value;				\
+}								\
+void								\
+TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
+{								\
+  __type value = (__type) (valp)->numval;			\
+  do_set_mutex_ ## __name (value);				\
+}
+
+TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
+#endif
+
+static void
+mutex_tunables_init (int argc __attribute__ ((unused)),
+			      char **argv  __attribute__ ((unused)),
+					      char **environ)
+{
+#if HAVE_TUNABLES
+
+	TUNABLE_GET (spin_count, int32_t,
+			TUNABLE_CALLBACK (set_mutex_spin_count));
+#endif
+}
+
+#ifdef SHARED
+# define INIT_SECTION ".init_array"
+#else
+# define INIT_SECTION ".preinit_array"
+#endif
+
+void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
+  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
+{
+  &mutex_tunables_init
+};
diff --git a/nptl/mutex-conf.h b/nptl/mutex-conf.h
new file mode 100644
index 0000000..babefe3
--- /dev/null
+++ b/nptl/mutex-conf.h
@@ -0,0 +1,31 @@
+/* mutex-conf.h: Pthread mutex tunable parameters.
+   Copyright (C) 2013-2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+#ifndef _MUTEX_CONF_H
+#define _MUTEX_CONF_H 1
+
+#include <pthread.h>
+#include <time.h>
+
+struct mutex_config
+{
+  int spin_count;
+};
+
+extern struct mutex_config __mutex_aconf attribute_hidden;
+
+#endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] Mutex: Only read while spinning
  2018-03-30  7:17 [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Kemi Wang
@ 2018-03-30  7:17 ` Kemi Wang
  2018-04-05 20:55   ` Adhemerval Zanella
  2018-03-30  7:17 ` [PATCH 3/3] Mutex: Avoid useless spinning Kemi Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Kemi Wang @ 2018-03-30  7:17 UTC (permalink / raw)
  To: Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

The pthread adaptive spin mutex spins on the lock for a while before going
to a sleep. While the lock is contended and we need to wait, going straight
back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
that will force expensive memory synchronization among processors and
penalize other running threads. For example, it constantly floods the
system with "read for ownership" requests, which are much more expensive to
process than a single read. Thus, we only use MO read until we observe the
lock to not be acquired anymore, as suggusted by Andi Kleen.

Test machine:
2-sockets Skylake paltform, 112 cores with 62G RAM

Test case: Contended pthread adaptive spin mutex with global update
each thread of the workload does:
a) Lock the mutex (adaptive spin type)
b) Globle variable increment
c) Unlock the mutex
in a loop until timeout, and the main thread reports the total iteration
number of all the threads in one second.

This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
modified to PTHREAD_MUTEX_ADAPTIVE_NP.
github: https://github.com/antonblanchard/will-it-scale.git

nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
1               51644585        51307573(-0.7%)    51323778(-0.6%)
2               7914789         10011301(+26.5%)   9867343(+24.7%)
7               1687620         4224135(+150.3%)   3430504(+103.3%)
14              1026555         3784957(+268.7%)   1843458(+79.6%)
28              962001          2886885(+200.1%)   681965(-29.1%)
56              883770          2740755(+210.1%)   364879(-58.7%)
112             1150589         2707089(+135.3%)   415261(-63.9%)

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..c3aca93 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include <mutex-conf.h>
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
       if (LLL_MUTEX_TRYLOCK (mutex) != 0)
 	{
 	  int cnt = 0;
-	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
-			     mutex->__data.__spins * 2 + 10);
+	  int max_cnt = MIN (__mutex_aconf.spin_count,
+			mutex->__data.__spins * 2 + 100);
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
-		{
-		  LLL_MUTEX_LOCK (mutex);
-		  break;
-		}
-	      atomic_spin_nop ();
+		if (cnt >= max_cnt)
+		  {
+		    LLL_MUTEX_LOCK (mutex);
+		    break;
+		  }
+		/* MO read while spinning */
+		do
+		  {
+		    atomic_spin_nop ();
+		  }
+		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
+			++cnt < max_cnt);
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] Mutex: Avoid useless spinning
  2018-03-30  7:17 [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Kemi Wang
  2018-03-30  7:17 ` [PATCH 2/3] Mutex: Only read while spinning Kemi Wang
@ 2018-03-30  7:17 ` Kemi Wang
  2018-04-05 20:59   ` Adhemerval Zanella
  2018-04-02 15:19 ` [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Adhemerval Zanella
  2018-04-04 10:27 ` kemi
  3 siblings, 1 reply; 14+ messages in thread
From: Kemi Wang @ 2018-03-30  7:17 UTC (permalink / raw)
  To: Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

Usually, we can't set too short time out while spinning on the lock, that
probably makes a thread which is trying to acquire the lock go to sleep
quickly, thus weakens the benefit of pthread adaptive spin lock.

However, there is also a problem if we set the time out large in
case of protecting a small critical section with severe lock contention.
As we can see the test result in the last patch, the performance is highly
effected by the spin count tunables, smaller spin count, better performance
improvement. This is because the thread probably spins on the lock until
timeout in severe lock contention before going to sleep.

In this patch, we avoid the useless spin by making the spinner sleep
if it fails to acquire the lock when the lock is available, as suggested
by Tim Chen.

nr_threads    base       COUNT=1000(head~1)   COUNT=1000(head)
1           51644585      51323778(-0.6%)	     51378551(-0.5%)
2           7914789       9867343(+24.7%)	     11503559(+45.3%)
7           1687620       3430504(+103.3%)	     7817383(+363.2%)
14          1026555       1843458(+79.6%)	     7360883(+617.0%)
28          962001        681965(-29.1%)	     5681945(+490.6%)
56          883770        364879(-58.7%)	     3416068(+286.5%)
112         1150589       415261(-63.9%)	     3255700(+183.0%)

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_lock.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index c3aca93..0faee1a 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -127,22 +127,19 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (__mutex_aconf.spin_count,
 			mutex->__data.__spins * 2 + 100);
+
+     	/* MO read while spinning */
 	  do
-	    {
-		if (cnt >= max_cnt)
-		  {
-		    LLL_MUTEX_LOCK (mutex);
-		    break;
-		  }
-		/* MO read while spinning */
-		do
-		  {
-		    atomic_spin_nop ();
-		  }
-		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
+    	{
+		  atomic_spin_nop ();
+		}
+	  while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
 			++cnt < max_cnt);
-	    }
-	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
+	    /* Try to acquire the lock if lock is available or the spin count
+	     * is run out, go to sleep if fails
+	     */
+	  if (LLL_MUTEX_TRYLOCK (mutex) != 0)
+		  LLL_MUTEX_LOCK (mutex);
 
 	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
 	}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex
  2018-03-30  7:17 [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Kemi Wang
  2018-03-30  7:17 ` [PATCH 2/3] Mutex: Only read while spinning Kemi Wang
  2018-03-30  7:17 ` [PATCH 3/3] Mutex: Avoid useless spinning Kemi Wang
@ 2018-04-02 15:19 ` Adhemerval Zanella
  2018-04-04 10:27 ` kemi
  3 siblings, 0 replies; 14+ messages in thread
From: Adhemerval Zanella @ 2018-04-02 15:19 UTC (permalink / raw)
  To: libc-alpha



On 30/03/2018 04:14, Kemi Wang wrote:
> This patch does not have any functionality change, we only provide a spin
> count tunes for pthread adaptive spin mutex. The tunable
> glibc.mutex.spin_count tunes can be used by system adminstrator to squeeze
> system performance according to different hardware capability and workload
> model.
> 
> This is the preparation work for the next patch, in which the way of
> adaptive spin would be changed from an expensive cmpxchg to read while
> spinning.
> 
>    * elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
>    * manual/tunables.texi: Add glibc.mutex.spin_count description.
>    * nptl/Makefile: Add mutex-conf.c for compilation.
>    * nptl/mutex-conf.h: New file.
>    * nptl/mutex-conf.c: New file.
> 
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>

> ---
>  ChangeLog            | 10 ++++++-
>  elf/dl-tunables.list | 10 +++++++
>  manual/tunables.texi | 17 ++++++++++++
>  nptl/Makefile        |  3 +-
>  nptl/mutex-conf.c    | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  nptl/mutex-conf.h    | 31 +++++++++++++++++++++
>  6 files changed, 147 insertions(+), 2 deletions(-)
>  create mode 100644 nptl/mutex-conf.c
>  create mode 100644 nptl/mutex-conf.h
> 
> diff --git a/ChangeLog b/ChangeLog
> index 1f98425..472657c 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,5 +1,13 @@
> -2018-03-29  Florian Weimer  <fweimer@redhat.com>
> +2018-03-30 Kemi Wang <kemi.wang@intel.com>
> +
> +	* elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
> +	* manual/tunables.texi: Add glibc.mutex.spin_count description.
> +	* nptl/Makefile: Add mutex-conf.c for compilation.
> +	* nptl/mutex-conf.h: New file.
> +	* nptl/mutex-conf.c: New file.
> +	* nptl/Makefile: Add new file compilation.
>  
> +2018-03-29  Florian Weimer  <fweimer@redhat.com>
>  	* sysdeps/unix/sysv/linux/i386/tst-bz21269.c (do_test): Also
>  	capture SIGBUS.
>  
> diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
> index 1f8ecb8..0c27c14 100644
> --- a/elf/dl-tunables.list
> +++ b/elf/dl-tunables.list
> @@ -121,4 +121,14 @@ glibc {
>        default: 3
>      }
>    }
> +
> +  mutex {
> +	  spin_count {
> +		  type: INT_32
> +		  minval: 0
> +		  maxval: 30000
> +		  env_alias: LD_SPIN_COUNT
> +		  default: 1000
> +	  }
> +  }

Indentation seems off here, rest the file uses double space while your
patch uses tabs.

Also I not sure if it worth to add environment variable for this tunable,
I would rather avoid adding newer ones (and naming seems off, since afaik
LD_* meaning some parameters that affects the loader).

>  }
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index be33c9f..9c6a9f1 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -281,6 +281,23 @@ of try lock attempts.
>  The default value of this tunable is @samp{3}.
>  @end deftp
>  
> +@node Pthread Mutex Tunables
> +@section Pthread Mutex Tunables
> +@cindex pthread mutex tunables
> +
> +@deftp {Tunable namespace} glibc.mutex
> +Behavior of ptherad mutex can be tuned to acquire performance improvement
> +according to specific hardware capablity and workload character by setting
> +the following tunables in the @code{mutex} namespace.
> +@end deftp

There is a typo (s/ptherad/pthread) and I think 'acquire' is not the best
word here, I would use 'increase' (I am not a native speaker, so someone
might suggest a better wording here).

> +
> +@deftp Tunable glibc.mutex.spin_count
> +The @code{glibc.mutex.spin_count} tunable set the maximum times the thread
> +should spin on the lock before going to sleep.
> +
> +The default value of this tunable is @samp{1000}.
> +@end deftp

I think we need to expand it to specify:

  * spinning is only used for the mutex initialized with PTHREAD_MUTEX_ADAPTIVE_NP
    GNU extension(current wording is ambiguous).

  * It affects both pthread_mutex_lock and pthread_mutex_timedlock.

  * The spinning is done by first issuing an atomic operation similar to trylock
    followed by a arch-specific no operation (not sure if we need to expand
    how spinning is backoff value is incremented).

  * Not sure 'sleep' is right terminology here, since for Linux 'futex' can
    return right away when maximum spin_count it reached.

> +
>  @node Hardware Capability Tunables
>  @section Hardware Capability Tunables
>  @cindex hardware capability tunables
> diff --git a/nptl/Makefile b/nptl/Makefile
> index 94be92c..5edacea 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -139,7 +139,8 @@ libpthread-routines = nptl-init vars events version pt-interp \
>  		      pthread_mutex_getprioceiling \
>  		      pthread_mutex_setprioceiling \
>  		      pthread_setname pthread_getname \
> -		      pthread_setattr_default_np pthread_getattr_default_np
> +		      pthread_setattr_default_np pthread_getattr_default_np \
> +		      mutex-conf
>  #		      pthread_setuid pthread_seteuid pthread_setreuid \
>  #		      pthread_setresuid \
>  #		      pthread_setgid pthread_setegid pthread_setregid \

I think a better name for the file would be pthread_mutex_conf.c.

> diff --git a/nptl/mutex-conf.c b/nptl/mutex-conf.c
> new file mode 100644
> index 0000000..f4ffd6d
> --- /dev/null
> +++ b/nptl/mutex-conf.c
> @@ -0,0 +1,78 @@
> +/* mutex-conf.c: Pthread mutex tunable parameters.
> +   Copyright (C) 2013-2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +

Copyright should start at 2018 for newer implementations.

> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include <pthreadP.h>
> +#include <init-arch.h>
> +#include <mutex-conf.h>
> +#include <unistd.h>
> +
> +#if HAVE_TUNABLES
> +# define TUNABLE_NAMESPACE mutex
> +#endif
> +#include <elf/dl-tunables.h>
> +
> +
> +struct mutex_config __mutex_aconf =
> +  {
> +	  /* The maximum times a thread spin on the lock before
> +	   * going to sleep */
> +	  .spin_count = 1000,
> +  };
> +

Indentation seems off here and in other places in this file.

> +#if HAVE_TUNABLES
> +#define TUNABLE_CALLBACK_FNDECL(__name, __type)			\

Indentation for nested preprocessor is to just one space each level:

#if HAVE_TUNABLES
# define ... 

> +static inline void						\
> +__always_inline							\
> +do_set_mutex_ ## __name (__type value)			\
> +{								\
> +  __mutex_aconf.__name = value;				\
> +}								\
> +void								\
> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
> +{								\
> +  __type value = (__type) (valp)->numval;			\
> +  do_set_mutex_ ## __name (value);				\
> +}
> +
> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
> +#endif
> +
> +static void
> +mutex_tunables_init (int argc __attribute__ ((unused)),
> +			      char **argv  __attribute__ ((unused)),
> +					      char **environ)
> +{
> +#if HAVE_TUNABLES
> +
> +	TUNABLE_GET (spin_count, int32_t,
> +			TUNABLE_CALLBACK (set_mutex_spin_count));
> +#endif
> +}
> +
> +#ifdef SHARED
> +# define INIT_SECTION ".init_array"
> +#else
> +# define INIT_SECTION ".preinit_array"
> +#endif
> +
> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
> +{
> +  &mutex_tunables_init
> +};> diff --git a/nptl/mutex-conf.h b/nptl/mutex-conf.h
> new file mode 100644
> index 0000000..babefe3
> --- /dev/null
> +++ b/nptl/mutex-conf.h
> @@ -0,0 +1,31 @@
> +/* mutex-conf.h: Pthread mutex tunable parameters.
> +   Copyright (C) 2013-2018 Free Software Foundation, Inc.

Copyright should start at 2018 for newer implementations.

> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +#ifndef _MUTEX_CONF_H
> +#define _MUTEX_CONF_H 1
> +
> +#include <pthread.h>
> +#include <time.h>
> +
> +struct mutex_config
> +{
> +  int spin_count;
> +};
> +
> +extern struct mutex_config __mutex_aconf attribute_hidden;
> +
> +#endif
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex
  2018-03-30  7:17 [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Kemi Wang
                   ` (2 preceding siblings ...)
  2018-04-02 15:19 ` [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Adhemerval Zanella
@ 2018-04-04 10:27 ` kemi
  2018-04-04 17:17   ` Adhemerval Zanella
  3 siblings, 1 reply; 14+ messages in thread
From: kemi @ 2018-04-04 10:27 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Glibc alpha, aubrey

Hi, Adhemerval
   Thanks for your review. Could you please help to review the other two patches in this series
if available? Also, please keep guys in the cc list in case someone may not subscribe glibc 
mail list. Thanks:)

See my reply below.

On 2018年03月30日 15:14, Kemi Wang wrote:
> On 30/03/2018 04:14, Kemi Wang wrote:
>> This patch does not have any functionality change, we only provide a spin
>> count tunes for pthread adaptive spin mutex. The tunable
>> glibc.mutex.spin_count tunes can be used by system adminstrator to squeeze
>> system performance according to different hardware capability and workload
>> model.
>>
>> This is the preparation work for the next patch, in which the way of
>> adaptive spin would be changed from an expensive cmpxchg to read while
>> spinning.
>>
>>    * elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
>>    * manual/tunables.texi: Add glibc.mutex.spin_count description.
>>    * nptl/Makefile: Add mutex-conf.c for compilation.
>>    * nptl/mutex-conf.h: New file.
>>    * nptl/mutex-conf.c: New file.
>>
>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> 
>> ---
>>  ChangeLog            | 10 ++++++-
>>  elf/dl-tunables.list | 10 +++++++
>>  manual/tunables.texi | 17 ++++++++++++
>>  nptl/Makefile        |  3 +-
>>  nptl/mutex-conf.c    | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  nptl/mutex-conf.h    | 31 +++++++++++++++++++++
>>  6 files changed, 147 insertions(+), 2 deletions(-)
>>  create mode 100644 nptl/mutex-conf.c
>>  create mode 100644 nptl/mutex-conf.h
>>
>> diff --git a/ChangeLog b/ChangeLog
>> index 1f98425..472657c 100644
>> --- a/ChangeLog
>> +++ b/ChangeLog
>> @@ -1,5 +1,13 @@
>> -2018-03-29  Florian Weimer  <fweimer@redhat.com>
>> +2018-03-30 Kemi Wang <kemi.wang@intel.com>
>> +
>> +	* elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
>> +	* manual/tunables.texi: Add glibc.mutex.spin_count description.
>> +	* nptl/Makefile: Add mutex-conf.c for compilation.
>> +	* nptl/mutex-conf.h: New file.
>> +	* nptl/mutex-conf.c: New file.
>> +	* nptl/Makefile: Add new file compilation.
>>  
>> +2018-03-29  Florian Weimer  <fweimer@redhat.com>
>>  	* sysdeps/unix/sysv/linux/i386/tst-bz21269.c (do_test): Also
>>  	capture SIGBUS.
>>  
>> diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
>> index 1f8ecb8..0c27c14 100644
>> --- a/elf/dl-tunables.list
>> +++ b/elf/dl-tunables.list
>> @@ -121,4 +121,14 @@ glibc {
>>        default: 3
>>      }
>>    }
>> +
>> +  mutex {
>> +	  spin_count {
>> +		  type: INT_32
>> +		  minval: 0
>> +		  maxval: 30000
>> +		  env_alias: LD_SPIN_COUNT
>> +		  default: 1000
>> +	  }
>> +  }
> 
> Indentation seems off here, rest the file uses double space while your
> patch uses tabs.
> 

Thanks for catching it.

> Also I not sure if it worth to add environment variable for this tunable,
> I would rather avoid adding newer ones (and naming seems off, since afaik
> LD_* meaning some parameters that affects the loader).
> 

AFAIK, environment variable is what I can think of to give people the possibility of 
tunes in shell for that, maybe you have better idea?

Yes, we probably use other name like MUTEX_SPIN_COUNT to avoid confusion, agree?

>>  }
>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>> index be33c9f..9c6a9f1 100644
>> --- a/manual/tunables.texi
>> +++ b/manual/tunables.texi
>> @@ -281,6 +281,23 @@ of try lock attempts.
>>  The default value of this tunable is @samp{3}.
>>  @end deftp
>>  
>> +@node Pthread Mutex Tunables
>> +@section Pthread Mutex Tunables
>> +@cindex pthread mutex tunables
>> +
>> +@deftp {Tunable namespace} glibc.mutex
>> +Behavior of ptherad mutex can be tuned to acquire performance improvement
>> +according to specific hardware capablity and workload character by setting
>> +the following tunables in the @code{mutex} namespace.
>> +@end deftp
> 
> There is a typo (s/ptherad/pthread) 

thanks for catching it.

> and I think 'acquire' is not the best
> word here, I would use 'increase' (I am not a native speaker, so someone
> might suggest a better wording here).
> 

All right, I will ask for some native speakers to help check it.

>> +
>> +@deftp Tunable glibc.mutex.spin_count
>> +The @code{glibc.mutex.spin_count} tunable set the maximum times the thread
>> +should spin on the lock before going to sleep.
>> +
>> +The default value of this tunable is @samp{1000}.
>> +@end deftp
> 
> I think we need to expand it to specify:
> 
>   * spinning is only used for the mutex initialized with PTHREAD_MUTEX_ADAPTIVE_NP
>     GNU extension(current wording is ambiguous).
> >   * It affects both pthread_mutex_lock and pthread_mutex_timedlock.
> 
>   * The spinning is done by first issuing an atomic operation similar to trylock
>     followed by a arch-specific no operation (not sure if we need to expand
>     how spinning is backoff value is incremented).
> 	
      How about this?
      The spinning is done in case of either the maximum spin count 
      is reached or lock is acquired during spinning.

>   * Not sure 'sleep' is right terminology here, since for Linux 'futex' can
>     return right away when maximum spin_count it reached.
>
    'block' here may be more precise, agree?
    e.g. use "before calling into the kernel to block" or something like that

>> +
>>  @node Hardware Capability Tunables
>>  @section Hardware Capability Tunables
>>  @cindex hardware capability tunables
>> diff --git a/nptl/Makefile b/nptl/Makefile
>> index 94be92c..5edacea 100644
>> --- a/nptl/Makefile
>> +++ b/nptl/Makefile
>> @@ -139,7 +139,8 @@ libpthread-routines = nptl-init vars events version pt-interp \
>>  		      pthread_mutex_getprioceiling \
>>  		      pthread_mutex_setprioceiling \
>>  		      pthread_setname pthread_getname \
>> -		      pthread_setattr_default_np pthread_getattr_default_np
>> +		      pthread_setattr_default_np pthread_getattr_default_np \
>> +		      mutex-conf
>>  #		      pthread_setuid pthread_seteuid pthread_setreuid \
>>  #		      pthread_setresuid \
>>  #		      pthread_setgid pthread_setegid pthread_setregid \
> 
> I think a better name for the file would be pthread_mutex_conf.c.
> 

Agree, will send V2 to fix it.

>> diff --git a/nptl/mutex-conf.c b/nptl/mutex-conf.c
>> new file mode 100644
>> index 0000000..f4ffd6d
>> --- /dev/null
>> +++ b/nptl/mutex-conf.c
>> @@ -0,0 +1,78 @@
>> +/* mutex-conf.c: Pthread mutex tunable parameters.
>> +   Copyright (C) 2013-2018 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
> 
> Copyright should start at 2018 for newer implementations.
> 

I am not aware of that before, do you mean "Copyright (C) 2018-2023"?

>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include "config.h"
>> +#include <pthreadP.h>
>> +#include <init-arch.h>
>> +#include <mutex-conf.h>
>> +#include <unistd.h>
>> +
>> +#if HAVE_TUNABLES
>> +# define TUNABLE_NAMESPACE mutex
>> +#endif
>> +#include <elf/dl-tunables.h>
>> +
>> +
>> +struct mutex_config __mutex_aconf =
>> +  {
>> +	  /* The maximum times a thread spin on the lock before
>> +	   * going to sleep */
>> +	  .spin_count = 1000,
>> +  };
>> +
> 
> Indentation seems off here and in other places in this file.
> 

Will fix it in V2, thanks for pointing it out.

>> +#if HAVE_TUNABLES
>> +#define TUNABLE_CALLBACK_FNDECL(__name, __type)			\
> 
> Indentation for nested preprocessor is to just one space each level:
> 
> #if HAVE_TUNABLES
> # define ... 
> 
>> +static inline void						\
>> +__always_inline							\
>> +do_set_mutex_ ## __name (__type value)			\
>> +{								\
>> +  __mutex_aconf.__name = value;				\
>> +}								\
>> +void								\
>> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
>> +{								\
>> +  __type value = (__type) (valp)->numval;			\
>> +  do_set_mutex_ ## __name (value);				\
>> +}
>> +
>> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
>> +#endif
>> +
>> +static void
>> +mutex_tunables_init (int argc __attribute__ ((unused)),
>> +			      char **argv  __attribute__ ((unused)),
>> +					      char **environ)
>> +{
>> +#if HAVE_TUNABLES
>> +
>> +	TUNABLE_GET (spin_count, int32_t,
>> +			TUNABLE_CALLBACK (set_mutex_spin_count));
>> +#endif
>> +}
>> +
>> +#ifdef SHARED
>> +# define INIT_SECTION ".init_array"
>> +#else
>> +# define INIT_SECTION ".preinit_array"
>> +#endif
>> +
>> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
>> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
>> +{
>> +  &mutex_tunables_init
>> +};> diff --git a/nptl/mutex-conf.h b/nptl/mutex-conf.h
>> new file mode 100644
>> index 0000000..babefe3
>> --- /dev/null
>> +++ b/nptl/mutex-conf.h
>> @@ -0,0 +1,31 @@
>> +/* mutex-conf.h: Pthread mutex tunable parameters.
>> +   Copyright (C) 2013-2018 Free Software Foundation, Inc.
> 
> Copyright should start at 2018 for newer implementations.
> 

Sure.

>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +#ifndef _MUTEX_CONF_H
>> +#define _MUTEX_CONF_H 1
>> +
>> +#include <pthread.h>
>> +#include <time.h>
>> +
>> +struct mutex_config
>> +{
>> +  int spin_count;
>> +};
>> +
>> +extern struct mutex_config __mutex_aconf attribute_hidden;
>> +
>> +#endif
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex
  2018-04-04 10:27 ` kemi
@ 2018-04-04 17:17   ` Adhemerval Zanella
  2018-04-05  1:11     ` Carlos O'Donell
  0 siblings, 1 reply; 14+ messages in thread
From: Adhemerval Zanella @ 2018-04-04 17:17 UTC (permalink / raw)
  To: kemi
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Glibc alpha, aubrey



On 04/04/2018 07:25, kemi wrote:
> Hi, Adhemerval
>    Thanks for your review. Could you please help to review the other two patches in this series
> if available? Also, please keep guys in the cc list in case someone may not subscribe glibc 
> mail list. Thanks:)

Yeah, I am checking the patch along with the referenced benchmark on a aarch64
machine as well.  

> 
> See my reply below.
> 
> On 2018年03月30日 15:14, Kemi Wang wrote:
>> On 30/03/2018 04:14, Kemi Wang wrote:
>>> This patch does not have any functionality change, we only provide a spin
>>> count tunes for pthread adaptive spin mutex. The tunable
>>> glibc.mutex.spin_count tunes can be used by system adminstrator to squeeze
>>> system performance according to different hardware capability and workload
>>> model.
>>>
>>> This is the preparation work for the next patch, in which the way of
>>> adaptive spin would be changed from an expensive cmpxchg to read while
>>> spinning.
>>>
>>>    * elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
>>>    * manual/tunables.texi: Add glibc.mutex.spin_count description.
>>>    * nptl/Makefile: Add mutex-conf.c for compilation.
>>>    * nptl/mutex-conf.h: New file.
>>>    * nptl/mutex-conf.c: New file.
>>>
>>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>
>>> ---
>>>  ChangeLog            | 10 ++++++-
>>>  elf/dl-tunables.list | 10 +++++++
>>>  manual/tunables.texi | 17 ++++++++++++
>>>  nptl/Makefile        |  3 +-
>>>  nptl/mutex-conf.c    | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  nptl/mutex-conf.h    | 31 +++++++++++++++++++++
>>>  6 files changed, 147 insertions(+), 2 deletions(-)
>>>  create mode 100644 nptl/mutex-conf.c
>>>  create mode 100644 nptl/mutex-conf.h
>>>
>>> diff --git a/ChangeLog b/ChangeLog
>>> index 1f98425..472657c 100644
>>> --- a/ChangeLog
>>> +++ b/ChangeLog
>>> @@ -1,5 +1,13 @@
>>> -2018-03-29  Florian Weimer  <fweimer@redhat.com>
>>> +2018-03-30 Kemi Wang <kemi.wang@intel.com>
>>> +
>>> +	* elf/dl-tunables.list: Add glibc.mutex.spin_count entry.
>>> +	* manual/tunables.texi: Add glibc.mutex.spin_count description.
>>> +	* nptl/Makefile: Add mutex-conf.c for compilation.
>>> +	* nptl/mutex-conf.h: New file.
>>> +	* nptl/mutex-conf.c: New file.
>>> +	* nptl/Makefile: Add new file compilation.
>>>  
>>> +2018-03-29  Florian Weimer  <fweimer@redhat.com>
>>>  	* sysdeps/unix/sysv/linux/i386/tst-bz21269.c (do_test): Also
>>>  	capture SIGBUS.
>>>  
>>> diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
>>> index 1f8ecb8..0c27c14 100644
>>> --- a/elf/dl-tunables.list
>>> +++ b/elf/dl-tunables.list
>>> @@ -121,4 +121,14 @@ glibc {
>>>        default: 3
>>>      }
>>>    }
>>> +
>>> +  mutex {
>>> +	  spin_count {
>>> +		  type: INT_32
>>> +		  minval: 0
>>> +		  maxval: 30000
>>> +		  env_alias: LD_SPIN_COUNT
>>> +		  default: 1000
>>> +	  }
>>> +  }
>>
>> Indentation seems off here, rest the file uses double space while your
>> patch uses tabs.
>>
> 
> Thanks for catching it.
> 
>> Also I not sure if it worth to add environment variable for this tunable,
>> I would rather avoid adding newer ones (and naming seems off, since afaik
>> LD_* meaning some parameters that affects the loader).
>>
> 
> AFAIK, environment variable is what I can think of to give people the possibility of 
> tunes in shell for that, maybe you have better idea?
> 
> Yes, we probably use other name like MUTEX_SPIN_COUNT to avoid confusion, agree?

The tunables framework already provides a environment variable to this [1],
the 'env_alias' is mainly to provide compatibility and to use the same logic
internally.

So I think it is better to use the default tunable env var.

[1] https://www.gnu.org/software/libc/manual/html_node/Tunables.html

> 
>>>  }
>>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>>> index be33c9f..9c6a9f1 100644
>>> --- a/manual/tunables.texi
>>> +++ b/manual/tunables.texi
>>> @@ -281,6 +281,23 @@ of try lock attempts.
>>>  The default value of this tunable is @samp{3}.
>>>  @end deftp
>>>  
>>> +@node Pthread Mutex Tunables
>>> +@section Pthread Mutex Tunables
>>> +@cindex pthread mutex tunables
>>> +
>>> +@deftp {Tunable namespace} glibc.mutex
>>> +Behavior of ptherad mutex can be tuned to acquire performance improvement
>>> +according to specific hardware capablity and workload character by setting
>>> +the following tunables in the @code{mutex} namespace.
>>> +@end deftp
>>
>> There is a typo (s/ptherad/pthread) 
> 
> thanks for catching it.
> 
>> and I think 'acquire' is not the best
>> word here, I would use 'increase' (I am not a native speaker, so someone
>> might suggest a better wording here).
>>
> 
> All right, I will ask for some native speakers to help check it.
> 
>>> +
>>> +@deftp Tunable glibc.mutex.spin_count
>>> +The @code{glibc.mutex.spin_count} tunable set the maximum times the thread
>>> +should spin on the lock before going to sleep.
>>> +
>>> +The default value of this tunable is @samp{1000}.
>>> +@end deftp
>>
>> I think we need to expand it to specify:
>>
>>   * spinning is only used for the mutex initialized with PTHREAD_MUTEX_ADAPTIVE_NP
>>     GNU extension(current wording is ambiguous).
>>>   * It affects both pthread_mutex_lock and pthread_mutex_timedlock.
>>
>>   * The spinning is done by first issuing an atomic operation similar to trylock
>>     followed by a arch-specific no operation (not sure if we need to expand
>>     how spinning is backoff value is incremented).
>> 	
>       How about this?
>       The spinning is done in case of either the maximum spin count 
>       is reached or lock is acquired during spinning.

It is better for this specific part.

> 
>>   * Not sure 'sleep' is right terminology here, since for Linux 'futex' can
>>     return right away when maximum spin_count it reached.
>>
>     'block' here may be more precise, agree?
>     e.g. use "before calling into the kernel to block" or something like that
> 
>>> +
>>>  @node Hardware Capability Tunables
>>>  @section Hardware Capability Tunables
>>>  @cindex hardware capability tunables
>>> diff --git a/nptl/Makefile b/nptl/Makefile
>>> index 94be92c..5edacea 100644
>>> --- a/nptl/Makefile
>>> +++ b/nptl/Makefile
>>> @@ -139,7 +139,8 @@ libpthread-routines = nptl-init vars events version pt-interp \
>>>  		      pthread_mutex_getprioceiling \
>>>  		      pthread_mutex_setprioceiling \
>>>  		      pthread_setname pthread_getname \
>>> -		      pthread_setattr_default_np pthread_getattr_default_np
>>> +		      pthread_setattr_default_np pthread_getattr_default_np \
>>> +		      mutex-conf
>>>  #		      pthread_setuid pthread_seteuid pthread_setreuid \
>>>  #		      pthread_setresuid \
>>>  #		      pthread_setgid pthread_setegid pthread_setregid \
>>
>> I think a better name for the file would be pthread_mutex_conf.c.
>>
> 
> Agree, will send V2 to fix it.
> 
>>> diff --git a/nptl/mutex-conf.c b/nptl/mutex-conf.c
>>> new file mode 100644
>>> index 0000000..f4ffd6d
>>> --- /dev/null
>>> +++ b/nptl/mutex-conf.c
>>> @@ -0,0 +1,78 @@
>>> +/* mutex-conf.c: Pthread mutex tunable parameters.
>>> +   Copyright (C) 2013-2018 Free Software Foundation, Inc.
>>> +   This file is part of the GNU C Library.
>>> +
>>
>> Copyright should start at 2018 for newer implementations.
>>
> 
> I am not aware of that before, do you mean "Copyright (C) 2018-2023"?

No, just "Copyright (C) 2018".

> 
>>> +   The GNU C Library is free software; you can redistribute it and/or
>>> +   modify it under the terms of the GNU Lesser General Public
>>> +   License as published by the Free Software Foundation; either
>>> +   version 2.1 of the License, or (at your option) any later version.
>>> +
>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> +   Lesser General Public License for more details.
>>> +
>>> +   You should have received a copy of the GNU Lesser General Public
>>> +   License along with the GNU C Library; if not, see
>>> +   <http://www.gnu.org/licenses/>.  */
>>> +
>>> +#include "config.h"
>>> +#include <pthreadP.h>
>>> +#include <init-arch.h>
>>> +#include <mutex-conf.h>
>>> +#include <unistd.h>
>>> +
>>> +#if HAVE_TUNABLES
>>> +# define TUNABLE_NAMESPACE mutex
>>> +#endif
>>> +#include <elf/dl-tunables.h>
>>> +
>>> +
>>> +struct mutex_config __mutex_aconf =
>>> +  {
>>> +	  /* The maximum times a thread spin on the lock before
>>> +	   * going to sleep */
>>> +	  .spin_count = 1000,
>>> +  };
>>> +
>>
>> Indentation seems off here and in other places in this file.
>>
> 
> Will fix it in V2, thanks for pointing it out.
> 
>>> +#if HAVE_TUNABLES
>>> +#define TUNABLE_CALLBACK_FNDECL(__name, __type)			\
>>
>> Indentation for nested preprocessor is to just one space each level:
>>
>> #if HAVE_TUNABLES
>> # define ... 
>>
>>> +static inline void						\
>>> +__always_inline							\
>>> +do_set_mutex_ ## __name (__type value)			\
>>> +{								\
>>> +  __mutex_aconf.__name = value;				\
>>> +}								\
>>> +void								\
>>> +TUNABLE_CALLBACK (set_mutex_ ## __name) (tunable_val_t *valp) \
>>> +{								\
>>> +  __type value = (__type) (valp)->numval;			\
>>> +  do_set_mutex_ ## __name (value);				\
>>> +}
>>> +
>>> +TUNABLE_CALLBACK_FNDECL (spin_count, int32_t);
>>> +#endif
>>> +
>>> +static void
>>> +mutex_tunables_init (int argc __attribute__ ((unused)),
>>> +			      char **argv  __attribute__ ((unused)),
>>> +					      char **environ)
>>> +{
>>> +#if HAVE_TUNABLES
>>> +
>>> +	TUNABLE_GET (spin_count, int32_t,
>>> +			TUNABLE_CALLBACK (set_mutex_spin_count));
>>> +#endif
>>> +}
>>> +
>>> +#ifdef SHARED
>>> +# define INIT_SECTION ".init_array"
>>> +#else
>>> +# define INIT_SECTION ".preinit_array"
>>> +#endif
>>> +
>>> +void (*const __pthread_mutex_tunables_init_array []) (int, char **, char **)
>>> +  __attribute__ ((section (INIT_SECTION), aligned (sizeof (void *)))) =
>>> +{
>>> +  &mutex_tunables_init
>>> +};> diff --git a/nptl/mutex-conf.h b/nptl/mutex-conf.h
>>> new file mode 100644
>>> index 0000000..babefe3
>>> --- /dev/null
>>> +++ b/nptl/mutex-conf.h
>>> @@ -0,0 +1,31 @@
>>> +/* mutex-conf.h: Pthread mutex tunable parameters.
>>> +   Copyright (C) 2013-2018 Free Software Foundation, Inc.
>>
>> Copyright should start at 2018 for newer implementations.
>>
> 
> Sure.
> 
>>> +   This file is part of the GNU C Library.
>>> +
>>> +   The GNU C Library is free software; you can redistribute it and/or
>>> +   modify it under the terms of the GNU Lesser General Public
>>> +   License as published by the Free Software Foundation; either
>>> +   version 2.1 of the License, or (at your option) any later version.
>>> +
>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> +   Lesser General Public License for more details.
>>> +
>>> +   You should have received a copy of the GNU Lesser General Public
>>> +   License along with the GNU C Library; if not, see
>>> +   <http://www.gnu.org/licenses/>.  */
>>> +#ifndef _MUTEX_CONF_H
>>> +#define _MUTEX_CONF_H 1
>>> +
>>> +#include <pthread.h>
>>> +#include <time.h>
>>> +
>>> +struct mutex_config
>>> +{
>>> +  int spin_count;
>>> +};
>>> +
>>> +extern struct mutex_config __mutex_aconf attribute_hidden;
>>> +
>>> +#endif
>>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex
  2018-04-04 17:17   ` Adhemerval Zanella
@ 2018-04-05  1:11     ` Carlos O'Donell
  0 siblings, 0 replies; 14+ messages in thread
From: Carlos O'Donell @ 2018-04-05  1:11 UTC (permalink / raw)
  To: Adhemerval Zanella, kemi
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Glibc alpha, aubrey

On 04/04/2018 12:16 PM, Adhemerval Zanella wrote:
>>> Also I not sure if it worth to add environment variable for this tunable,
>>> I would rather avoid adding newer ones (and naming seems off, since afaik
>>> LD_* meaning some parameters that affects the loader).
>>>
>>
>> AFAIK, environment variable is what I can think of to give people the possibility of 
>> tunes in shell for that, maybe you have better idea?
>>
>> Yes, we probably use other name like MUTEX_SPIN_COUNT to avoid confusion, agree?
> 
> The tunables framework already provides a environment variable to this [1],
> the 'env_alias' is mainly to provide compatibility and to use the same logic
> internally.
> 
> So I think it is better to use the default tunable env var.
> 
> [1] https://www.gnu.org/software/libc/manual/html_node/Tunables.html

Fully agree. We should never add any new environment variable, they should all be
automatically used from the top-level tunable env var.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] Mutex: Only read while spinning
  2018-03-30  7:17 ` [PATCH 2/3] Mutex: Only read while spinning Kemi Wang
@ 2018-04-05 20:55   ` Adhemerval Zanella
  2018-04-08  8:30     ` kemi
  0 siblings, 1 reply; 14+ messages in thread
From: Adhemerval Zanella @ 2018-04-05 20:55 UTC (permalink / raw)
  To: libc-alpha



On 30/03/2018 04:14, Kemi Wang wrote:
> The pthread adaptive spin mutex spins on the lock for a while before going
> to a sleep. While the lock is contended and we need to wait, going straight
> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
> that will force expensive memory synchronization among processors and
> penalize other running threads. For example, it constantly floods the
> system with "read for ownership" requests, which are much more expensive to
> process than a single read. Thus, we only use MO read until we observe the
> lock to not be acquired anymore, as suggusted by Andi Kleen.
> 
> Test machine:
> 2-sockets Skylake paltform, 112 cores with 62G RAM
> 
> Test case: Contended pthread adaptive spin mutex with global update
> each thread of the workload does:
> a) Lock the mutex (adaptive spin type)
> b) Globle variable increment
> c) Unlock the mutex
> in a loop until timeout, and the main thread reports the total iteration
> number of all the threads in one second.
> 
> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
> github: https://github.com/antonblanchard/will-it-scale.git
> 
> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
> 28              962001          2886885(+200.1%)   681965(-29.1%)
> 56              883770          2740755(+210.1%)   364879(-58.7%)
> 112             1150589         2707089(+135.3%)   415261(-63.9%)

In pthread_mutex3 it is basically more updates in a global variable synchronized
with a mutex, so if I am reading correct the benchmark, a higher value means
less contention. I also assume you use the 'threads' value in this table.

I checked on a 64 cores aarch64 machine to see what kind of improvement, if
any; one would get with this change:

nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
1               27566206        28778254 (4.211680)   28778467 (4.212389)
2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
56              4020956         3898386 (-3.144122)   4038505 (0.434542)

So I think this change should be platform-specific.

> 
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 1519c14..c3aca93 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -26,6 +26,7 @@
>  #include <atomic.h>
>  #include <lowlevellock.h>
>  #include <stap-probe.h>
> +#include <mutex-conf.h>
>  
>  #ifndef lll_lock_elision
>  #define lll_lock_elision(lock, try_lock, private)	({ \
> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>  	{
>  	  int cnt = 0;
> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
> -			     mutex->__data.__spins * 2 + 10);
> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
> +			mutex->__data.__spins * 2 + 100);
>  	  do
>  	    {
> -	      if (cnt++ >= max_cnt)
> -		{
> -		  LLL_MUTEX_LOCK (mutex);
> -		  break;
> -		}
> -	      atomic_spin_nop ();
> +		if (cnt >= max_cnt)
> +		  {
> +		    LLL_MUTEX_LOCK (mutex);
> +		    break;
> +		  }
> +		/* MO read while spinning */
> +		do
> +		  {
> +		    atomic_spin_nop ();
> +		  }
> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
> +			++cnt < max_cnt);
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] Mutex: Avoid useless spinning
  2018-03-30  7:17 ` [PATCH 3/3] Mutex: Avoid useless spinning Kemi Wang
@ 2018-04-05 20:59   ` Adhemerval Zanella
  2018-04-08  8:33     ` kemi
  0 siblings, 1 reply; 14+ messages in thread
From: Adhemerval Zanella @ 2018-04-05 20:59 UTC (permalink / raw)
  To: libc-alpha



On 30/03/2018 04:14, Kemi Wang wrote:
> Usually, we can't set too short time out while spinning on the lock, that
> probably makes a thread which is trying to acquire the lock go to sleep
> quickly, thus weakens the benefit of pthread adaptive spin lock.
> 
> However, there is also a problem if we set the time out large in
> case of protecting a small critical section with severe lock contention.
> As we can see the test result in the last patch, the performance is highly
> effected by the spin count tunables, smaller spin count, better performance
> improvement. This is because the thread probably spins on the lock until
> timeout in severe lock contention before going to sleep.
> 
> In this patch, we avoid the useless spin by making the spinner sleep
> if it fails to acquire the lock when the lock is available, as suggested
> by Tim Chen.
> 
> nr_threads    base       COUNT=1000(head~1)   COUNT=1000(head)
> 1           51644585      51323778(-0.6%)	     51378551(-0.5%)
> 2           7914789       9867343(+24.7%)	     11503559(+45.3%)
> 7           1687620       3430504(+103.3%)	     7817383(+363.2%)
> 14          1026555       1843458(+79.6%)	     7360883(+617.0%)
> 28          962001        681965(-29.1%)	     5681945(+490.6%)
> 56          883770        364879(-58.7%)	     3416068(+286.5%)
> 112         1150589       415261(-63.9%)	     3255700(+183.0%)

As before [2], I checked the change on a 64 cores aarch64 machine, but
differently than previous patch this one seems to show improvements:

nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
1               27566206        28776779 (4.206770)  28778073 (4.211078)
2               8498813         9129102 (6.904173)   7042975 (-20.670782)
7               5019434         5832195 (13.935765)  5098511 (1.550982)
14              4379155         6507212 (32.703053)  5200018 (15.785772)
28              4397464         4584480 (4.079329)   4456767 (1.330628)
56              4020956         3534899 (-13.750237) 4096197 (1.836850)

I would suggest you to squash both patch in only one for version 2.

> 
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  nptl/pthread_mutex_lock.c | 25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index c3aca93..0faee1a 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -127,22 +127,19 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>  	  int cnt = 0;
>  	  int max_cnt = MIN (__mutex_aconf.spin_count,
>  			mutex->__data.__spins * 2 + 100);
> +
> +     	/* MO read while spinning */
>  	  do
> -	    {
> -		if (cnt >= max_cnt)
> -		  {
> -		    LLL_MUTEX_LOCK (mutex);
> -		    break;
> -		  }
> -		/* MO read while spinning */
> -		do
> -		  {
> -		    atomic_spin_nop ();
> -		  }
> -		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
> +    	{
> +		  atomic_spin_nop ();
> +		}
> +	  while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>  			++cnt < max_cnt);
> -	    }
> -	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> +	    /* Try to acquire the lock if lock is available or the spin count
> +	     * is run out, go to sleep if fails
> +	     */
> +	  if (LLL_MUTEX_TRYLOCK (mutex) != 0)
> +		  LLL_MUTEX_LOCK (mutex);
>  
>  	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>  	}
> 

Please fix the format issue here.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Mutex: Only read while spinning
  2018-04-05 20:55   ` Adhemerval Zanella
@ 2018-04-08  8:30     ` kemi
  2018-04-09 20:52       ` Adhemerval Zanella
  0 siblings, 1 reply; 14+ messages in thread
From: kemi @ 2018-04-08  8:30 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha
  Cc: Andi Kleen, Chen, Tim C, Dave, ying.huang, aaron, aubrey



On 2018年04月06日 04:55, Adhemerval Zanella wrote:
> 
> 
> On 30/03/2018 04:14, Kemi Wang wrote:
>> The pthread adaptive spin mutex spins on the lock for a while before going
>> to a sleep. While the lock is contended and we need to wait, going straight
>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>> that will force expensive memory synchronization among processors and
>> penalize other running threads. For example, it constantly floods the
>> system with "read for ownership" requests, which are much more expensive to
>> process than a single read. Thus, we only use MO read until we observe the
>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>
>> Test machine:
>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>
>> Test case: Contended pthread adaptive spin mutex with global update
>> each thread of the workload does:
>> a) Lock the mutex (adaptive spin type)
>> b) Globle variable increment
>> c) Unlock the mutex
>> in a loop until timeout, and the main thread reports the total iteration
>> number of all the threads in one second.
>>
>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>> github: https://github.com/antonblanchard/will-it-scale.git
>>
>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
> 
> In pthread_mutex3 it is basically more updates in a global variable synchronized
> with a mutex, so if I am reading correct the benchmark, a higher value means
> less contention. I also assume you use the 'threads' value in this table.
> 

Yes. I used multi-threads mode for testing.

> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
> any; one would get with this change:
> 
> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
> 

Thanks for your job to test it on aarch64 machine. That's great supplement for us.

> So I think this change should be platform-specific.
> 

It may depend on platform, especially for cmpxchg, MO read and pause instructions.

Well, I think the performance also depends on spin count. A simple explanation is that:
spin always fails in severe lock contention with large thread numbers (fit case b) below),
thus, the overhead increases due to large spin count, larger spin count, more overhead.
It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
Did you run "export LD_SPIN_COUNT=10" before the second round test?

Analysis:
let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
Then s = k*c, *k* is factor.  

a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
in the critical section, it is easy to understand that the overhead for acquiring the lock 
equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 

b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
   In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame

 T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
 |
 T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
 .
 .
 .
 T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  

after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
for time (k-2)*c,... task(k-1) for time c.

 T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
          sleep         sleep   

 task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
until threads exist. Therefore, the spin count effects the system performance largely!

c) If *k* > 1 && threads number *n* < *k* (small critical section case)
  Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
only consider spin overhead.

>>
>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>> index 1519c14..c3aca93 100644
>> --- a/nptl/pthread_mutex_lock.c
>> +++ b/nptl/pthread_mutex_lock.c
>> @@ -26,6 +26,7 @@
>>  #include <atomic.h>
>>  #include <lowlevellock.h>
>>  #include <stap-probe.h>
>> +#include <mutex-conf.h>
>>  
>>  #ifndef lll_lock_elision
>>  #define lll_lock_elision(lock, try_lock, private)	({ \
>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>  	{
>>  	  int cnt = 0;
>> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>> -			     mutex->__data.__spins * 2 + 10);
>> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
>> +			mutex->__data.__spins * 2 + 100);
>>  	  do
>>  	    {
>> -	      if (cnt++ >= max_cnt)
>> -		{
>> -		  LLL_MUTEX_LOCK (mutex);
>> -		  break;
>> -		}
>> -	      atomic_spin_nop ();
>> +		if (cnt >= max_cnt)
>> +		  {
>> +		    LLL_MUTEX_LOCK (mutex);
>> +		    break;
>> +		  }
>> +		/* MO read while spinning */
>> +		do
>> +		  {
>> +		    atomic_spin_nop ();
>> +		  }
>> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>> +			++cnt < max_cnt);
>>  	    }
>>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>  
>>
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] Mutex: Avoid useless spinning
  2018-04-05 20:59   ` Adhemerval Zanella
@ 2018-04-08  8:33     ` kemi
  0 siblings, 0 replies; 14+ messages in thread
From: kemi @ 2018-04-08  8:33 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha
  Cc: Andi Kleen, Chen, Tim C, Dave, ying.huang, aaron, aubrey



On 2018年04月06日 04:59, Adhemerval Zanella wrote:
> 
> 
> On 30/03/2018 04:14, Kemi Wang wrote:
>> Usually, we can't set too short time out while spinning on the lock, that
>> probably makes a thread which is trying to acquire the lock go to sleep
>> quickly, thus weakens the benefit of pthread adaptive spin lock.
>>
>> However, there is also a problem if we set the time out large in
>> case of protecting a small critical section with severe lock contention.
>> As we can see the test result in the last patch, the performance is highly
>> effected by the spin count tunables, smaller spin count, better performance
>> improvement. This is because the thread probably spins on the lock until
>> timeout in severe lock contention before going to sleep.
>>
>> In this patch, we avoid the useless spin by making the spinner sleep
>> if it fails to acquire the lock when the lock is available, as suggested
>> by Tim Chen.
>>
>> nr_threads    base       COUNT=1000(head~1)   COUNT=1000(head)
>> 1           51644585      51323778(-0.6%)	     51378551(-0.5%)
>> 2           7914789       9867343(+24.7%)	     11503559(+45.3%)
>> 7           1687620       3430504(+103.3%)	     7817383(+363.2%)
>> 14          1026555       1843458(+79.6%)	     7360883(+617.0%)
>> 28          962001        681965(-29.1%)	     5681945(+490.6%)
>> 56          883770        364879(-58.7%)	     3416068(+286.5%)
>> 112         1150589       415261(-63.9%)	     3255700(+183.0%)
> 
> As before [2], I checked the change on a 64 cores aarch64 machine, but
> differently than previous patch this one seems to show improvements:
> 
> nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1               27566206        28776779 (4.206770)  28778073 (4.211078)
> 2               8498813         9129102 (6.904173)   7042975 (-20.670782)
> 7               5019434         5832195 (13.935765)  5098511 (1.550982)
> 14              4379155         6507212 (32.703053)  5200018 (15.785772)
> 28              4397464         4584480 (4.079329)   4456767 (1.330628)
> 56              4020956         3534899 (-13.750237) 4096197 (1.836850)
> 
> I would suggest you to squash both patch in only one for version 2.
> 

OK, the separation here is easy for review.

>>
>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  nptl/pthread_mutex_lock.c | 25 +++++++++++--------------
>>  1 file changed, 11 insertions(+), 14 deletions(-)
>>
>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>> index c3aca93..0faee1a 100644
>> --- a/nptl/pthread_mutex_lock.c
>> +++ b/nptl/pthread_mutex_lock.c
>> @@ -127,22 +127,19 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>  	  int cnt = 0;
>>  	  int max_cnt = MIN (__mutex_aconf.spin_count,
>>  			mutex->__data.__spins * 2 + 100);
>> +
>> +     	/* MO read while spinning */
>>  	  do
>> -	    {
>> -		if (cnt >= max_cnt)
>> -		  {
>> -		    LLL_MUTEX_LOCK (mutex);
>> -		    break;
>> -		  }
>> -		/* MO read while spinning */
>> -		do
>> -		  {
>> -		    atomic_spin_nop ();
>> -		  }
>> -		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>> +    	{
>> +		  atomic_spin_nop ();
>> +		}
>> +	  while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>  			++cnt < max_cnt);
>> -	    }
>> -	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> +	    /* Try to acquire the lock if lock is available or the spin count
>> +	     * is run out, go to sleep if fails
>> +	     */
>> +	  if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>> +		  LLL_MUTEX_LOCK (mutex);
>>  
>>  	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>>  	}
>>
> 
> Please fix the format issue here.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Mutex: Only read while spinning
  2018-04-08  8:30     ` kemi
@ 2018-04-09 20:52       ` Adhemerval Zanella
  2018-04-10  1:49         ` kemi
  0 siblings, 1 reply; 14+ messages in thread
From: Adhemerval Zanella @ 2018-04-09 20:52 UTC (permalink / raw)
  To: kemi, libc-alpha; +Cc: Andi Kleen, Chen, Tim C, Dave, ying.huang, aaron, aubrey



On 08/04/2018 05:28, kemi wrote:
> 
> 
> On 2018年04月06日 04:55, Adhemerval Zanella wrote:
>>
>>
>> On 30/03/2018 04:14, Kemi Wang wrote:
>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>> to a sleep. While the lock is contended and we need to wait, going straight
>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>> that will force expensive memory synchronization among processors and
>>> penalize other running threads. For example, it constantly floods the
>>> system with "read for ownership" requests, which are much more expensive to
>>> process than a single read. Thus, we only use MO read until we observe the
>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>
>>> Test machine:
>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>
>>> Test case: Contended pthread adaptive spin mutex with global update
>>> each thread of the workload does:
>>> a) Lock the mutex (adaptive spin type)
>>> b) Globle variable increment
>>> c) Unlock the mutex
>>> in a loop until timeout, and the main thread reports the total iteration
>>> number of all the threads in one second.
>>>
>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>
>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>
>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>> with a mutex, so if I am reading correct the benchmark, a higher value means
>> less contention. I also assume you use the 'threads' value in this table.
>>
> 
> Yes. I used multi-threads mode for testing.
> 
>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>> any; one would get with this change:
>>
>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>
> 
> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
> 
>> So I think this change should be platform-specific.
>>
> 
> It may depend on platform, especially for cmpxchg, MO read and pause instructions.

Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
more platform specific.

> 
> Well, I think the performance also depends on spin count. A simple explanation is that:
> spin always fails in severe lock contention with large thread numbers (fit case b) below),
> thus, the overhead increases due to large spin count, larger spin count, more overhead.
> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
> Did you run "export LD_SPIN_COUNT=10" before the second round test?

I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
current patch set version is the same).  I redid the number to check for some
mishandled from my part, but still the number are somewhat not ok to set as
default (even more that 1000 is the default value for spin_counts).

nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
1		27563021		28775107 (4.212273)		28775127 (4.212339)
2		8261443		9044577 (8.658603)		7792347 (-6.019958)
7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
14		4951300		5348550 (7.427247)		2854946 (-73.428849)
28		4296920		4551240 (5.587928)		3231684 (-32.962257)
56		4182013		3420937 (-22.247589)		3769944 (-10.930375)


> 
> Analysis:
> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
> Then s = k*c, *k* is factor.  
> 
> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
> in the critical section, it is easy to understand that the overhead for acquiring the lock 
> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
> 
> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
> 
>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>  |
>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>  .
>  .
>  .
>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
> 
> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
> for time (k-2)*c,... task(k-1) for time c.
> 
>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>           sleep         sleep   
> 
>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
> until threads exist. Therefore, the spin count effects the system performance largely!
> 
> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
> only consider spin overhead.

I do agree that the sping count does play a role here, but for the machine
I am testing I think what is actually happening is this patch is adding more
branch on critical loop section, and does hurt performance:

Current spin loop version:
---
  do
    { 
      if (cnt++ >= max_cnt)
        { 
          LLL_MUTEX_LOCK (mutex);
          break;
        }
      atomic_spin_nop ();
    }
  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
---

This patch changes to:
---
  do
    {   
      if (cnt >= max_cnt)
        {
          LLL_MUTEX_LOCK (mutex);
          break;
        }
      do { 
        atomic_spin_nop ();
      }
      while (atomic_load_relaxed (&mutex->__data.__lock) != 0
             && ++cnt < max_cnt);
    }
  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
---

While 3 patch in set changes to:
---
  do
    {
      atomic_spin_nop ();
     }
  while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
         ++cnt < max_cnt);

  if (LLL_MUTEX_TRYLOCK (mutex) != 0)
    LLL_MUTEX_LOCK (mutex);
---

That's why I suggested for next version to squash second and third patch on only
one aimed to optimize the adapting spinning algorithm. 

Also, if you could provide a benchmark to stress this point without resorting in
an external one it would be better.

> 
>>>
>>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>> ---
>>>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>> index 1519c14..c3aca93 100644
>>> --- a/nptl/pthread_mutex_lock.c
>>> +++ b/nptl/pthread_mutex_lock.c
>>> @@ -26,6 +26,7 @@
>>>  #include <atomic.h>
>>>  #include <lowlevellock.h>
>>>  #include <stap-probe.h>
>>> +#include <mutex-conf.h>
>>>  
>>>  #ifndef lll_lock_elision
>>>  #define lll_lock_elision(lock, try_lock, private)	({ \
>>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>  	{
>>>  	  int cnt = 0;
>>> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>>> -			     mutex->__data.__spins * 2 + 10);
>>> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
>>> +			mutex->__data.__spins * 2 + 100);
>>>  	  do
>>>  	    {
>>> -	      if (cnt++ >= max_cnt)
>>> -		{
>>> -		  LLL_MUTEX_LOCK (mutex);
>>> -		  break;
>>> -		}
>>> -	      atomic_spin_nop ();
>>> +		if (cnt >= max_cnt)
>>> +		  {
>>> +		    LLL_MUTEX_LOCK (mutex);
>>> +		    break;
>>> +		  }
>>> +		/* MO read while spinning */
>>> +		do
>>> +		  {
>>> +		    atomic_spin_nop ();
>>> +		  }
>>> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>> +			++cnt < max_cnt);
>>>  	    }
>>>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>>  
>>>
>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Mutex: Only read while spinning
  2018-04-09 20:52       ` Adhemerval Zanella
@ 2018-04-10  1:49         ` kemi
  2018-04-11 13:28           ` Adhemerval Zanella
  0 siblings, 1 reply; 14+ messages in thread
From: kemi @ 2018-04-10  1:49 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha
  Cc: Andi Kleen, Chen, Tim C, Dave, ying.huang, aaron, aubrey



On 2018年04月10日 04:52, Adhemerval Zanella wrote:
> 
> 
> On 08/04/2018 05:28, kemi wrote:
>>
>>
>> On 2018年04月06日 04:55, Adhemerval Zanella wrote:
>>>
>>>
>>> On 30/03/2018 04:14, Kemi Wang wrote:
>>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>>> to a sleep. While the lock is contended and we need to wait, going straight
>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>>> that will force expensive memory synchronization among processors and
>>>> penalize other running threads. For example, it constantly floods the
>>>> system with "read for ownership" requests, which are much more expensive to
>>>> process than a single read. Thus, we only use MO read until we observe the
>>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>>
>>>> Test machine:
>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>>
>>>> Test case: Contended pthread adaptive spin mutex with global update
>>>> each thread of the workload does:
>>>> a) Lock the mutex (adaptive spin type)
>>>> b) Globle variable increment
>>>> c) Unlock the mutex
>>>> in a loop until timeout, and the main thread reports the total iteration
>>>> number of all the threads in one second.
>>>>
>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>>
>>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>>
>>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>>> with a mutex, so if I am reading correct the benchmark, a higher value means
>>> less contention. I also assume you use the 'threads' value in this table.
>>>
>>
>> Yes. I used multi-threads mode for testing.
>>
>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>>> any; one would get with this change:
>>>
>>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>>
>>
>> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
>>
>>> So I think this change should be platform-specific.
>>>
>>
>> It may depend on platform, especially for cmpxchg, MO read and pause instructions.
> 
> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
> more platform specific.
> 
>>
>> Well, I think the performance also depends on spin count. A simple explanation is that:
>> spin always fails in severe lock contention with large thread numbers (fit case b) below),
>> thus, the overhead increases due to large spin count, larger spin count, more overhead.
>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
>> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
>> Did you run "export LD_SPIN_COUNT=10" before the second round test?
> 
> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
> current patch set version is the same).  I redid the number to check for some
> mishandled from my part, but still the number are somewhat not ok to set as
> default (even more that 1000 is the default value for spin_counts).
> 
> nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1		27563021		28775107 (4.212273)		28775127 (4.212339)
> 2		8261443		9044577 (8.658603)		7792347 (-6.019958)
> 7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
> 14		4951300		5348550 (7.427247)		2854946 (-73.428849)
> 28		4296920		4551240 (5.587928)		3231684 (-32.962257)
> 56		4182013		3420937 (-22.247589)		3769944 (-10.930375)
> 
> 
>>
>> Analysis:
>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
>> Then s = k*c, *k* is factor.  
>>
>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
>> in the critical section, it is easy to understand that the overhead for acquiring the lock 
>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
>>
>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
>>
>>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>>  |
>>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>>  .
>>  .
>>  .
>>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
>>
>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
>> for time (k-2)*c,... task(k-1) for time c.
>>
>>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>>           sleep         sleep   
>>
>>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
>> until threads exist. Therefore, the spin count effects the system performance largely!
>>
>> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
>> only consider spin overhead.
> 
> I do agree that the sping count does play a role here, but for the machine
> I am testing I think what is actually happening is this patch is adding more
> branch on critical loop section, and does hurt performance:
> 
> Current spin loop version:
> ---
>   do
>     { 
>       if (cnt++ >= max_cnt)
>         { 
>           LLL_MUTEX_LOCK (mutex);
>           break;
>         }
>       atomic_spin_nop ();
>     }
>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> ---
> 
> This patch changes to:
> ---
>   do
>     {   
>       if (cnt >= max_cnt)
>         {
>           LLL_MUTEX_LOCK (mutex);
>           break;
>         }
>       do { 
>         atomic_spin_nop ();
>       }
>       while (atomic_load_relaxed (&mutex->__data.__lock) != 0
>              && ++cnt < max_cnt);
>     }
>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> ---
> 
> While 3 patch in set changes to:
> ---
>   do
>     {
>       atomic_spin_nop ();
>      }
>   while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>          ++cnt < max_cnt);
> 
>   if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>     LLL_MUTEX_LOCK (mutex);
> ---
> 
> That's why I suggested for next version to squash second and third patch on only
> one aimed to optimize the adapting spinning algorithm. 
> 

Agree, will combine 2nd and 3th patch in V2.

> Also, if you could provide a benchmark to stress this point without resorting in
> an external one it would be better.
> 

I will consider to do that. Preliminarily, we can provide several test cases each of which has
different size of critical section and can be run in multi-threads mode.
Your idea?

>>
>>>>
>>>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>>> ---
>>>>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>>>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>>> index 1519c14..c3aca93 100644
>>>> --- a/nptl/pthread_mutex_lock.c
>>>> +++ b/nptl/pthread_mutex_lock.c
>>>> @@ -26,6 +26,7 @@
>>>>  #include <atomic.h>
>>>>  #include <lowlevellock.h>
>>>>  #include <stap-probe.h>
>>>> +#include <mutex-conf.h>
>>>>  
>>>>  #ifndef lll_lock_elision
>>>>  #define lll_lock_elision(lock, try_lock, private)	({ \
>>>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>>  	{
>>>>  	  int cnt = 0;
>>>> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>>>> -			     mutex->__data.__spins * 2 + 10);
>>>> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
>>>> +			mutex->__data.__spins * 2 + 100);
>>>>  	  do
>>>>  	    {
>>>> -	      if (cnt++ >= max_cnt)
>>>> -		{
>>>> -		  LLL_MUTEX_LOCK (mutex);
>>>> -		  break;
>>>> -		}
>>>> -	      atomic_spin_nop ();
>>>> +		if (cnt >= max_cnt)
>>>> +		  {
>>>> +		    LLL_MUTEX_LOCK (mutex);
>>>> +		    break;
>>>> +		  }
>>>> +		/* MO read while spinning */
>>>> +		do
>>>> +		  {
>>>> +		    atomic_spin_nop ();
>>>> +		  }
>>>> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>>> +			++cnt < max_cnt);
>>>>  	    }
>>>>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>>>  
>>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Mutex: Only read while spinning
  2018-04-10  1:49         ` kemi
@ 2018-04-11 13:28           ` Adhemerval Zanella
  0 siblings, 0 replies; 14+ messages in thread
From: Adhemerval Zanella @ 2018-04-11 13:28 UTC (permalink / raw)
  To: kemi, libc-alpha; +Cc: Andi Kleen, Chen, Tim C, Dave, ying.huang, aaron, aubrey



On 09/04/2018 22:46, kemi wrote:
> 
> 
> On 2018年04月10日 04:52, Adhemerval Zanella wrote:
>>
>>
>> On 08/04/2018 05:28, kemi wrote:
>>>
>>>
>>> On 2018年04月06日 04:55, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 30/03/2018 04:14, Kemi Wang wrote:
>>>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>>>> to a sleep. While the lock is contended and we need to wait, going straight
>>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>>>> that will force expensive memory synchronization among processors and
>>>>> penalize other running threads. For example, it constantly floods the
>>>>> system with "read for ownership" requests, which are much more expensive to
>>>>> process than a single read. Thus, we only use MO read until we observe the
>>>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>>>
>>>>> Test machine:
>>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>>>
>>>>> Test case: Contended pthread adaptive spin mutex with global update
>>>>> each thread of the workload does:
>>>>> a) Lock the mutex (adaptive spin type)
>>>>> b) Globle variable increment
>>>>> c) Unlock the mutex
>>>>> in a loop until timeout, and the main thread reports the total iteration
>>>>> number of all the threads in one second.
>>>>>
>>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>>>
>>>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>>>
>>>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>>>> with a mutex, so if I am reading correct the benchmark, a higher value means
>>>> less contention. I also assume you use the 'threads' value in this table.
>>>>
>>>
>>> Yes. I used multi-threads mode for testing.
>>>
>>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>>>> any; one would get with this change:
>>>>
>>>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>>>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>>>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>>>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>>>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>>>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>>>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>>>
>>>
>>> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
>>>
>>>> So I think this change should be platform-specific.
>>>>
>>>
>>> It may depend on platform, especially for cmpxchg, MO read and pause instructions.
>>
>> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
>> more platform specific.
>>
>>>
>>> Well, I think the performance also depends on spin count. A simple explanation is that:
>>> spin always fails in severe lock contention with large thread numbers (fit case b) below),
>>> thus, the overhead increases due to large spin count, larger spin count, more overhead.
>>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
>>> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
>>> Did you run "export LD_SPIN_COUNT=10" before the second round test?
>>
>> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
>> current patch set version is the same).  I redid the number to check for some
>> mishandled from my part, but still the number are somewhat not ok to set as
>> default (even more that 1000 is the default value for spin_counts).
>>
>> nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>> 1		27563021		28775107 (4.212273)		28775127 (4.212339)
>> 2		8261443		9044577 (8.658603)		7792347 (-6.019958)
>> 7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
>> 14		4951300		5348550 (7.427247)		2854946 (-73.428849)
>> 28		4296920		4551240 (5.587928)		3231684 (-32.962257)
>> 56		4182013		3420937 (-22.247589)		3769944 (-10.930375)
>>
>>
>>>
>>> Analysis:
>>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
>>> Then s = k*c, *k* is factor.  
>>>
>>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
>>> in the critical section, it is easy to understand that the overhead for acquiring the lock 
>>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
>>>
>>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>>>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
>>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
>>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
>>>
>>>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>>>  |
>>>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>>>  .
>>>  .
>>>  .
>>>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
>>>
>>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
>>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
>>> for time (k-2)*c,... task(k-1) for time c.
>>>
>>>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>>>           sleep         sleep   
>>>
>>>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
>>> until threads exist. Therefore, the spin count effects the system performance largely!
>>>
>>> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>>>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
>>> only consider spin overhead.
>>
>> I do agree that the sping count does play a role here, but for the machine
>> I am testing I think what is actually happening is this patch is adding more
>> branch on critical loop section, and does hurt performance:
>>
>> Current spin loop version:
>> ---
>>   do
>>     { 
>>       if (cnt++ >= max_cnt)
>>         { 
>>           LLL_MUTEX_LOCK (mutex);
>>           break;
>>         }
>>       atomic_spin_nop ();
>>     }
>>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> This patch changes to:
>> ---
>>   do
>>     {   
>>       if (cnt >= max_cnt)
>>         {
>>           LLL_MUTEX_LOCK (mutex);
>>           break;
>>         }
>>       do { 
>>         atomic_spin_nop ();
>>       }
>>       while (atomic_load_relaxed (&mutex->__data.__lock) != 0
>>              && ++cnt < max_cnt);
>>     }
>>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> While 3 patch in set changes to:
>> ---
>>   do
>>     {
>>       atomic_spin_nop ();
>>      }
>>   while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>          ++cnt < max_cnt);
>>
>>   if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>     LLL_MUTEX_LOCK (mutex);
>> ---
>>
>> That's why I suggested for next version to squash second and third patch on only
>> one aimed to optimize the adapting spinning algorithm. 
>>
> 
> Agree, will combine 2nd and 3th patch in V2.
> 
>> Also, if you could provide a benchmark to stress this point without resorting in
>> an external one it would be better.
>>
> 
> I will consider to do that. Preliminarily, we can provide several test cases each of which has
> different size of critical section and can be run in multi-threads mode.
> Your idea?

I would the pthread_mutex3/threads from the project you used as base, it is
simply enough and give us a direct comparable metric. I would ran with
number of threads of 1, nproc/2 and nproc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-04-11 13:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-30  7:17 [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Kemi Wang
2018-03-30  7:17 ` [PATCH 2/3] Mutex: Only read while spinning Kemi Wang
2018-04-05 20:55   ` Adhemerval Zanella
2018-04-08  8:30     ` kemi
2018-04-09 20:52       ` Adhemerval Zanella
2018-04-10  1:49         ` kemi
2018-04-11 13:28           ` Adhemerval Zanella
2018-03-30  7:17 ` [PATCH 3/3] Mutex: Avoid useless spinning Kemi Wang
2018-04-05 20:59   ` Adhemerval Zanella
2018-04-08  8:33     ` kemi
2018-04-02 15:19 ` [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Adhemerval Zanella
2018-04-04 10:27 ` kemi
2018-04-04 17:17   ` Adhemerval Zanella
2018-04-05  1:11     ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).