[PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP
@ 2018-07-13  6:55 Kemi Wang
  2018-07-13  6:55 ` [PATCH v2 2/5] MCS lock: Enhance legacy MCS lock algorithm Kemi Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Kemi Wang @ 2018-07-13  6:55 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer, Rical Jason, Carlos Donell,
	Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

The pthread adaptive mutex is designed based-on the truth that the lock
probably would be released after a few of CPU cycles in most cases.
Especially for the case: the applications code is well-behaved with a short
critical section that is highly contended. Thus, spinning on the lock for
a while instead of calling into the kernel to block immediately can help
to improve the system performance.

But there are two main problems in current adaptive mutex. The first one is
fairness, multiple spinners contend for the lock simultaneously and there
is no any guarantee for a spinner to acquire the lock no matter how long it
has been waiting for. The other is the heavy cache line bouncing. Since the
cache line including the lock is shared among all of the spinners, when
lock is released, each spinner will try to acquire the lock via cmpxchg
instruction which constantly floods the system via "read-for-ownership"
requests. As a result, there will be a lot of cache line bouncing in a
large system with a lots of CPUs.

To address the problems mentioned above, the idea for queuing mutex
spinners with MCS lock[1] referred to the implementation of mutex[2] in
kernel is proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is
introduced. Compare to adaptive mutex (read only while spinning), the test
result on Intel 2 sockets Skylake platform has showns significant
performance improvement (See the first patch for details).

Though, queuing spinner with mcs lock can help to improve the performance
of adaptive mutex when multiple threads contending for a lock, people
probably want to know how queue spinner mutex performs when compare to
other lock discipline widely knowns as pthread spin lock and pthread mutex.
We can see the details of the test result in the first patch. Simply, some
conclusion is summarized as below:
a) In case of little lock contention, spin lock performs best, queue
spinner mutex performs similar to adaptive mutex, and both perform a
little better than pthread mutex.
b) In the case of severe lock contention with large number of CPUs when
protecting a small critical section (less than 1000ns). Most of lock
acquisition is got via spinning. Queue spinner mutex.
performs much better than spin lock and adaptive mutex. This is because the
overhead of heavy cache line bouncing plays a big role on lock performance.
c) With the increase of the size of a critical section, the advantage of
queue spinner mutex on performance in reduced gradually. This is because
the overhead of cache line bouncing will not become the bottleneck of lock
performance, instead, the overhead of futex_wait and futex_wake
plays a big role. When the critical section is increased to 1ms, even the
latency of futex syscall would be ignorable compared to the total time of
lock acquisition.

As we can see above, queue spinner mutex performs well in kinds of workload,
but there would be a potential risk to use this type of mutex. When the
lock holder is transformed to the next spinner in the queue, but it is not
running (the CPU is scheduled to run other task). Thus, the other spinners
have to wait in the queue, this would probably collapse the lock performance.
To emulate this case, we run two same processes simultaneously, the
process has 28 threads each of which sets CPU affinity to an individual CPU
according to the thread id. Thus, CPU [0~27] are subscribed by two threads.
In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by
58.1% (2205245->924263).
Therefore, queue spinner mutex would be carefully used for applications to
pursue fairness and performance without oversubscribing CPU resource. E.g.
Containers in public cloud infrastructure.

To overcome the disadvantage of lock performance collapse in the legacy MCS
lock when CPUs are oversubscribed, we proposed an enhanced MCS lock
algorithm in the second patch which allows a spinner to spin on mutex lock
without having to be waken up by its previous spinner. In such case, this
new mutex type would be widely and safely used in kinds of usage scenarios.

ChangeLog:
   V1->V2
   a) Propose an enhanced MCS lock algrithm to avoid potential lock
   performance collapse.
   b) Add a link of the paper which introduces original MCS lock.
   c) Add the commit id for mutex implementation with MCS lock in kernel.

[1] Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for scalable
synchronization on shared-memory multiprocessors." ACM Transactions on
Computer Systems (TOCS) 9.1 (1991): 21-65.
https://www.cos.ufrj.br/~ines/courses/progpar01/papers/p21-mellor-
crummey.pdf
[2] Commit id for mutex with MCS lock in kernel:
2bd2c92cf07cc4a373bf316c75b78ac465fefd35

Kemi Wang (5):
  Mutex: Queue spinners to reduce cache line bouncing and ensure
    fairness
  MCS lock: Enhance legacy MCS lock algorithm
  Mutex: add unit tests for new type
  BENCHMARK: add a benchmark for testing new type of mutex
  Manual: Add manual for pthread mutex

 benchtests/Makefile                          |  4 +-
 benchtests/bench-mutex-adaptive-thread.c     |  8 ++-
 benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++
 manual/Makefile                              |  2 +-
 manual/mutex.texi                            | 68 +++++++++++++++++++++
 nptl/Makefile                                |  8 +--
 nptl/allocatestack.c                         |  2 +-
 nptl/descr.h                                 | 26 ++++----
 nptl/mcs_lock.c                              | 88 ++++++++++++++++++++++++++++
 nptl/mcs_lock.h                              | 21 +++++++
 nptl/nptl-init.c                             |  2 +-
 nptl/pthreadP.h                              |  2 +-
 nptl/pthread_mutex_init.c                    |  3 +-
 nptl/pthread_mutex_lock.c                    | 40 ++++++++++++-
 nptl/pthread_mutex_timedlock.c               | 35 ++++++++++-
 nptl/pthread_mutex_trylock.c                 |  5 +-
 nptl/pthread_mutex_unlock.c                  |  7 ++-
 nptl/pthread_mutexattr_settype.c             |  2 +-
 nptl/tst-initializers1.c                     | 11 ++--
 nptl/tst-mutex5b.c                           |  2 +
 nptl/tst-mutex7b.c                           |  2 +
 sysdeps/nptl/bits/thread-shared-types.h      | 22 +++++--
 sysdeps/nptl/pthread.h                       | 15 +++--
 sysdeps/unix/sysv/linux/hppa/pthread.h       |  4 ++
 24 files changed, 350 insertions(+), 50 deletions(-)
 create mode 100644 benchtests/bench-mutex-queuespinner-thread.c
 create mode 100644 manual/mutex.texi
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h
 create mode 100644 nptl/tst-mutex5b.c
 create mode 100644 nptl/tst-mutex7b.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 2/5] MCS lock: Enhance legacy MCS lock algorithm
  2018-07-13  6:55 [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Kemi Wang
@ 2018-07-13  6:55 ` Kemi Wang
  2018-07-13  6:55 ` [PATCH v2 1/5] Mutex: Queue spinners to reduce cache line bouncing and ensure fairness Kemi Wang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Kemi Wang @ 2018-07-13  6:55 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer, Rical Jason, Carlos Donell,
	Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

As we have discussed before, lock holder preemption would collapse lock
performance when CPUs are oversubscribed. To address this issue, we propose
a new evolved MCS lock algorithm which allows a long-waited spinner in the
queue to spin on the mutex lock without waking up. Thus, active spinners
still have the chance to spin on the mutex lock even if their fore spinners
are not running.

Compare to legacy MCS lock, the difference of the principles of this
enhanced MCS lock includes:
a) A spinner represented by node data structure has to be marked *black*
before being allowed to spin on the mutex lock due to timeout;
b) A black node can only be changed to a *white* node (normal node, how to
deal with a normal node follows the principle of legacy MCS lock) when its
fore spinner wakes it up;
c) A black node is handled specially for mcs lock acquisition and mcs lock
release. For mcs lock acquisition (a spinner calls mcs_lock), the spinner
can acquire the permission to spin on mutex lock immediately without being
added into the queue, and the token which identifies this behavior is set
accordingly (node->tag = 1). For mcs lock release, nothing needs to do if
the token has been set (node->tag == 1). If the token has not been set, it
is dealt with like a normal node.
d) The nodes which represent spinners are allocated with Thread Local
Storage (TLS), thus, we don't need to care about their life cycles.

Implementation details (Focus on difference):
a) White node to black node
The requirement of node data structure
node->locked: 0   A spinner is waiting in the queue
node->locked: 1   A spinner is waken up by its fore spinner
We name this type of spinner *white* node.

To mark a spinner black, we introduce the third state of a spinner:
node->locked: 2   A spinner jumps out the queue due to timeout
We name this type of spinner *black* node.

When a spinner jumps out of the queue due to timeout, we use
atomic_compare_and_exchange_val_acq (&node->locked, 2, 0) to ensure the
atomic of the transition of a white node to a black node.

b) Black node to white node
When a spinner jumps out the queue due to timeout to spin on mutex lock,
the next pointer of its fore spinner still points to it. Thus, this black
node would be finally "waken" up by its fore spinner via next->locked = 1.
In such case, the black node is changed to white node because its
node->locked equals to 1.

c) mcs lock acquisition of a black node
The requirement of node data structure(we introduce another field)
node->tag: 1   A black node calls mcs_lock to acquire the permission to
spin on mutex lock.
The token (node->tag) is set to 1 to identify mcs lock acquisition.

d) mcs lock release of a black node
There are two different usage scenarios for a black node to release mcs
lock.
  d1) Case 1:
  A white node jumps out of the queue to spin on mutex lock, after spinning
  for a while, it calls mcs_lock to release mcs lock.
  In this case, the mcs lock release of a black node is dealt with as same
  as a normal node.

  d2) Case 2:
  A black node calls mcs_lock to acquire the permmision to spin on
  mutex lock, after spinning for a while, it calls mcs_unlock to release
  mcs lock.
  In this case(we know this case by checking node->tag == 1), nothing needs
  to do.

e) Spinning timeout
Spinning timeout in the queue can be implemented via either 1) setting a
threshold of spin count, or 2) a timer.
In this patch, we use the MAX_ADAPTIVE_COUNT which can be tuned by system
administrators. We may need different max count for different architecture
due to *pause* lasting for very different amount of time.

f) The management of node data structure
node data structure which represents the state of a spinner is allocated
with the attribute *__thread* (Thread Local Storage), thus, we don't need
to care about its life cycle.

Test result:
To emulate lock holder preemption, we run two same processes simultaneously,
each process has 28 threads running in parallel, and each thread sets CPU
affinity to an individual CPU and does the following:
1) lock
2) spend *s* nanoseconds in the critical section
3) unlock
4) spend *t* nanoseconds in the non-critical section
in a loop until 5 seconds, and the lock performance is measured by the total
iterations.
Thus, CPU [0~27] are oversubscribed by two threads at same time.

The rightmost column is base data without CPU oversubscribe(run a single
process with legacy MCS lock).

lock_delay   unlock_delay   Legacy MCS       Enhanced MCS     Base
100ns         600ns         pid1: 7282215    pid1: 6959243    7174841
                            pid2: 1828699    pid2: 7115025

1000ns        6000ns        pid1: 437412     pid1: 1796011    2266682
                            pid2: 2238150    pid2: 1774810

10000ns       60000ns       pid1: 177121     pid1: 178121     203068
                            pid2: 178577     pid2: 178110

From the test result, compare to legacy MCS lock, our enhanced MCS lock
a) performs more balance among processes when CPUs are oversubscribed;
b) does not have obvious performance collapse.

Alternative solutions (Info provided by Tim Chen and Andi Kleen):
a)One other possibility is to set a quota ( > 1) on the numbers of
spinners. So even if one of the spinner is pre-empted, you still have other
spinners available to grab the lock to prevent the preemption performance
collapse.
b) Using hints to the scheduler that the lock Holder wouldn't be preempted
for user programs.

This is what I proposed here, I hope smart guys in community can help to
improve this idea (maybe drop this idea:() or they may have better idea.
Thanks for comments.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/mcs_lock.c                         | 28 ++++++++++++++++++++++++----
 nptl/pthread_mutex_lock.c               |  7 ++++++-
 sysdeps/nptl/bits/thread-shared-types.h |  1 +
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/nptl/mcs_lock.c b/nptl/mcs_lock.c
index 21d20cf..e0d6a05 100644
--- a/nptl/mcs_lock.c
+++ b/nptl/mcs_lock.c
@@ -22,10 +22,19 @@
 void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node)
 {
   mcs_lock_t *prev;
+  int cnt = 0;

-  /* Initalize node.  */
+  /* Black node is allowed to spin on mutex immediately */
+  if (node->locked == 2)
+    {
+      node->tag = 1;
+      node->next = NULL;
+      return;
+    }
+  /* Initialize node.  */
   node->next = NULL;
   node->locked = 0;
+  node->tag = 0;

   prev = atomic_exchange_acquire(lock, node);

@@ -39,13 +48,25 @@ void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node)
   /* Add current spinner into the queue.  */
   atomic_store_release (&prev->next, node);
   atomic_full_barrier ();
-  /* Waiting until waken up by the previous spinner.  */
+  /* Waiting unless waken up by the previous spinner or timeout.  */
   while (!atomic_load_relaxed (&node->locked))
-    atomic_spin_nop ();
+    {
+	  atomic_spin_nop ();
+	  /* If timeout, mark this node black before get the permission.  */
+	  if (++cnt >= MAX_ADAPTIVE_COUNT)
+        {
+          /* Make node->locked = 2 to mark a node black */
+          atomic_compare_and_exchange_val_acq (&node->locked, 2, 0);
+          return;
+        }
+    }
 }

 void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node)
 {
+  if (node->tag == 1)
+	  return;
+
   mcs_lock_t *next = node->next;

   if (next == NULL)
@@ -61,7 +82,6 @@ void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node)
       while (! (next = atomic_load_relaxed (&node->next)))
         atomic_spin_nop ();
     }
-
   /* Wake up the next spinner.  */
   atomic_store_release (&next->locked, 1);
   atomic_full_barrier ();
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 5fe6038..fe9e6ed 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -57,6 +57,12 @@
 #define FORCE_ELISION(m, s)
 #endif

+static __thread mcs_lock_t node = {
+  NULL,
+  0,
+  0
+};
+
 static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
      __attribute_noinline__;

@@ -128,7 +134,6 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
           int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
                             mutex->__data.__spins * 2 + 10);
           int val = 0;
-          mcs_lock_t node;

           mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node);

diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
index 9d3c4de..4ec17b6 100644
--- a/sysdeps/nptl/bits/thread-shared-types.h
+++ b/sysdeps/nptl/bits/thread-shared-types.h
@@ -153,6 +153,7 @@ struct mcs_lock
 {
   struct mcs_lock *next;
   int locked;
+  int tag;
 };

 typedef struct mcs_lock mcs_lock_t;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/5] Mutex: Queue spinners to reduce cache line bouncing and ensure fairness
  2018-07-13  6:55 [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Kemi Wang
  2018-07-13  6:55 ` [PATCH v2 2/5] MCS lock: Enhance legacy MCS lock algorithm Kemi Wang
@ 2018-07-13  6:55 ` Kemi Wang
  2018-07-13  6:55 ` [PATCH v2 5/5] Manual: Add manual for pthread mutex Kemi Wang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Kemi Wang @ 2018-07-13  6:55 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer, Rical Jason, Carlos Donell,
	Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

There are two main problems in current adaptive mutex. The first one is
fairness, multiple spinners contend for the lock simultaneously and there
is no any guarantee for a spinner to acquire the lock no matter how long it
has been waiting for. The other is the heavy cache line bouncing. Since the
cache line including the mutex is shared among all of the spinners, when
lock is released, each spinner will try to acquire the lock via cmpxchg
instruction which constantly floods the system via "read-for-ownership"
requests. As a result, there will be a lot of cache line bouncing in a
large system with a lots of CPUs.

This patch introduces a new type of mutex called
PTHREAD_MUTEX_QUEUESPINNER_NP which puts mutex spinners into a queue before
spinning on the mutex lock and only allows the first spinner to spin on the
mutex lock. This reduces the overhead of cache line bouncing when lock
holder is transferred and allow the tasks to move forward faster, because
there is only one spinner and the cache line of mutex lock is contended
between lock holder and current spinner. At the same time, lock fairness
is also guaranteed in some degree. However, there may be a potential issue
in this proposal if CPUs are oversubscribed. When the lock holder is
transferred to the next spinner in the queue, but it may not be running at
that moment (CPU is scheduled to run other task), the lock performance
would collapse (see more details at the end of commit log), this is why we
introduced a new type of mutex rather than do the optimization based-on
the existed mutex discipline.

The implementation of queuing spinner mutex is based-on MCS lock and an
additional pointer is required to hold MCS lock. To keep the size of mutex
data structure consistent and maintain the user space ABI unchanged, the
__list field which is originally for implementing robust futex will be
reused for that purpose. Therefore, the queue spinner mutex with robust
futex would not be supported.

Pass the ABI compatibility test by running "make check-abi"

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: mutex-adaptive-thread/mutex-queuespinner-thread

Usage: make bench BENCHSET="mutex-adaptive-thread mutex-queuespinner-thread"

Test result:
+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 1x           |
+----------------+-----------------+-----------------+------------+
|   1 thread     |     9304170     |     9318160     |    0.2%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |    14718000     |    14947600     |    1.6%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |    21436800     |    20249800     |   -5.5%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |    16657600     |    15656500     |   -6.0%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |     4020620     |    14757000     |  267.0%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |     3489400     |     8996000     |  157.8%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |     3102040     |     9106490     |  193.6%    |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 10x          |
+----------------+-----------------+-----------------+------------+
|   1 thread     |     5226360     |     5228880     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |     6875240     |     7016720     |    2.1%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |     6323230     |     6053060     |   -4.3%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |     6215860     |     6388180     |    2.8%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |     3921620     |     5249650     |   33.9%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |     2855460     |     4308940     |   50.9%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |     2572420     |     4166650     |   62.0%    |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 100x         |
+----------------+-----------------+-----------------+------------+
|   1 thread     |      968946     |      969081     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |      772844     |      776187     |    0.4%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |      808041     |      812314     |    0.5%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |      802213     |      794792     |   -0.9%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |      338170     |      339024     |    0.3%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |      339900     |      339932     |    0.0%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |      331791     |      335243     |    1.0%    |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 1000x        |
+----------------+-----------------+-----------------+------------+
|   1 thread     |      106082     |      106102     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |      100833     |      100823     |   -0.0%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |      100965     |      100842     |   -0.1%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |       96813     |       96846     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |       52230     |       52024     |   -0.4%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |       48298     |       46427     |   -3.9%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |       45865     |       44405     |   -3.2%    |
+----------------+-----------------+-----------------+------------+

Though, the queue spinner mutex performs better than adaptive mutex, people
probably ask why we need this new type of mutex since we have already had
pthread spin lock and pthread mutex. Therefore, we designed several test
cases below and explore how they perform.
Let's assume the size of critical section is represented by *s*, the size
of non-critical section is represented by *t*, and let t = k*s. Then, on a
single thread, the arrival rate at which a single core will try to acquire
the lock, in the absence of contention, is 1/(k+1). We also assume there
are *n* threads contending for a lock, each thread binds to an individual
CPU core, and does the following:
1) lock
2) spend *s* nanoseconds in the critical section
3) unlock
4) spend *t* nanoseconds in the non-critical section
in a loop until 5 seconds, and the lock performance is measured by the total
throughput.

To emulate different usage scenarios, we let k=6, s=100ns and run this
workload using different lock disciplines. Then increase *s* and repeat.
In our workload, 4 threads contending for a lock emulates little lock
contention, and 28 threads contending for a lock emulates severe lock
contention within a socket, and 56 threads contending for a lock emulates
severe lock contention across sockets.

The benchmark is provided by Andi Kleen.

+-------+-------------+--------------+----------------+---------------+
|  Num  |  Spin Lock  | Normal Mutex | Adaptive Mutex | Queue Spinner |
+-------+-------------+--------------+------------- --+---------------+
|                     |    s=100ns t=600ns                            |
+-------+-------------+--------------+----------------+---------------+
|   4   |  12117662   |   7124320    |    10372184    |     9557689   |
+-------+-------------+--------------+----------------+---------------+
|  28   |   2695783   |   6385815    |     3927942    |     7182092   |
+-------+-------------+--------------+----------------+---------------+
|  56   |   2203519   |   4555164    |     3143599    |     4690016   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=1000ns t=6000ns                          |
+-------+-------------+--------------+----------------+---------------+
|   4   |   1529542   |   1380643    |     1495118    |     1503344   |
+-------+-------------+--------------+----------------+---------------+
|  28   |   2063929   |   1695128    |     2064940    |     2205245   |
+-------+-------------+--------------+----------------+---------------+
|  56   |   1507764   |   1427931    |     1704105    |     1720832   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=10000ns t=60000ns                        |
+-------+-------------+--------------+----------------+---------------+
|   4   |    159407   |    159213    |      159213    |      159215   |
+-------+-------------+--------------+----------------+---------------+
|  28   |    272062   |    153567    |      223229    |      224948   |
+-------+-------------+--------------+----------------+---------------+
|  56   |    269920   |    157287    |      239814    |      239887   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=100000ns t=600000ns                      |
+-------+-------------+--------------+----------------+---------------+
|   4   |     16024   |     16023    |       16021    |       16021   |
+-------+-------------+--------------+----------------+---------------+
|  28   |     27990   |     20421    |       20372    |       20378   |
+-------+-------------+--------------+----------------+---------------+
|  56   |     27987   |     20395    |       20322    |       20348   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=1000000ns t=6000000ns                    |
+-------+-------------+--------------+----------------+---------------+
|   4   |      1604   |      1604    |        1604    |        1604   |
+-------+-------------+--------------+----------------+---------------+
|  28   |      2826   |      2748    |        2748    |        2748   |
+-------+-------------+--------------+----------------+---------------+
|  56   |      2853   |      2773    |        2774    |        2773   |
+-------+-------------+--------------+----------------+---------------+

Generally, we can get some conclusion from the rest result above:
a) In case of little lock contention, spin lock performs best, queue
spinner mutex performs similar to adaptive mutex, and both perform a
little better than pthread mutex;
b) In the case of severe lock contention with large number of CPUs when
protecting a small critical section (less than 1000ns). Most of lock
acquisition is got via spinning. Queue spinner mutex
performs much better than spin lock and adaptive mutex. This is because the
overhead of heavy cache line bouncing plays a big role on lock performance.
c) With the increase of the size of a critical section, the advantage of
queue spinner mutex on performance in reduced gradually. This is because
the overhead of cache line bouncing will not become the bottleneck of lock
performance, instead, the overhead of futex_wait and futex_wake
plays a big role. When the critical section is increased to 1ms, even the
latency of futex syscall would be ignorable compared to the total time of
lock acquisition.

As we can see above, queue spinner mutex performs well in kinds of workload,
but there would be a potential risk to use this type of mutex. When the
lock holder is transformed to the next spinner in the queue, but it is not
running (the CPU is scheduled to run other task). And, the other spinners
have to wait in the queue, this would probably collapse the lock
performance. To emulate this case, we run two same processes simultaneously,
there are 28 threads running in parallel in the process, each thread sets
CPU affinity to an individual CPU according to the thread id. Thus,
CPU [0~27] are subscribed by two threads at the same time.
We run this test with different workload mentioned above, in the worst case
(s=1000ns, t=6000ns), the lock performance is reduced by 58.1%
(2205245->924263).

Therefore, queue spinner mutex would be carefully used for applications to
pursue fairness and performance without oversubscribing CPU resource. E.g.
Run a application within a container in public cloud infrastructure.

May to do list:
a) Tuning the threshold of spin count

At last, I would like to extend my appreciation sincerely to Andi Kleen for
his guidance and support during the development.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/Makefile                           |  2 +-
 nptl/allocatestack.c                    |  2 +-
 nptl/descr.h                            | 26 ++++++-------
 nptl/mcs_lock.c                         | 68 +++++++++++++++++++++++++++++++++
 nptl/mcs_lock.h                         | 21 ++++++++++
 nptl/nptl-init.c                        |  2 +-
 nptl/pthreadP.h                         |  2 +-
 nptl/pthread_mutex_init.c               |  3 +-
 nptl/pthread_mutex_lock.c               | 35 ++++++++++++++++-
 nptl/pthread_mutex_timedlock.c          | 35 +++++++++++++++--
 nptl/pthread_mutex_trylock.c            |  5 ++-
 nptl/pthread_mutex_unlock.c             |  7 +++-
 nptl/pthread_mutexattr_settype.c        |  2 +-
 sysdeps/nptl/bits/thread-shared-types.h | 21 +++++++---
 sysdeps/nptl/pthread.h                  | 15 +++++---
 sysdeps/unix/sysv/linux/hppa/pthread.h  |  4 ++
 16 files changed, 212 insertions(+), 38 deletions(-)
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h

diff --git a/nptl/Makefile b/nptl/Makefile
index bd1096f..7559907 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -140,7 +140,7 @@ libpthread-routines = nptl-init vars events version pt-interp \
 		      pthread_mutex_setprioceiling \
 		      pthread_setname pthread_getname \
 		      pthread_setattr_default_np pthread_getattr_default_np \
-		      pthread_mutex_conf
+		      pthread_mutex_conf mcs_lock
 #		      pthread_setuid pthread_seteuid pthread_setreuid \
 #		      pthread_setresuid \
 #		      pthread_setgid pthread_setegid pthread_setregid \
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 9c10b99..cb0fad7 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -743,7 +743,7 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
      might have happened in the kernel.  */
   pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
 				  - offsetof (pthread_mutex_t,
-					      __data.__list.__next));
+					      __data.__list.__list_t.__next));
   pd->robust_head.list_op_pending = NULL;
 #if __PTHREAD_MUTEX_HAVE_PREV
   pd->robust_prev = &pd->robust_head;
diff --git a/nptl/descr.h b/nptl/descr.h
index 0a0abb4..ddad47c 100644
--- a/nptl/descr.h
+++ b/nptl/descr.h
@@ -184,38 +184,38 @@ struct pthread
      FIXME We should use relaxed MO atomic operations here and signal fences
      because this kind of concurrency is similar to synchronizing with a
      signal handler.  */
-# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __next))
+# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __list_t.__next))
 
 # define ENQUEUE_MUTEX_BOTH(mutex, val)					      \
   do {									      \
     __pthread_list_t *next = (__pthread_list_t *)			      \
       ((((uintptr_t) THREAD_GETMEM (THREAD_SELF, robust_head.list)) & ~1ul)   \
        - QUEUE_PTR_ADJUST);						      \
-    next->__prev = (void *) &mutex->__data.__list.__next;		      \
-    mutex->__data.__list.__next = THREAD_GETMEM (THREAD_SELF,		      \
+    next->__list_t.__prev = (void *) &mutex->__data.__list.__list_t.__next;		      \
+    mutex->__data.__list.__list_t.__next = THREAD_GETMEM (THREAD_SELF,		      \
 						 robust_head.list);	      \
-    mutex->__data.__list.__prev = (void *) &THREAD_SELF->robust_head;	      \
+    mutex->__data.__list.__list_t.__prev = (void *) &THREAD_SELF->robust_head;	      \
     /* Ensure that the new list entry is ready before we insert it.  */	      \
     __asm ("" ::: "memory");						      \
     THREAD_SETMEM (THREAD_SELF, robust_head.list,			      \
-		   (void *) (((uintptr_t) &mutex->__data.__list.__next)	      \
+		   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)	      \
 			     | val));					      \
   } while (0)
 # define DEQUEUE_MUTEX(mutex) \
   do {									      \
     __pthread_list_t *next = (__pthread_list_t *)			      \
-      ((char *) (((uintptr_t) mutex->__data.__list.__next) & ~1ul)	      \
+      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__next) & ~1ul)	      \
        - QUEUE_PTR_ADJUST);						      \
-    next->__prev = mutex->__data.__list.__prev;				      \
+    next->__list_t.__prev = mutex->__data.__list.__list_t.__prev;				      \
     __pthread_list_t *prev = (__pthread_list_t *)			      \
-      ((char *) (((uintptr_t) mutex->__data.__list.__prev) & ~1ul)	      \
+      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__prev) & ~1ul)	      \
        - QUEUE_PTR_ADJUST);						      \
-    prev->__next = mutex->__data.__list.__next;				      \
+    prev->__list_t.__next = mutex->__data.__list.__list_t.__next;				      \
     /* Ensure that we remove the entry from the list before we change the     \
        __next pointer of the entry, which is read by the kernel.  */	      \
     __asm ("" ::: "memory");						      \
-    mutex->__data.__list.__prev = NULL;					      \
-    mutex->__data.__list.__next = NULL;					      \
+    mutex->__data.__list.__list_t.__prev = NULL;					      \
+    mutex->__data.__list.__list_t.__next = NULL;					      \
   } while (0)
 #else
   union
@@ -226,7 +226,7 @@ struct pthread
 
 # define ENQUEUE_MUTEX_BOTH(mutex, val)					      \
   do {									      \
-    mutex->__data.__list.__next						      \
+    mutex->__data.__list.__list_t.__next						      \
       = THREAD_GETMEM (THREAD_SELF, robust_list.__next);		      \
     /* Ensure that the new list entry is ready before we insert it.  */	      \
     __asm ("" ::: "memory");						      \
@@ -253,7 +253,7 @@ struct pthread
 	/* Ensure that we remove the entry from the list before we change the \
 	   __next pointer of the entry, which is read by the kernel.  */      \
 	    __asm ("" ::: "memory");					      \
-	mutex->__data.__list.__next = NULL;				      \
+	mutex->__data.__list.__list_t.__next = NULL;				      \
       }									      \
   } while (0)
 #endif
diff --git a/nptl/mcs_lock.c b/nptl/mcs_lock.c
new file mode 100644
index 0000000..21d20cf
--- /dev/null
+++ b/nptl/mcs_lock.c
@@ -0,0 +1,68 @@
+/* Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "pthreadP.h"
+#include <atomic.h>
+
+void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node)
+{
+  mcs_lock_t *prev;
+
+  /* Initalize node.  */
+  node->next = NULL;
+  node->locked = 0;
+
+  prev = atomic_exchange_acquire(lock, node);
+
+  /* No spinners waiting in the queue, lock is acquired immediately.  */
+  if (prev == NULL)
+    {
+      node->locked = 1;
+      return;
+    }
+
+  /* Add current spinner into the queue.  */
+  atomic_store_release (&prev->next, node);
+  atomic_full_barrier ();
+  /* Waiting until waken up by the previous spinner.  */
+  while (!atomic_load_relaxed (&node->locked))
+    atomic_spin_nop ();
+}
+
+void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node)
+{
+  mcs_lock_t *next = node->next;
+
+  if (next == NULL)
+    {
+      /* Check the tail of the queue:
+       * a) Release the lock and return if current node is the tail
+       * (lock == node).  */
+      if (atomic_compare_and_exchange_val_acq(lock, NULL, node) == node)
+        return;
+
+      /* b) Waiting until new node is added to the queue if current node is
+       * not the tail (lock != node).  */
+      while (! (next = atomic_load_relaxed (&node->next)))
+        atomic_spin_nop ();
+    }
+
+  /* Wake up the next spinner.  */
+  atomic_store_release (&next->locked, 1);
+  atomic_full_barrier ();
+}
diff --git a/nptl/mcs_lock.h b/nptl/mcs_lock.h
new file mode 100644
index 0000000..b779824
--- /dev/null
+++ b/nptl/mcs_lock.h
@@ -0,0 +1,21 @@
+/* Copyright (C) 2002-2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node);
+
+void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node);
diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c
index 1d3790f..5e8643f 100644
--- a/nptl/nptl-init.c
+++ b/nptl/nptl-init.c
@@ -303,7 +303,7 @@ __pthread_initialize_minimal_internal (void)
 #ifdef __NR_set_robust_list
     pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
 				    - offsetof (pthread_mutex_t,
-						__data.__list.__next));
+						__data.__list.__list_t.__next));
     INTERNAL_SYSCALL_DECL (err);
     int res = INTERNAL_SYSCALL (set_robust_list, err, 2, &pd->robust_head,
 				sizeof (struct robust_list_head));
diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
index 075530c..0dad1dc 100644
--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
@@ -62,7 +62,7 @@
 /* Internal mutex type value.  */
 enum
 {
-  PTHREAD_MUTEX_KIND_MASK_NP = 3,
+  PTHREAD_MUTEX_KIND_MASK_NP = 7,
 
   PTHREAD_MUTEX_ELISION_NP    = 256,
   PTHREAD_MUTEX_NO_ELISION_NP = 512,
diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
index d8fe473..e682de3 100644
--- a/nptl/pthread_mutex_init.c
+++ b/nptl/pthread_mutex_init.c
@@ -110,6 +110,8 @@ __pthread_mutex_init (pthread_mutex_t *mutex,
 	  && __set_robust_list_avail < 0)
 	return ENOTSUP;
 #endif
+      if ((imutexattr->mutexkind & PTHREAD_MUTEX_QUEUESPINNER_NP) != 0)
+        return ENOTSUP;
 
       mutex->__data.__kind |= PTHREAD_MUTEX_ROBUST_NORMAL_NP;
     }
@@ -146,7 +148,6 @@ __pthread_mutex_init (pthread_mutex_t *mutex,
   if ((imutexattr->mutexkind & (PTHREAD_MUTEXATTR_FLAG_PSHARED
 				| PTHREAD_MUTEXATTR_FLAG_ROBUST)) != 0)
     mutex->__data.__kind |= PTHREAD_MUTEX_PSHARED_BIT;
-
   /* Default values: mutex not used yet.  */
   // mutex->__count = 0;	already done by memset
   // mutex->__owner = 0;	already done by memset
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 26bcebf..5fe6038 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include "mcs_lock.h"
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -116,6 +117,36 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
       mutex->__data.__count = 1;
     }
   else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
+			 == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
+    {
+      if (! __is_smp)
+        goto simple;
+
+      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
+        {
+          int cnt = 0;
+          int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
+                            mutex->__data.__spins * 2 + 10);
+          int val = 0;
+          mcs_lock_t node;
+
+          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node);
+
+          do
+            {
+              atomic_spin_nop ();
+              val = atomic_load_relaxed (&mutex->__data.__lock);
+            }
+          while (val != 0 && ++cnt < max_cnt);
+
+          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node);
+          LLL_MUTEX_LOCK (mutex);
+
+          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+        }
+      assert (mutex->__data.__owner == 0);
+    }
+  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
 			  == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
     {
       if (! __is_smp)
@@ -192,7 +223,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
     case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
     case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
       /* We need to set op_pending before starting the operation.  Also
 	 see comments at ENQUEUE_MUTEX.  */
       __asm ("" ::: "memory");
@@ -372,7 +403,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	  {
 	    /* Note: robust PI futexes are signaled by setting bit 0.  */
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			   (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				     | 1));
 	    /* We need to set op_pending before starting the operation.  Also
 	       see comments at ENQUEUE_MUTEX.  */
diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 66efd39..07fa52a 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -25,6 +25,8 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <not-cancel.h>
+#include "mcs_lock.h"
+#include "pthread_mutex_conf.h"
 
 #include <stap-probe.h>
 
@@ -133,13 +135,40 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
 	}
       break;
-
+    case PTHREAD_MUTEX_QUEUESPINNER_NP:
+      if (! __is_smp)
+        goto simple;
+
+      if (lll_trylock (mutex) != 0)
+        {
+          int cnt = 0;
+          int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
+                            mutex->__data.__spins * 2 + 10);
+          int val = 0;
+          mcs_lock_t node;
+
+          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node);
+
+          do
+            {
+              atomic_spin_nop ();
+              val = atomic_load_relaxed (&mutex->__data.__lock);
+            }
+          while (val != 0 && ++cnt < max_cnt);
+
+          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node);
+          result = lll_timedlock(mutex->__data.__lock, abstime,
+                      PTHREAD_MUTEX_PSHARED (mutex));
+
+          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+        }
+      break;
     case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
     case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
     case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
     case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
       /* We need to set op_pending before starting the operation.  Also
 	 see comments at ENQUEUE_MUTEX.  */
       __asm ("" ::: "memory");
@@ -345,7 +374,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	  {
 	    /* Note: robust PI futexes are signaled by setting bit 0.  */
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			   (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				     | 1));
 	    /* We need to set op_pending before starting the operation.  Also
 	       see comments at ENQUEUE_MUTEX.  */
diff --git a/nptl/pthread_mutex_trylock.c b/nptl/pthread_mutex_trylock.c
index 7de61f4..6979abd 100644
--- a/nptl/pthread_mutex_trylock.c
+++ b/nptl/pthread_mutex_trylock.c
@@ -76,6 +76,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
       FORCE_ELISION (mutex, goto elision);
       /*FALL THROUGH*/
     case PTHREAD_MUTEX_ADAPTIVE_NP:
+    case PTHREAD_MUTEX_QUEUESPINNER_NP:
     case PTHREAD_MUTEX_ERRORCHECK_NP:
       if (lll_trylock (mutex->__data.__lock) != 0)
 	break;
@@ -91,7 +92,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
     case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
     case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
 
       oldval = mutex->__data.__lock;
       do
@@ -205,7 +206,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	if (robust)
 	  /* Note: robust PI futexes are signaled by setting bit 0.  */
 	  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			 (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			 (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				   | 1));
 
 	oldval = mutex->__data.__lock;
diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
index 9ea6294..5a5c511 100644
--- a/nptl/pthread_mutex_unlock.c
+++ b/nptl/pthread_mutex_unlock.c
@@ -76,6 +76,9 @@ __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
       goto normal;
     }
   else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
+			      == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
+    goto normal;
+  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
 			      == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
     goto normal;
   else
@@ -140,7 +143,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
     robust:
       /* Remove mutex from the list.  */
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
       /* We must set op_pending before we dequeue the mutex.  Also see
 	 comments at ENQUEUE_MUTEX.  */
       __asm ("" ::: "memory");
@@ -234,7 +237,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
 	  /* Remove mutex from the list.
 	     Note: robust PI futexes are signaled by setting bit 0.  */
 	  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			 (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			 (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				   | 1));
 	  /* We must set op_pending before we dequeue the mutex.  Also see
 	     comments at ENQUEUE_MUTEX.  */
diff --git a/nptl/pthread_mutexattr_settype.c b/nptl/pthread_mutexattr_settype.c
index 7d36cc3..c2382b4 100644
--- a/nptl/pthread_mutexattr_settype.c
+++ b/nptl/pthread_mutexattr_settype.c
@@ -25,7 +25,7 @@ __pthread_mutexattr_settype (pthread_mutexattr_t *attr, int kind)
 {
   struct pthread_mutexattr *iattr;
 
-  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_ADAPTIVE_NP)
+  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_QUEUESPINNER_NP)
     return EINVAL;
 
   /* Cannot distinguish between DEFAULT and NORMAL. So any settype
diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
index 1e2092a..9d3c4de 100644
--- a/sysdeps/nptl/bits/thread-shared-types.h
+++ b/sysdeps/nptl/bits/thread-shared-types.h
@@ -79,15 +79,19 @@
 /* Common definition of pthread_mutex_t. */
 
 #if !__PTHREAD_MUTEX_USE_UNION
-typedef struct __pthread_internal_list
+typedef union __pthread_internal_list
 {
-  struct __pthread_internal_list *__prev;
-  struct __pthread_internal_list *__next;
+  struct {
+    union __pthread_internal_list *__prev;
+    union __pthread_internal_list *__next;
+  }__list_t;
+  void *mcs_lock;
 } __pthread_list_t;
 #else
-typedef struct __pthread_internal_slist
+typedef union __pthread_internal_slist
 {
-  struct __pthread_internal_slist *__next;
+  union __pthread_internal_slist *__next;
+  void *mcs_lock;
 } __pthread_slist_t;
 #endif
 
@@ -145,6 +149,13 @@ struct __pthread_mutex_s
   __PTHREAD_COMPAT_PADDING_END
 };
 
+struct mcs_lock
+{
+  struct mcs_lock *next;
+  int locked;
+};
+
+typedef struct mcs_lock mcs_lock_t;
 
 /* Common definition of pthread_cond_t. */
 
diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
index df049ab..4b4b80a 100644
--- a/sysdeps/nptl/pthread.h
+++ b/sysdeps/nptl/pthread.h
@@ -45,7 +45,8 @@ enum
   PTHREAD_MUTEX_TIMED_NP,
   PTHREAD_MUTEX_RECURSIVE_NP,
   PTHREAD_MUTEX_ERRORCHECK_NP,
-  PTHREAD_MUTEX_ADAPTIVE_NP
+  PTHREAD_MUTEX_ADAPTIVE_NP,
+  PTHREAD_MUTEX_QUEUESPINNER_NP
 #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
   ,
   PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
@@ -85,14 +86,16 @@ enum
 
 #if __PTHREAD_MUTEX_HAVE_PREV
 # define PTHREAD_MUTEX_INITIALIZER \
-  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { { 0, 0 } } } }
 # ifdef __USE_GNU
 #  define PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP \
-  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
 #  define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP \
-  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
 #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
-  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
+#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
 
 # endif
 #else
@@ -105,6 +108,8 @@ enum
   { { 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, 0, { __PTHREAD_SPINS } } }
 #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
   { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, 0, { __PTHREAD_SPINS } } }
+#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
+  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, 0, { __PTHREAD_SPINS } } }
 
 # endif
 #endif
diff --git a/sysdeps/unix/sysv/linux/hppa/pthread.h b/sysdeps/unix/sysv/linux/hppa/pthread.h
index 11a024d..57c101c 100644
--- a/sysdeps/unix/sysv/linux/hppa/pthread.h
+++ b/sysdeps/unix/sysv/linux/hppa/pthread.h
@@ -46,6 +46,7 @@ enum
   PTHREAD_MUTEX_RECURSIVE_NP,
   PTHREAD_MUTEX_ERRORCHECK_NP,
   PTHREAD_MUTEX_ADAPTIVE_NP
+  PTHREAD_MUTEX_QUEUESPINNER_NP,
 #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
   ,
   PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
@@ -95,6 +96,9 @@ enum
 # define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
   { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, { 0, 0, 0, 0 }, 0, \
       { __PTHREAD_SPINS }, { 0, 0 } } }
+# define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
+  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, { 0, 0, 0, 0 }, 0, \
+      { __PTHREAD_SPINS }, { 0, 0 } } }
 #endif
 
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 5/5] Manual: Add manual for pthread mutex
  2018-07-13  6:55 [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Kemi Wang
  2018-07-13  6:55 ` [PATCH v2 2/5] MCS lock: Enhance legacy MCS lock algorithm Kemi Wang
  2018-07-13  6:55 ` [PATCH v2 1/5] Mutex: Queue spinners to reduce cache line bouncing and ensure fairness Kemi Wang
@ 2018-07-13  6:55 ` Kemi Wang
  2018-07-18  7:34   ` Rical Jasan
  2018-07-13  6:55 ` [PATCH v2 4/5] BENCHMARK: add a benchmark for testing new type of mutex Kemi Wang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 9+ messages in thread
From: Kemi Wang @ 2018-07-13  6:55 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer, Rical Jason, Carlos Donell,
	Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

Pthread mutex is not described in the documentation, so I started to document
pthread mutex with PTHREAD_MUTEX_QUEUESPINNER type here at least.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 manual/Makefile   |  2 +-
 manual/mutex.texi | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+), 1 deletion(-)
 create mode 100644 manual/mutex.texi

diff --git a/manual/Makefile b/manual/Makefile
index c275664..e56880a 100644
--- a/manual/Makefile
+++ b/manual/Makefile
@@ -39,7 +39,7 @@ chapters = $(addsuffix .texi, \
 		       pipe socket terminal syslog math arith time	\
 		       resource setjmp signal startup process ipc job	\
 		       nss users sysinfo conf crypt debug threads	\
-		       probes tunables)
+		       probes tunables mutex)
 appendices = lang.texi header.texi install.texi maint.texi platform.texi \
 	     contrib.texi
 licenses = freemanuals.texi lgpl-2.1.texi fdl-1.3.texi
diff --git a/manual/mutex.texi b/manual/mutex.texi
new file mode 100644
index 0000000..1520a8c
--- /dev/null
+++ b/manual/mutex.texi
@@ -0,0 +1,68 @@
+@node Pthread mutex
+@c %MENU% mutex
+
+This chapter describes the usage and implmentation of POSIX Pthreads Mutex.
+
+@menu
+* Mutex introduction:: What is mutex?
+* Mutex type:: The capability for each type of mutex
+* Mutex usage:: How to use mutex?
+* Usage scenarios and limitation::
+@end menu
+
+@node Mutex introduction
+@section Mutex introduction
+
+Mutex is used to protect the data structure shared among threads/processes.
+
+@node Mutex type
+@section Mutex type
+
+@deftp Type PTHREAD_MUTEX_QUEUESPINNER_NP
+Queue spinner mutex can reduce the overhead of lock holder transition and
+make mutex scalable in a large system with more and more CPUs (E.g. NUMA
+architecture) by queuing spinners. It puts mutex spinners into a queue
+before spinning on the mutex lock and only allows one spinner spinning on
+mutex lock. Thus, when lock is released, the current spinner can acquire
+the lock immediately because the cache line including mutex lock is only
+contended between previous lock holder and current spinner, and the
+overhead of lock acquisition via spinning is always O(1) no matter how
+severe the lock is contended.
+@end deftp
+
+@node Mutex usage
+@section Mutex usage
+
+@deftp Type PTHREAD_MUTEX_QUEUESPINNER_NP
+Queue spinner mutex can be initialized simply by either using the macro
+definition @code{PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP} or dynamically
+calling @code{pthread_mutex_init}.
+
+Static initialization:
+@smallexample
+mutex = PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP
+@end smallexample
+
+Dynamic initialization:
+@smallexample
+pthread_mutexattr_init(&attr)
+pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_QUEUESPINNER_NP)
+pthread_mutex_init(&mutex, &attr)
+@end smallexample
+@end deftp
+
+@node Usage scenarios and limitation
+@section Usage scenarios and limitation
+
+@deftp TYPE PTHREAD_MUTEX_QUEUESPINNER_NP
+There could be a potential risk to use mutex initialized with type
+@code{PTHREAD_MUTEX_QUEUESPINNER_NP} if CPU resource is oversubscribed. E.g.
+when a lock holder is transferred to the next spinner in the queue. but it
+is not running (the CPU is scheduled to run other task at that moment).
+Thus, the other spinners have to wait and it may lead to lock performance
+collapse. Therefore, queue spinner mutex would be carefully used for
+applications to pursue performance and fairness without oversubsribing CPU
+resource. E.g. Run a application within a container in private or public
+cloud infrastructure or a application running on the CPUs without subscribed
+by other tasks at the same time.
+@end deftp
-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 5/5] Manual: Add manual for pthread mutex
  2018-07-13  6:55 ` [PATCH v2 5/5] Manual: Add manual for pthread mutex Kemi Wang
@ 2018-07-18  7:34   ` Rical Jasan
  2018-07-18  7:56     ` Wang, Kemi
  0 siblings, 1 reply; 9+ messages in thread
From: Rical Jasan @ 2018-07-18  7:34 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Adhemerval Zanella, Florian Weimer, Carlos Donell, Glibc alpha,
	Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey

On 07/12/2018 11:52 PM, Kemi Wang wrote:
> Pthread mutex is not described in the documentation, so I started to document
> pthread mutex with PTHREAD_MUTEX_QUEUESPINNER type here at least.
> 
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  manual/Makefile   |  2 +-
>  manual/mutex.texi | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+), 1 deletion(-)
>  create mode 100644 manual/mutex.texi

I don't think this patch should create a new chapter.  There is already
threads.texi---woefully incomplete as it may be (one and a half pages in
the PDF and over 100 functions in the .texi file documented in comments
as undocumented).  There are currently two sections in threads.texi:
Thread-specific Data and Non-POSIX Extensions.  I think this patch
should introduce a new section in that chapter; something like Mutexes.

This issue is independent of the content, which I'll review separately.

Rical

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v2 5/5] Manual: Add manual for pthread mutex
  2018-07-18  7:34   ` Rical Jasan
@ 2018-07-18  7:56     ` Wang, Kemi
  0 siblings, 0 replies; 9+ messages in thread
From: Wang, Kemi @ 2018-07-18  7:56 UTC (permalink / raw)
  To: Rical Jasan
  Cc: Adhemerval Zanella, Florian Weimer, Carlos Donell, Glibc alpha,
	Dave Hansen, Chen, Tim C, Kleen, Andi, Huang, Ying, Lu, Aaron,
	Li, Aubrey

OK. I will take a look at it.

-----Original Message-----
From: Rical Jasan [mailto:rj@2c3t.io] 
Sent: Wednesday, July 18, 2018 3:34 PM
To: Wang, Kemi <kemi.wang@intel.com>
Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>; Florian Weimer <fweimer@redhat.com>; Carlos Donell <carlos@redhat.com>; Glibc alpha <libc-alpha@sourceware.org>; Dave Hansen <dave.hansen@linux.intel.com>; Chen, Tim C <tim.c.chen@intel.com>; Kleen, Andi <andi.kleen@intel.com>; Huang, Ying <ying.huang@intel.com>; Lu, Aaron <aaron.lu@intel.com>; Li, Aubrey <aubrey.li@intel.com>
Subject: Re: [PATCH v2 5/5] Manual: Add manual for pthread mutex

On 07/12/2018 11:52 PM, Kemi Wang wrote:
> Pthread mutex is not described in the documentation, so I started to 
> document pthread mutex with PTHREAD_MUTEX_QUEUESPINNER type here at least.
> 
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  manual/Makefile   |  2 +-
>  manual/mutex.texi | 68 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+), 1 deletion(-)  create mode 100644 
> manual/mutex.texi

I don't think this patch should create a new chapter.  There is already threads.texi---woefully incomplete as it may be (one and a half pages in the PDF and over 100 functions in the .texi file documented in comments as undocumented).  There are currently two sections in threads.texi:
Thread-specific Data and Non-POSIX Extensions.  I think this patch should introduce a new section in that chapter; something like Mutexes.

This issue is independent of the content, which I'll review separately.

Rical

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 4/5] BENCHMARK: add a benchmark for testing new type of mutex
  2018-07-13  6:55 [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Kemi Wang
                   ` (2 preceding siblings ...)
  2018-07-13  6:55 ` [PATCH v2 5/5] Manual: Add manual for pthread mutex Kemi Wang
@ 2018-07-13  6:55 ` Kemi Wang
  2018-07-13  6:55 ` [PATCH v2 3/5] Mutex: add unit tests for new type Kemi Wang
  2018-08-02  0:38 ` [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP kemi
  5 siblings, 0 replies; 9+ messages in thread
From: Kemi Wang @ 2018-07-13  6:55 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer, Rical Jason, Carlos Donell,
	Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

Add the benchmark for testing the mutex with type
PTHREAD_MUTEX_QUEUESPINNER_NP.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 benchtests/Makefile                          |  4 ++--
 benchtests/bench-mutex-adaptive-thread.c     |  8 ++++++--
 benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++++++++++++++++
 3 files changed, 29 insertions(+), 4 deletions(-)
 create mode 100644 benchtests/bench-mutex-queuespinner-thread.c

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 47b587f..31bef2e 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -96,7 +96,7 @@ bench-malloc := $(filter malloc-%,${BENCHSET})
 endif
 
 ifeq (${BENCHSET},)
-bench-mutex := mutex-adaptive-thread
+bench-mutex := mutex-adaptive-thread mutex-queuespinner-thread
 else
 bench-mutex := $(filter mutex-%,${BENCHSET})
 endif
@@ -174,7 +174,7 @@ bench-clean:
 ifneq ($(strip ${BENCHSET}),)
 VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \
    wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \
-   malloc-thread mutex-adaptive-thread
+   malloc-thread mutex-adaptive-thread mutex-queuespinner-thread
 INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET})
 ifneq (${INVALIDBENCHSETNAMES},)
 $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES})
diff --git a/benchtests/bench-mutex-adaptive-thread.c b/benchtests/bench-mutex-adaptive-thread.c
index ce9c40e..0badb93 100644
--- a/benchtests/bench-mutex-adaptive-thread.c
+++ b/benchtests/bench-mutex-adaptive-thread.c
@@ -30,8 +30,12 @@
 #include "json-lib.h"
 
 /* Benchmark duration in seconds.  */
-#define BENCHMARK_DURATION	15
-#define TYPE PTHREAD_MUTEX_ADAPTIVE_NP
+#define BENCHMARK_DURATION	5
+
+#ifndef TYPE
+# define TYPE PTHREAD_MUTEX_ADAPTIVE_NP
+#endif
+
 #define mb() asm ("" ::: "memory")
 #define UNLOCK_DELAY 10
 
diff --git a/benchtests/bench-mutex-queuespinner-thread.c b/benchtests/bench-mutex-queuespinner-thread.c
new file mode 100644
index 0000000..0a1693e
--- /dev/null
+++ b/benchtests/bench-mutex-queuespinner-thread.c
@@ -0,0 +1,21 @@
+/* Benchmark pthread adaptive spin mutex lock and unlock functions.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define TYPE PTHREAD_MUTEX_QUEUESPINNER_NP
+
+#include "bench-mutex-adaptive-thread.c"
-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 3/5] Mutex: add unit tests for new type
  2018-07-13  6:55 [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Kemi Wang
                   ` (3 preceding siblings ...)
  2018-07-13  6:55 ` [PATCH v2 4/5] BENCHMARK: add a benchmark for testing new type of mutex Kemi Wang
@ 2018-07-13  6:55 ` Kemi Wang
  2018-08-02  0:38 ` [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP kemi
  5 siblings, 0 replies; 9+ messages in thread
From: Kemi Wang @ 2018-07-13  6:55 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer, Rical Jason, Carlos Donell,
	Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu,
	Lu Aubrey, Kemi Wang

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/Makefile            |  6 +++---
 nptl/tst-initializers1.c | 11 +++++++----
 nptl/tst-mutex5b.c       |  2 ++
 nptl/tst-mutex7b.c       |  2 ++
 4 files changed, 14 insertions(+), 7 deletions(-)
 create mode 100644 nptl/tst-mutex5b.c
 create mode 100644 nptl/tst-mutex7b.c

diff --git a/nptl/Makefile b/nptl/Makefile
index 7559907..92ed53c 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -235,9 +235,9 @@ LDLIBS-tst-minstack-throw = -lstdc++
 
 tests = tst-attr1 tst-attr2 tst-attr3 tst-default-attr \
 	tst-mutex1 tst-mutex2 tst-mutex3 tst-mutex4 tst-mutex5 tst-mutex6 \
-	tst-mutex7 tst-mutex9 tst-mutex5a tst-mutex7a tst-mutex7robust \
-	tst-mutexpi1 tst-mutexpi2 tst-mutexpi3 tst-mutexpi4 tst-mutexpi5 \
-	tst-mutexpi5a tst-mutexpi6 tst-mutexpi7 tst-mutexpi7a \
+	tst-mutex7 tst-mutex9 tst-mutex5a tst-mutex5b tst-mutex7a tst-mutex7b \
+	tst-mutex7robust tst-mutexpi1 tst-mutexpi2 tst-mutexpi3 tst-mutexpi4 \
+	tst-mutexpi5 tst-mutexpi5a tst-mutexpi6 tst-mutexpi7 tst-mutexpi7a \
 	tst-mutexpi9 \
 	tst-spin1 tst-spin2 tst-spin3 tst-spin4 \
 	tst-cond1 tst-cond2 tst-cond3 tst-cond4 tst-cond5 tst-cond6 tst-cond7 \
diff --git a/nptl/tst-initializers1.c b/nptl/tst-initializers1.c
index a8fdac1..3677cc0 100644
--- a/nptl/tst-initializers1.c
+++ b/nptl/tst-initializers1.c
@@ -24,6 +24,7 @@
 pthread_mutex_t mtx_normal = PTHREAD_MUTEX_INITIALIZER;
 pthread_mutex_t mtx_recursive = PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP;
 pthread_mutex_t mtx_errorchk = PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP;
+pthread_mutex_t mtx_queuespinner = PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP;
 pthread_mutex_t mtx_adaptive = PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP;
 pthread_rwlock_t rwl_normal = PTHREAD_RWLOCK_INITIALIZER;
 pthread_rwlock_t rwl_writer
@@ -39,20 +40,22 @@ do_test (void)
     return 2;
   if (mtx_errorchk.__data.__kind != PTHREAD_MUTEX_ERRORCHECK_NP)
     return 3;
-  if (mtx_adaptive.__data.__kind != PTHREAD_MUTEX_ADAPTIVE_NP)
+  if (mtx_queuespinner.__data.__kind != PTHREAD_MUTEX_QUEUESPINNER_NP)
     return 4;
-  if (rwl_normal.__data.__flags != PTHREAD_RWLOCK_PREFER_READER_NP)
+  if (mtx_adaptive.__data.__kind != PTHREAD_MUTEX_ADAPTIVE_NP)
     return 5;
+  if (rwl_normal.__data.__flags != PTHREAD_RWLOCK_PREFER_READER_NP)
+    return 6;
   if (rwl_writer.__data.__flags
       != PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP)
-    return 6;
+    return 7;
   /* <libc-lock.h> __libc_rwlock_init definition for libc.so
      relies on PTHREAD_RWLOCK_INITIALIZER being all zeros.  If
      that ever changes, <libc-lock.h> needs updating.  */
   size_t i;
   for (i = 0; i < sizeof (rwl_normal); i++)
     if (((char *) &rwl_normal)[i] != '\0')
-      return 7;
+      return 8;
   return 0;
 }
 
diff --git a/nptl/tst-mutex5b.c b/nptl/tst-mutex5b.c
new file mode 100644
index 0000000..9da54ea
--- /dev/null
+++ b/nptl/tst-mutex5b.c
@@ -0,0 +1,2 @@
+#define TYPE PTHREAD_MUTEX_QUEUESPINNER_NP
+#include "tst-mutex5.c"
diff --git a/nptl/tst-mutex7b.c b/nptl/tst-mutex7b.c
new file mode 100644
index 0000000..c2c1410
--- /dev/null
+++ b/nptl/tst-mutex7b.c
@@ -0,0 +1,2 @@
+#define TYPE PTHREAD_MUTEX_QUEUESPINNER_NP
+#include "tst-mutex7.c"
-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP
  2018-07-13  6:55 [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Kemi Wang
                   ` (4 preceding siblings ...)
  2018-07-13  6:55 ` [PATCH v2 3/5] Mutex: add unit tests for new type Kemi Wang
@ 2018-08-02  0:38 ` kemi
  5 siblings, 0 replies; 9+ messages in thread
From: kemi @ 2018-08-02  0:38 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer, Rical Jason, Carlos Donell,
	Glibc alpha
  Cc: Dave Hansen, Tim Chen, Andi Kleen, Ying Huang, Aaron Lu, Lu Aubrey

Hi, Gentle maintainers
  Could we get this patch series reviewed next? Thanks

https://sourceware.org/ml/libc-alpha/2018-07/msg00354.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00357.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00355.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00358.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00356.html
https://sourceware.org/ml/libc-alpha/2018-07/msg00359.html

On 2018å¹´07æœˆ13æ—¥ 14:52, Kemi Wang wrote:
> The pthread adaptive mutex is designed based-on the truth that the lock
> probably would be released after a few of CPU cycles in most cases.
> Especially for the case: the applications code is well-behaved with a short
> critical section that is highly contended. Thus, spinning on the lock for
> a while instead of calling into the kernel to block immediately can help
> to improve the system performance.
> 
> But there are two main problems in current adaptive mutex. The first one is
> fairness, multiple spinners contend for the lock simultaneously and there
> is no any guarantee for a spinner to acquire the lock no matter how long it
> has been waiting for. The other is the heavy cache line bouncing. Since the
> cache line including the lock is shared among all of the spinners, when
> lock is released, each spinner will try to acquire the lock via cmpxchg
> instruction which constantly floods the system via "read-for-ownership"
> requests. As a result, there will be a lot of cache line bouncing in a
> large system with a lots of CPUs.
> 
> To address the problems mentioned above, the idea for queuing mutex
> spinners with MCS lock[1] referred to the implementation of mutex[2] in
> kernel is proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is
> introduced. Compare to adaptive mutex (read only while spinning), the test
> result on Intel 2 sockets Skylake platform has showns significant
> performance improvement (See the first patch for details).
> 
> Though, queuing spinner with mcs lock can help to improve the performance
> of adaptive mutex when multiple threads contending for a lock, people
> probably want to know how queue spinner mutex performs when compare to
> other lock discipline widely knowns as pthread spin lock and pthread mutex.
> We can see the details of the test result in the first patch. Simply, some
> conclusion is summarized as below:
> a) In case of little lock contention, spin lock performs best, queue
> spinner mutex performs similar to adaptive mutex, and both perform a
> little better than pthread mutex.
> b) In the case of severe lock contention with large number of CPUs when
> protecting a small critical section (less than 1000ns). Most of lock
> acquisition is got via spinning. Queue spinner mutex.
> performs much better than spin lock and adaptive mutex. This is because the
> overhead of heavy cache line bouncing plays a big role on lock performance.
> c) With the increase of the size of a critical section, the advantage of
> queue spinner mutex on performance in reduced gradually. This is because
> the overhead of cache line bouncing will not become the bottleneck of lock
> performance, instead, the overhead of futex_wait and futex_wake
> plays a big role. When the critical section is increased to 1ms, even the
> latency of futex syscall would be ignorable compared to the total time of
> lock acquisition.
> 
> As we can see above, queue spinner mutex performs well in kinds of workload,
> but there would be a potential risk to use this type of mutex. When the
> lock holder is transformed to the next spinner in the queue, but it is not
> running (the CPU is scheduled to run other task). Thus, the other spinners
> have to wait in the queue, this would probably collapse the lock performance.
> To emulate this case, we run two same processes simultaneously, the
> process has 28 threads each of which sets CPU affinity to an individual CPU
> according to the thread id. Thus, CPU [0~27] are subscribed by two threads.
> In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by
> 58.1% (2205245->924263).
> Therefore, queue spinner mutex would be carefully used for applications to
> pursue fairness and performance without oversubscribing CPU resource. E.g.
> Containers in public cloud infrastructure.
> 
> To overcome the disadvantage of lock performance collapse in the legacy MCS
> lock when CPUs are oversubscribed, we proposed an enhanced MCS lock
> algorithm in the second patch which allows a spinner to spin on mutex lock
> without having to be waken up by its previous spinner. In such case, this
> new mutex type would be widely and safely used in kinds of usage scenarios.
> 
> ChangeLog:
>    V1->V2
>    a) Propose an enhanced MCS lock algrithm to avoid potential lock
>    performance collapse.
>    b) Add a link of the paper which introduces original MCS lock.
>    c) Add the commit id for mutex implementation with MCS lock in kernel.
> 
> [1] Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for scalable
> synchronization on shared-memory multiprocessors." ACM Transactions on
> Computer Systems (TOCS) 9.1 (1991): 21-65.
> https://www.cos.ufrj.br/~ines/courses/progpar01/papers/p21-mellor-
> crummey.pdf
> [2] Commit id for mutex with MCS lock in kernel:
> 2bd2c92cf07cc4a373bf316c75b78ac465fefd35
> 
> Kemi Wang (5):
>   Mutex: Queue spinners to reduce cache line bouncing and ensure
>     fairness
>   MCS lock: Enhance legacy MCS lock algorithm
>   Mutex: add unit tests for new type
>   BENCHMARK: add a benchmark for testing new type of mutex
>   Manual: Add manual for pthread mutex
> 
>  benchtests/Makefile                          |  4 +-
>  benchtests/bench-mutex-adaptive-thread.c     |  8 ++-
>  benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++
>  manual/Makefile                              |  2 +-
>  manual/mutex.texi                            | 68 +++++++++++++++++++++
>  nptl/Makefile                                |  8 +--
>  nptl/allocatestack.c                         |  2 +-
>  nptl/descr.h                                 | 26 ++++----
>  nptl/mcs_lock.c                              | 88 ++++++++++++++++++++++++++++
>  nptl/mcs_lock.h                              | 21 +++++++
>  nptl/nptl-init.c                             |  2 +-
>  nptl/pthreadP.h                              |  2 +-
>  nptl/pthread_mutex_init.c                    |  3 +-
>  nptl/pthread_mutex_lock.c                    | 40 ++++++++++++-
>  nptl/pthread_mutex_timedlock.c               | 35 ++++++++++-
>  nptl/pthread_mutex_trylock.c                 |  5 +-
>  nptl/pthread_mutex_unlock.c                  |  7 ++-
>  nptl/pthread_mutexattr_settype.c             |  2 +-
>  nptl/tst-initializers1.c                     | 11 ++--
>  nptl/tst-mutex5b.c                           |  2 +
>  nptl/tst-mutex7b.c                           |  2 +
>  sysdeps/nptl/bits/thread-shared-types.h      | 22 +++++--
>  sysdeps/nptl/pthread.h                       | 15 +++--
>  sysdeps/unix/sysv/linux/hppa/pthread.h       |  4 ++
>  24 files changed, 350 insertions(+), 50 deletions(-)
>  create mode 100644 benchtests/bench-mutex-queuespinner-thread.c
>  create mode 100644 manual/mutex.texi
>  create mode 100644 nptl/mcs_lock.c
>  create mode 100644 nptl/mcs_lock.h
>  create mode 100644 nptl/tst-mutex5b.c
>  create mode 100644 nptl/tst-mutex7b.c
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-08-02  0:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-13  6:55 [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Kemi Wang
2018-07-13  6:55 ` [PATCH v2 2/5] MCS lock: Enhance legacy MCS lock algorithm Kemi Wang
2018-07-13  6:55 ` [PATCH v2 1/5] Mutex: Queue spinners to reduce cache line bouncing and ensure fairness Kemi Wang
2018-07-13  6:55 ` [PATCH v2 5/5] Manual: Add manual for pthread mutex Kemi Wang
2018-07-18  7:34   ` Rical Jasan
2018-07-18  7:56     ` Wang, Kemi
2018-07-13  6:55 ` [PATCH v2 4/5] BENCHMARK: add a benchmark for testing new type of mutex Kemi Wang
2018-07-13  6:55 ` [PATCH v2 3/5] Mutex: add unit tests for new type Kemi Wang
2018-08-02  0:38 ` [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP kemi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).