public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Kemi Wang <kemi.wang@intel.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
	Florian Weimer <fweimer@redhat.com>, Rical Jason <rj@2c3t.io>,
	Carlos Donell <carlos@redhat.com>,
	Glibc alpha <libc-alpha@sourceware.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Andi Kleen <andi.kleen@intel.com>,
	Ying Huang <ying.huang@intel.com>, Aaron Lu <aaron.lu@intel.com>,
	Lu Aubrey <aubrey.li@intel.com>, Kemi Wang <kemi.wang@intel.com>
Subject: [PATCH v3 3/3] Mutex: Replace trylock by read only while spinning
Date: Wed, 23 May 2018 09:25:00 -0000	[thread overview]
Message-ID: <1527067354-13333-3-git-send-email-kemi.wang@intel.com> (raw)
In-Reply-To: <1527067354-13333-1-git-send-email-kemi.wang@intel.com>

The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggeusted by Andi Kleen.

Performance impact:
Significant mutex performance improvement is not expected for this patch,
though, it probably bring some benefit for the scenarios with severe lock
contention on many architectures, the whole system performance can benefit
from this modification because a number of unnecessary "read for ownership"
requests which stress the cache system via read and invalidate broadcast
are eliminated during spinning.
Meanwhile, it may have some tiny performance regression on the lock holder
transformation for the case of lock acquisition via spinning gets, because
the lock state is checked before acquiring the lock via trylock. In the
worst case, the extra latency of read and pause is added except that of
trylock when lock is available.

Similar mechanism has been implemented for pthread spin lock.

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
with global update)
Usage: make bench BENCHSET=mutex-adaptive-thread
Test result:
+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |           Critical section size: 1x            |
+----------------+------------------------------------------------+
|1 thread        |  2.76681e+07    |  2.7965e+07     |   +1.1%    |
|2 threads       |  3.29905e+07    |  3.55279e+07    |   +7.7%    |
|3 threads       |  4.38102e+07    |  3.98567e+07    |   -9.0%    |
|4 threads       |  1.72172e+07    |  2.09498e+07    |   +21.7%   |
|28 threads      |  1.03732e+07    |  1.05133e+07    |   +1.4%    |
|56 threads      |  1.06308e+07    |  5.06522e+07    |   +14.6%   |
|112 threads     |  8.55177e+06    |  1.02954e+07    |   +20.4%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 10x           |
+----------------+------------------------------------------------+
|1 thread        |  1.57006e+07    |  1.54727e+07    |   -1.5%    |
|2 threads       |  1.8044e+07     |  1.75601e+07    |   -2.7%    |
|3 threads       |  1.35634e+07    |  1.46384e+07    |   +7.9%    |
|4 threads       |  1.21257e+07    |  1.32046e+07    |   +8.9%    |
|28 threads      |  8.09593e+06    |  1.02713e+07    |   +26.9%   |
|56 threads      |  9.09907e+06    |  4.16203e+07    |   +16.4%   |
|112 threads     |  7.09731e+06    |  8.62406e+06    |   +21.5%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 100x          |
+----------------+------------------------------------------------+
|1 thread        |  2.87116e+06    | 2.89188e+06     |   +0.7%    |
|2 threads       |  2.23409e+06    | 2.24216e+06     |   +0.4%    |
|3 threads       |  2.29888e+06    | 2.29964e+06     |   +0.0%    |
|4 threads       |  2.26898e+06    | 2.21394e+06     |   -2.4%    |
|28 threads      |  1.03228e+06    | 1.0051e+06      |   -2.6%    |
|56 threads      |  1.02953 +06    | 1.6344e+07      |   -2.3%    |
|112 threads     |  1.01615e+06    | 1.00134e+06     |   -1.5%    |
+----------------+------------------------------------------------+
|                |           Critical section size: 1000x         |
+----------------+------------------------------------------------+
|1 thread        |  316392         |  315635         |   -0.2%    |
|2 threads       |  302806         |  303469         |   +0.2%    |
|3 threads       |  298506         |  294281         |   -1.4%    |
|4 threads       |  292037         |  289945         |   -0.7%    |
|28 threads      |  155188         |  155250         |   +0.0%    |
|56 threads      |  190657         |  183106         |   -4.0%    |
|112 threads     |  210818         |  220342         |   +4.5%    |
+----------------+-----------------+-----------------+------------+

    * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex

ChangLog:
    V2->V3:
    a) Drop the idea of blocking spinners if fail to acquire a lock, since
       this idea would not be an universal winner. E.g. several threads
       contend for a lock which protects a small critical section, thus,
       probably any thread can acquire the lock via spinning.
    b) Fix the format issue AFAIC

    V1->V2: fix format issue

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 ChangeLog                 |  5 +++++
 nptl/pthread_mutex_lock.c | 41 +++++++++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index e2991e9..3bafb0e 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,10 @@
 2018-05-23  Kemi Wang <kemi.wang@intel.com>
 
+	* nptl/pthread_mutex_lock.c: Replace trylock by read only while
+	spinning.
+
+2018-05-23  Kemi Wang <kemi.wang@intel.com>
+
 	* benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive
 	spin mutex.
 	* benchmark/Makefile: Add adaptive spin mutex benchmark.
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..7ce50f6 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include <pthread_mutex_conf.h>
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -123,22 +124,30 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
 
       if (LLL_MUTEX_TRYLOCK (mutex) != 0)
 	{
-	  int cnt = 0;
-	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
-			     mutex->__data.__spins * 2 + 10);
-	  do
-	    {
-	      if (cnt++ >= max_cnt)
-		{
-		  LLL_MUTEX_LOCK (mutex);
-		  break;
-		}
-	      atomic_spin_nop ();
-	    }
-	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
-
-	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
-	}
+      int val = 0;
+      int cnt = 0;
+      int max_cnt = MIN (__mutex_aconf.spin_count,
+                        mutex->__data.__spins * 2 + 10);
+
+      do
+        {
+          if (cnt >= max_cnt)
+            {
+              LLL_MUTEX_LOCK (mutex);
+              break;
+            }
+          /* Read only while spinning unless lock is available.  */
+          do
+            {
+              atomic_spin_nop ();
+              val = atomic_load_relaxed (&mutex->__data.__lock);
+            }
+          while (val != 0 && ++cnt < max_cnt);
+        }
+      while  (LLL_MUTEX_TRYLOCK (mutex) != 0);
+
+      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+    }
       assert (mutex->__data.__owner == 0);
     }
   else
-- 
2.7.4

  reply	other threads:[~2018-05-23  9:25 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-23  9:25 [PATCH v3 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Kemi Wang
2018-05-23  9:25 ` Kemi Wang [this message]
2018-05-23  9:25 ` [PATCH v3 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark Kemi Wang
2018-05-25  8:52 ` [PATCH v3 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Wang, Kemi
2018-05-30  3:03   ` kemi
2018-06-07  4:57     ` kemi
2018-06-07 13:07 ` Szabolcs Nagy
2018-06-07 14:58   ` Wang, Kemi
2018-06-07 13:09 ` Florian Weimer
2018-06-07 15:00   ` Wang, Kemi
2018-06-08  8:02   ` kemi
2018-06-08 14:54     ` Florian Weimer
2018-06-14  1:39       ` kemi
2018-06-07 13:10 ` Florian Weimer
2018-06-07 15:02   ` Wang, Kemi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1527067354-13333-3-git-send-email-kemi.wang@intel.com \
    --to=kemi.wang@intel.com \
    --cc=aaron.lu@intel.com \
    --cc=adhemerval.zanella@linaro.org \
    --cc=andi.kleen@intel.com \
    --cc=aubrey.li@intel.com \
    --cc=carlos@redhat.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=rj@2c3t.io \
    --cc=tim.c.chen@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).