From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-93939-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 124174 invoked by alias); 3 Jul 2018 12:46:29 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 123895 invoked by uid 89); 3 Jul 2018 12:46:14 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-26.1 required=5.0 tests=BAYES_00,FREEMAIL_FROM,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_PASS,TIME_LIMIT_EXCEEDED autolearn=unavailable version=3.3.2 spammy=02, 4.5, 04, bench
X-HELO: mail-oi0-f66.google.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=5VbN18UVaWx8iMUIoT9dpKQYH2qp9CxGNuzLAyE2Xd8=;
        b=u9mE6vxJPOu/pwwL9KCglRXSfx5+PZ1zAGqmLZY4MeYMnoLD7NW8wHEW2MIL9qpAsN
         sPYK6WNqrKmJWNofNHvu1uy2HOhcX9unX0u7kwTV+33x/7yXCFj7Z/3LccqZrBORq9QZ
         T+9Q0Lx7xGZEfKo1cunSdOnO7Wjvl3z4vkaCjgioD8vf45Zd6MO+9IRNPR4BqUX6w6Gm
         4BXhui14ftrp5KDQSnbiYM12Qrsg+57mWZ0a9b2Gf96aQI0+H/RJqwmfNd/OcvPzK1Wt
         u1JOVpBlCeRfxANmMDtbIB/K/wMyGI+g0FTE2rBThatQD0kPJ00l2+dY17Kj+peJU7Zq
         4aew==
MIME-Version: 1.0
In-Reply-To: <1530520046-18343-3-git-send-email-kemi.wang@intel.com>
References: <1530520046-18343-1-git-send-email-kemi.wang@intel.com> <1530520046-18343-3-git-send-email-kemi.wang@intel.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Tue, 03 Jul 2018 12:46:00 -0000
Message-ID: <CAMe9rOoTnUqqW96rVsFuXWjqW_38ziyKFaw7Q-ROL473d7eG6A@mail.gmail.com>
Subject: Re: [PATCH v6 3/3] Mutex: Replace trylock by read only while spinning
To: Kemi Wang <kemi.wang@intel.com>
Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>, Florian Weimer <fweimer@redhat.com>, 
	Rical Jason <rj@2c3t.io>, Carlos Donell <carlos@redhat.com>, Glibc alpha <libc-alpha@sourceware.org>, 
	Dave Hansen <dave.hansen@linux.intel.com>, Tim Chen <tim.c.chen@intel.com>, 
	Andi Kleen <andi.kleen@intel.com>, Ying Huang <ying.huang@intel.com>, 
	Aaron Lu <aaron.lu@intel.com>, Lu Aubrey <aubrey.li@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-SW-Source: 2018-07/txt/msg00054.txt.bz2

On Mon, Jul 2, 2018 at 1:27 AM, Kemi Wang <kemi.wang@intel.com> wrote:
> The pthread adaptive spin mutex spins on the lock for a while before
> calling into the kernel to block. But, in the current implementation of
> spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
> the lock is contended, it is not a good idea on many targets as that will
> force expensive memory synchronization among processors and penalize other
> running threads. For example, it constantly floods the system with "read
> for ownership" requests, which are much more expensive to process than a
> single read. Thus, we only use MO read until we observe the lock to not be
> acquired anymore, as suggested by Andi Kleen.
>
> Performance impact:
> Significant mutex performance improvement is not expected for this patch,
> though, it probably bring some benefit for the scenarios with severe lock
> contention on many architectures, the whole system performance can benefit
> from this modification because a number of unnecessary "read for ownership"
> requests which stress the cache system by broadcasting cache line
> invalidity are eliminated during spinning.
> Meanwhile, it may have some tiny performance regression on the lock holder
> transformation for the case of lock acquisition via spinning gets, because
> the lock state is checked before acquiring the lock via trylock.
>
> Similar mechanism has been implemented for pthread spin lock.
>
> Test machine:
> 2-sockets Skylake platform, 112 cores with 62G RAM
>
> Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
> with global update)
> Usage: make bench BENCHSET=mutex-adaptive-thread
> Test result:
> +----------------+-----------------+-----------------+------------+
> |  Configuration |      Base       |      Head       | % Change   |
> |                | Total iteration | Total iteration | base->head |
> +----------------+-----------------+-----------------+------------+
> |                |           Critical section size: 1x            |
> +----------------+------------------------------------------------+
> |1 thread        |  2.76681e+07    |  2.7965e+07     |   +1.1%    |
> |2 threads       |  3.29905e+07    |  3.55279e+07    |   +7.7%    |
> |3 threads       |  4.38102e+07    |  3.98567e+07    |   -9.0%    |
> |4 threads       |  1.72172e+07    |  2.09498e+07    |   +21.7%   |
> |28 threads      |  1.03732e+07    |  1.05133e+07    |   +1.4%    |
> |56 threads      |  1.06308e+07    |  5.06522e+07    |   +14.6%   |
> |112 threads     |  8.55177e+06    |  1.02954e+07    |   +20.4%   |
> +----------------+------------------------------------------------+
> |                |           Critical section size: 10x           |
> +----------------+------------------------------------------------+
> |1 thread        |  1.57006e+07    |  1.54727e+07    |   -1.5%    |
> |2 threads       |  1.8044e+07     |  1.75601e+07    |   -2.7%    |
> |3 threads       |  1.35634e+07    |  1.46384e+07    |   +7.9%    |
> |4 threads       |  1.21257e+07    |  1.32046e+07    |   +8.9%    |
> |28 threads      |  8.09593e+06    |  1.02713e+07    |   +26.9%   |
> |56 threads      |  9.09907e+06    |  4.16203e+07    |   +16.4%   |
> |112 threads     |  7.09731e+06    |  8.62406e+06    |   +21.5%   |
> +----------------+------------------------------------------------+
> |                |           Critical section size: 100x          |
> +----------------+------------------------------------------------+
> |1 thread        |  2.87116e+06    | 2.89188e+06     |   +0.7%    |
> |2 threads       |  2.23409e+06    | 2.24216e+06     |   +0.4%    |
> |3 threads       |  2.29888e+06    | 2.29964e+06     |   +0.0%    |
> |4 threads       |  2.26898e+06    | 2.21394e+06     |   -2.4%    |
> |28 threads      |  1.03228e+06    | 1.0051e+06      |   -2.6%    |
> |56 threads      |  1.02953 +06    | 1.6344e+07      |   -2.3%    |
> |112 threads     |  1.01615e+06    | 1.00134e+06     |   -1.5%    |
> +----------------+------------------------------------------------+
> |                |           Critical section size: 1000x         |
> +----------------+------------------------------------------------+
> |1 thread        |  316392         |  315635         |   -0.2%    |
> |2 threads       |  302806         |  303469         |   +0.2%    |
> |3 threads       |  298506         |  294281         |   -1.4%    |
> |4 threads       |  292037         |  289945         |   -0.7%    |
> |28 threads      |  155188         |  155250         |   +0.0%    |
> |56 threads      |  190657         |  183106         |   -4.0%    |
> |112 threads     |  210818         |  220342         |   +4.5%    |
> +----------------+-----------------+-----------------+------------+
>
>     * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex
>     * nptl/pthread_mutex_timedlock.c: Likewise
>     * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only
>       while spinning for x86 architecture
>     * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise
>     * sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise
>
> ChangLog:
>     V5->V6:
>     no change
>
>     V4->V5:
>     a) Make the optimization work for pthread mutex_timedlock() in x86
>     architecture.
>     b) Move the READ_ONLY_SPIN macro definition from this patch to the
>     first patch which adds glibc.mutex.spin_count tunable entry
>
>     V3->V4:
>     a) Make the optimization opt-in, and enable for x86 architecture as
>     default, as suggested by Florian Weimer.
>
>     V2->V3:
>     a) Drop the idea of blocking spinners if fail to acquire a lock, since
>        this idea would not be an universal winner. E.g. several threads
>        contend for a lock which protects a small critical section, thus,
>        probably any thread can acquire the lock via spinning.
>     b) Fix the format issue AFAIC
>
>     V1->V2: fix format issue
>
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  nptl/pthread_mutex_lock.c                             | 15 +++++++++++++++
>  nptl/pthread_mutex_timedlock.c                        | 15 +++++++++++++++
>  sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c |  1 +
>  sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c      |  1 +
>  sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c |  1 +
>  5 files changed, 33 insertions(+)
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 1519c14..9a7b5c2 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>                   LLL_MUTEX_LOCK (mutex);
>                   break;
>                 }
> +#ifdef READ_ONLY_SPIN
> +             do
> +               {
> +                 atomic_spin_nop ();
> +                 val = atomic_load_relaxed (&mutex->__data.__lock);
> +               }
> +             while (val != 0 && ++cnt < max_cnt);
> +#else
>               atomic_spin_nop ();
> +#endif
>             }
>           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>
> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> index 66efd39..dcaeca8 100644
> --- a/nptl/pthread_mutex_timedlock.c
> +++ b/nptl/pthread_mutex_timedlock.c
> @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>                                           PTHREAD_MUTEX_PSHARED (mutex));
>                   break;
>                 }
> +#ifdef READ_ONLY_SPIN
> +             do
> +               {
> +                 atomic_spin_nop ();
> +                 val = atomic_load_relaxed (&mutex->__data.__lock);
> +               }
> +             while (val != 0 && ++cnt < max_cnt);
> +#else
>               atomic_spin_nop ();
> +#endif
>             }
>           while (lll_trylock (mutex->__data.__lock) != 0);
>

Please define generic

static inline void __always_inline
atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex)
{
      atomic_spin_nop ();
 }

and x86 version:

static inline void __always_inline
atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex)
{
  do
    {
      atomic_spin_nop ();
      val = atomic_load_relaxed (&mutex->__data.__lock);
    }
  while (val != 0 && ++cnt < max_cnt);
}

in a standalone patch.

-- 
H.J.