From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-99251-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 106077 invoked by alias); 15 Jan 2019 02:56:39 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 106059 invoked by uid 89); 15 Jan 2019 02:56:38 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=heavy, H*RU:HELO, Hx-spam-relays-external:HELO, slight
X-HELO: mga17.intel.com
Subject: Re: [PATCH] NUMA spinlock [BZ #23962]
To: Ma Ling <ling.ma.program@gmail.com>, libc-alpha@sourceware.org
Cc: hongjiu.lu@intel.com, "ling.ma" <ling.ml@antfin.com>,
 Wei Xiao <wei3.xiao@intel.com>
References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local>
From: kemi <kemi.wang@intel.com>
Message-ID: <a820fbc4-883f-c4f2-d2e4-b55df2817dba@intel.com>
Date: Tue, 15 Jan 2019 02:56:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2019-01/txt/msg00321.txt.bz2


On 2018/12/26 ä¸å10:50, Ma Ling wrote:
> From: "ling.ma" <ling.ml@antfin.com>
> 
> On multi-socket systems, memory is shared across the entire system.
> Data access to the local socket is much faster than the remote socket
> and data access to the local core is faster than sibling cores on the
> same socket.  For serialized workloads with conventional spinlock,
> when there is high spinlock contention between threads, lock ping-pong
> among sockets becomes the bottleneck and threads spend majority of
> their time in spinlock overhead.
> 
> On multi-socket systems, the keys to our NUMA spinlock performance
> are to minimize cross-socket traffic as well as localize the serialized
> workload to one core for execution.  The basic principles of NUMA
> spinlock are mainly consisted of following approaches, which reduce
> data movement and accelerate critical section, eventually give us
> significant performance improvement.
> 
> 1. MCS spinlock
> MCS spinlock help us to reduce the useless lock movement in the
> spinning state.  This paper provides a good description for this
> kind of lock:

That's not the truth.
No matter generic spinlock(or x86 version) spinlock has used the way of 
test and test_and_set to reduce the useless lock movement in the spinning 
state.

See
glibc/nptl/pthread_spin_lock.c
glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S

What MCS-spinlock really helps is to accelerate lock release and lock acquisition
by reducing lots of cache line bouncing.

> NUMA spinlock can greatly speed up critical section on multi-socket
> systems.  It should improve spinlock performance on all multi-socket
> systems. 
> 

This is out-of-question that NUMA spinlock helps a lot in case of heavy lock
contention. But, we should also propose the data for non-contented case and slight
contended case.

It's expected that extra code complexity may degrade lock performance a bit for
slight contended case, I would like to see the data for that.

Also, the lock starvation would be possible if the running core is always busy with
heavy lock contention. More explanation is expected.