From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-99262-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 75543 invoked by alias); 15 Jan 2019 12:37:04 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 75530 invoked by uid 89); 15 Jan 2019 12:37:04 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS autolearn=no version=3.3.2 spammy=bigger, acquire, acts, experimenting
X-HELO: mx1.redhat.com
Message-ID: <6eaaec4d0ae349eaf31de1239f27c01dc1f5b6a8.camel@redhat.com>
Subject: Re: [PATCH] NUMA spinlock [BZ #23962]
From: Torvald Riegel <triegel@redhat.com>
To: kemi <kemi.wang@intel.com>, Rich Felker <dalias@libc.org>, "H.J. Lu"
	 <hjl.tools@gmail.com>
Cc: Ma Ling <ling.ma.program@gmail.com>, GNU C Library
 <libc-alpha@sourceware.org>, "Lu, Hongjiu" <hongjiu.lu@intel.com>,
 "ling.ma" <ling.ml@antfin.com>, Wei Xiao <wei3.xiao@intel.com>
Date: Tue, 15 Jan 2019 12:37:00 -0000
In-Reply-To: <0b4620c1-a9c5-061e-9636-65d80655a6fd@intel.com>
References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local>
	 <20190103204338.GU23599@brightrain.aerifal.cx>
	 <CAMe9rOoBhZmzuEoPGjhbxkYZ3mOaC-8tPUrSuhmcbnu8J19LpA@mail.gmail.com>
	 <20190103212113.GV23599@brightrain.aerifal.cx>
	 <5c2bf8859a412759aba26a21b317ea98f6ff8eaf.camel@redhat.com>
	 <0b4620c1-a9c5-061e-9636-65d80655a6fd@intel.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-SW-Source: 2019-01/txt/msg00332.txt.bz2

On Tue, 2019-01-15 at 10:28 +0800, kemi wrote:
> > > "Scalable spinlock" is something of an oxymoron.
> > 
> > No, that's not true at all.  Most high-performance shared-memory
> > synchronization constructs (on typical HW we have today) will do some kind
> > of spinning (and back-off), and there's nothing wrong about it.  This can
> > scale very well. 
> > 
> > > Spinlocks are for
> > > situations where contention is extremely rare,
> > 
> > No, the question is rather whether the program needs blocking through the
> > OS (for performance, or for semantics such as PI) or not.  Energy may be
> > another factor.  For example, glibc's current mutexes don't scale well on
> > short critical because there's not enough spinning being done.
> > 
> 
> yes. That's why we need pthread.mutex.spin_count tunable interface before.

I don't think we need the tunable interface before that.  Where we need to
improve performance most is for applications that don't want to bother
tuning their mutexes -- that's where the broadest gains are overall, I
think.

In turn, that means that we have spinning and back-off that give good
average-case performance -- whether that's through automatic tuning of
those two things at runtime, or through static default values that we do
regular performance checks for in the glibc community. 

>From that perspective, the tunable interface is a nice addition that can
allow users to fine-tune the setting, but it's not how users would enable
it.

> But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting
> algorithm of current adaptive mutex is the major negative factor which degrades mutex
> performance.

Note that I'm not advocating for focusing on just the adaptive mutex type. 
IMO, adding this type was a mistake because whether to spin or not does not
affect semantics of the mutexes.  Performance hints shouldn't be done via a
mutex' type, and all mutex implementations should consider to spin at least
a little.

If we just do something about the adaptive mutexes, then I guess this will
reach few users.  I believe most applications just don't use them, and the
current implementation of adaptive mutexes is so simplistic that there's
not much performance to be had by changing to adaptive mutexes (which is
another reason for it having few users).

> That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive
> mutex.

MCS-style spinning (ie, spinning on memory local to the spinning thread) is
helpful, but I think we should tackle spinning on global memory first (ie,
on a location in the mutex, which is shared by all the threads trying to
acquire it).  Of course, always including back-off.

> https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html
> 
> Also, if with very small critical section in the worklad, this new type of mutex 
> with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs
> much better than original spinlock.

I don't think we want to have a new type for that.  It maybe useful for
experimenting with it, but it shouldn't be exposed to users as a stable
interface.

Also, have you experimented with different kinds/settings of exponential
back-off?  I just saw normal spinning in your implementation, no varying
amounts of back-off.  The performance comparison should include back-off
though, as that's one way to work around the contention problems (with a
bigger hammer than local spinning of course, but can be effective
nonetheless, and faster in low-contention cases).

My guess is that a mix of local spinning on memory shared by a few threads
running on cores that are close to each other would perform best (eg,
similar to what's done in flat combining). 

> So, in some day, if adaptive mutex is tuned good enough, it should act like
> mcs-spinlock (or NUMA spinlock) if workload has small critical section, and
> performs like normal mutex if the critical section is too big to spinning-wait.

I agree in some way, but I think that the adaptive mutex type should just
be an alias of the normal mutex type (for API compatibility reasons only). 
And there could be other reasons than just critical-section-size that
determine whether a thread should block using futexes or not.