From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 75543 invoked by alias); 15 Jan 2019 12:37:04 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 75530 invoked by uid 89); 15 Jan 2019 12:37:04 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS autolearn=no version=3.3.2 spammy=bigger, acquire, acts, experimenting X-HELO: mx1.redhat.com Message-ID: <6eaaec4d0ae349eaf31de1239f27c01dc1f5b6a8.camel@redhat.com> Subject: Re: [PATCH] NUMA spinlock [BZ #23962] From: Torvald Riegel To: kemi , Rich Felker , "H.J. Lu" Cc: Ma Ling , GNU C Library , "Lu, Hongjiu" , "ling.ma" , Wei Xiao Date: Tue, 15 Jan 2019 12:37:00 -0000 In-Reply-To: <0b4620c1-a9c5-061e-9636-65d80655a6fd@intel.com> References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local> <20190103204338.GU23599@brightrain.aerifal.cx> <20190103212113.GV23599@brightrain.aerifal.cx> <5c2bf8859a412759aba26a21b317ea98f6ff8eaf.camel@redhat.com> <0b4620c1-a9c5-061e-9636-65d80655a6fd@intel.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-SW-Source: 2019-01/txt/msg00332.txt.bz2 On Tue, 2019-01-15 at 10:28 +0800, kemi wrote: > > > "Scalable spinlock" is something of an oxymoron. > > > > No, that's not true at all. Most high-performance shared-memory > > synchronization constructs (on typical HW we have today) will do some kind > > of spinning (and back-off), and there's nothing wrong about it. This can > > scale very well. > > > > > Spinlocks are for > > > situations where contention is extremely rare, > > > > No, the question is rather whether the program needs blocking through the > > OS (for performance, or for semantics such as PI) or not. Energy may be > > another factor. For example, glibc's current mutexes don't scale well on > > short critical because there's not enough spinning being done. > > > > yes. That's why we need pthread.mutex.spin_count tunable interface before. I don't think we need the tunable interface before that. Where we need to improve performance most is for applications that don't want to bother tuning their mutexes -- that's where the broadest gains are overall, I think. In turn, that means that we have spinning and back-off that give good average-case performance -- whether that's through automatic tuning of those two things at runtime, or through static default values that we do regular performance checks for in the glibc community. >From that perspective, the tunable interface is a nice addition that can allow users to fine-tune the setting, but it's not how users would enable it. > But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting > algorithm of current adaptive mutex is the major negative factor which degrades mutex > performance. Note that I'm not advocating for focusing on just the adaptive mutex type. IMO, adding this type was a mistake because whether to spin or not does not affect semantics of the mutexes. Performance hints shouldn't be done via a mutex' type, and all mutex implementations should consider to spin at least a little. If we just do something about the adaptive mutexes, then I guess this will reach few users. I believe most applications just don't use them, and the current implementation of adaptive mutexes is so simplistic that there's not much performance to be had by changing to adaptive mutexes (which is another reason for it having few users). > That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive > mutex. MCS-style spinning (ie, spinning on memory local to the spinning thread) is helpful, but I think we should tackle spinning on global memory first (ie, on a location in the mutex, which is shared by all the threads trying to acquire it). Of course, always including back-off. > https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html > > Also, if with very small critical section in the worklad, this new type of mutex > with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs > much better than original spinlock. I don't think we want to have a new type for that. It maybe useful for experimenting with it, but it shouldn't be exposed to users as a stable interface. Also, have you experimented with different kinds/settings of exponential back-off? I just saw normal spinning in your implementation, no varying amounts of back-off. The performance comparison should include back-off though, as that's one way to work around the contention problems (with a bigger hammer than local spinning of course, but can be effective nonetheless, and faster in low-contention cases). My guess is that a mix of local spinning on memory shared by a few threads running on cores that are close to each other would perform best (eg, similar to what's done in flat combining). > So, in some day, if adaptive mutex is tuned good enough, it should act like > mcs-spinlock (or NUMA spinlock) if workload has small critical section, and > performs like normal mutex if the critical section is too big to spinning-wait. I agree in some way, but I think that the adaptive mutex type should just be an alias of the normal mutex type (for API compatibility reasons only). And there could be other reasons than just critical-section-size that determine whether a thread should block using futexes or not.