From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 56588 invoked by alias); 3 Jan 2019 21:28:40 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 56577 invoked by uid 89); 3 Jan 2019 21:28:39 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=BAYES_00,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=no version=3.3.2 spammy=Possibly, badly, situations, Overhead X-HELO: mail-oi1-f195.google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=fUdP5ZCSTwaE6lqNm2nZF6L/loPGWgc3e4hflxtnDEg=; b=aAhmYr2r2lVx02uxb5MET6K/qZahPEr7kCyTG9IUcu5rqtwbncG4tSrmfhJ6/VHKJl 2QjTmL4yKnmkAAoQRZdxNKZjDygOwCmZ83ubV3tEE5M6AqiQbyE0gVq/a28QnK8zZrcZ VIXwxq/ARJrb4N9mVpU1kk405r5XddjNYXG4Ad0qV9bTJe3l6GMdF5ZZnrD4MrlUIX8t B7jf6KBJHUEOigeUhPB59lGhNVnW+EjFMkUsTeoHT4AD2jjqbGPp7Lu59bQZWJyXCdpW zMKgJRomOZdp4RMKVXjK4y/KJJU/ZC4y4U35hcR1GQ73WilshQtx0MYJhiv8MQxUt1nk 5jEg== MIME-Version: 1.0 References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local> <20190103204338.GU23599@brightrain.aerifal.cx> <20190103212113.GV23599@brightrain.aerifal.cx> In-Reply-To: <20190103212113.GV23599@brightrain.aerifal.cx> From: "H.J. Lu" Date: Thu, 03 Jan 2019 21:28:00 -0000 Message-ID: Subject: Re: [PATCH] NUMA spinlock [BZ #23962] To: Rich Felker Cc: Ma Ling , GNU C Library , "Lu, Hongjiu" , "ling.ma" , Wei Xiao Content-Type: text/plain; charset="UTF-8" X-SW-Source: 2019-01/txt/msg00096.txt.bz2 On Thu, Jan 3, 2019 at 1:21 PM Rich Felker wrote: > > On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote: > > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker wrote: > > > > > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote: > > > > From: "ling.ma" > > > > > > > > On multi-socket systems, memory is shared across the entire system. > > > > Data access to the local socket is much faster than the remote socket > > > > and data access to the local core is faster than sibling cores on the > > > > same socket. For serialized workloads with conventional spinlock, > > > > when there is high spinlock contention between threads, lock ping-pong > > > > among sockets becomes the bottleneck and threads spend majority of > > > > their time in spinlock overhead. > > > > > > > > On multi-socket systems, the keys to our NUMA spinlock performance > > > > are to minimize cross-socket traffic as well as localize the serialized > > > > workload to one core for execution. The basic principles of NUMA > > > > spinlock are mainly consisted of following approaches, which reduce > > > > data movement and accelerate critical section, eventually give us > > > > significant performance improvement. > > > > > > I question whether this belongs in glibc. It seems highly application- > > > and kernel-specific. Is there a reason you wouldn't prefer to > > > implement and maintain it in a library for use in the kind of > > > application that needs it? > > > > This is a good question. On the other hand, the current spinlock > > in glibc hasn't been changed for many years. It doesn't scale for > > today's hardware. Having a scalable spinlock in glibc is desirable. > > "Scalable spinlock" is something of an oxymoron. Spinlocks are for > situations where contention is extremely rare, since they inherently > blow up badly under contention. If this is happening it means you > wanted a mutex not a spinlock. The critical region serialized by spinlock is very small. Overhead of mutex is much bigger than the critical region itself. > > > This looks like fragile duplication of stdio-like buffering logic > > > that's not at all specific to this file. Does glibc have a policy on > > > whether things needing this should use stdio or some other shared code > > > rather than open-coding it like this? > > > > This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c. Should it > > be exported in GLIBC_PRIVATE name space? > > Possibly. I don't have a strong opinion here. At least it's good to > know it's copied from code elsewhere that's believed to be > correct/safe. > > > > > > > [...] > > > > > > > +/* Allocate a NUMA spinlock and return a pointer to it. Caller should > > > > + call numa_spinlock_free on the NUMA spinlock pointer to free the > > > > + memory when it is no longer needed. */ > > > > + > > > > +struct numa_spinlock * > > > > +numa_spinlock_alloc (void) > > > > +{ > > > > + const size_t buffer_size = 1024; > > > > + char buffer[buffer_size]; > > > > + char *buffer_end = buffer + buffer_size; > > > > + char *cp = buffer_end; > > > > + char *re = buffer_end; > > > > + > > > > + const int flags = O_RDONLY | O_CLOEXEC; > > > > + int fd = __open_nocancel ("/sys/devices/system/node/online", flags); > > > > + char *l; > > > > + unsigned int max_node = 0; > > > > + unsigned int node_count = 0; > > > > + if (fd != -1) > > > > + { > > > > + l = next_line (fd, buffer, &cp, &re, buffer_end); > > > > + if (l != NULL) > > > > + do > > > > + { > > > > + char *endp; > > > > + unsigned long int n = strtoul (l, &endp, 10); > > > > + if (l == endp) > > > > + { > > > > + node_count = 1; > > > > + break; > > > > + } > > > > + > > > > + unsigned long int m = n; > > > > + if (*endp == '-') > > > > + { > > > > + l = endp + 1; > > > > + m = strtoul (l, &endp, 10); > > > > + if (l == endp) > > > > + { > > > > + node_count = 1; > > > > + break; > > > > + } > > > > + } > > > > + > > > > + node_count += m - n + 1; > > > > + > > > > + if (m >= max_node) > > > > + max_node = m; > > > > + > > > > + l = endp; > > > > + while (l < re && isspace (*l)) > > > > + ++l; > > > > + } > > > > + while (l < re); > > > > + > > > > + __close_nocancel_nostatus (fd); > > > > + } > > > > + > > > > + /* NB: Some NUMA nodes may not be available, if the number of NUMA > > > > + nodes is 1, set the maximium NUMA node number to 0. */ > > > > + if (node_count == 1) > > > > + max_node = 0; > > > > + > > > > + unsigned int max_cpu = 0; > > > > + unsigned int *physical_package_id_p = NULL; > > > > + > > > > + if (max_node == 0) > > > > + { > > > > + /* There is at least 1 node. */ > > > > + node_count = 1; > > > > + > > > > + /* If NUMA is disabled, use physical_package_id instead. */ > > > > + struct dirent **cpu_list; > > > > + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list, > > > > + select_cpu, NULL); > > > > + if (nprocs > 0) > > > > + { > > > > + int i; > > > > + unsigned int *cpu_id_p = NULL; > > > > + > > > > + /* Find the maximum CPU number. */ > > > > + if (posix_memalign ((void **) &cpu_id_p, > > > > + __alignof__ (void *), > > > > + nprocs * sizeof (unsigned int)) == 0) > > > > > > Using posix_memalign to get memory with the alignment of > > > __alignof__(void*) makes no sense. All allocations via malloc are > > > suitably aligned for any standard type. > > > > Does glibc prefer malloc over posix_memalign? > > If not, I think it *should*. Use of posix_memalign suggests to the > reader that some nonstandard alignment is needed, prompting one to ask > "what?" only to figure out it was gratuitous. > > What's worse -- and I missed this on the first pass -- > posix_memalign's signature is broken by design and almost universally > gets the programmer to invoke undefined behavior, which has happened > here. cpu_id_p has type unsigned int *, but due to the cast, > posix_memalign accesses it as if it had type void *, which is an > aliasing violation. For this reason alone, I would go so far as to say > the function should never be used unless there's no way to avoid it, > and even then it should be wrapped with a function that *returns* the > pointer as void* so that this error can't be made. We will change it to malloc. > > > > + if (snprintf (path, sizeof (path), > > > > + "/sys/devices/system/cpu/%s/topology/physical_package_id", > > > > + d->d_name) > 0) > > > > > > Are these sysfs pathnames documented as stable/permanent by the > > > kernel? > > > > I believe so. > > OK. It is worth checking though since otherwise you have an interface > that will suddenly break when things change on the kernel side. > Documentation/cputopology.txt has =========================================== How CPU topology info is exported via sysfs =========================================== Export CPU topology info via sysfs. Items (attributes) are similar to /proc/cpuinfo output of some architectures: 1) /sys/devices/system/cpu/cpuX/topology/physical_package_id: physical package id of cpuX. Typically corresponds to a physical socket number, but the actual value is architecture and platform dependent. .... -- H.J.