From: "H.J. Lu" <hjl.tools@gmail.com>
To: Rich Felker <dalias@libc.org>
Cc: Ma Ling <ling.ma.program@gmail.com>,
GNU C Library <libc-alpha@sourceware.org>,
"Lu, Hongjiu" <hongjiu.lu@intel.com>,
"ling.ma" <ling.ml@antfin.com>, Wei Xiao <wei3.xiao@intel.com>
Subject: Re: [PATCH] NUMA spinlock [BZ #23962]
Date: Thu, 03 Jan 2019 21:28:00 -0000 [thread overview]
Message-ID: <CAMe9rOrCYqQpJ=2cNpQudO0gN3ocn1pKLd7q7tLcu4=xO+C+dw@mail.gmail.com> (raw)
In-Reply-To: <20190103212113.GV23599@brightrain.aerifal.cx>
On Thu, Jan 3, 2019 at 1:21 PM Rich Felker <dalias@libc.org> wrote:
>
> On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote:
> > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > > > From: "ling.ma" <ling.ml@antfin.com>
> > > >
> > > > On multi-socket systems, memory is shared across the entire system.
> > > > Data access to the local socket is much faster than the remote socket
> > > > and data access to the local core is faster than sibling cores on the
> > > > same socket. For serialized workloads with conventional spinlock,
> > > > when there is high spinlock contention between threads, lock ping-pong
> > > > among sockets becomes the bottleneck and threads spend majority of
> > > > their time in spinlock overhead.
> > > >
> > > > On multi-socket systems, the keys to our NUMA spinlock performance
> > > > are to minimize cross-socket traffic as well as localize the serialized
> > > > workload to one core for execution. The basic principles of NUMA
> > > > spinlock are mainly consisted of following approaches, which reduce
> > > > data movement and accelerate critical section, eventually give us
> > > > significant performance improvement.
> > >
> > > I question whether this belongs in glibc. It seems highly application-
> > > and kernel-specific. Is there a reason you wouldn't prefer to
> > > implement and maintain it in a library for use in the kind of
> > > application that needs it?
> >
> > This is a good question. On the other hand, the current spinlock
> > in glibc hasn't been changed for many years. It doesn't scale for
> > today's hardware. Having a scalable spinlock in glibc is desirable.
>
> "Scalable spinlock" is something of an oxymoron. Spinlocks are for
> situations where contention is extremely rare, since they inherently
> blow up badly under contention. If this is happening it means you
> wanted a mutex not a spinlock.
The critical region serialized by spinlock is very small. Overhead
of mutex is much bigger than the critical region itself.
> > > This looks like fragile duplication of stdio-like buffering logic
> > > that's not at all specific to this file. Does glibc have a policy on
> > > whether things needing this should use stdio or some other shared code
> > > rather than open-coding it like this?
> >
> > This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c. Should it
> > be exported in GLIBC_PRIVATE name space?
>
> Possibly. I don't have a strong opinion here. At least it's good to
> know it's copied from code elsewhere that's believed to be
> correct/safe.
>
> >
> > > > [...]
> > >
> > > > +/* Allocate a NUMA spinlock and return a pointer to it. Caller should
> > > > + call numa_spinlock_free on the NUMA spinlock pointer to free the
> > > > + memory when it is no longer needed. */
> > > > +
> > > > +struct numa_spinlock *
> > > > +numa_spinlock_alloc (void)
> > > > +{
> > > > + const size_t buffer_size = 1024;
> > > > + char buffer[buffer_size];
> > > > + char *buffer_end = buffer + buffer_size;
> > > > + char *cp = buffer_end;
> > > > + char *re = buffer_end;
> > > > +
> > > > + const int flags = O_RDONLY | O_CLOEXEC;
> > > > + int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
> > > > + char *l;
> > > > + unsigned int max_node = 0;
> > > > + unsigned int node_count = 0;
> > > > + if (fd != -1)
> > > > + {
> > > > + l = next_line (fd, buffer, &cp, &re, buffer_end);
> > > > + if (l != NULL)
> > > > + do
> > > > + {
> > > > + char *endp;
> > > > + unsigned long int n = strtoul (l, &endp, 10);
> > > > + if (l == endp)
> > > > + {
> > > > + node_count = 1;
> > > > + break;
> > > > + }
> > > > +
> > > > + unsigned long int m = n;
> > > > + if (*endp == '-')
> > > > + {
> > > > + l = endp + 1;
> > > > + m = strtoul (l, &endp, 10);
> > > > + if (l == endp)
> > > > + {
> > > > + node_count = 1;
> > > > + break;
> > > > + }
> > > > + }
> > > > +
> > > > + node_count += m - n + 1;
> > > > +
> > > > + if (m >= max_node)
> > > > + max_node = m;
> > > > +
> > > > + l = endp;
> > > > + while (l < re && isspace (*l))
> > > > + ++l;
> > > > + }
> > > > + while (l < re);
> > > > +
> > > > + __close_nocancel_nostatus (fd);
> > > > + }
> > > > +
> > > > + /* NB: Some NUMA nodes may not be available, if the number of NUMA
> > > > + nodes is 1, set the maximium NUMA node number to 0. */
> > > > + if (node_count == 1)
> > > > + max_node = 0;
> > > > +
> > > > + unsigned int max_cpu = 0;
> > > > + unsigned int *physical_package_id_p = NULL;
> > > > +
> > > > + if (max_node == 0)
> > > > + {
> > > > + /* There is at least 1 node. */
> > > > + node_count = 1;
> > > > +
> > > > + /* If NUMA is disabled, use physical_package_id instead. */
> > > > + struct dirent **cpu_list;
> > > > + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
> > > > + select_cpu, NULL);
> > > > + if (nprocs > 0)
> > > > + {
> > > > + int i;
> > > > + unsigned int *cpu_id_p = NULL;
> > > > +
> > > > + /* Find the maximum CPU number. */
> > > > + if (posix_memalign ((void **) &cpu_id_p,
> > > > + __alignof__ (void *),
> > > > + nprocs * sizeof (unsigned int)) == 0)
> > >
> > > Using posix_memalign to get memory with the alignment of
> > > __alignof__(void*) makes no sense. All allocations via malloc are
> > > suitably aligned for any standard type.
> >
> > Does glibc prefer malloc over posix_memalign?
>
> If not, I think it *should*. Use of posix_memalign suggests to the
> reader that some nonstandard alignment is needed, prompting one to ask
> "what?" only to figure out it was gratuitous.
>
> What's worse -- and I missed this on the first pass --
> posix_memalign's signature is broken by design and almost universally
> gets the programmer to invoke undefined behavior, which has happened
> here. cpu_id_p has type unsigned int *, but due to the cast,
> posix_memalign accesses it as if it had type void *, which is an
> aliasing violation. For this reason alone, I would go so far as to say
> the function should never be used unless there's no way to avoid it,
> and even then it should be wrapped with a function that *returns* the
> pointer as void* so that this error can't be made.
We will change it to malloc.
> > > > + if (snprintf (path, sizeof (path),
> > > > + "/sys/devices/system/cpu/%s/topology/physical_package_id",
> > > > + d->d_name) > 0)
> > >
> > > Are these sysfs pathnames documented as stable/permanent by the
> > > kernel?
> >
> > I believe so.
>
> OK. It is worth checking though since otherwise you have an interface
> that will suddenly break when things change on the kernel side.
>
Documentation/cputopology.txt has
===========================================
How CPU topology info is exported via sysfs
===========================================
Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo output of some architectures:
1) /sys/devices/system/cpu/cpuX/topology/physical_package_id:
physical package id of cpuX. Typically corresponds to a physical
socket number, but the actual value is architecture and platform
dependent.
....
--
H.J.
next prev parent reply other threads:[~2019-01-03 21:28 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-26 9:51 Ma Ling
2019-01-03 4:05 ` 马凌(彦军)
[not found] ` <0a474516-b8c8-48cf-aeea-e57c77b78cbd.ling.ml@antfin.com>
2019-01-03 5:35 ` 转发:[PATCH] " 马凌(彦军)
2019-01-03 14:52 ` Szabolcs Nagy
2019-01-03 19:59 ` H.J. Lu
2019-01-05 12:34 ` [PATCH] " Carlos O'Donell
2019-01-05 16:36 ` H.J. Lu
2019-01-07 19:12 ` Florian Weimer
2019-01-07 19:49 ` H.J. Lu
2019-01-10 16:31 ` Carlos O'Donell
2019-01-10 16:32 ` Florian Weimer
2019-01-10 16:41 ` Carlos O'Donell
2019-01-10 17:52 ` Szabolcs Nagy
2019-01-10 19:24 ` Carlos O'Donell
2019-01-11 12:01 ` kemi
2019-01-14 22:45 ` Torvald Riegel
2019-01-15 9:32 ` Florian Weimer
2019-01-15 12:01 ` Torvald Riegel
2019-01-15 12:17 ` Florian Weimer
2019-01-15 12:31 ` Torvald Riegel
2019-01-11 16:24 ` H.J. Lu
2019-01-14 23:03 ` Torvald Riegel
2019-01-04 4:13 ` 转发:[PATCH] " 马凌(彦军)
2019-01-03 20:43 ` [PATCH] " Rich Felker
2019-01-03 20:55 ` H.J. Lu
2019-01-03 21:21 ` Rich Felker
2019-01-03 21:28 ` H.J. Lu [this message]
2019-01-14 23:18 ` Torvald Riegel
2019-01-15 2:33 ` kemi
2019-01-15 12:37 ` Torvald Riegel
2019-01-15 16:44 ` Rich Felker
2019-01-17 3:10 ` kemi
2019-02-04 17:23 ` Torvald Riegel
2019-01-14 22:40 ` Torvald Riegel
2019-01-14 23:26 ` Torvald Riegel
2019-01-15 4:47 ` 马凌(彦军)
2019-01-15 2:56 ` kemi
2019-01-15 4:27 ` 马凌(彦军)
2019-01-10 13:18 马凌(彦军)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMe9rOrCYqQpJ=2cNpQudO0gN3ocn1pKLd7q7tLcu4=xO+C+dw@mail.gmail.com' \
--to=hjl.tools@gmail.com \
--cc=dalias@libc.org \
--cc=hongjiu.lu@intel.com \
--cc=libc-alpha@sourceware.org \
--cc=ling.ma.program@gmail.com \
--cc=ling.ml@antfin.com \
--cc=wei3.xiao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).