From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-99025-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 37500 invoked by alias); 3 Jan 2019 21:21:22 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 37474 invoked by uid 89); 3 Jan 2019 21:21:21 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=0.1 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RDNS_DYNAMIC,TVD_RCVD_IP autolearn=no version=3.3.2 spammy=badly, situations, believed, blow
X-HELO: brightrain.aerifal.cx
Date: Thu, 03 Jan 2019 21:21:00 -0000
From: Rich Felker <dalias@libc.org>
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Ma Ling <ling.ma.program@gmail.com>,
	GNU C Library <libc-alpha@sourceware.org>,
	"Lu, Hongjiu" <hongjiu.lu@intel.com>,
	"ling.ma" <ling.ml@antfin.com>, Wei Xiao <wei3.xiao@intel.com>
Subject: Re: [PATCH] NUMA spinlock [BZ #23962]
Message-ID: <20190103212113.GV23599@brightrain.aerifal.cx>
References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local>
 <20190103204338.GU23599@brightrain.aerifal.cx>
 <CAMe9rOoBhZmzuEoPGjhbxkYZ3mOaC-8tPUrSuhmcbnu8J19LpA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAMe9rOoBhZmzuEoPGjhbxkYZ3mOaC-8tPUrSuhmcbnu8J19LpA@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: Rich Felker <dalias@aerifal.cx>
X-SW-Source: 2019-01/txt/msg00095.txt.bz2

On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote:
> On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > > From: "ling.ma" <ling.ml@antfin.com>
> > >
> > > On multi-socket systems, memory is shared across the entire system.
> > > Data access to the local socket is much faster than the remote socket
> > > and data access to the local core is faster than sibling cores on the
> > > same socket.  For serialized workloads with conventional spinlock,
> > > when there is high spinlock contention between threads, lock ping-pong
> > > among sockets becomes the bottleneck and threads spend majority of
> > > their time in spinlock overhead.
> > >
> > > On multi-socket systems, the keys to our NUMA spinlock performance
> > > are to minimize cross-socket traffic as well as localize the serialized
> > > workload to one core for execution.  The basic principles of NUMA
> > > spinlock are mainly consisted of following approaches, which reduce
> > > data movement and accelerate critical section, eventually give us
> > > significant performance improvement.
> >
> > I question whether this belongs in glibc. It seems highly application-
> > and kernel-specific. Is there a reason you wouldn't prefer to
> > implement and maintain it in a library for use in the kind of
> > application that needs it?
> 
> This is a good question.  On the other hand,  the current spinlock
> in glibc hasn't been changed for many years.  It doesn't scale for
> today's hardware.  Having a scalable spinlock in glibc is desirable.

"Scalable spinlock" is something of an oxymoron. Spinlocks are for
situations where contention is extremely rare, since they inherently
blow up badly under contention. If this is happening it means you
wanted a mutex not a spinlock.

> > This looks like fragile duplication of stdio-like buffering logic
> > that's not at all specific to this file. Does glibc have a policy on
> > whether things needing this should use stdio or some other shared code
> > rather than open-coding it like this?
> 
> This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c.  Should it
> be exported in GLIBC_PRIVATE name space?

Possibly. I don't have a strong opinion here. At least it's good to
know it's copied from code elsewhere that's believed to be
correct/safe.

> 
> > > [...]
> >
> > > +/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
> > > +   call numa_spinlock_free on the NUMA spinlock pointer to free the
> > > +   memory when it is no longer needed.  */
> > > +
> > > +struct numa_spinlock *
> > > +numa_spinlock_alloc (void)
> > > +{
> > > +  const size_t buffer_size = 1024;
> > > +  char buffer[buffer_size];
> > > +  char *buffer_end = buffer + buffer_size;
> > > +  char *cp = buffer_end;
> > > +  char *re = buffer_end;
> > > +
> > > +  const int flags = O_RDONLY | O_CLOEXEC;
> > > +  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
> > > +  char *l;
> > > +  unsigned int max_node = 0;
> > > +  unsigned int node_count = 0;
> > > +  if (fd != -1)
> > > +    {
> > > +      l = next_line (fd, buffer, &cp, &re, buffer_end);
> > > +      if (l != NULL)
> > > +     do
> > > +       {
> > > +         char *endp;
> > > +         unsigned long int n = strtoul (l, &endp, 10);
> > > +         if (l == endp)
> > > +           {
> > > +             node_count = 1;
> > > +             break;
> > > +           }
> > > +
> > > +         unsigned long int m = n;
> > > +         if (*endp == '-')
> > > +           {
> > > +             l = endp + 1;
> > > +             m = strtoul (l, &endp, 10);
> > > +             if (l == endp)
> > > +               {
> > > +                 node_count = 1;
> > > +                 break;
> > > +               }
> > > +           }
> > > +
> > > +         node_count += m - n + 1;
> > > +
> > > +         if (m >= max_node)
> > > +           max_node = m;
> > > +
> > > +         l = endp;
> > > +         while (l < re && isspace (*l))
> > > +           ++l;
> > > +       }
> > > +     while (l < re);
> > > +
> > > +      __close_nocancel_nostatus (fd);
> > > +    }
> > > +
> > > +  /* NB: Some NUMA nodes may not be available, if the number of NUMA
> > > +     nodes is 1, set the maximium NUMA node number to 0.  */
> > > +  if (node_count == 1)
> > > +    max_node = 0;
> > > +
> > > +  unsigned int max_cpu = 0;
> > > +  unsigned int *physical_package_id_p = NULL;
> > > +
> > > +  if (max_node == 0)
> > > +    {
> > > +      /* There is at least 1 node.  */
> > > +      node_count = 1;
> > > +
> > > +      /* If NUMA is disabled, use physical_package_id instead.  */
> > > +      struct dirent **cpu_list;
> > > +      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
> > > +                         select_cpu, NULL);
> > > +      if (nprocs > 0)
> > > +     {
> > > +       int i;
> > > +       unsigned int *cpu_id_p = NULL;
> > > +
> > > +       /* Find the maximum CPU number.  */
> > > +       if (posix_memalign ((void **) &cpu_id_p,
> > > +                           __alignof__ (void *),
> > > +                           nprocs * sizeof (unsigned int)) == 0)
> >
> > Using posix_memalign to get memory with the alignment of
> > __alignof__(void*) makes no sense. All allocations via malloc are
> > suitably aligned for any standard type.
> 
> Does glibc prefer malloc over posix_memalign?

If not, I think it *should*. Use of posix_memalign suggests to the
reader that some nonstandard alignment is needed, prompting one to ask
"what?" only to figure out it was gratuitous.

What's worse -- and I missed this on the first pass --
posix_memalign's signature is broken by design and almost universally
gets the programmer to invoke undefined behavior, which has happened
here. cpu_id_p has type unsigned int *, but due to the cast,
posix_memalign accesses it as if it had type void *, which is an
aliasing violation. For this reason alone, I would go so far as to say
the function should never be used unless there's no way to avoid it,
and even then it should be wrapped with a function that *returns* the
pointer as void* so that this error can't be made.

> > > +                   if (snprintf (path, sizeof (path),
> > > +                                 "/sys/devices/system/cpu/%s/topology/physical_package_id",
> > > +                                 d->d_name) > 0)
> >
> > Are these sysfs pathnames documented as stable/permanent by the
> > kernel?
> 
> I believe so.

OK. It is worth checking though since otherwise you have an interface
that will suddenly break when things change on the kernel side.

Rich