Re: [PATCH] NUMA spinlock [BZ #23962]

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Rich Felker <dalias@libc.org>
To: Ma Ling <ling.ma.program@gmail.com>
Cc: libc-alpha@sourceware.org, hongjiu.lu@intel.com,
	"ling.ma" <ling.ml@antfin.com>, Wei Xiao <wei3.xiao@intel.com>
Subject: Re: [PATCH] NUMA spinlock [BZ #23962]
Date: Thu, 03 Jan 2019 20:43:00 -0000	[thread overview]
Message-ID: <20190103204338.GU23599@brightrain.aerifal.cx> (raw)
In-Reply-To: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local>

On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> From: "ling.ma" <ling.ml@antfin.com>
> 
> On multi-socket systems, memory is shared across the entire system.
> Data access to the local socket is much faster than the remote socket
> and data access to the local core is faster than sibling cores on the
> same socket.  For serialized workloads with conventional spinlock,
> when there is high spinlock contention between threads, lock ping-pong
> among sockets becomes the bottleneck and threads spend majority of
> their time in spinlock overhead.
> 
> On multi-socket systems, the keys to our NUMA spinlock performance
> are to minimize cross-socket traffic as well as localize the serialized
> workload to one core for execution.  The basic principles of NUMA
> spinlock are mainly consisted of following approaches, which reduce
> data movement and accelerate critical section, eventually give us
> significant performance improvement.

I question whether this belongs in glibc. It seems highly application-
and kernel-specific. Is there a reason you wouldn't prefer to
implement and maintain it in a library for use in the kind of
application that needs it?

Some specific review inline below:

> [...]
> diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> new file mode 100644
> index 0000000..8ff4e1a
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> @@ -0,0 +1,304 @@
> +/* Initialization of NUMA spinlock.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <assert.h>
> +#include <ctype.h>
> +#include <string.h>
> +#include <dirent.h>
> +#include <stdio.h>
> +#include <limits.h>
> +#ifdef _LIBC
> +# include <not-cancel.h>
> +#else
> +# include <stdlib.h>
> +# include <unistd.h>
> +# include <fcntl.h>
> +# define __open_nocancel		open
> +# define __close_nocancel_nostatus	close
> +# define __read_nocancel		read
> +#endif
> +
> +#include "numa-spinlock-private.h"
> +
> +static char *
> +next_line (int fd, char *const buffer, char **cp, char **re,
> +	   char *const buffer_end)
> +{
> +  char *res = *cp;
> +  char *nl = memchr (*cp, '\n', *re - *cp);
> +  if (nl == NULL)
> +    {
> +      if (*cp != buffer)
> +	{
> +	  if (*re == buffer_end)
> +	    {
> +	      memmove (buffer, *cp, *re - *cp);
> +	      *re = buffer + (*re - *cp);
> +	      *cp = buffer;
> +
> +	      ssize_t n = __read_nocancel (fd, *re, buffer_end - *re);
> +	      if (n < 0)
> +		return NULL;
> +
> +	      *re += n;
> +
> +	      nl = memchr (*cp, '\n', *re - *cp);
> +	      while (nl == NULL && *re == buffer_end)
> +		{
> +		  /* Truncate too long lines.  */
> +		  *re = buffer + 3 * (buffer_end - buffer) / 4;
> +		  n = __read_nocancel (fd, *re, buffer_end - *re);
> +		  if (n < 0)
> +		    return NULL;
> +
> +		  nl = memchr (*re, '\n', n);
> +		  **re = '\n';
> +		  *re += n;
> +		}
> +	    }
> +	  else
> +	    nl = memchr (*cp, '\n', *re - *cp);
> +
> +	  res = *cp;
> +	}
> +
> +      if (nl == NULL)
> +	nl = *re - 1;
> +    }
> +
> +  *cp = nl + 1;
> +  assert (*cp <= *re);
> +
> +  return res == *re ? NULL : res;
> +}

This looks like fragile duplication of stdio-like buffering logic
that's not at all specific to this file. Does glibc have a policy on
whether things needing this should use stdio or some other shared code
rather than open-coding it like this?

> [...]

> +/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
> +   call numa_spinlock_free on the NUMA spinlock pointer to free the
> +   memory when it is no longer needed.  */
> +
> +struct numa_spinlock *
> +numa_spinlock_alloc (void)
> +{
> +  const size_t buffer_size = 1024;
> +  char buffer[buffer_size];
> +  char *buffer_end = buffer + buffer_size;
> +  char *cp = buffer_end;
> +  char *re = buffer_end;
> +
> +  const int flags = O_RDONLY | O_CLOEXEC;
> +  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
> +  char *l;
> +  unsigned int max_node = 0;
> +  unsigned int node_count = 0;
> +  if (fd != -1)
> +    {
> +      l = next_line (fd, buffer, &cp, &re, buffer_end);
> +      if (l != NULL)
> +	do
> +	  {
> +	    char *endp;
> +	    unsigned long int n = strtoul (l, &endp, 10);
> +	    if (l == endp)
> +	      {
> +		node_count = 1;
> +		break;
> +	      }
> +
> +	    unsigned long int m = n;
> +	    if (*endp == '-')
> +	      {
> +		l = endp + 1;
> +		m = strtoul (l, &endp, 10);
> +		if (l == endp)
> +		  {
> +		    node_count = 1;
> +		    break;
> +		  }
> +	      }
> +
> +	    node_count += m - n + 1;
> +
> +	    if (m >= max_node)
> +	      max_node = m;
> +
> +	    l = endp;
> +	    while (l < re && isspace (*l))
> +	      ++l;
> +	  }
> +	while (l < re);
> +
> +      __close_nocancel_nostatus (fd);
> +    }
> +
> +  /* NB: Some NUMA nodes may not be available, if the number of NUMA
> +     nodes is 1, set the maximium NUMA node number to 0.  */
> +  if (node_count == 1)
> +    max_node = 0;
> +
> +  unsigned int max_cpu = 0;
> +  unsigned int *physical_package_id_p = NULL;
> +
> +  if (max_node == 0)
> +    {
> +      /* There is at least 1 node.  */
> +      node_count = 1;
> +
> +      /* If NUMA is disabled, use physical_package_id instead.  */
> +      struct dirent **cpu_list;
> +      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
> +			    select_cpu, NULL);
> +      if (nprocs > 0)
> +	{
> +	  int i;
> +	  unsigned int *cpu_id_p = NULL;
> +
> +	  /* Find the maximum CPU number.  */
> +	  if (posix_memalign ((void **) &cpu_id_p,
> +			      __alignof__ (void *),
> +			      nprocs * sizeof (unsigned int)) == 0)

Using posix_memalign to get memory with the alignment of
__alignof__(void*) makes no sense. All allocations via malloc are
suitably aligned for any standard type.

> +	    {
> +	      for (i = 0; i < nprocs; i++)
> +		{
> +		  unsigned int cpu_id
> +		    = strtoul (cpu_list[i]->d_name + 3, NULL, 10);
> +		  cpu_id_p[i] = cpu_id;
> +		  if (cpu_id > max_cpu)
> +		    max_cpu = cpu_id;
> +		}
> +
> +	      if (posix_memalign ((void **) &physical_package_id_p,
> +				  __alignof__ (void *),
> +				  ((max_cpu + 1)
> +				   * sizeof (unsigned int))) == 0)

Again.

> +		{
> +		  memset (physical_package_id_p, 0,
> +			  ((max_cpu + 1) * sizeof (unsigned int)));
> +
> +		  max_node = UINT_MAX;
> +
> +		  /* Get physical_package_id.  */
> +		  char path[(sizeof ("/sys/devices/system/cpu")
> +			     + 3 * sizeof (unsigned long int)
> +			     + sizeof ("/topology/physical_package_id"))];
> +		  for (i = 0; i < nprocs; i++)
> +		    {
> +		      struct dirent *d = cpu_list[i];
> +		      if (snprintf (path, sizeof (path),
> +				    "/sys/devices/system/cpu/%s/topology/physical_package_id",
> +				    d->d_name) > 0)

Are these sysfs pathnames documented as stable/permanent by the
kernel?

> +			{
> +			  fd = __open_nocancel (path, flags);
> +			  if (fd != -1)
> +			    {
> +			      if (__read_nocancel (fd, buffer,
> +						   buffer_size) > 0)
> +				{
> +				  char *endp;
> +				  unsigned long int package_id
> +				    = strtoul (buffer, &endp, 10);
> +				  if (package_id != ULONG_MAX
> +				      && *buffer != '\0'
> +				      && (*endp == '\0' || *endp == '\n'))
> +				    {
> +				      physical_package_id_p[cpu_id_p[i]]
> +					= package_id;
> +				      if (max_node == UINT_MAX)
> +					{
> +					  /* This is the first node.  */
> +					  max_node = package_id;
> +					}
> +				      else if (package_id != max_node)
> +					{
> +					  /* NB: We only need to know if
> +					     NODE_COUNT > 1.  */
> +					  node_count = 2;
> +					  if (package_id > max_node)
> +					    max_node = package_id;
> +					}
> +				    }
> +				}
> +			      __close_nocancel_nostatus (fd);
> +			    }
> +			}
> +
> +		      free (d);
> +		    }
> +		}
> +
> +	      free (cpu_id_p);
> +	    }
> +	  else
> +	    {
> +	      for (i = 0; i < nprocs; i++)
> +		free (cpu_list[i]);
> +	    }
> +
> +	  free (cpu_list);
> +	}
> +    }
> +
> +  if (physical_package_id_p != NULL && node_count == 1)
> +    {
> +      /* There is only one node.  No need for physical_package_id_p.  */
> +      free (physical_package_id_p);
> +      physical_package_id_p = NULL;
> +      max_cpu = 0;
> +    }
> +
> +  /* Allocate an array of struct numa_spinlock_info pointers to hold info
> +     for all NUMA nodes with NUMA node number from getcpu () as index.  */
> +  size_t size = (sizeof (struct numa_spinlock)
> +		 + ((max_node + 1)
> +		    * sizeof (struct numa_spinlock_info *)));
> +  struct numa_spinlock *lock;
> +  if (posix_memalign ((void **) &lock,
> +		      __alignof__ (struct numa_spinlock_info *), size))

Another gratuitous posix_memalign.

next prev parent reply	other threads:[~2019-01-03 20:43 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-26  9:51 Ma Ling
2019-01-03  4:05 ` 马凌(彦军)
     [not found]   ` <0a474516-b8c8-48cf-aeea-e57c77b78cbd.ling.ml@antfin.com>
2019-01-03  5:35     ` 转发：[PATCH] " 马凌(彦军)
2019-01-03 14:52       ` Szabolcs Nagy
2019-01-03 19:59         ` H.J. Lu
2019-01-05 12:34           ` [PATCH] " Carlos O'Donell
2019-01-05 16:36             ` H.J. Lu
2019-01-07 19:12               ` Florian Weimer
2019-01-07 19:49                 ` H.J. Lu
2019-01-10 16:31                   ` Carlos O'Donell
2019-01-10 16:32                     ` Florian Weimer
2019-01-10 16:41                       ` Carlos O'Donell
2019-01-10 17:52                         ` Szabolcs Nagy
2019-01-10 19:24                           ` Carlos O'Donell
2019-01-11 12:01                             ` kemi
2019-01-14 22:45                         ` Torvald Riegel
2019-01-15  9:32                           ` Florian Weimer
2019-01-15 12:01                             ` Torvald Riegel
2019-01-15 12:17                               ` Florian Weimer
2019-01-15 12:31                                 ` Torvald Riegel
2019-01-11 16:24                       ` H.J. Lu
2019-01-14 23:03             ` Torvald Riegel
2019-01-04  4:13         ` 转发：[PATCH] " 马凌(彦军)
2019-01-03 20:43 ` Rich Felker [this message]
2019-01-03 20:55   ` [PATCH] " H.J. Lu
2019-01-03 21:21     ` Rich Felker
2019-01-03 21:28       ` H.J. Lu
2019-01-14 23:18       ` Torvald Riegel
2019-01-15  2:33         ` kemi
2019-01-15 12:37           ` Torvald Riegel
2019-01-15 16:44             ` Rich Felker
2019-01-17  3:10             ` kemi
2019-02-04 17:23               ` Torvald Riegel
2019-01-14 22:40     ` Torvald Riegel
2019-01-14 23:26 ` Torvald Riegel
2019-01-15  4:47   ` 马凌(彦军)
2019-01-15  2:56 ` kemi
2019-01-15  4:27   ` 马凌(彦军)
2019-01-10 13:18 马凌(彦军)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190103204338.GU23599@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=hongjiu.lu@intel.com \
    --cc=libc-alpha@sourceware.org \
    --cc=ling.ma.program@gmail.com \
    --cc=ling.ml@antfin.com \
    --cc=wei3.xiao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).