[PATCH] libdw: add thread-safety to dwarf

public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] libdw: add thread-safety to dwarf_getabbrev()
@ 2019-08-16 19:24 Jonathon Anderson
  2019-08-21 11:16 ` Mark Wielaard
                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-16 19:24 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Srdan Milakovic

For parallel applications that need the information in the DIEs, the
Dwarf_Abbrev hash table et al. become a massive data race. This fixes 
that by:

1. Adding atomics & locks to the hash table to manage concurrency
   (lib/dynamicsizehash_concurrent.{c,h})
2. Adding a lock & array structure to the memory manager (pseudo-TLS)
   (libdwP.h, libdw_alloc.c)
3. Adding extra configure options for Helgrind/DRD annotations
   (configure.ac)
4. Including "stdatomic.h" from FreeBSD, to support C11-style atomics.
   (lib/stdatomic.h)

Signed-off-by: Jonathon Anderson <jma14@rice.edu>
Signed-off-by: Srđan Milaković <sm108@rice.edu>
---

Notes:
 - GCC >= 4.9 provides <stdatomic.h> natively; for those versions
   lib/stdatomic.h could be removed or disabled. We can also rewrite the
   file if the copyright becomes an issue.
 - Currently the concurrent hash table is always enabled, 
performance-wise
   there is no known difference between it and the non-concurrent 
version.
   This can be changed to toggle with --enable-thread-safety if 
preferred.
 - Another implementation of #2 above might use dynamic TLS 
(pthread_key_*),
   we chose this implementation to reduce the overall complexity.
   This can also be bound to --enable-thread-safety if preferred.

 ChangeLog                        |   5 +
 configure.ac                     |  30 ++
 lib/ChangeLog                    |   6 +
 lib/Makefile.am                  |   5 +-
 lib/dynamicsizehash_concurrent.c | 522 +++++++++++++++++++++++++++++++
 lib/dynamicsizehash_concurrent.h | 118 +++++++
 lib/stdatomic.h                  | 442 ++++++++++++++++++++++++++
 libdw/ChangeLog                  |   9 +
 libdw/Makefile.am                |   2 +-
 libdw/dwarf_abbrev_hash.c        |   2 +-
 libdw/dwarf_abbrev_hash.h        |   2 +-
 libdw/dwarf_begin_elf.c          |  13 +-
 libdw/dwarf_end.c                |  24 +-
 libdw/libdwP.h                   |  13 +-
 libdw/libdw_alloc.c              |  70 ++++-
 15 files changed, 1240 insertions(+), 23 deletions(-)
 create mode 100644 lib/dynamicsizehash_concurrent.c
 create mode 100644 lib/dynamicsizehash_concurrent.h
 create mode 100644 lib/stdatomic.h

diff --git a/ChangeLog b/ChangeLog
index bed3999f..93907ddd 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2019-08-15  Jonathon Anderson <jma14@rice.edu>
+
+	* configure.ac: Add new --enable-valgrind-annotations
+	* configure.ac: Add new --with-valgrind (headers only)
+
 2019-08-13  Mark Wielaard  <mark@klomp.org>

 	* configure.ac: Set version to 0.177.
diff --git a/configure.ac b/configure.ac
index c443fa3b..c5406b44 100644
--- a/configure.ac
+++ b/configure.ac
@@ -323,6 +323,35 @@ if test "$use_valgrind" = yes; then
 fi
 AM_CONDITIONAL(USE_VALGRIND, test "$use_valgrind" = yes)

+AC_ARG_WITH([valgrind],
+AS_HELP_STRING([--with-valgrind],[include directory for Valgrind 
headers]),
+[with_valgrind_headers=$withval], [with_valgrind_headers=no])
+if test "x$with_valgrind_headers" != xno; then
+    save_CFLAGS="$CFLAGS"
+    CFLAGS="$CFLAGS -I$with_valgrind_headers"
+    AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
+      #include <valgrind/valgrind.h>
+      int main() { return 0; }
+    ]])], [ HAVE_VALGRIND_HEADERS="yes"
+            CFLAGS="$save_CFLAGS -I$with_valgrind_headers" ],
+          [ AC_MSG_ERROR([invalid valgrind include directory: 
$with_valgrind_headers]) ])
+fi
+
+AC_ARG_ENABLE([valgrind-annotations],
+AS_HELP_STRING([--enable-valgrind-annotations],[insert extra 
annotations for better valgrind support]),
+[use_vg_annotations=$enableval], [use_vg_annotations=no])
+if test "$use_vg_annotations" = yes; then
+    if test "x$HAVE_VALGRIND_HEADERS" != "xyes"; then
+      AC_MSG_CHECKING([whether Valgrind headers are available])
+      AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
+        #include <valgrind/valgrind.h>
+        int main() { return 0; }
+      ]])], [ AC_MSG_RESULT([yes]) ],
+            [ AC_MSG_ERROR([valgrind annotations requested but no 
headers are available]) ])
+    fi
+fi
+AM_CONDITIONAL(USE_VG_ANNOTATIONS, test "$use_vg_annotations" = yes)
+
 AC_ARG_ENABLE([install-elfh],
 AS_HELP_STRING([--enable-install-elfh],[install elf.h in include dir]),
                [install_elfh=$enableval], [install_elfh=no])
@@ -668,6 +697,7 @@ AC_MSG_NOTICE([
   OTHER FEATURES
     Deterministic archives by default  : ${default_ar_deterministic}
     Native language support            : ${USE_NLS}
+    Extra Valgrind annotations         : ${use_vg_annotations}

   EXTRA TEST FEATURES (used with make check)
     have bunzip2 installed (required)  : ${HAVE_BUNZIP2}
diff --git a/lib/ChangeLog b/lib/ChangeLog
index 7381860c..e6d08509 100644
--- a/lib/ChangeLog
+++ b/lib/ChangeLog
@@ -1,3 +1,9 @@
+2019-08-08  Jonathon Anderson  <jma14@rice.edu>
+
+	* dynamicsizehash_concurrent.{c,h}: New files.
+	* stdatomic.h: New file, taken from FreeBSD.
+	* Makefile.am (noinst_HEADERS): Added *.h above.
+
 2019-05-03  Rosen Penev  <rosenp@gmail.com>

 	* color.c (parse_opt): Cast program_invocation_short_name to char *.
diff --git a/lib/Makefile.am b/lib/Makefile.am
index 36d21a07..af7228b9 100644
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -38,8 +38,9 @@ libeu_a_SOURCES = xstrdup.c xstrndup.c xmalloc.c 
next_prime.c \
 		  color.c printversion.c

 noinst_HEADERS = fixedsizehash.h libeu.h system.h dynamicsizehash.h 
list.h \
-		 eu-config.h color.h printversion.h bpf.h
-EXTRA_DIST = dynamicsizehash.c
+		 eu-config.h color.h printversion.h bpf.h \
+		 dynamicsizehash_concurrent.h stdatomic.h
+EXTRA_DIST = dynamicsizehash.c dynamicsizehash_concurrent.c

 if !GPROF
 xmalloc_CFLAGS = -ffunction-sections
diff --git a/lib/dynamicsizehash_concurrent.c 
b/lib/dynamicsizehash_concurrent.c
new file mode 100644
index 00000000..d645b143
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.c
@@ -0,0 +1,522 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License 
and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <stdlib.h>
+#include <system.h>
+#include <pthread.h>
+
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+   COMPARE   comparison function taking two pointers to TYPE objects
+
+   The following macros if present select features:
+
+   ITERATE   iterating over the table entries is possible
+   REVERSE   iterate in reverse order of insert
+ */
+
+
+static size_t
+lookup (NAME *htab, HASHTYPE hval, TYPE val __attribute__ ((unused)))
+{
+  /* First hash function: simply take the modul but prevent zero.  
Small values
+      can skip the division, which helps performance when this is 
common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+#if COMPARE != 0  /* A handful of tables don't actually compare the 
entries in
+                    the table, they instead rely on the hash.  In that 
case, we
+                    can skip parts that relate to the value. */
+  TYPE val_ptr;
+#endif
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    {
+#if COMPARE == 0
+      return idx;
+#else
+      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                            memory_order_acquire);
+      if (COMPARE(val_ptr, val) == 0)
+          return idx;
+#endif
+    }
+  else if (hash == 0)
+    {
+      return 0;
+    }
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+        {
+#if COMPARE == 0
+          return idx;
+#else
+          val_ptr = (TYPE) 
atomic_load_explicit(&htab->table[idx].val_ptr,
+                                                memory_order_acquire);
+          if (COMPARE(val_ptr, val) == 0)
+              return idx;
+#endif
+        }
+      else if (hash == 0)
+        {
+          return 0;
+        }
+    }
+}
+
+static int
+insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
+{
+  /* First hash function: simply take the modul but prevent zero.  
Small values
+      can skip the division, which helps performance when this is 
common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+  TYPE val_ptr;
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    {
+      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                            memory_order_acquire);
+      if (COMPARE(val_ptr, val) != 0)
+          return -1;
+    }
+  else if (hash == 0)
+    {
+      val_ptr = NULL;
+      
atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                              (uintptr_t *) &val_ptr,
+                                              (uintptr_t) val,
+                                              memory_order_acquire,
+                                              memory_order_acquire);
+
+      if (val_ptr == NULL)
+        {
+          atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                memory_order_release);
+          return 0;
+        }
+      else
+        {
+          do
+            {
+              hash = atomic_load_explicit(&htab->table[idx].val_ptr,
+                                          memory_order_acquire);
+            }
+          while (hash == 0);
+        }
+    }
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+        {
+          val_ptr = (TYPE) 
atomic_load_explicit(&htab->table[idx].val_ptr,
+                                                memory_order_acquire);
+          if (COMPARE(val_ptr, val) != 0)
+              return -1;
+        }
+      else if (hash == 0)
+        {
+          val_ptr = NULL;
+          
atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                                  (uintptr_t *) 
&val_ptr,
+                                                  (uintptr_t) val,
+                                                  memory_order_acquire,
+                                                  
memory_order_acquire);
+
+          if (val_ptr == NULL)
+            {
+              atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                    memory_order_release);
+              return 0;
+            }
+          else
+            {
+              do
+                {
+                  hash = 
atomic_load_explicit(&htab->table[idx].val_ptr,
+                                              memory_order_acquire);
+                }
+              while (hash == 0);
+            }
+        }
+    }
+}
+
+#define NO_RESIZING 0u
+#define ALLOCATING_MEMORY 1u
+#define MOVING_DATA 3u
+#define CLEANING 2u
+
+#define STATE_BITS 2u
+#define STATE_INCREMENT (1u << STATE_BITS)
+#define STATE_MASK (STATE_INCREMENT - 1)
+#define GET_STATE(A) ((A) & STATE_MASK)
+
+#define IS_NO_RESIZE_OR_CLEANING(A) (((A) & 0x1u) == 0)
+
+#define GET_ACTIVE_WORKERS(A) ((A) >> STATE_BITS)
+
+#define INITIALIZATION_BLOCK_SIZE 256
+#define MOVE_BLOCK_SIZE 256
+#define CEIL(A, B) (((A) + (B) - 1) / (B))
+
+/* Initializes records and copies the data from the old table.
+   It can share work with other threads */
+static void resize_helper(NAME *htab, int blocking)
+{
+  size_t num_old_blocks = CEIL(htab->old_size, MOVE_BLOCK_SIZE);
+  size_t num_new_blocks = CEIL(htab->size, INITIALIZATION_BLOCK_SIZE);
+
+  size_t my_block;
+  size_t num_finished_blocks = 0;
+
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_init_block, 
1,
+                                                memory_order_acquire))
+                                                    < num_new_blocks)
+    {
+      size_t record_it = my_block * INITIALIZATION_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * INITIALIZATION_BLOCK_SIZE;
+      if (record_end > htab->size)
+          record_end = htab->size;
+
+      while (record_it++ != record_end)
+        {
+          atomic_init(&htab->table[record_it].hashval, (uintptr_t) 
NULL);
+          atomic_init(&htab->table[record_it].val_ptr, (uintptr_t) 
NULL);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_initialized_blocks,
+                            num_finished_blocks, memory_order_release);
+  while (atomic_load_explicit(&htab->num_initialized_blocks,
+                              memory_order_acquire) != num_new_blocks);
+
+  /* All block are initialized, start moving */
+  num_finished_blocks = 0;
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_move_block, 
1,
+                                                memory_order_acquire))
+                                                    < num_old_blocks)
+    {
+      size_t record_it = my_block * MOVE_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * MOVE_BLOCK_SIZE;
+      if (record_end > htab->old_size)
+          record_end = htab->old_size;
+
+      while (record_it++ != record_end)
+        {
+          TYPE val_ptr = (TYPE) atomic_load_explicit(
+              &htab->old_table[record_it].val_ptr,
+              memory_order_acquire);
+          if (val_ptr == NULL)
+              continue;
+
+          HASHTYPE hashval = atomic_load_explicit(
+              &htab->old_table[record_it].hashval,
+              memory_order_acquire);
+          assert(hashval);
+
+          insert_helper(htab, hashval, val_ptr);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_moved_blocks, 
num_finished_blocks,
+                            memory_order_release);
+
+  if (blocking)
+      while (atomic_load_explicit(&htab->num_moved_blocks,
+                                  memory_order_acquire) != 
num_old_blocks);
+}
+
+static void
+resize_master(NAME *htab)
+{
+  htab->old_size = htab->size;
+  htab->old_table = htab->table;
+
+  htab->size = next_prime(htab->size * 2);
+  htab->table = malloc((1 + htab->size) * sizeof(htab->table[0]));
+  assert(htab->table);
+
+  /* Change state from ALLOCATING_MEMORY to MOVING_DATA */
+  atomic_fetch_xor_explicit(&htab->resizing_state,
+                            ALLOCATING_MEMORY ^ MOVING_DATA,
+                            memory_order_release);
+
+  resize_helper(htab, 1);
+
+  /* Change state from MOVING_DATA to CLEANING */
+  size_t resize_state = 
atomic_fetch_xor_explicit(&htab->resizing_state,
+                                                  MOVING_DATA ^ 
CLEANING,
+                                                  
memory_order_acq_rel);
+  while (GET_ACTIVE_WORKERS(resize_state) != 0)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* There are no more active workers */
+  atomic_store_explicit(&htab->next_init_block, 0, 
memory_order_relaxed);
+  atomic_store_explicit(&htab->num_initialized_blocks, 0,
+                        memory_order_relaxed);
+
+  atomic_store_explicit(&htab->next_move_block, 0, 
memory_order_relaxed);
+  atomic_store_explicit(&htab->num_moved_blocks, 0, 
memory_order_relaxed);
+
+  free(htab->old_table);
+
+  /* Change state to NO_RESIZING */
+  atomic_fetch_xor_explicit(&htab->resizing_state, CLEANING ^ 
NO_RESIZING,
+                            memory_order_relaxed);
+
+}
+
+static void
+resize_worker(NAME *htab)
+{
+  size_t resize_state = atomic_load_explicit(&htab->resizing_state,
+                                              memory_order_acquire);
+
+  /* If the resize has finished */
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+      return;
+
+  /* Register as worker and check if the resize has finished in the 
meantime*/
+  resize_state = atomic_fetch_add_explicit(&htab->resizing_state,
+                                            STATE_INCREMENT,
+                                            memory_order_acquire);
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  /* Wait while the new table is being allocated. */
+  while (GET_STATE(resize_state) == ALLOCATING_MEMORY)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* Check if the resize is done */
+  assert(GET_STATE(resize_state) != NO_RESIZING);
+  if (GET_STATE(resize_state) == CLEANING)
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  resize_helper(htab, 0);
+
+  /* Deregister worker */
+  atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                            memory_order_release);
+}
+
+
+int
+#define INIT(name) _INIT (name)
+#define _INIT(name) \
+  name##_init
+INIT(NAME) (NAME *htab, size_t init_size)
+{
+  /* We need the size to be a prime.  */
+  init_size = next_prime (init_size);
+
+  /* Initialize the data structure.  */
+  htab->size = init_size;
+  atomic_init(&htab->filled, 0);
+  atomic_init(&htab->resizing_state, 0);
+
+  atomic_init(&htab->next_init_block, 0);
+  atomic_init(&htab->num_initialized_blocks, 0);
+
+  atomic_init(&htab->next_move_block, 0);
+  atomic_init(&htab->num_moved_blocks, 0);
+
+  pthread_rwlock_init(&htab->resize_rwl, NULL);
+
+  htab->table = (void *) malloc ((init_size + 1) * sizeof 
(htab->table[0]));
+  if (htab->table == NULL)
+      return -1;
+
+  for (size_t i = 0; i <= init_size; i++)
+    {
+      atomic_init(&htab->table[i].hashval, (uintptr_t) NULL);
+      atomic_init(&htab->table[i].val_ptr, (uintptr_t) NULL);
+    }
+
+  return 0;
+}
+
+
+int
+#define FREE(name) _FREE (name)
+#define _FREE(name) \
+name##_free
+FREE(NAME) (NAME *htab)
+{
+  pthread_rwlock_destroy(&htab->resize_rwl);
+  free (htab->table);
+  return 0;
+}
+
+
+int
+#define INSERT(name) _INSERT (name)
+#define _INSERT(name) \
+name##_insert
+INSERT(NAME) (NAME *htab, HASHTYPE hval, TYPE data)
+{
+  int incremented = 0;
+
+  for(;;)
+    {
+      while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+          resize_worker(htab);
+
+      size_t filled;
+      if (!incremented)
+        {
+          filled = atomic_fetch_add_explicit(&htab->filled, 1,
+                                              memory_order_acquire);
+          incremented = 1;
+        }
+      else
+        {
+          filled = atomic_load_explicit(&htab->filled,
+                                        memory_order_acquire);
+        }
+
+
+      if (100 * filled > 90 * htab->size)
+        {
+          /* Table is filled more than 90%.  Resize the table.  */
+
+          size_t resizing_state = 
atomic_load_explicit(&htab->resizing_state,
+                                                        
memory_order_acquire);
+          if (resizing_state == 0 &&
+              
atomic_compare_exchange_strong_explicit(&htab->resizing_state,
+                                                      &resizing_state,
+                                                      
ALLOCATING_MEMORY,
+                                                      
memory_order_acquire,
+                                                      
memory_order_acquire))
+            {
+              /* Master thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+              pthread_rwlock_wrlock(&htab->resize_rwl);
+              resize_master(htab);
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+            }
+          else
+            {
+              /* Worker thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+              resize_worker(htab);
+            }
+        }
+      else
+        {
+          /* Lock acquired, no need for resize*/
+          break;
+        }
+    }
+
+  int ret_val = insert_helper(htab, hval, data);
+  if (ret_val == -1)
+      atomic_fetch_sub_explicit(&htab->filled, 1, 
memory_order_relaxed);
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
+
+
+
+TYPE
+#define FIND(name) _FIND (name)
+#define _FIND(name) \
+  name##_find
+FIND(NAME) (NAME *htab, HASHTYPE hval, TYPE val)
+{
+  while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+      resize_worker(htab);
+
+  size_t idx;
+
+  /* Make the hash data nonzero.  */
+  hval = hval ?: 1;
+  idx = lookup(htab, hval, val);
+
+  if (idx == 0)
+    {
+      pthread_rwlock_unlock(&htab->resize_rwl);
+      return NULL;
+    }
+
+  /* get a copy before unlocking the lock */
+  TYPE ret_val = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                             memory_order_relaxed);
+
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
+
diff --git a/lib/dynamicsizehash_concurrent.h 
b/lib/dynamicsizehash_concurrent.h
new file mode 100644
index 00000000..eb06ac9e
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.h
@@ -0,0 +1,118 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License 
and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <stddef.h>
+#include <pthread.h>
+#include "stdatomic.h"
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+
+   The following macros if present select features:
+
+   ITERATE   iterating over the table entries is possible
+   HASHTYPE  integer type for hash values, default unsigned long int
+ */
+
+
+
+#ifndef HASHTYPE
+# define HASHTYPE unsigned long int
+#endif
+
+#ifndef RESIZE_BLOCK_SIZE
+# define RESIZE_BLOCK_SIZE 256
+#endif
+
+/* Defined separately.  */
+extern size_t next_prime (size_t seed);
+
+
+/* Table entry type.  */
+#define _DYNHASHCONENTTYPE(name)       \
+  typedef struct name##_ent         \
+  {                                 \
+    _Atomic(HASHTYPE) hashval;      \
+    atomic_uintptr_t val_ptr;       \
+  } name##_ent
+#define DYNHASHENTTYPE(name) _DYNHASHCONENTTYPE (name)
+DYNHASHENTTYPE (NAME);
+
+/* Type of the dynamic hash table data structure.  */
+#define _DYNHASHCONTYPE(name) \
+typedef struct                                     \
+{                                                  \
+  size_t size;                                     \
+  size_t old_size;                                 \
+  atomic_size_t filled;                            \
+  name##_ent *table;                               \
+  name##_ent *old_table;                           \
+  atomic_size_t resizing_state;                    \
+  atomic_size_t next_init_block;                   \
+  atomic_size_t num_initialized_blocks;            \
+  atomic_size_t next_move_block;                   \
+  atomic_size_t num_moved_blocks;                  \
+  pthread_rwlock_t resize_rwl;                     \
+} name
+#define DYNHASHTYPE(name) _DYNHASHCONTYPE (name)
+DYNHASHTYPE (NAME);
+
+
+
+#define _FUNCTIONS(name)                                            \
+/* Initialize the hash table.  */                                   \
+extern int name##_init (name *htab, size_t init_size);              \
+                                                                    \
+/* Free resources allocated for hash table.  */                     \
+extern int name##_free (name *htab);                                \
+                                                                    \
+/* Insert new entry.  */                                            \
+extern int name##_insert (name *htab, HASHTYPE hval, TYPE data);    \
+                                                                    \
+/* Find entry in hash table.  */                                    \
+extern TYPE name##_find (name *htab, HASHTYPE hval, TYPE val);
+#define FUNCTIONS(name) _FUNCTIONS (name)
+FUNCTIONS (NAME)
+
+
+#ifndef NO_UNDEF
+# undef DYNHASHENTTYPE
+# undef DYNHASHTYPE
+# undef FUNCTIONS
+# undef _FUNCTIONS
+# undef XFUNCTIONS
+# undef _XFUNCTIONS
+# undef NAME
+# undef TYPE
+# undef ITERATE
+# undef COMPARE
+# undef FIRST
+# undef NEXT
+#endif
diff --git a/lib/stdatomic.h b/lib/stdatomic.h
new file mode 100644
index 00000000..49626662
--- /dev/null
+++ b/lib/stdatomic.h
@@ -0,0 +1,442 @@
+/*-
+ * Copyright (c) 2011 Ed Schouten <ed@FreeBSD.org>
+ *                    David Chisnall <theraven@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in 
the
+ *    documentation and/or other materials provided with the 
distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' 
AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, 
THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 
PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE 
LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 
CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE 
GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 
INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, 
STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 
ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 
OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef _STDATOMIC_H_
+#define	_STDATOMIC_H_
+
+#include <stddef.h>
+#include <stdint.h>
+
+#if !defined(__has_feature)
+#define __has_feature(x) 0
+#endif
+#if !defined(__has_builtin)
+#define __has_builtin(x) 0
+#endif
+#if !defined(__GNUC_PREREQ__)
+#if defined(__GNUC__) && defined(__GNUC_MINOR__)
+#define __GNUC_PREREQ__(maj, min)					\
+	((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
+#else
+#define __GNUC_PREREQ__(maj, min) 0
+#endif
+#endif
+
+#if !defined(__CLANG_ATOMICS) && !defined(__GNUC_ATOMICS)
+#if __has_feature(c_atomic)
+#define	__CLANG_ATOMICS
+#elif __GNUC_PREREQ__(4, 7)
+#define	__GNUC_ATOMICS
+#elif !defined(__GNUC__)
+#error "stdatomic.h does not support your compiler"
+#endif
+#endif
+
+/*
+ * language independent type to represent a Boolean value
+ */
+
+typedef int __Bool;
+
+/*
+ * 7.17.1 Atomic lock-free macros.
+ */
+
+#ifdef __GCC_ATOMIC_BOOL_LOCK_FREE
+#define	ATOMIC_BOOL_LOCK_FREE		__GCC_ATOMIC_BOOL_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR_LOCK_FREE
+#define	ATOMIC_CHAR_LOCK_FREE		__GCC_ATOMIC_CHAR_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR16_T_LOCK_FREE
+#define	ATOMIC_CHAR16_T_LOCK_FREE	__GCC_ATOMIC_CHAR16_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR32_T_LOCK_FREE
+#define	ATOMIC_CHAR32_T_LOCK_FREE	__GCC_ATOMIC_CHAR32_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_WCHAR_T_LOCK_FREE
+#define	ATOMIC_WCHAR_T_LOCK_FREE	__GCC_ATOMIC_WCHAR_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_SHORT_LOCK_FREE
+#define	ATOMIC_SHORT_LOCK_FREE		__GCC_ATOMIC_SHORT_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_INT_LOCK_FREE
+#define	ATOMIC_INT_LOCK_FREE		__GCC_ATOMIC_INT_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_LONG_LOCK_FREE
+#define	ATOMIC_LONG_LOCK_FREE		__GCC_ATOMIC_LONG_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_LLONG_LOCK_FREE
+#define	ATOMIC_LLONG_LOCK_FREE		__GCC_ATOMIC_LLONG_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_POINTER_LOCK_FREE
+#define	ATOMIC_POINTER_LOCK_FREE	__GCC_ATOMIC_POINTER_LOCK_FREE
+#endif
+
+#if !defined(__CLANG_ATOMICS)
+#define	_Atomic(T)			struct { volatile __typeof__(T) __val; }
+#endif
+
+/*
+ * 7.17.2 Initialization.
+ */
+
+#if defined(__CLANG_ATOMICS)
+#define	ATOMIC_VAR_INIT(value)		(value)
+#define	atomic_init(obj, value)		__c11_atomic_init(obj, value)
+#else
+#define	ATOMIC_VAR_INIT(value)		{ .__val = (value) }
+#define	atomic_init(obj, value)		((void)((obj)->__val = (value)))
+#endif
+
+/*
+ * Clang and recent GCC both provide predefined macros for the memory
+ * orderings.  If we are using a compiler that doesn't define them, 
use the
+ * clang values - these will be ignored in the fallback path.
+ */
+
+#ifndef __ATOMIC_RELAXED
+#define __ATOMIC_RELAXED		0
+#endif
+#ifndef __ATOMIC_CONSUME
+#define __ATOMIC_CONSUME		1
+#endif
+#ifndef __ATOMIC_ACQUIRE
+#define __ATOMIC_ACQUIRE		2
+#endif
+#ifndef __ATOMIC_RELEASE
+#define __ATOMIC_RELEASE		3
+#endif
+#ifndef __ATOMIC_ACQ_REL
+#define __ATOMIC_ACQ_REL		4
+#endif
+#ifndef __ATOMIC_SEQ_CST
+#define __ATOMIC_SEQ_CST		5
+#endif
+
+/*
+ * 7.17.3 Order and consistency.
+ *
+ * The memory_order_* constants that denote the barrier behaviour of 
the
+ * atomic operations.
+ */
+
+typedef enum {
+    memory_order_relaxed = __ATOMIC_RELAXED,
+    memory_order_consume = __ATOMIC_CONSUME,
+    memory_order_acquire = __ATOMIC_ACQUIRE,
+    memory_order_release = __ATOMIC_RELEASE,
+    memory_order_acq_rel = __ATOMIC_ACQ_REL,
+    memory_order_seq_cst = __ATOMIC_SEQ_CST
+} memory_order;
+
+/*
+ * 7.17.4 Fences.
+ */
+
+//#define __unused
+
+//static __inline void
+//atomic_thread_fence(memory_order __order __unused)
+//{
+//
+//#ifdef __CLANG_ATOMICS
+//    __c11_atomic_thread_fence(__order);
+//#elif defined(__GNUC_ATOMICS)
+//    __atomic_thread_fence(__order);
+//#else
+//    __sync_synchronize();
+//#endif
+//}
+//
+//static __inline void
+//atomic_signal_fence(memory_order __order __unused)
+//{
+//
+//#ifdef __CLANG_ATOMICS
+//    __c11_atomic_signal_fence(__order);
+//#elif defined(__GNUC_ATOMICS)
+//    __atomic_signal_fence(__order);
+//#else
+//    __asm volatile ("" ::: "memory");
+//#endif
+//}
+
+//#undef __unused
+
+/*
+ * 7.17.5 Lock-free property.
+ */
+
+#if defined(_KERNEL)
+/* Atomics in kernelspace are always lock-free. */
+#define	atomic_is_lock_free(obj) \
+	((void)(obj), (__Bool)1)
+#elif defined(__CLANG_ATOMICS)
+#define	atomic_is_lock_free(obj) \
+	__atomic_is_lock_free(sizeof(*(obj)), obj)
+#elif defined(__GNUC_ATOMICS)
+#define	atomic_is_lock_free(obj) \
+	__atomic_is_lock_free(sizeof((obj)->__val), &(obj)->__val)
+#else
+#define	atomic_is_lock_free(obj) \
+	((void)(obj), sizeof((obj)->__val) <= sizeof(void *))
+#endif
+
+/*
+ * 7.17.6 Atomic integer types.
+ */
+
+typedef _Atomic(__Bool)			atomic_bool;
+typedef _Atomic(char)			atomic_char;
+typedef _Atomic(signed char)		atomic_schar;
+typedef _Atomic(unsigned char)		atomic_uchar;
+typedef _Atomic(short)			atomic_short;
+typedef _Atomic(unsigned short)		atomic_ushort;
+typedef _Atomic(int)			atomic_int;
+typedef _Atomic(unsigned int)		atomic_uint;
+typedef _Atomic(long)			atomic_long;
+typedef _Atomic(unsigned long)		atomic_ulong;
+typedef _Atomic(long long)		atomic_llong;
+typedef _Atomic(unsigned long long)	atomic_ullong;
+#if 0
+typedef _Atomic(char16_t)		atomic_char16_t;
+typedef _Atomic(char32_t)		atomic_char32_t;
+#endif
+typedef _Atomic(wchar_t)		atomic_wchar_t;
+typedef _Atomic(int_least8_t)		atomic_int_least8_t;
+typedef _Atomic(uint_least8_t)		atomic_uint_least8_t;
+typedef _Atomic(int_least16_t)		atomic_int_least16_t;
+typedef _Atomic(uint_least16_t)		atomic_uint_least16_t;
+typedef _Atomic(int_least32_t)		atomic_int_least32_t;
+typedef _Atomic(uint_least32_t)		atomic_uint_least32_t;
+typedef _Atomic(int_least64_t)		atomic_int_least64_t;
+typedef _Atomic(uint_least64_t)		atomic_uint_least64_t;
+typedef _Atomic(int_fast8_t)		atomic_int_fast8_t;
+typedef _Atomic(uint_fast8_t)		atomic_uint_fast8_t;
+typedef _Atomic(int_fast16_t)		atomic_int_fast16_t;
+typedef _Atomic(uint_fast16_t)		atomic_uint_fast16_t;
+typedef _Atomic(int_fast32_t)		atomic_int_fast32_t;
+typedef _Atomic(uint_fast32_t)		atomic_uint_fast32_t;
+typedef _Atomic(int_fast64_t)		atomic_int_fast64_t;
+typedef _Atomic(uint_fast64_t)		atomic_uint_fast64_t;
+typedef _Atomic(intptr_t)		atomic_intptr_t;
+typedef _Atomic(uintptr_t)		atomic_uintptr_t;
+typedef _Atomic(size_t)			atomic_size_t;
+typedef _Atomic(ptrdiff_t)		atomic_ptrdiff_t;
+typedef _Atomic(intmax_t)		atomic_intmax_t;
+typedef _Atomic(uintmax_t)		atomic_uintmax_t;
+
+/*
+ * 7.17.7 Operations on atomic types.
+ */
+
+/*
+ * Compiler-specific operations.
+ */
+
+#if defined(__CLANG_ATOMICS)
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)						\
+	__c11_atomic_compare_exchange_strong(object, expected, desired,	\
+	    success, failure)
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	__c11_atomic_compare_exchange_weak(object, expected, desired,	\
+	    success, failure)
+#define	atomic_exchange_explicit(object, desired, order)		\
+	__c11_atomic_exchange(object, desired, order)
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	__c11_atomic_fetch_add(object, operand, order)
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	__c11_atomic_fetch_and(object, operand, order)
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	__c11_atomic_fetch_or(object, operand, order)
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	__c11_atomic_fetch_sub(object, operand, order)
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	__c11_atomic_fetch_xor(object, operand, order)
+#define	atomic_load_explicit(object, order)				\
+	__c11_atomic_load(object, order)
+#define	atomic_store_explicit(object, desired, order)			\
+	__c11_atomic_store(object, desired, order)
+#elif defined(__GNUC_ATOMICS)
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)						\
+	__atomic_compare_exchange_n(&(object)->__val, expected,		\
+	    desired, 0, success, failure)
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	__atomic_compare_exchange_n(&(object)->__val, expected,		\
+	    desired, 1, success, failure)
+#define	atomic_exchange_explicit(object, desired, order)		\
+	__atomic_exchange_n(&(object)->__val, desired, order)
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	__atomic_fetch_add(&(object)->__val, operand, order)
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	__atomic_fetch_and(&(object)->__val, operand, order)
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	__atomic_fetch_or(&(object)->__val, operand, order)
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	__atomic_fetch_sub(&(object)->__val, operand, order)
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	__atomic_fetch_xor(&(object)->__val, operand, order)
+#define	atomic_load_explicit(object, order)				\
+	__atomic_load_n(&(object)->__val, order)
+#define	atomic_store_explicit(object, desired, order)			\
+	__atomic_store_n(&(object)->__val, desired, order)
+#else
+#define	__atomic_apply_stride(object, operand) \
+	(((__typeof__((object)->__val))0) + (operand))
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)	__extension__ ({			\
+	__typeof__(expected) __ep = (expected);				\
+	__typeof__(*__ep) __e = *__ep;					\
+	(void)(success); (void)(failure);				\
+	(__Bool)((*__ep = __sync_val_compare_and_swap(&(object)->__val,	\
+	    __e, desired)) == __e);					\
+})
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	atomic_compare_exchange_strong_explicit(object, expected,	\
+		desired, success, failure)
+#if __has_builtin(__sync_swap)
+/* Clang provides a full-barrier atomic exchange - use it if 
available. */
+#define	atomic_exchange_explicit(object, desired, order)		\
+	((void)(order), __sync_swap(&(object)->__val, desired))
+#else
+/*
+ * __sync_lock_test_and_set() is only an acquire barrier in theory 
(although in
+ * practice it is usually a full barrier) so we need an explicit 
barrier before
+ * it.
+ */
+#define	atomic_exchange_explicit(object, desired, order)		\
+__extension__ ({							\
+	__typeof__(object) __o = (object);				\
+	__typeof__(desired) __d = (desired);				\
+	(void)(order);							\
+	__sync_synchronize();						\
+	__sync_lock_test_and_set(&(__o)->__val, __d);			\
+})
+#endif
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_add(&(object)->__val,		\
+	    __atomic_apply_stride(object, operand)))
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_and(&(object)->__val, operand))
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_or(&(object)->__val, operand))
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_sub(&(object)->__val,		\
+	    __atomic_apply_stride(object, operand)))
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_xor(&(object)->__val, operand))
+#define	atomic_load_explicit(object, order)				\
+	((void)(order), __sync_fetch_and_add(&(object)->__val, 0))
+#define	atomic_store_explicit(object, desired, order)			\
+	((void)atomic_exchange_explicit(object, desired, order))
+#endif
+
+/*
+ * Convenience functions.
+ *
+ * Don't provide these in kernel space. In kernel space, we should be
+ * disciplined enough to always provide explicit barriers.
+ */
+
+#ifndef _KERNEL
+#define	atomic_compare_exchange_strong(object, expected, desired)	\
+	atomic_compare_exchange_strong_explicit(object, expected,	\
+	    desired, memory_order_seq_cst, memory_order_seq_cst)
+#define	atomic_compare_exchange_weak(object, expected, desired)		\
+	atomic_compare_exchange_weak_explicit(object, expected,		\
+	    desired, memory_order_seq_cst, memory_order_seq_cst)
+#define	atomic_exchange(object, desired)				\
+	atomic_exchange_explicit(object, desired, memory_order_seq_cst)
+#define	atomic_fetch_add(object, operand)				\
+	atomic_fetch_add_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_and(object, operand)				\
+	atomic_fetch_and_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_or(object, operand)				\
+	atomic_fetch_or_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_sub(object, operand)				\
+	atomic_fetch_sub_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_xor(object, operand)				\
+	atomic_fetch_xor_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_load(object)						\
+	atomic_load_explicit(object, memory_order_seq_cst)
+#define	atomic_store(object, desired)					\
+	atomic_store_explicit(object, desired, memory_order_seq_cst)
+#endif /* !_KERNEL */
+
+/*
+ * 7.17.8 Atomic flag type and operations.
+ *
+ * XXX: Assume atomic_bool can be used as an atomic_flag. Is there some
+ * kind of compiler built-in type we could use?
+ */
+
+typedef struct {
+    atomic_bool	__flag;
+} atomic_flag;
+
+#define	ATOMIC_FLAG_INIT		{ ATOMIC_VAR_INIT(0) }
+
+static __inline __Bool
+atomic_flag_test_and_set_explicit(volatile atomic_flag *__object,
+                                  memory_order __order)
+{
+    return (atomic_exchange_explicit(&__object->__flag, 1, __order));
+}
+
+static __inline void
+atomic_flag_clear_explicit(volatile atomic_flag *__object, 
memory_order __order)
+{
+
+    atomic_store_explicit(&__object->__flag, 0, __order);
+}
+
+#ifndef _KERNEL
+static __inline __Bool
+atomic_flag_test_and_set(volatile atomic_flag *__object)
+{
+
+    return (atomic_flag_test_and_set_explicit(__object,
+                                              memory_order_seq_cst));
+}
+
+static __inline void
+atomic_flag_clear(volatile atomic_flag *__object)
+{
+
+    atomic_flag_clear_explicit(__object, memory_order_seq_cst);
+}
+#endif /* !_KERNEL */
+
+#endif /* !_STDATOMIC_H_ */
\ No newline at end of file
diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index bf1f4857..87abf7a7 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,12 @@
+2019-08-15  Jonathon Anderson  <jma14@rice.edu>
+
+	* libdw_alloc.c (__libdw_allocate): Added thread-safe stack allocator.
+	* libdwP.h (Dwarf): Likewise.
+	* dwarf_begin_elf.c (dwarf_begin_elf): Support for above.
+	* dwarf_end.c (dwarf_end): Likewise.
+	* dwarf_abbrev_hash.{c,h}: Use the *_concurrent hash table.
+	* Makefile.am: Link -lpthread to provide rwlocks.
+
 2019-08-12  Mark Wielaard  <mark@klomp.org>

 	* libdw.map (ELFUTILS_0.177): Add new version of dwelf_elf_begin.
diff --git a/libdw/Makefile.am b/libdw/Makefile.am
index 7a3d5322..6d0a0187 100644
--- a/libdw/Makefile.am
+++ b/libdw/Makefile.am
@@ -108,7 +108,7 @@ am_libdw_pic_a_OBJECTS = $(libdw_a_SOURCES:.c=.os)
 libdw_so_LIBS = libdw_pic.a ../libdwelf/libdwelf_pic.a \
 	  ../libdwfl/libdwfl_pic.a ../libebl/libebl.a
 libdw_so_DEPS = ../lib/libeu.a ../libelf/libelf.so
-libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS)
+libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS) 
-lpthread
 libdw_so_SOURCES =
 libdw.so$(EXEEXT): $(srcdir)/libdw.map $(libdw_so_LIBS) 
$(libdw_so_DEPS)
 # The rpath is necessary for libebl because its $ORIGIN use will
diff --git a/libdw/dwarf_abbrev_hash.c b/libdw/dwarf_abbrev_hash.c
index f52f5ad5..c2548140 100644
--- a/libdw/dwarf_abbrev_hash.c
+++ b/libdw/dwarf_abbrev_hash.c
@@ -38,7 +38,7 @@
 #define next_prime __libdwarf_next_prime
 extern size_t next_prime (size_t) attribute_hidden;

-#include <dynamicsizehash.c>
+#include <dynamicsizehash_concurrent.c>

 #undef next_prime
 #define next_prime attribute_hidden __libdwarf_next_prime
diff --git a/libdw/dwarf_abbrev_hash.h b/libdw/dwarf_abbrev_hash.h
index d2f02ccc..bc3d62c7 100644
--- a/libdw/dwarf_abbrev_hash.h
+++ b/libdw/dwarf_abbrev_hash.h
@@ -34,6 +34,6 @@
 #define TYPE Dwarf_Abbrev *
 #define COMPARE(a, b) (0)

-#include <dynamicsizehash.h>
+#include <dynamicsizehash_concurrent.h>

 #endif	/* dwarf_abbrev_hash.h */
diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
index 38c8f5c6..b3885bb5 100644
--- a/libdw/dwarf_begin_elf.c
+++ b/libdw/dwarf_begin_elf.c
@@ -417,11 +417,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn 
*scngrp)
   /* Initialize the memory handling.  */
   result->mem_default_size = mem_default_size;
   result->oom_handler = __libdw_oom;
-  result->mem_tail = (struct libdw_memblock *) (result + 1);
-  result->mem_tail->size = (result->mem_default_size
-			    - offsetof (struct libdw_memblock, mem));
-  result->mem_tail->remaining = result->mem_tail->size;
-  result->mem_tail->prev = NULL;
+  pthread_rwlock_init(&result->mem_rwl, NULL);
+  result->mem_stacks = 1;
+  result->mem_tails = malloc (sizeof (struct libdw_memblock *));
+  result->mem_tails[0] = (struct libdw_memblock *) (result + 1);
+  result->mem_tails[0]->size = (result->mem_default_size
+			       - offsetof (struct libdw_memblock, mem));
+  result->mem_tails[0]->remaining = result->mem_tails[0]->size;
+  result->mem_tails[0]->prev = NULL;

   if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
     {
diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
index 29795c10..6317bcda 100644
--- a/libdw/dwarf_end.c
+++ b/libdw/dwarf_end.c
@@ -94,14 +94,22 @@ dwarf_end (Dwarf *dwarf)
       /* And the split Dwarf.  */
       tdestroy (dwarf->split_tree, noop_free);

-      struct libdw_memblock *memp = dwarf->mem_tail;
-      /* The first block is allocated together with the Dwarf object.  
*/
-      while (memp->prev != NULL)
-	{
-	  struct libdw_memblock *prevp = memp->prev;
-	  free (memp);
-	  memp = prevp;
-	}
+      for (size_t i = 0; i < dwarf->mem_stacks; i++)
+        {
+          struct libdw_memblock *memp = dwarf->mem_tails[i];
+          /* The first block is allocated together with the Dwarf 
object.  */
+          while (memp != NULL && memp->prev != NULL)
+	    {
+	      struct libdw_memblock *prevp = memp->prev;
+	      free (memp);
+	      memp = prevp;
+	    }
+          /* Only for stack 0 though, the others are allocated 
individually.  */
+          if (memp != NULL && i > 0)
+            free (memp);
+        }
+      free (dwarf->mem_tails);
+      pthread_rwlock_destroy (&dwarf->mem_rwl);

       /* Free the pubnames helper structure.  */
       free (dwarf->pubnames_sets);
diff --git a/libdw/libdwP.h b/libdw/libdwP.h
index eebb7d12..442d493d 100644
--- a/libdw/libdwP.h
+++ b/libdw/libdwP.h
@@ -31,6 +31,7 @@

 #include <libintl.h>
 #include <stdbool.h>
+#include <pthread.h>

 #include <libdw.h>
 #include <dwarf.h>
@@ -218,16 +219,18 @@ struct Dwarf
   /* Similar for addrx/constx, which will come from .debug_addr 
section.  */
   struct Dwarf_CU *fake_addr_cu;

-  /* Internal memory handling.  This is basically a simplified
+  /* Internal memory handling.  This is basically a simplified 
thread-local
      reimplementation of obstacks.  Unfortunately the standard obstack
      implementation is not usable in libraries.  */
+  pthread_rwlock_t mem_rwl;
+  size_t mem_stacks;
   struct libdw_memblock
   {
     size_t size;
     size_t remaining;
     struct libdw_memblock *prev;
     char mem[0];
-  } *mem_tail;
+  } **mem_tails;

   /* Default size of allocated memory blocks.  */
   size_t mem_default_size;
@@ -572,7 +575,7 @@ extern void __libdw_seterrno (int value) 
internal_function;

 /* Memory handling, the easy parts.  This macro does not do any 
locking.  */
 #define libdw_alloc(dbg, type, tsize, cnt) \
-  ({ struct libdw_memblock *_tail = (dbg)->mem_tail;			      \
+  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		      \
      size_t _required = (tsize) * (cnt);				      \
      type *_result = (type *) (_tail->mem + (_tail->size - 
_tail->remaining));\
      size_t _padding = ((__alignof (type)				      \
@@ -591,6 +594,10 @@ extern void __libdw_seterrno (int value) 
internal_function;
 #define libdw_typed_alloc(dbg, type) \
   libdw_alloc (dbg, type, sizeof (type), 1)

+/* Callback to choose a thread-local memory allocation stack.  */
+extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
+     __nonnull_attribute__ (1);
+
 /* Callback to allocate more.  */
 extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t 
align)
      __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
index f1e08714..c3c5e8a7 100644
--- a/libdw/libdw_alloc.c
+++ b/libdw/libdw_alloc.c
@@ -33,9 +33,73 @@

 #include <errno.h>
 #include <stdlib.h>
+#include <assert.h>
 #include "libdwP.h"
 #include "system.h"
+#include "stdatomic.h"
+#if USE_VG_ANNOTATIONS == 1
+#include <helgrind.h>
+#include <drd.h>
+#else
+#define ANNOTATE_HAPPENS_BEFORE(X)
+#define ANNOTATE_HAPPENS_AFTER(X)
+#endif
+
+
+#define thread_local __thread

+static thread_local int initialized = 0;
+static thread_local size_t thread_id;
+static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
+
+struct libdw_memblock *
+__libdw_alloc_tail (Dwarf *dbg)
+{
+  if (!initialized)
+    {
+      thread_id = atomic_fetch_add (&next_id, 1);
+      initialized = 1;
+    }
+
+  pthread_rwlock_rdlock (&dbg->mem_rwl);
+  if (thread_id >= dbg->mem_stacks)
+    {
+      pthread_rwlock_unlock (&dbg->mem_rwl);
+      pthread_rwlock_wrlock (&dbg->mem_rwl);
+
+      /* Another thread may have already reallocated. In theory using 
an
+         atomic would be faster, but given that this only happens once 
per
+         thread per Dwarf, some minor slowdown should be fine.  */
+      if (thread_id >= dbg->mem_stacks)
+        {
+          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
+                                    * sizeof (struct libdw_memblock 
*));
+          assert(dbg->mem_tails);
+          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
+            dbg->mem_tails[i] = NULL;
+          dbg->mem_stacks = thread_id + 1;
+          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
+        }
+
+      pthread_rwlock_unlock (&dbg->mem_rwl);
+      pthread_rwlock_rdlock (&dbg->mem_rwl);
+    }
+
+  /* At this point, we have an entry in the tail array.  */
+  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
+  struct libdw_memblock *result = dbg->mem_tails[thread_id];
+  if (result == NULL)
+    {
+      result = malloc (dbg->mem_default_size);
+      result->size = dbg->mem_default_size
+                     - offsetof (struct libdw_memblock, mem);
+      result->remaining = result->size;
+      result->prev = NULL;
+      dbg->mem_tails[thread_id] = result;
+    }
+  pthread_rwlock_unlock (&dbg->mem_rwl);
+  return result;
+}

 void *
 __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
@@ -52,8 +116,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, 
size_t align)
   newp->size = size - offsetof (struct libdw_memblock, mem);
   newp->remaining = (uintptr_t) newp + size - (result + minsize);

-  newp->prev = dbg->mem_tail;
-  dbg->mem_tail = newp;
+  pthread_rwlock_rdlock (&dbg->mem_rwl);
+  newp->prev = dbg->mem_tails[thread_id];
+  dbg->mem_tails[thread_id] = newp;
+  pthread_rwlock_unlock (&dbg->mem_rwl);

   return (void *) result;
 }
-- 
2.23.0.rc1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-16 19:24 [PATCH] libdw: add thread-safety to dwarf_getabbrev() Jonathon Anderson
@ 2019-08-21 11:16 ` Mark Wielaard
  2019-08-21 14:21   ` Jonathon Anderson
       [not found]   ` <1566396518.5389.0@smtp.mail.rice.edu>
  2019-08-21 21:50 ` Mark Wielaard
  2019-08-24 23:24 ` Mark Wielaard
  2 siblings, 2 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-08-21 11:16 UTC (permalink / raw)
  To: Jonathon Anderson, elfutils-devel; +Cc: Srdan Milakovic

Hi Jonathon and Srđan,

On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
> For parallel applications that need the information in the DIEs, the
> Dwarf_Abbrev hash table et al. become a massive data race. This fixes 
> that by:
> 
> 1. Adding atomics & locks to the hash table to manage concurrency
>    (lib/dynamicsizehash_concurrent.{c,h})
> 2. Adding a lock & array structure to the memory manager (pseudo-TLS)
>    (libdwP.h, libdw_alloc.c)
> 3. Adding extra configure options for Helgrind/DRD annotations
>    (configure.ac)
> 4. Including "stdatomic.h" from FreeBSD, to support C11-style   atomics.
>    (lib/stdatomic.h)

This looks like really nice work. Thanks!

I am splitting review in some smaller parts if you don't mind.
Simply because it is large and I cannot keep everything in my head at
once :) But here some initial overall comments.

> Notes:
>  - GCC >= 4.9 provides <stdatomic.h> natively; for those versions
>    lib/stdatomic.h could be removed or disabled. We can also rewrite the
>    file if the copyright becomes an issue.

If the compiler provides stdatomic.h then I think it would be good to
use that instead of our own implementation. The copyright isn't a
problem. But do you have a reference/URL to the upstream version? I
like to add that somewhere, so we can sync with it in the future. I see
various commented out parts. Was that already upstream? Should we just
remove those parts?

>  - Currently the concurrent hash table is always enabled, 
>    performance-wise there is no known difference between it  
>    and the non-concurrent  version.
>    This can be changed to toggle with --enable-thread-safety
>    if preferred.

I would prefer it always enabled, unless there is a massive slowdown of
the single-threaded case. The problem with --enable-thread-safety is
that it is a) known broken (sigh) and b) it basically introduces two
separate libraries that behave subtly differently. I would very much
like to get rid of --enable-thread-safety by fixing the broken locking
and simply making it the default.

>  - Another implementation of #2 above might use dynamic TLS 
>    (pthread_key_*),
>    we chose this implementation to reduce the overall complexity.

Are there any other trade-offs?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-21 11:16 ` Mark Wielaard
@ 2019-08-21 14:21   ` Jonathon Anderson
  2019-08-23 21:22     ` Mark Wielaard
       [not found]   ` <1566396518.5389.0@smtp.mail.rice.edu>
  1 sibling, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-21 14:21 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

First message failed to send, hopefully this one works...

On Wed, Aug 21, 2019 at 6:16 AM, Mark Wielaard <mark@klomp.org> wrote:
> Hi Jonathon and Srđan,
> 
> On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
>>  For parallel applications that need the information in the DIEs, the
>>  Dwarf_Abbrev hash table et al. become a massive data race. This 
>> fixes
>>  that by:
>> 
>>  1. Adding atomics & locks to the hash table to manage concurrency
>>     (lib/dynamicsizehash_concurrent.{c,h})
>>  2. Adding a lock & array structure to the memory manager 
>> (pseudo-TLS)
>>     (libdwP.h, libdw_alloc.c)
>>  3. Adding extra configure options for Helgrind/DRD annotations
>>     (configure.ac)
>>  4. Including "stdatomic.h" from FreeBSD, to support C11-style   
>> atomics.
>>     (lib/stdatomic.h)
> 
> This looks like really nice work. Thanks!
> 
> I am splitting review in some smaller parts if you don't mind.
> Simply because it is large and I cannot keep everything in my head at
> once :) But here some initial overall comments.

Thank you for looking over it! Comments on comments below.

> 
>>  Notes:
>>   - GCC >= 4.9 provides <stdatomic.h> natively; for those versions
>>     lib/stdatomic.h could be removed or disabled. We can also 
>> rewrite the
>>     file if the copyright becomes an issue.
> 
> If the compiler provides stdatomic.h then I think it would be good to
> use that instead of our own implementation. The copyright isn't a
> problem. But do you have a reference/URL to the upstream version? I
> like to add that somewhere, so we can sync with it in the future. I 
> see
> various commented out parts. Was that already upstream? Should we just
> remove those parts?

It would definitely be preferable to use the compiler's implementation
if possible, we used this in case GCC 4.7 and 4.8 (RHEL7) compatibility
was needed. If those versions are old enough the file can be removed
entirely.

The upstream is at
https://github.com/freebsd/freebsd/blob/master/sys/sys/stdatomic.h,
although the version here appears to be slightly modified (we used the
version that HPCToolkit ships). The components we use don't seem
affected, so a resync shouldn't make a difference.

> 
>>   - Currently the concurrent hash table is always enabled,
>>     performance-wise there is no known difference between it
>>     and the non-concurrent  version.
>>     This can be changed to toggle with --enable-thread-safety
>>     if preferred.
> 
> I would prefer it always enabled, unless there is a massive slowdown 
> of
> the single-threaded case. The problem with --enable-thread-safety is
> that it is a) known broken (sigh) and b) it basically introduces two
> separate libraries that behave subtly differently. I would very much
> like to get rid of --enable-thread-safety by fixing the broken locking
> and simply making it the default.

I haven't noticed any slowdown in the single-threaded case, although I
haven't stressed it hard enough to find out for certain. From a
theoretical standpoint it shouldn't, atomics (with the proper memory
orders) are usually (on x86 at least) as cheap as normal accesses when
used by a single thread, and the algorithm is otherwise effectively the
same as the original hash table.

How difficult would it be to fix the locking (or rather, what's
"broken")? We would definitely benefit from having thread-safety at
least for getters, which would only need locks around the internal
management.

> 
>>   - Another implementation of #2 above might use dynamic TLS
>>     (pthread_key_*),
>>     we chose this implementation to reduce the overall complexity.
> 
> Are there any other trade-offs?

If the application spawns N threads that all use a Dwarf object (same
or different) enough to cause an allocation, and then joins them all,
any Dwarf objects allocated after will allocate N unusable slots in the
mem_tails array. Thus long-running applications (that don't use thread
pools) would start experiencing effects similar to a memory leak, of 1
pointer's worth (8 bytes on 64-bit) per dead thread.

The alternative is to use dynamic TLS so that only threads that are
currently active use the extra memory, assuming libpthread is
sufficiently proactive about reclaiming unused key values. I think if
we assume `dwarf_end` happens-after any memory management (which would
make sense for a destructor), there should be a simple atomic pattern
to handle the freeing, but I would need to sit down for a while to work
out a sufficient proof.

I was also under the impression that dynamic TLS was particularly
expensive performance-wise, although I haven't experimented with it
myself enough to know. The compiler can be a little smarter about
static TLS, and the result is easier to reason about, which is why we
chose this implementation for initial patch. If the alternative would
be preferable we can change it.

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-21 14:21   ` Jonathon Anderson
@ 2019-08-23 21:22     ` Mark Wielaard
  0 siblings, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-08-23 21:22 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

On Wed, 2019-08-21 at 09:20 -0500, Jonathon Anderson wrote:
> First message failed to send, hopefully this one works...

Just for the record, the mailinglist did reject HTML
messages/attachments. It should have been changed now to simply strip
the HTML.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <1566396518.5389.0@smtp.mail.rice.edu>]

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
       [not found]   ` <1566396518.5389.0@smtp.mail.rice.edu>
@ 2019-08-23 18:25     ` Mark Wielaard
  2019-08-23 22:36       ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-08-23 18:25 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Wed, 2019-08-21 at 09:08 -0500, Jonathon Anderson wrote:
> On Wed, Aug 21, 2019 at 6:16 AM, Mark Wielaard <mark@klomp.org>
> wrote:On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
> > >  For parallel applications that need the information in the DIEs, the
> > >  Dwarf_Abbrev hash table et al. become a massive data race. This 
> > > fixes
> > >  that by:
> > > 
> > >  1. Adding atomics & locks to the hash table to manage concurrency
> > >     (lib/dynamicsizehash_concurrent.{c,h})
> > >  2. Adding a lock & array structure to the memory manager 
> > > (pseudo-TLS)
> > >     (libdwP.h, libdw_alloc.c)
> > >  3. Adding extra configure options for Helgrind/DRD annotations
> > >     (configure.ac)
> > >  4. Including "stdatomic.h" from FreeBSD, to support C11-style   
> > > atomics.
> > >     (lib/stdatomic.h)
> > 
> > This looks like really nice work. Thanks!
> > 
> > I am splitting review in some smaller parts if you don't mind.
> > Simply because it is large and I cannot keep everything in my head at
> > once :)

BTW. I would prefer to handle this as 4 separate additions, probably in
this order:

1) configure stuff for valgrind annotations.
2) add support for stdatomic.h functions.
3) thread-safe obstack memory handling
4) concurrent dynamic hash table.

> > If the compiler provides stdatomic.h then I think it would be good
> > to
> > use that instead of our own implementation. The copyright isn't a
> > problem. But do you have a reference/URL to the upstream version? I
> > like to add that somewhere, so we can sync with it in the future. I 
> > see
> > various commented out parts. Was that already upstream? Should we just
> > remove those parts?
> 
> It would definitely be preferable to use the compiler's implementation 
> if possible, we used this in case GCC 4.7 and 4.8 (RHEL7) compatibility 
> was needed. If those versions are old enough the file can be removed 
> entirely.
> 
> The upstream is at 
> https://github.com/freebsd/freebsd/blob/master/sys/sys/stdatomic.h, 
> although the version here appears to be slightly modified (we used the 
> version that HPCToolkit ships). The components we use don't seem 
> affected, so a resync shouldn't make a difference.

OK, then we should come up with some kind of configure test to see if
we can use the standard stdatomic.h and otherwise use our own. I am
surprised I cannot find other projects doing this. Would be nice to
"steal" something standard for this.

> > >   - Currently the concurrent hash table is always enabled,
> > >     performance-wise there is no known difference between it
> > >     and the non-concurrent  version.
> > >     This can be changed to toggle with --enable-thread-safety
> > >     if preferred.
> > 
> > I would prefer it always enabled, unless there is a massive slowdown 
> > of
> > the single-threaded case. The problem with --enable-thread-safety is
> > that it is a) known broken (sigh) and b) it basically introduces two
> > separate libraries that behave subtly differently. I would very much
> > like to get rid of --enable-thread-safety by fixing the broken locking
> > and simply making it the default.
> 
> I haven't noticed any slowdown in the single-threaded case, although I 
> haven't stressed it hard enough to find out for certain. From a 
> theoretical standpoint it shouldn't, atomics (with the proper memory 
> orders) are usually (on x86 at least) as cheap as normal accesses when 
> used by a single thread, and the algorithm is otherwise effectively the 
> same as the original hash table.
> 
> How difficult would it be to fix the locking (or rather, what's 
> "broken")? We would definitely benefit from having thread-safety at 
> least for getters, which would only need locks around the internal 
> management.

To be honest I don't know how badly it is broken.
It is only implemented for libelf.
If you configure --enable-thread-safety and make check you will see
several tests fail because they abort with Unexpected error: Resource
deadlock avoided.

I think it is mainly that nobody maintained the locks and now some are
just wrongly placed. Ideally we switch --enable-thread-safety on by
default, identify which locks are wrongly placed, run all tests with
valgrind/hellgrind and fix any issues found.

It really has not been a priority. Sorry.

> > >   - Another implementation of #2 above might use dynamic TLS
> > >     (pthread_key_*),
> > >     we chose this implementation to reduce the overall complexity.
> > 
> > Are there any other trade-offs?
> 
> If the application spawns N threads that all use a Dwarf object (same 
> or different) enough to cause an allocation, and then joins them all, 
> any Dwarf objects allocated after will allocate N unusable slots in the 
> mem_tails array. Thus long-running applications (that don't use thread 
> pools) would start experiencing effects similar to a memory leak, of 1 
> pointer's worth (8 bytes on 64-bit) per dead thread.
> 
> The alternative is to use dynamic TLS so that only threads that are 
> currently active use the extra memory, assuming libpthread is 
> sufficiently proactive about reclaiming unused key values. I think if 
> we assume `dwarf_end` happens-after any memory management (which would 
> make sense for a destructor), there should be a simple atomic pattern 
> to handle the freeing, but I would need to sit down for a while to work 
> out a sufficient proof.
> 
> I was also under the impression that dynamic TLS was particularly 
> expensive performance-wise, although I haven't experimented with it 
> myself enough to know. The compiler can be a little smarter about 
> static TLS, and the result is easier to reason about, which is why we 
> chose this implementation for initial patch. If the alternative would 
> be preferable we can change it.

I thought a bit about this one and although I am slightly worried about
the possible indefinite growing of the thread_ids, I don't think the
"memory leak" is an issue. Concurrent usage of the Dwarf object already
costs a bit more memory (since each thread gets its own memory block if
they allocate at the same time) which is probably larger than any extra
created by reserving space for all possible thread_ids. This is only
really a problem for a program that doesn't use thread pools and keeps
opening and concurrently accessing new Dwarf objects (because at a
certain point the whole Dwarf will have been read and no new
allocations happen). Although it would be nice if we could somehow
reset the next_id to zero in dwarf_end (), when this is the last thread
or Dwarf object.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-23 18:25     ` Mark Wielaard
@ 2019-08-23 22:36       ` Jonathon Anderson
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-23 22:36 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic



On Fri, Aug 23, 2019 at 1:25 PM, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> On Wed, 2019-08-21 at 09:08 -0500, Jonathon Anderson wrote:
>>  On Wed, Aug 21, 2019 at 6:16 AM, Mark Wielaard <mark@klomp.org>
>>  wrote:On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
>>  > >  For parallel applications that need the information in the 
>> DIEs, the
>>  > >  Dwarf_Abbrev hash table et al. become a massive data race. This
>>  > > fixes
>>  > >  that by:
>>  > >
>>  > >  1. Adding atomics & locks to the hash table to manage 
>> concurrency
>>  > >     (lib/dynamicsizehash_concurrent.{c,h})
>>  > >  2. Adding a lock & array structure to the memory manager
>>  > > (pseudo-TLS)
>>  > >     (libdwP.h, libdw_alloc.c)
>>  > >  3. Adding extra configure options for Helgrind/DRD annotations
>>  > >     (configure.ac)
>>  > >  4. Including "stdatomic.h" from FreeBSD, to support C11-style
>>  > > atomics.
>>  > >     (lib/stdatomic.h)
>>  >
>>  > This looks like really nice work. Thanks!
>>  >
>>  > I am splitting review in some smaller parts if you don't mind.
>>  > Simply because it is large and I cannot keep everything in my 
>> head at
>>  > once :)
> 
> BTW. I would prefer to handle this as 4 separate additions, probably 
> in
> this order:
> 
> 1) configure stuff for valgrind annotations.
> 2) add support for stdatomic.h functions.
> 3) thread-safe obstack memory handling
> 4) concurrent dynamic hash table.

Sure thing, I can split the patch into bits over the weekend. I may 
take your advice and just use git request-pull though.

> 
>>  > If the compiler provides stdatomic.h then I think it would be good
>>  > to
>>  > use that instead of our own implementation. The copyright isn't a
>>  > problem. But do you have a reference/URL to the upstream version? 
>> I
>>  > like to add that somewhere, so we can sync with it in the future. 
>> I
>>  > see
>>  > various commented out parts. Was that already upstream? Should we 
>> just
>>  > remove those parts?
>> 
>>  It would definitely be preferable to use the compiler's 
>> implementation
>>  if possible, we used this in case GCC 4.7 and 4.8 (RHEL7) 
>> compatibility
>>  was needed. If those versions are old enough the file can be removed
>>  entirely.
>> 
>>  The upstream is at
>>  https://github.com/freebsd/freebsd/blob/master/sys/sys/stdatomic.h,
>>  although the version here appears to be slightly modified (we used 
>> the
>>  version that HPCToolkit ships). The components we use don't seem
>>  affected, so a resync shouldn't make a difference.
> 
> OK, then we should come up with some kind of configure test to see if
> we can use the standard stdatomic.h and otherwise use our own. I am
> surprised I cannot find other projects doing this. Would be nice to
> "steal" something standard for this.

At least OpenSSL does it: 
https://github.com/openssl/openssl/blob/master/include/internal/refcount.h, 
the interesting note being that it has a series of fallbacks (various 
compiler builtins and then locks). The other projects I skimmed just 
have the fallbacks and don't check for C11, given that Elfutils only 
supports GCC that might be a valid (and more compact) approach.

> 
>>  > >   - Currently the concurrent hash table is always enabled,
>>  > >     performance-wise there is no known difference between it
>>  > >     and the non-concurrent  version.
>>  > >     This can be changed to toggle with --enable-thread-safety
>>  > >     if preferred.
>>  >
>>  > I would prefer it always enabled, unless there is a massive 
>> slowdown
>>  > of
>>  > the single-threaded case. The problem with --enable-thread-safety 
>> is
>>  > that it is a) known broken (sigh) and b) it basically introduces 
>> two
>>  > separate libraries that behave subtly differently. I would very 
>> much
>>  > like to get rid of --enable-thread-safety by fixing the broken 
>> locking
>>  > and simply making it the default.
>> 
>>  I haven't noticed any slowdown in the single-threaded case, 
>> although I
>>  haven't stressed it hard enough to find out for certain. From a
>>  theoretical standpoint it shouldn't, atomics (with the proper memory
>>  orders) are usually (on x86 at least) as cheap as normal accesses 
>> when
>>  used by a single thread, and the algorithm is otherwise effectively 
>> the
>>  same as the original hash table.
>> 
>>  How difficult would it be to fix the locking (or rather, what's
>>  "broken")? We would definitely benefit from having thread-safety at
>>  least for getters, which would only need locks around the internal
>>  management.
> 
> To be honest I don't know how badly it is broken.
> It is only implemented for libelf.
> If you configure --enable-thread-safety and make check you will see
> several tests fail because they abort with Unexpected error: Resource
> deadlock avoided.
> 
> I think it is mainly that nobody maintained the locks and now some are
> just wrongly placed. Ideally we switch --enable-thread-safety on by
> default, identify which locks are wrongly placed, run all tests with
> valgrind/hellgrind and fix any issues found.
> 
> It really has not been a priority. Sorry.

No worries, its not a priority on our end either. Elfutils' codebase is 
significantly simpler (IMHO) than ours, so if it ever comes up we'll 
just submit another patch.

> 
>>  > >   - Another implementation of #2 above might use dynamic TLS
>>  > >     (pthread_key_*),
>>  > >     we chose this implementation to reduce the overall 
>> complexity.
>>  >
>>  > Are there any other trade-offs?
>> 
>>  If the application spawns N threads that all use a Dwarf object 
>> (same
>>  or different) enough to cause an allocation, and then joins them 
>> all,
>>  any Dwarf objects allocated after will allocate N unusable slots in 
>> the
>>  mem_tails array. Thus long-running applications (that don't use 
>> thread
>>  pools) would start experiencing effects similar to a memory leak, 
>> of 1
>>  pointer's worth (8 bytes on 64-bit) per dead thread.
>> 
>>  The alternative is to use dynamic TLS so that only threads that are
>>  currently active use the extra memory, assuming libpthread is
>>  sufficiently proactive about reclaiming unused key values. I think 
>> if
>>  we assume `dwarf_end` happens-after any memory management (which 
>> would
>>  make sense for a destructor), there should be a simple atomic 
>> pattern
>>  to handle the freeing, but I would need to sit down for a while to 
>> work
>>  out a sufficient proof.
>> 
>>  I was also under the impression that dynamic TLS was particularly
>>  expensive performance-wise, although I haven't experimented with it
>>  myself enough to know. The compiler can be a little smarter about
>>  static TLS, and the result is easier to reason about, which is why 
>> we
>>  chose this implementation for initial patch. If the alternative 
>> would
>>  be preferable we can change it.
> 
> I thought a bit about this one and although I am slightly worried 
> about
> the possible indefinite growing of the thread_ids, I don't think the
> "memory leak" is an issue. Concurrent usage of the Dwarf object 
> already
> costs a bit more memory (since each thread gets its own memory block 
> if
> they allocate at the same time) which is probably larger than any 
> extra
> created by reserving space for all possible thread_ids. This is only
> really a problem for a program that doesn't use thread pools and keeps
> opening and concurrently accessing new Dwarf objects (because at a
> certain point the whole Dwarf will have been read and no new
> allocations happen). Although it would be nice if we could somehow
> reset the next_id to zero in dwarf_end (), when this is the last 
> thread
> or Dwarf object.

The memory overhead is a little worse than that (each thread allocates 
its own memory blocks, period), but that would be present in both 
implementations. I can't think of a simple way to increase the memory 
efficiency past that (although I can think of some ridiculously complex 
ways).

I suppose it would be possible to use a sort of free-list for the IDs, 
although that requires a hook at thread exit (doable with PThreads, not 
with C11) and cleaning up would be a bit of a nightmare. At some point 
dynamic TLS is more robust against weird situations (and, if my 
thought-proof is correct, simpler).

> 
> Cheers,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-16 19:24 [PATCH] libdw: add thread-safety to dwarf_getabbrev() Jonathon Anderson
  2019-08-21 11:16 ` Mark Wielaard
@ 2019-08-21 21:50 ` Mark Wielaard
  2019-08-21 22:01   ` Mark Wielaard
  2019-08-21 22:21   ` Jonathon Anderson
  2019-08-24 23:24 ` Mark Wielaard
  2 siblings, 2 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-08-21 21:50 UTC (permalink / raw)
  To: Jonathon Anderson, elfutils-devel; +Cc: Srdan Milakovic

On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
> diff --git a/ChangeLog b/ChangeLog
> index bed3999f..93907ddd 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,3 +1,8 @@
> +2019-08-15  Jonathon Anderson <jma14@rice.edu>
> +
> +	* configure.ac: Add new --enable-valgrind-annotations
> +	* configure.ac: Add new --with-valgrind (headers only)
> +
>  2019-08-13  Mark Wielaard  <mark@klomp.org>
> 
>  	* configure.ac: Set version to 0.177.
> diff --git a/configure.ac b/configure.ac
> index c443fa3b..c5406b44 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -323,6 +323,35 @@ if test "$use_valgrind" = yes; then
>  fi
>  AM_CONDITIONAL(USE_VALGRIND, test "$use_valgrind" = yes)
> 
> +AC_ARG_WITH([valgrind],
> +AS_HELP_STRING([--with-valgrind],[include directory for Valgrind 
> headers]),
> +[with_valgrind_headers=$withval], [with_valgrind_headers=no])
> +if test "x$with_valgrind_headers" != xno; then
> +    save_CFLAGS="$CFLAGS"
> +    CFLAGS="$CFLAGS -I$with_valgrind_headers"
> +    AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
> +      #include <valgrind/valgrind.h>
> +      int main() { return 0; }
> +    ]])], [ HAVE_VALGRIND_HEADERS="yes"
> +            CFLAGS="$save_CFLAGS -I$with_valgrind_headers" ],
> +          [ AC_MSG_ERROR([invalid valgrind include directory: 
> $with_valgrind_headers]) ])
> +fi
> +
> +AC_ARG_ENABLE([valgrind-annotations],
> +AS_HELP_STRING([--enable-valgrind-annotations],[insert extra 
> annotations for better valgrind support]),
> +[use_vg_annotations=$enableval], [use_vg_annotations=no])
> +if test "$use_vg_annotations" = yes; then
> +    if test "x$HAVE_VALGRIND_HEADERS" != "xyes"; then
> +      AC_MSG_CHECKING([whether Valgrind headers are available])
> +      AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
> +        #include <valgrind/valgrind.h>
> +        int main() { return 0; }
> +      ]])], [ AC_MSG_RESULT([yes]) ],
> +            [ AC_MSG_ERROR([valgrind annotations requested but no 
> headers are available]) ])
> +    fi
> +fi
> +AM_CONDITIONAL(USE_VG_ANNOTATIONS, test "$use_vg_annotations" = yes)
> +
>  AC_ARG_ENABLE([install-elfh],
>  AS_HELP_STRING([--enable-install-elfh],[install elf.h in include dir]),
>                 [install_elfh=$enableval], [install_elfh=no])
> @@ -668,6 +697,7 @@ AC_MSG_NOTICE([
>    OTHER FEATURES
>      Deterministic archives by default  : ${default_ar_deterministic}
>      Native language support            : ${USE_NLS}
> +    Extra Valgrind annotations         : ${use_vg_annotations}
> 
>    EXTRA TEST FEATURES (used with make check)
>      have bunzip2 installed (required)  : ${HAVE_BUNZIP2}

This part sets up things so we can include extra valgrind annotations,
but then doesn't seem to be used. It sounds useful though, because
valgrind/helgrind won't know about any of the atomics. Is this
something you added, but then removed?

Thanks,

Mark

P.S. It looks like something decided to add some line breaks in the
patch so that it doesn't easily apply. It isn't hard to fixup, but you
might want to consider submitting using git send-email or attaching the
result of git format-patch instead of putting the patch in the message
body.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-21 21:50 ` Mark Wielaard
@ 2019-08-21 22:01   ` Mark Wielaard
  2019-08-21 22:21   ` Jonathon Anderson
  1 sibling, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-08-21 22:01 UTC (permalink / raw)
  To: Jonathon Anderson, elfutils-devel; +Cc: Srdan Milakovic

On Wed, 2019-08-21 at 23:50 +0200, Mark Wielaard wrote:
> On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
> > @@ -668,6 +697,7 @@ AC_MSG_NOTICE([
> >    OTHER FEATURES
> >      Deterministic archives by default  :
> > ${default_ar_deterministic}
> >      Native language support            : ${USE_NLS}
> > +    Extra Valgrind annotations         : ${use_vg_annotations}
> > 
> >    EXTRA TEST FEATURES (used with make check)
> >      have bunzip2 installed (required)  : ${HAVE_BUNZIP2}
> 
> This part sets up things so we can include extra valgrind
> annotations,
> but then doesn't seem to be used. It sounds useful though, because
> valgrind/helgrind won't know about any of the atomics. Is this
> something you added, but then removed?

Ah, I should have read to the end, sorry. I see you then use them in
libdw/libdw_alloc.c as happens before/after annotations.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-21 21:50 ` Mark Wielaard
  2019-08-21 22:01   ` Mark Wielaard
@ 2019-08-21 22:21   ` Jonathon Anderson
  2019-08-23 21:26     ` Mark Wielaard
  1 sibling, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-21 22:21 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

[-- Attachment #1: Type: text/plain, Size: 3603 bytes --]



On Wed, Aug 21, 2019 at 4:50 PM, Mark Wielaard <mark@klomp.org> wrote:
> On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
>>  diff --git a/ChangeLog b/ChangeLog
>>  index bed3999f..93907ddd 100644
>>  --- a/ChangeLog
>>  +++ b/ChangeLog
>>  @@ -1,3 +1,8 @@
>>  +2019-08-15  Jonathon Anderson <jma14@rice.edu>
>>  +
>>  +	* configure.ac: Add new --enable-valgrind-annotations
>>  +	* configure.ac: Add new --with-valgrind (headers only)
>>  +
>>   2019-08-13  Mark Wielaard  <mark@klomp.org>
>> 
>>   	* configure.ac: Set version to 0.177.
>>  diff --git a/configure.ac b/configure.ac
>>  index c443fa3b..c5406b44 100644
>>  --- a/configure.ac
>>  +++ b/configure.ac
>>  @@ -323,6 +323,35 @@ if test "$use_valgrind" = yes; then
>>   fi
>>   AM_CONDITIONAL(USE_VALGRIND, test "$use_valgrind" = yes)
>> 
>>  +AC_ARG_WITH([valgrind],
>>  +AS_HELP_STRING([--with-valgrind],[include directory for Valgrind
>>  headers]),
>>  +[with_valgrind_headers=$withval], [with_valgrind_headers=no])
>>  +if test "x$with_valgrind_headers" != xno; then
>>  +    save_CFLAGS="$CFLAGS"
>>  +    CFLAGS="$CFLAGS -I$with_valgrind_headers"
>>  +    AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
>>  +      #include <valgrind/valgrind.h>
>>  +      int main() { return 0; }
>>  +    ]])], [ HAVE_VALGRIND_HEADERS="yes"
>>  +            CFLAGS="$save_CFLAGS -I$with_valgrind_headers" ],
>>  +          [ AC_MSG_ERROR([invalid valgrind include directory:
>>  $with_valgrind_headers]) ])
>>  +fi
>>  +
>>  +AC_ARG_ENABLE([valgrind-annotations],
>>  +AS_HELP_STRING([--enable-valgrind-annotations],[insert extra
>>  annotations for better valgrind support]),
>>  +[use_vg_annotations=$enableval], [use_vg_annotations=no])
>>  +if test "$use_vg_annotations" = yes; then
>>  +    if test "x$HAVE_VALGRIND_HEADERS" != "xyes"; then
>>  +      AC_MSG_CHECKING([whether Valgrind headers are available])
>>  +      AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
>>  +        #include <valgrind/valgrind.h>
>>  +        int main() { return 0; }
>>  +      ]])], [ AC_MSG_RESULT([yes]) ],
>>  +            [ AC_MSG_ERROR([valgrind annotations requested but no
>>  headers are available]) ])
>>  +    fi
>>  +fi
>>  +AM_CONDITIONAL(USE_VG_ANNOTATIONS, test "$use_vg_annotations" = 
>> yes)
>>  +
>>   AC_ARG_ENABLE([install-elfh],
>>   AS_HELP_STRING([--enable-install-elfh],[install elf.h in include 
>> dir]),
>>                  [install_elfh=$enableval], [install_elfh=no])
>>  @@ -668,6 +697,7 @@ AC_MSG_NOTICE([
>>     OTHER FEATURES
>>       Deterministic archives by default  : 
>> ${default_ar_deterministic}
>>       Native language support            : ${USE_NLS}
>>  +    Extra Valgrind annotations         : ${use_vg_annotations}
>> 
>>     EXTRA TEST FEATURES (used with make check)
>>       have bunzip2 installed (required)  : ${HAVE_BUNZIP2}
> 
> This part sets up things so we can include extra valgrind annotations,
> but then doesn't seem to be used. It sounds useful though, because
> valgrind/helgrind won't know about any of the atomics. Is this
> something you added, but then removed?
> 
> Thanks,
> 
> Mark
> 
> P.S. It looks like something decided to add some line breaks in the
> patch so that it doesn't easily apply. It isn't hard to fixup, but you
> might want to consider submitting using git send-email or attaching 
> the
> result of git format-patch instead of putting the patch in the message
> body.

Originally I had some issues with git send-mail, I usually do PRs 
purely in git so the email side is still a little new. I've attached 
the original patch from git format-patch, sorry for the mess.

-Jonathon

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-libdw-add-thread-safety-to-dwarf_getabbrev.patch --]
[-- Type: text/x-patch, Size: 52969 bytes --]

From 9077c0df713e9adfdee7fe1f66005453316842de Mon Sep 17 00:00:00 2001
From: Jonathon Anderson <jma14@rice.edu>
Date: Thu, 8 Aug 2019 09:01:56 -0500
Subject: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

For applications that need the information in the DIEs, the Dwarf_Abbrev
hash table et al. becomes a massive data race. This fixes that by:

1. Adding atomics & locks to the hash table to manage concurrency
   (lib/dynamicsizehash_concurrent.{c,h})
2. Adding a lock & array structure the the memory manager (pseudo-TLS)
   (libdwP.h, libdw_alloc.c)
3. Adding extra configure options for Helgrind/DRD annotations
   (configure.ac)
4. Including "stdatomic.h" from FreeBSD, to support C11-style atomics.
   (lib/stdatomic.h)

Signed-off-by: Jonathon Anderson <jma14@rice.edu>
Signed-off-by: Srđan Milaković <sm108@rice.edu>
---
 ChangeLog                        |   5 +
 configure.ac                     |  30 ++
 lib/ChangeLog                    |   6 +
 lib/Makefile.am                  |   5 +-
 lib/dynamicsizehash_concurrent.c | 522 +++++++++++++++++++++++++++++++
 lib/dynamicsizehash_concurrent.h | 118 +++++++
 lib/stdatomic.h                  | 442 ++++++++++++++++++++++++++
 libdw/ChangeLog                  |   9 +
 libdw/Makefile.am                |   2 +-
 libdw/dwarf_abbrev_hash.c        |   2 +-
 libdw/dwarf_abbrev_hash.h        |   2 +-
 libdw/dwarf_begin_elf.c          |  13 +-
 libdw/dwarf_end.c                |  24 +-
 libdw/libdwP.h                   |  13 +-
 libdw/libdw_alloc.c              |  70 ++++-
 15 files changed, 1240 insertions(+), 23 deletions(-)
 create mode 100644 lib/dynamicsizehash_concurrent.c
 create mode 100644 lib/dynamicsizehash_concurrent.h
 create mode 100644 lib/stdatomic.h

diff --git a/ChangeLog b/ChangeLog
index bed3999f..93907ddd 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2019-08-15  Jonathon Anderson <jma14@rice.edu>
+
+	* configure.ac: Add new --enable-valgrind-annotations
+	* configure.ac: Add new --with-valgrind (headers only)
+
 2019-08-13  Mark Wielaard  <mark@klomp.org>
 
 	* configure.ac: Set version to 0.177.
diff --git a/configure.ac b/configure.ac
index c443fa3b..c5406b44 100644
--- a/configure.ac
+++ b/configure.ac
@@ -323,6 +323,35 @@ if test "$use_valgrind" = yes; then
 fi
 AM_CONDITIONAL(USE_VALGRIND, test "$use_valgrind" = yes)
 
+AC_ARG_WITH([valgrind],
+AS_HELP_STRING([--with-valgrind],[include directory for Valgrind headers]),
+[with_valgrind_headers=$withval], [with_valgrind_headers=no])
+if test "x$with_valgrind_headers" != xno; then
+    save_CFLAGS="$CFLAGS"
+    CFLAGS="$CFLAGS -I$with_valgrind_headers"
+    AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
+      #include <valgrind/valgrind.h>
+      int main() { return 0; }
+    ]])], [ HAVE_VALGRIND_HEADERS="yes"
+            CFLAGS="$save_CFLAGS -I$with_valgrind_headers" ],
+          [ AC_MSG_ERROR([invalid valgrind include directory: $with_valgrind_headers]) ])
+fi
+
+AC_ARG_ENABLE([valgrind-annotations],
+AS_HELP_STRING([--enable-valgrind-annotations],[insert extra annotations for better valgrind support]),
+[use_vg_annotations=$enableval], [use_vg_annotations=no])
+if test "$use_vg_annotations" = yes; then
+    if test "x$HAVE_VALGRIND_HEADERS" != "xyes"; then
+      AC_MSG_CHECKING([whether Valgrind headers are available])
+      AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
+        #include <valgrind/valgrind.h>
+        int main() { return 0; }
+      ]])], [ AC_MSG_RESULT([yes]) ],
+            [ AC_MSG_ERROR([valgrind annotations requested but no headers are available]) ])
+    fi
+fi
+AM_CONDITIONAL(USE_VG_ANNOTATIONS, test "$use_vg_annotations" = yes)
+
 AC_ARG_ENABLE([install-elfh],
 AS_HELP_STRING([--enable-install-elfh],[install elf.h in include dir]),
                [install_elfh=$enableval], [install_elfh=no])
@@ -668,6 +697,7 @@ AC_MSG_NOTICE([
   OTHER FEATURES
     Deterministic archives by default  : ${default_ar_deterministic}
     Native language support            : ${USE_NLS}
+    Extra Valgrind annotations         : ${use_vg_annotations}
 
   EXTRA TEST FEATURES (used with make check)
     have bunzip2 installed (required)  : ${HAVE_BUNZIP2}
diff --git a/lib/ChangeLog b/lib/ChangeLog
index 7381860c..e6d08509 100644
--- a/lib/ChangeLog
+++ b/lib/ChangeLog
@@ -1,3 +1,9 @@
+2019-08-08  Jonathon Anderson  <jma14@rice.edu>
+
+	* dynamicsizehash_concurrent.{c,h}: New files.
+	* stdatomic.h: New file, taken from FreeBSD.
+	* Makefile.am (noinst_HEADERS): Added *.h above.
+
 2019-05-03  Rosen Penev  <rosenp@gmail.com>
 
 	* color.c (parse_opt): Cast program_invocation_short_name to char *.
diff --git a/lib/Makefile.am b/lib/Makefile.am
index 36d21a07..af7228b9 100644
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -38,8 +38,9 @@ libeu_a_SOURCES = xstrdup.c xstrndup.c xmalloc.c next_prime.c \
 		  color.c printversion.c
 
 noinst_HEADERS = fixedsizehash.h libeu.h system.h dynamicsizehash.h list.h \
-		 eu-config.h color.h printversion.h bpf.h
-EXTRA_DIST = dynamicsizehash.c
+		 eu-config.h color.h printversion.h bpf.h \
+		 dynamicsizehash_concurrent.h stdatomic.h
+EXTRA_DIST = dynamicsizehash.c dynamicsizehash_concurrent.c
 
 if !GPROF
 xmalloc_CFLAGS = -ffunction-sections
diff --git a/lib/dynamicsizehash_concurrent.c b/lib/dynamicsizehash_concurrent.c
new file mode 100644
index 00000000..d645b143
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.c
@@ -0,0 +1,522 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <stdlib.h>
+#include <system.h>
+#include <pthread.h>
+
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+   COMPARE   comparison function taking two pointers to TYPE objects
+
+   The following macros if present select features:
+
+   ITERATE   iterating over the table entries is possible
+   REVERSE   iterate in reverse order of insert
+ */
+
+
+static size_t
+lookup (NAME *htab, HASHTYPE hval, TYPE val __attribute__ ((unused)))
+{
+  /* First hash function: simply take the modul but prevent zero.  Small values
+      can skip the division, which helps performance when this is common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+#if COMPARE != 0  /* A handful of tables don't actually compare the entries in
+                    the table, they instead rely on the hash.  In that case, we
+                    can skip parts that relate to the value. */
+  TYPE val_ptr;
+#endif
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    {
+#if COMPARE == 0
+      return idx;
+#else
+      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                            memory_order_acquire);
+      if (COMPARE(val_ptr, val) == 0)
+          return idx;
+#endif
+    }
+  else if (hash == 0)
+    {
+      return 0;
+    }
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+        {
+#if COMPARE == 0
+          return idx;
+#else
+          val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                                memory_order_acquire);
+          if (COMPARE(val_ptr, val) == 0)
+              return idx;
+#endif
+        }
+      else if (hash == 0)
+        {
+          return 0;
+        }
+    }
+}
+
+static int
+insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
+{
+  /* First hash function: simply take the modul but prevent zero.  Small values
+      can skip the division, which helps performance when this is common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+  TYPE val_ptr;
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    {
+      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                            memory_order_acquire);
+      if (COMPARE(val_ptr, val) != 0)
+          return -1;
+    }
+  else if (hash == 0)
+    {
+      val_ptr = NULL;
+      atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                              (uintptr_t *) &val_ptr,
+                                              (uintptr_t) val,
+                                              memory_order_acquire,
+                                              memory_order_acquire);
+
+      if (val_ptr == NULL)
+        {
+          atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                memory_order_release);
+          return 0;
+        }
+      else
+        {
+          do
+            {
+              hash = atomic_load_explicit(&htab->table[idx].val_ptr,
+                                          memory_order_acquire);
+            }
+          while (hash == 0);
+        }
+    }
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+        {
+          val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                                memory_order_acquire);
+          if (COMPARE(val_ptr, val) != 0)
+              return -1;
+        }
+      else if (hash == 0)
+        {
+          val_ptr = NULL;
+          atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                                  (uintptr_t *) &val_ptr,
+                                                  (uintptr_t) val,
+                                                  memory_order_acquire,
+                                                  memory_order_acquire);
+
+          if (val_ptr == NULL)
+            {
+              atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                    memory_order_release);
+              return 0;
+            }
+          else
+            {
+              do
+                {
+                  hash = atomic_load_explicit(&htab->table[idx].val_ptr,
+                                              memory_order_acquire);
+                }
+              while (hash == 0);
+            }
+        }
+    }
+}
+
+#define NO_RESIZING 0u
+#define ALLOCATING_MEMORY 1u
+#define MOVING_DATA 3u
+#define CLEANING 2u
+
+#define STATE_BITS 2u
+#define STATE_INCREMENT (1u << STATE_BITS)
+#define STATE_MASK (STATE_INCREMENT - 1)
+#define GET_STATE(A) ((A) & STATE_MASK)
+
+#define IS_NO_RESIZE_OR_CLEANING(A) (((A) & 0x1u) == 0)
+
+#define GET_ACTIVE_WORKERS(A) ((A) >> STATE_BITS)
+
+#define INITIALIZATION_BLOCK_SIZE 256
+#define MOVE_BLOCK_SIZE 256
+#define CEIL(A, B) (((A) + (B) - 1) / (B))
+
+/* Initializes records and copies the data from the old table.
+   It can share work with other threads */
+static void resize_helper(NAME *htab, int blocking)
+{
+  size_t num_old_blocks = CEIL(htab->old_size, MOVE_BLOCK_SIZE);
+  size_t num_new_blocks = CEIL(htab->size, INITIALIZATION_BLOCK_SIZE);
+
+  size_t my_block;
+  size_t num_finished_blocks = 0;
+
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_init_block, 1,
+                                                memory_order_acquire))
+                                                    < num_new_blocks)
+    {
+      size_t record_it = my_block * INITIALIZATION_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * INITIALIZATION_BLOCK_SIZE;
+      if (record_end > htab->size)
+          record_end = htab->size;
+
+      while (record_it++ != record_end)
+        {
+          atomic_init(&htab->table[record_it].hashval, (uintptr_t) NULL);
+          atomic_init(&htab->table[record_it].val_ptr, (uintptr_t) NULL);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_initialized_blocks,
+                            num_finished_blocks, memory_order_release);
+  while (atomic_load_explicit(&htab->num_initialized_blocks,
+                              memory_order_acquire) != num_new_blocks);
+
+  /* All block are initialized, start moving */
+  num_finished_blocks = 0;
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_move_block, 1,
+                                                memory_order_acquire))
+                                                    < num_old_blocks)
+    {
+      size_t record_it = my_block * MOVE_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * MOVE_BLOCK_SIZE;
+      if (record_end > htab->old_size)
+          record_end = htab->old_size;
+
+      while (record_it++ != record_end)
+        {
+          TYPE val_ptr = (TYPE) atomic_load_explicit(
+              &htab->old_table[record_it].val_ptr,
+              memory_order_acquire);
+          if (val_ptr == NULL)
+              continue;
+
+          HASHTYPE hashval = atomic_load_explicit(
+              &htab->old_table[record_it].hashval,
+              memory_order_acquire);
+          assert(hashval);
+
+          insert_helper(htab, hashval, val_ptr);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_moved_blocks, num_finished_blocks,
+                            memory_order_release);
+
+  if (blocking)
+      while (atomic_load_explicit(&htab->num_moved_blocks,
+                                  memory_order_acquire) != num_old_blocks);
+}
+
+static void
+resize_master(NAME *htab)
+{
+  htab->old_size = htab->size;
+  htab->old_table = htab->table;
+
+  htab->size = next_prime(htab->size * 2);
+  htab->table = malloc((1 + htab->size) * sizeof(htab->table[0]));
+  assert(htab->table);
+
+  /* Change state from ALLOCATING_MEMORY to MOVING_DATA */
+  atomic_fetch_xor_explicit(&htab->resizing_state,
+                            ALLOCATING_MEMORY ^ MOVING_DATA,
+                            memory_order_release);
+
+  resize_helper(htab, 1);
+
+  /* Change state from MOVING_DATA to CLEANING */
+  size_t resize_state = atomic_fetch_xor_explicit(&htab->resizing_state,
+                                                  MOVING_DATA ^ CLEANING,
+                                                  memory_order_acq_rel);
+  while (GET_ACTIVE_WORKERS(resize_state) != 0)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* There are no more active workers */
+  atomic_store_explicit(&htab->next_init_block, 0, memory_order_relaxed);
+  atomic_store_explicit(&htab->num_initialized_blocks, 0,
+                        memory_order_relaxed);
+
+  atomic_store_explicit(&htab->next_move_block, 0, memory_order_relaxed);
+  atomic_store_explicit(&htab->num_moved_blocks, 0, memory_order_relaxed);
+
+  free(htab->old_table);
+
+  /* Change state to NO_RESIZING */
+  atomic_fetch_xor_explicit(&htab->resizing_state, CLEANING ^ NO_RESIZING,
+                            memory_order_relaxed);
+
+}
+
+static void
+resize_worker(NAME *htab)
+{
+  size_t resize_state = atomic_load_explicit(&htab->resizing_state,
+                                              memory_order_acquire);
+
+  /* If the resize has finished */
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+      return;
+
+  /* Register as worker and check if the resize has finished in the meantime*/
+  resize_state = atomic_fetch_add_explicit(&htab->resizing_state,
+                                            STATE_INCREMENT,
+                                            memory_order_acquire);
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  /* Wait while the new table is being allocated. */
+  while (GET_STATE(resize_state) == ALLOCATING_MEMORY)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* Check if the resize is done */
+  assert(GET_STATE(resize_state) != NO_RESIZING);
+  if (GET_STATE(resize_state) == CLEANING)
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  resize_helper(htab, 0);
+
+  /* Deregister worker */
+  atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                            memory_order_release);
+}
+
+
+int
+#define INIT(name) _INIT (name)
+#define _INIT(name) \
+  name##_init
+INIT(NAME) (NAME *htab, size_t init_size)
+{
+  /* We need the size to be a prime.  */
+  init_size = next_prime (init_size);
+
+  /* Initialize the data structure.  */
+  htab->size = init_size;
+  atomic_init(&htab->filled, 0);
+  atomic_init(&htab->resizing_state, 0);
+
+  atomic_init(&htab->next_init_block, 0);
+  atomic_init(&htab->num_initialized_blocks, 0);
+
+  atomic_init(&htab->next_move_block, 0);
+  atomic_init(&htab->num_moved_blocks, 0);
+
+  pthread_rwlock_init(&htab->resize_rwl, NULL);
+
+  htab->table = (void *) malloc ((init_size + 1) * sizeof (htab->table[0]));
+  if (htab->table == NULL)
+      return -1;
+
+  for (size_t i = 0; i <= init_size; i++)
+    {
+      atomic_init(&htab->table[i].hashval, (uintptr_t) NULL);
+      atomic_init(&htab->table[i].val_ptr, (uintptr_t) NULL);
+    }
+
+  return 0;
+}
+
+
+int
+#define FREE(name) _FREE (name)
+#define _FREE(name) \
+name##_free
+FREE(NAME) (NAME *htab)
+{
+  pthread_rwlock_destroy(&htab->resize_rwl);
+  free (htab->table);
+  return 0;
+}
+
+
+int
+#define INSERT(name) _INSERT (name)
+#define _INSERT(name) \
+name##_insert
+INSERT(NAME) (NAME *htab, HASHTYPE hval, TYPE data)
+{
+  int incremented = 0;
+
+  for(;;)
+    {
+      while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+          resize_worker(htab);
+
+      size_t filled;
+      if (!incremented)
+        {
+          filled = atomic_fetch_add_explicit(&htab->filled, 1,
+                                              memory_order_acquire);
+          incremented = 1;
+        }
+      else
+        {
+          filled = atomic_load_explicit(&htab->filled,
+                                        memory_order_acquire);
+        }
+
+
+      if (100 * filled > 90 * htab->size)
+        {
+          /* Table is filled more than 90%.  Resize the table.  */
+
+          size_t resizing_state = atomic_load_explicit(&htab->resizing_state,
+                                                        memory_order_acquire);
+          if (resizing_state == 0 &&
+              atomic_compare_exchange_strong_explicit(&htab->resizing_state,
+                                                      &resizing_state,
+                                                      ALLOCATING_MEMORY,
+                                                      memory_order_acquire,
+                                                      memory_order_acquire))
+            {
+              /* Master thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+              pthread_rwlock_wrlock(&htab->resize_rwl);
+              resize_master(htab);
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+            }
+          else
+            {
+              /* Worker thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+              resize_worker(htab);
+            }
+        }
+      else
+        {
+          /* Lock acquired, no need for resize*/
+          break;
+        }
+    }
+
+  int ret_val = insert_helper(htab, hval, data);
+  if (ret_val == -1)
+      atomic_fetch_sub_explicit(&htab->filled, 1, memory_order_relaxed);
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
+
+
+
+TYPE
+#define FIND(name) _FIND (name)
+#define _FIND(name) \
+  name##_find
+FIND(NAME) (NAME *htab, HASHTYPE hval, TYPE val)
+{
+  while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+      resize_worker(htab);
+
+  size_t idx;
+
+  /* Make the hash data nonzero.  */
+  hval = hval ?: 1;
+  idx = lookup(htab, hval, val);
+
+  if (idx == 0)
+    {
+      pthread_rwlock_unlock(&htab->resize_rwl);
+      return NULL;
+    }
+
+  /* get a copy before unlocking the lock */
+  TYPE ret_val = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                             memory_order_relaxed);
+
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
+
diff --git a/lib/dynamicsizehash_concurrent.h b/lib/dynamicsizehash_concurrent.h
new file mode 100644
index 00000000..eb06ac9e
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.h
@@ -0,0 +1,118 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <stddef.h>
+#include <pthread.h>
+#include "stdatomic.h"
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+
+   The following macros if present select features:
+
+   ITERATE   iterating over the table entries is possible
+   HASHTYPE  integer type for hash values, default unsigned long int
+ */
+
+
+
+#ifndef HASHTYPE
+# define HASHTYPE unsigned long int
+#endif
+
+#ifndef RESIZE_BLOCK_SIZE
+# define RESIZE_BLOCK_SIZE 256
+#endif
+
+/* Defined separately.  */
+extern size_t next_prime (size_t seed);
+
+
+/* Table entry type.  */
+#define _DYNHASHCONENTTYPE(name)       \
+  typedef struct name##_ent         \
+  {                                 \
+    _Atomic(HASHTYPE) hashval;      \
+    atomic_uintptr_t val_ptr;       \
+  } name##_ent
+#define DYNHASHENTTYPE(name) _DYNHASHCONENTTYPE (name)
+DYNHASHENTTYPE (NAME);
+
+/* Type of the dynamic hash table data structure.  */
+#define _DYNHASHCONTYPE(name) \
+typedef struct                                     \
+{                                                  \
+  size_t size;                                     \
+  size_t old_size;                                 \
+  atomic_size_t filled;                            \
+  name##_ent *table;                               \
+  name##_ent *old_table;                           \
+  atomic_size_t resizing_state;                    \
+  atomic_size_t next_init_block;                   \
+  atomic_size_t num_initialized_blocks;            \
+  atomic_size_t next_move_block;                   \
+  atomic_size_t num_moved_blocks;                  \
+  pthread_rwlock_t resize_rwl;                     \
+} name
+#define DYNHASHTYPE(name) _DYNHASHCONTYPE (name)
+DYNHASHTYPE (NAME);
+
+
+
+#define _FUNCTIONS(name)                                            \
+/* Initialize the hash table.  */                                   \
+extern int name##_init (name *htab, size_t init_size);              \
+                                                                    \
+/* Free resources allocated for hash table.  */                     \
+extern int name##_free (name *htab);                                \
+                                                                    \
+/* Insert new entry.  */                                            \
+extern int name##_insert (name *htab, HASHTYPE hval, TYPE data);    \
+                                                                    \
+/* Find entry in hash table.  */                                    \
+extern TYPE name##_find (name *htab, HASHTYPE hval, TYPE val);
+#define FUNCTIONS(name) _FUNCTIONS (name)
+FUNCTIONS (NAME)
+
+
+#ifndef NO_UNDEF
+# undef DYNHASHENTTYPE
+# undef DYNHASHTYPE
+# undef FUNCTIONS
+# undef _FUNCTIONS
+# undef XFUNCTIONS
+# undef _XFUNCTIONS
+# undef NAME
+# undef TYPE
+# undef ITERATE
+# undef COMPARE
+# undef FIRST
+# undef NEXT
+#endif
diff --git a/lib/stdatomic.h b/lib/stdatomic.h
new file mode 100644
index 00000000..49626662
--- /dev/null
+++ b/lib/stdatomic.h
@@ -0,0 +1,442 @@
+/*-
+ * Copyright (c) 2011 Ed Schouten <ed@FreeBSD.org>
+ *                    David Chisnall <theraven@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef _STDATOMIC_H_
+#define	_STDATOMIC_H_
+
+#include <stddef.h>
+#include <stdint.h>
+
+#if !defined(__has_feature)
+#define __has_feature(x) 0
+#endif
+#if !defined(__has_builtin)
+#define __has_builtin(x) 0
+#endif
+#if !defined(__GNUC_PREREQ__)
+#if defined(__GNUC__) && defined(__GNUC_MINOR__)
+#define __GNUC_PREREQ__(maj, min)					\
+	((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
+#else
+#define __GNUC_PREREQ__(maj, min) 0
+#endif
+#endif
+
+#if !defined(__CLANG_ATOMICS) && !defined(__GNUC_ATOMICS)
+#if __has_feature(c_atomic)
+#define	__CLANG_ATOMICS
+#elif __GNUC_PREREQ__(4, 7)
+#define	__GNUC_ATOMICS
+#elif !defined(__GNUC__)
+#error "stdatomic.h does not support your compiler"
+#endif
+#endif
+
+/*
+ * language independent type to represent a Boolean value
+ */
+
+typedef int __Bool;
+
+/*
+ * 7.17.1 Atomic lock-free macros.
+ */
+
+#ifdef __GCC_ATOMIC_BOOL_LOCK_FREE
+#define	ATOMIC_BOOL_LOCK_FREE		__GCC_ATOMIC_BOOL_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR_LOCK_FREE
+#define	ATOMIC_CHAR_LOCK_FREE		__GCC_ATOMIC_CHAR_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR16_T_LOCK_FREE
+#define	ATOMIC_CHAR16_T_LOCK_FREE	__GCC_ATOMIC_CHAR16_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR32_T_LOCK_FREE
+#define	ATOMIC_CHAR32_T_LOCK_FREE	__GCC_ATOMIC_CHAR32_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_WCHAR_T_LOCK_FREE
+#define	ATOMIC_WCHAR_T_LOCK_FREE	__GCC_ATOMIC_WCHAR_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_SHORT_LOCK_FREE
+#define	ATOMIC_SHORT_LOCK_FREE		__GCC_ATOMIC_SHORT_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_INT_LOCK_FREE
+#define	ATOMIC_INT_LOCK_FREE		__GCC_ATOMIC_INT_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_LONG_LOCK_FREE
+#define	ATOMIC_LONG_LOCK_FREE		__GCC_ATOMIC_LONG_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_LLONG_LOCK_FREE
+#define	ATOMIC_LLONG_LOCK_FREE		__GCC_ATOMIC_LLONG_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_POINTER_LOCK_FREE
+#define	ATOMIC_POINTER_LOCK_FREE	__GCC_ATOMIC_POINTER_LOCK_FREE
+#endif
+
+#if !defined(__CLANG_ATOMICS)
+#define	_Atomic(T)			struct { volatile __typeof__(T) __val; }
+#endif
+
+/*
+ * 7.17.2 Initialization.
+ */
+
+#if defined(__CLANG_ATOMICS)
+#define	ATOMIC_VAR_INIT(value)		(value)
+#define	atomic_init(obj, value)		__c11_atomic_init(obj, value)
+#else
+#define	ATOMIC_VAR_INIT(value)		{ .__val = (value) }
+#define	atomic_init(obj, value)		((void)((obj)->__val = (value)))
+#endif
+
+/*
+ * Clang and recent GCC both provide predefined macros for the memory
+ * orderings.  If we are using a compiler that doesn't define them, use the
+ * clang values - these will be ignored in the fallback path.
+ */
+
+#ifndef __ATOMIC_RELAXED
+#define __ATOMIC_RELAXED		0
+#endif
+#ifndef __ATOMIC_CONSUME
+#define __ATOMIC_CONSUME		1
+#endif
+#ifndef __ATOMIC_ACQUIRE
+#define __ATOMIC_ACQUIRE		2
+#endif
+#ifndef __ATOMIC_RELEASE
+#define __ATOMIC_RELEASE		3
+#endif
+#ifndef __ATOMIC_ACQ_REL
+#define __ATOMIC_ACQ_REL		4
+#endif
+#ifndef __ATOMIC_SEQ_CST
+#define __ATOMIC_SEQ_CST		5
+#endif
+
+/*
+ * 7.17.3 Order and consistency.
+ *
+ * The memory_order_* constants that denote the barrier behaviour of the
+ * atomic operations.
+ */
+
+typedef enum {
+    memory_order_relaxed = __ATOMIC_RELAXED,
+    memory_order_consume = __ATOMIC_CONSUME,
+    memory_order_acquire = __ATOMIC_ACQUIRE,
+    memory_order_release = __ATOMIC_RELEASE,
+    memory_order_acq_rel = __ATOMIC_ACQ_REL,
+    memory_order_seq_cst = __ATOMIC_SEQ_CST
+} memory_order;
+
+/*
+ * 7.17.4 Fences.
+ */
+
+//#define __unused
+
+//static __inline void
+//atomic_thread_fence(memory_order __order __unused)
+//{
+//
+//#ifdef __CLANG_ATOMICS
+//    __c11_atomic_thread_fence(__order);
+//#elif defined(__GNUC_ATOMICS)
+//    __atomic_thread_fence(__order);
+//#else
+//    __sync_synchronize();
+//#endif
+//}
+//
+//static __inline void
+//atomic_signal_fence(memory_order __order __unused)
+//{
+//
+//#ifdef __CLANG_ATOMICS
+//    __c11_atomic_signal_fence(__order);
+//#elif defined(__GNUC_ATOMICS)
+//    __atomic_signal_fence(__order);
+//#else
+//    __asm volatile ("" ::: "memory");
+//#endif
+//}
+
+//#undef __unused
+
+/*
+ * 7.17.5 Lock-free property.
+ */
+
+#if defined(_KERNEL)
+/* Atomics in kernelspace are always lock-free. */
+#define	atomic_is_lock_free(obj) \
+	((void)(obj), (__Bool)1)
+#elif defined(__CLANG_ATOMICS)
+#define	atomic_is_lock_free(obj) \
+	__atomic_is_lock_free(sizeof(*(obj)), obj)
+#elif defined(__GNUC_ATOMICS)
+#define	atomic_is_lock_free(obj) \
+	__atomic_is_lock_free(sizeof((obj)->__val), &(obj)->__val)
+#else
+#define	atomic_is_lock_free(obj) \
+	((void)(obj), sizeof((obj)->__val) <= sizeof(void *))
+#endif
+
+/*
+ * 7.17.6 Atomic integer types.
+ */
+
+typedef _Atomic(__Bool)			atomic_bool;
+typedef _Atomic(char)			atomic_char;
+typedef _Atomic(signed char)		atomic_schar;
+typedef _Atomic(unsigned char)		atomic_uchar;
+typedef _Atomic(short)			atomic_short;
+typedef _Atomic(unsigned short)		atomic_ushort;
+typedef _Atomic(int)			atomic_int;
+typedef _Atomic(unsigned int)		atomic_uint;
+typedef _Atomic(long)			atomic_long;
+typedef _Atomic(unsigned long)		atomic_ulong;
+typedef _Atomic(long long)		atomic_llong;
+typedef _Atomic(unsigned long long)	atomic_ullong;
+#if 0
+typedef _Atomic(char16_t)		atomic_char16_t;
+typedef _Atomic(char32_t)		atomic_char32_t;
+#endif
+typedef _Atomic(wchar_t)		atomic_wchar_t;
+typedef _Atomic(int_least8_t)		atomic_int_least8_t;
+typedef _Atomic(uint_least8_t)		atomic_uint_least8_t;
+typedef _Atomic(int_least16_t)		atomic_int_least16_t;
+typedef _Atomic(uint_least16_t)		atomic_uint_least16_t;
+typedef _Atomic(int_least32_t)		atomic_int_least32_t;
+typedef _Atomic(uint_least32_t)		atomic_uint_least32_t;
+typedef _Atomic(int_least64_t)		atomic_int_least64_t;
+typedef _Atomic(uint_least64_t)		atomic_uint_least64_t;
+typedef _Atomic(int_fast8_t)		atomic_int_fast8_t;
+typedef _Atomic(uint_fast8_t)		atomic_uint_fast8_t;
+typedef _Atomic(int_fast16_t)		atomic_int_fast16_t;
+typedef _Atomic(uint_fast16_t)		atomic_uint_fast16_t;
+typedef _Atomic(int_fast32_t)		atomic_int_fast32_t;
+typedef _Atomic(uint_fast32_t)		atomic_uint_fast32_t;
+typedef _Atomic(int_fast64_t)		atomic_int_fast64_t;
+typedef _Atomic(uint_fast64_t)		atomic_uint_fast64_t;
+typedef _Atomic(intptr_t)		atomic_intptr_t;
+typedef _Atomic(uintptr_t)		atomic_uintptr_t;
+typedef _Atomic(size_t)			atomic_size_t;
+typedef _Atomic(ptrdiff_t)		atomic_ptrdiff_t;
+typedef _Atomic(intmax_t)		atomic_intmax_t;
+typedef _Atomic(uintmax_t)		atomic_uintmax_t;
+
+/*
+ * 7.17.7 Operations on atomic types.
+ */
+
+/*
+ * Compiler-specific operations.
+ */
+
+#if defined(__CLANG_ATOMICS)
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)						\
+	__c11_atomic_compare_exchange_strong(object, expected, desired,	\
+	    success, failure)
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	__c11_atomic_compare_exchange_weak(object, expected, desired,	\
+	    success, failure)
+#define	atomic_exchange_explicit(object, desired, order)		\
+	__c11_atomic_exchange(object, desired, order)
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	__c11_atomic_fetch_add(object, operand, order)
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	__c11_atomic_fetch_and(object, operand, order)
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	__c11_atomic_fetch_or(object, operand, order)
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	__c11_atomic_fetch_sub(object, operand, order)
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	__c11_atomic_fetch_xor(object, operand, order)
+#define	atomic_load_explicit(object, order)				\
+	__c11_atomic_load(object, order)
+#define	atomic_store_explicit(object, desired, order)			\
+	__c11_atomic_store(object, desired, order)
+#elif defined(__GNUC_ATOMICS)
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)						\
+	__atomic_compare_exchange_n(&(object)->__val, expected,		\
+	    desired, 0, success, failure)
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	__atomic_compare_exchange_n(&(object)->__val, expected,		\
+	    desired, 1, success, failure)
+#define	atomic_exchange_explicit(object, desired, order)		\
+	__atomic_exchange_n(&(object)->__val, desired, order)
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	__atomic_fetch_add(&(object)->__val, operand, order)
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	__atomic_fetch_and(&(object)->__val, operand, order)
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	__atomic_fetch_or(&(object)->__val, operand, order)
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	__atomic_fetch_sub(&(object)->__val, operand, order)
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	__atomic_fetch_xor(&(object)->__val, operand, order)
+#define	atomic_load_explicit(object, order)				\
+	__atomic_load_n(&(object)->__val, order)
+#define	atomic_store_explicit(object, desired, order)			\
+	__atomic_store_n(&(object)->__val, desired, order)
+#else
+#define	__atomic_apply_stride(object, operand) \
+	(((__typeof__((object)->__val))0) + (operand))
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)	__extension__ ({			\
+	__typeof__(expected) __ep = (expected);				\
+	__typeof__(*__ep) __e = *__ep;					\
+	(void)(success); (void)(failure);				\
+	(__Bool)((*__ep = __sync_val_compare_and_swap(&(object)->__val,	\
+	    __e, desired)) == __e);					\
+})
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	atomic_compare_exchange_strong_explicit(object, expected,	\
+		desired, success, failure)
+#if __has_builtin(__sync_swap)
+/* Clang provides a full-barrier atomic exchange - use it if available. */
+#define	atomic_exchange_explicit(object, desired, order)		\
+	((void)(order), __sync_swap(&(object)->__val, desired))
+#else
+/*
+ * __sync_lock_test_and_set() is only an acquire barrier in theory (although in
+ * practice it is usually a full barrier) so we need an explicit barrier before
+ * it.
+ */
+#define	atomic_exchange_explicit(object, desired, order)		\
+__extension__ ({							\
+	__typeof__(object) __o = (object);				\
+	__typeof__(desired) __d = (desired);				\
+	(void)(order);							\
+	__sync_synchronize();						\
+	__sync_lock_test_and_set(&(__o)->__val, __d);			\
+})
+#endif
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_add(&(object)->__val,		\
+	    __atomic_apply_stride(object, operand)))
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_and(&(object)->__val, operand))
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_or(&(object)->__val, operand))
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_sub(&(object)->__val,		\
+	    __atomic_apply_stride(object, operand)))
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_xor(&(object)->__val, operand))
+#define	atomic_load_explicit(object, order)				\
+	((void)(order), __sync_fetch_and_add(&(object)->__val, 0))
+#define	atomic_store_explicit(object, desired, order)			\
+	((void)atomic_exchange_explicit(object, desired, order))
+#endif
+
+/*
+ * Convenience functions.
+ *
+ * Don't provide these in kernel space. In kernel space, we should be
+ * disciplined enough to always provide explicit barriers.
+ */
+
+#ifndef _KERNEL
+#define	atomic_compare_exchange_strong(object, expected, desired)	\
+	atomic_compare_exchange_strong_explicit(object, expected,	\
+	    desired, memory_order_seq_cst, memory_order_seq_cst)
+#define	atomic_compare_exchange_weak(object, expected, desired)		\
+	atomic_compare_exchange_weak_explicit(object, expected,		\
+	    desired, memory_order_seq_cst, memory_order_seq_cst)
+#define	atomic_exchange(object, desired)				\
+	atomic_exchange_explicit(object, desired, memory_order_seq_cst)
+#define	atomic_fetch_add(object, operand)				\
+	atomic_fetch_add_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_and(object, operand)				\
+	atomic_fetch_and_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_or(object, operand)				\
+	atomic_fetch_or_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_sub(object, operand)				\
+	atomic_fetch_sub_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_xor(object, operand)				\
+	atomic_fetch_xor_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_load(object)						\
+	atomic_load_explicit(object, memory_order_seq_cst)
+#define	atomic_store(object, desired)					\
+	atomic_store_explicit(object, desired, memory_order_seq_cst)
+#endif /* !_KERNEL */
+
+/*
+ * 7.17.8 Atomic flag type and operations.
+ *
+ * XXX: Assume atomic_bool can be used as an atomic_flag. Is there some
+ * kind of compiler built-in type we could use?
+ */
+
+typedef struct {
+    atomic_bool	__flag;
+} atomic_flag;
+
+#define	ATOMIC_FLAG_INIT		{ ATOMIC_VAR_INIT(0) }
+
+static __inline __Bool
+atomic_flag_test_and_set_explicit(volatile atomic_flag *__object,
+                                  memory_order __order)
+{
+    return (atomic_exchange_explicit(&__object->__flag, 1, __order));
+}
+
+static __inline void
+atomic_flag_clear_explicit(volatile atomic_flag *__object, memory_order __order)
+{
+
+    atomic_store_explicit(&__object->__flag, 0, __order);
+}
+
+#ifndef _KERNEL
+static __inline __Bool
+atomic_flag_test_and_set(volatile atomic_flag *__object)
+{
+
+    return (atomic_flag_test_and_set_explicit(__object,
+                                              memory_order_seq_cst));
+}
+
+static __inline void
+atomic_flag_clear(volatile atomic_flag *__object)
+{
+
+    atomic_flag_clear_explicit(__object, memory_order_seq_cst);
+}
+#endif /* !_KERNEL */
+
+#endif /* !_STDATOMIC_H_ */
\ No newline at end of file
diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index bf1f4857..87abf7a7 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,12 @@
+2019-08-15  Jonathon Anderson  <jma14@rice.edu>
+
+	* libdw_alloc.c (__libdw_allocate): Added thread-safe stack allocator.
+	* libdwP.h (Dwarf): Likewise.
+	* dwarf_begin_elf.c (dwarf_begin_elf): Support for above.
+	* dwarf_end.c (dwarf_end): Likewise.
+	* dwarf_abbrev_hash.{c,h}: Use the *_concurrent hash table.
+	* Makefile.am: Link -lpthread to provide rwlocks.
+
 2019-08-12  Mark Wielaard  <mark@klomp.org>
 
 	* libdw.map (ELFUTILS_0.177): Add new version of dwelf_elf_begin.
diff --git a/libdw/Makefile.am b/libdw/Makefile.am
index 7a3d5322..6d0a0187 100644
--- a/libdw/Makefile.am
+++ b/libdw/Makefile.am
@@ -108,7 +108,7 @@ am_libdw_pic_a_OBJECTS = $(libdw_a_SOURCES:.c=.os)
 libdw_so_LIBS = libdw_pic.a ../libdwelf/libdwelf_pic.a \
 	  ../libdwfl/libdwfl_pic.a ../libebl/libebl.a
 libdw_so_DEPS = ../lib/libeu.a ../libelf/libelf.so
-libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS)
+libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS) -lpthread
 libdw_so_SOURCES =
 libdw.so$(EXEEXT): $(srcdir)/libdw.map $(libdw_so_LIBS) $(libdw_so_DEPS)
 # The rpath is necessary for libebl because its $ORIGIN use will
diff --git a/libdw/dwarf_abbrev_hash.c b/libdw/dwarf_abbrev_hash.c
index f52f5ad5..c2548140 100644
--- a/libdw/dwarf_abbrev_hash.c
+++ b/libdw/dwarf_abbrev_hash.c
@@ -38,7 +38,7 @@
 #define next_prime __libdwarf_next_prime
 extern size_t next_prime (size_t) attribute_hidden;
 
-#include <dynamicsizehash.c>
+#include <dynamicsizehash_concurrent.c>
 
 #undef next_prime
 #define next_prime attribute_hidden __libdwarf_next_prime
diff --git a/libdw/dwarf_abbrev_hash.h b/libdw/dwarf_abbrev_hash.h
index d2f02ccc..bc3d62c7 100644
--- a/libdw/dwarf_abbrev_hash.h
+++ b/libdw/dwarf_abbrev_hash.h
@@ -34,6 +34,6 @@
 #define TYPE Dwarf_Abbrev *
 #define COMPARE(a, b) (0)
 
-#include <dynamicsizehash.h>
+#include <dynamicsizehash_concurrent.h>
 
 #endif	/* dwarf_abbrev_hash.h */
diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
index 38c8f5c6..b3885bb5 100644
--- a/libdw/dwarf_begin_elf.c
+++ b/libdw/dwarf_begin_elf.c
@@ -417,11 +417,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn *scngrp)
   /* Initialize the memory handling.  */
   result->mem_default_size = mem_default_size;
   result->oom_handler = __libdw_oom;
-  result->mem_tail = (struct libdw_memblock *) (result + 1);
-  result->mem_tail->size = (result->mem_default_size
-			    - offsetof (struct libdw_memblock, mem));
-  result->mem_tail->remaining = result->mem_tail->size;
-  result->mem_tail->prev = NULL;
+  pthread_rwlock_init(&result->mem_rwl, NULL);
+  result->mem_stacks = 1;
+  result->mem_tails = malloc (sizeof (struct libdw_memblock *));
+  result->mem_tails[0] = (struct libdw_memblock *) (result + 1);
+  result->mem_tails[0]->size = (result->mem_default_size
+			       - offsetof (struct libdw_memblock, mem));
+  result->mem_tails[0]->remaining = result->mem_tails[0]->size;
+  result->mem_tails[0]->prev = NULL;
 
   if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
     {
diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
index 29795c10..6317bcda 100644
--- a/libdw/dwarf_end.c
+++ b/libdw/dwarf_end.c
@@ -94,14 +94,22 @@ dwarf_end (Dwarf *dwarf)
       /* And the split Dwarf.  */
       tdestroy (dwarf->split_tree, noop_free);
 
-      struct libdw_memblock *memp = dwarf->mem_tail;
-      /* The first block is allocated together with the Dwarf object.  */
-      while (memp->prev != NULL)
-	{
-	  struct libdw_memblock *prevp = memp->prev;
-	  free (memp);
-	  memp = prevp;
-	}
+      for (size_t i = 0; i < dwarf->mem_stacks; i++)
+        {
+          struct libdw_memblock *memp = dwarf->mem_tails[i];
+          /* The first block is allocated together with the Dwarf object.  */
+          while (memp != NULL && memp->prev != NULL)
+	    {
+	      struct libdw_memblock *prevp = memp->prev;
+	      free (memp);
+	      memp = prevp;
+	    }
+          /* Only for stack 0 though, the others are allocated individually.  */
+          if (memp != NULL && i > 0)
+            free (memp);
+        }
+      free (dwarf->mem_tails);
+      pthread_rwlock_destroy (&dwarf->mem_rwl);
 
       /* Free the pubnames helper structure.  */
       free (dwarf->pubnames_sets);
diff --git a/libdw/libdwP.h b/libdw/libdwP.h
index eebb7d12..442d493d 100644
--- a/libdw/libdwP.h
+++ b/libdw/libdwP.h
@@ -31,6 +31,7 @@
 
 #include <libintl.h>
 #include <stdbool.h>
+#include <pthread.h>
 
 #include <libdw.h>
 #include <dwarf.h>
@@ -218,16 +219,18 @@ struct Dwarf
   /* Similar for addrx/constx, which will come from .debug_addr section.  */
   struct Dwarf_CU *fake_addr_cu;
 
-  /* Internal memory handling.  This is basically a simplified
+  /* Internal memory handling.  This is basically a simplified thread-local
      reimplementation of obstacks.  Unfortunately the standard obstack
      implementation is not usable in libraries.  */
+  pthread_rwlock_t mem_rwl;
+  size_t mem_stacks;
   struct libdw_memblock
   {
     size_t size;
     size_t remaining;
     struct libdw_memblock *prev;
     char mem[0];
-  } *mem_tail;
+  } **mem_tails;
 
   /* Default size of allocated memory blocks.  */
   size_t mem_default_size;
@@ -572,7 +575,7 @@ extern void __libdw_seterrno (int value) internal_function;
 
 /* Memory handling, the easy parts.  This macro does not do any locking.  */
 #define libdw_alloc(dbg, type, tsize, cnt) \
-  ({ struct libdw_memblock *_tail = (dbg)->mem_tail;			      \
+  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		      \
      size_t _required = (tsize) * (cnt);				      \
      type *_result = (type *) (_tail->mem + (_tail->size - _tail->remaining));\
      size_t _padding = ((__alignof (type)				      \
@@ -591,6 +594,10 @@ extern void __libdw_seterrno (int value) internal_function;
 #define libdw_typed_alloc(dbg, type) \
   libdw_alloc (dbg, type, sizeof (type), 1)
 
+/* Callback to choose a thread-local memory allocation stack.  */
+extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
+     __nonnull_attribute__ (1);
+
 /* Callback to allocate more.  */
 extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
      __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
index f1e08714..c3c5e8a7 100644
--- a/libdw/libdw_alloc.c
+++ b/libdw/libdw_alloc.c
@@ -33,9 +33,73 @@
 
 #include <errno.h>
 #include <stdlib.h>
+#include <assert.h>
 #include "libdwP.h"
 #include "system.h"
+#include "stdatomic.h"
+#if USE_VG_ANNOTATIONS == 1
+#include <helgrind.h>
+#include <drd.h>
+#else
+#define ANNOTATE_HAPPENS_BEFORE(X)
+#define ANNOTATE_HAPPENS_AFTER(X)
+#endif
+
+
+#define thread_local __thread
 
+static thread_local int initialized = 0;
+static thread_local size_t thread_id;
+static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
+
+struct libdw_memblock *
+__libdw_alloc_tail (Dwarf *dbg)
+{
+  if (!initialized)
+    {
+      thread_id = atomic_fetch_add (&next_id, 1);
+      initialized = 1;
+    }
+
+  pthread_rwlock_rdlock (&dbg->mem_rwl);
+  if (thread_id >= dbg->mem_stacks)
+    {
+      pthread_rwlock_unlock (&dbg->mem_rwl);
+      pthread_rwlock_wrlock (&dbg->mem_rwl);
+
+      /* Another thread may have already reallocated. In theory using an
+         atomic would be faster, but given that this only happens once per
+         thread per Dwarf, some minor slowdown should be fine.  */
+      if (thread_id >= dbg->mem_stacks)
+        {
+          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
+                                    * sizeof (struct libdw_memblock *));
+          assert(dbg->mem_tails);
+          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
+            dbg->mem_tails[i] = NULL;
+          dbg->mem_stacks = thread_id + 1;
+          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
+        }
+
+      pthread_rwlock_unlock (&dbg->mem_rwl);
+      pthread_rwlock_rdlock (&dbg->mem_rwl);
+    }
+
+  /* At this point, we have an entry in the tail array.  */
+  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
+  struct libdw_memblock *result = dbg->mem_tails[thread_id];
+  if (result == NULL)
+    {
+      result = malloc (dbg->mem_default_size);
+      result->size = dbg->mem_default_size
+                     - offsetof (struct libdw_memblock, mem);
+      result->remaining = result->size;
+      result->prev = NULL;
+      dbg->mem_tails[thread_id] = result;
+    }
+  pthread_rwlock_unlock (&dbg->mem_rwl);
+  return result;
+}
 
 void *
 __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
@@ -52,8 +116,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
   newp->size = size - offsetof (struct libdw_memblock, mem);
   newp->remaining = (uintptr_t) newp + size - (result + minsize);
 
-  newp->prev = dbg->mem_tail;
-  dbg->mem_tail = newp;
+  pthread_rwlock_rdlock (&dbg->mem_rwl);
+  newp->prev = dbg->mem_tails[thread_id];
+  dbg->mem_tails[thread_id] = newp;
+  pthread_rwlock_unlock (&dbg->mem_rwl);
 
   return (void *) result;
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-21 22:21   ` Jonathon Anderson
@ 2019-08-23 21:26     ` Mark Wielaard
  0 siblings, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-08-23 21:26 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

On Wed, 2019-08-21 at 17:21 -0500, Jonathon Anderson wrote:
> > P.S. It looks like something decided to add some line breaks in the
> > patch so that it doesn't easily apply. It isn't hard to fixup, but you
> > might want to consider submitting using git send-email or attaching 
> > the result of git format-patch instead of putting the patch in the
> > message body.
> 
> Originally I had some issues with git send-mail, I usually do PRs 
> purely in git so the email side is still a little new. I've attached 
> the original patch from git format-patch, sorry for the mess.

No worries. The project is a bit old-fashioned. We still do everything
through email. It does make (free form) discussions of patches simpler.
But occasionally an email transport might mangle the actual code.

If it is easier please feel free to include a reference to a
repo/branch from which to pull a patch (or use a git request-pull
message).

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-16 19:24 [PATCH] libdw: add thread-safety to dwarf_getabbrev() Jonathon Anderson
  2019-08-21 11:16 ` Mark Wielaard
  2019-08-21 21:50 ` Mark Wielaard
@ 2019-08-24 23:24 ` Mark Wielaard
  2019-08-25  1:11   ` Jonathon Anderson
  2 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-08-24 23:24 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Fri, Aug 16, 2019 at 02:24:28PM -0500, Jonathon Anderson wrote:
> 2. Adding a lock & array structure to the memory manager (pseudo-TLS)
>   (libdwP.h, libdw_alloc.c)

Specifically concentrating on this part.

> diff --git a/libdw/ChangeLog b/libdw/ChangeLog
> index bf1f4857..87abf7a7 100644
> --- a/libdw/ChangeLog
> +++ b/libdw/ChangeLog
> @@ -1,3 +1,12 @@
> +2019-08-15  Jonathon Anderson  <jma14@rice.edu>
> +
> +	* libdw_alloc.c (__libdw_allocate): Added thread-safe stack allocator.
> +	* libdwP.h (Dwarf): Likewise.
> +	* dwarf_begin_elf.c (dwarf_begin_elf): Support for above.
> +	* dwarf_end.c (dwarf_end): Likewise.
> [...]
> +	* Makefile.am: Link -lpthread to provide rwlocks.

> diff --git a/libdw/Makefile.am b/libdw/Makefile.am
> index 7a3d5322..6d0a0187 100644
> --- a/libdw/Makefile.am
> +++ b/libdw/Makefile.am
> @@ -108,7 +108,7 @@ am_libdw_pic_a_OBJECTS = $(libdw_a_SOURCES:.c=.os)
> libdw_so_LIBS = libdw_pic.a ../libdwelf/libdwelf_pic.a \
> 	  ../libdwfl/libdwfl_pic.a ../libebl/libebl.a
> libdw_so_DEPS = ../lib/libeu.a ../libelf/libelf.so
> -libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS)
> +libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS)
> -lpthread
> libdw_so_SOURCES =
> libdw.so$(EXEEXT): $(srcdir)/libdw.map $(libdw_so_LIBS) $(libdw_so_DEPS)
> # The rpath is necessary for libebl because its $ORIGIN use will

Do we also need -pthread for CFLAGS?

> diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
> index 38c8f5c6..b3885bb5 100644
> --- a/libdw/dwarf_begin_elf.c
> +++ b/libdw/dwarf_begin_elf.c
> @@ -417,11 +417,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn
> *scngrp)
>   /* Initialize the memory handling.  */
>   result->mem_default_size = mem_default_size;
>   result->oom_handler = __libdw_oom;
> -  result->mem_tail = (struct libdw_memblock *) (result + 1);
> -  result->mem_tail->size = (result->mem_default_size
> -			    - offsetof (struct libdw_memblock, mem));
> -  result->mem_tail->remaining = result->mem_tail->size;
> -  result->mem_tail->prev = NULL;
> +  pthread_rwlock_init(&result->mem_rwl, NULL);
> +  result->mem_stacks = 1;
> +  result->mem_tails = malloc (sizeof (struct libdw_memblock *));
> +  result->mem_tails[0] = (struct libdw_memblock *) (result + 1);
> +  result->mem_tails[0]->size = (result->mem_default_size
> +			       - offsetof (struct libdw_memblock, mem));
> +  result->mem_tails[0]->remaining = result->mem_tails[0]->size;
> +  result->mem_tails[0]->prev = NULL;

Can't we just set mem_tails[0] = NULL?
Wouldn't __libdw_alloc_tail () then initialize it?

> diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
> index 29795c10..6317bcda 100644
> --- a/libdw/dwarf_end.c
> +++ b/libdw/dwarf_end.c
> @@ -94,14 +94,22 @@ dwarf_end (Dwarf *dwarf)
>       /* And the split Dwarf.  */
>       tdestroy (dwarf->split_tree, noop_free);
> 
> -      struct libdw_memblock *memp = dwarf->mem_tail;
> -      /* The first block is allocated together with the Dwarf object.  */
> -      while (memp->prev != NULL)
> -	{
> -	  struct libdw_memblock *prevp = memp->prev;
> -	  free (memp);
> -	  memp = prevp;
> -	}
> +      for (size_t i = 0; i < dwarf->mem_stacks; i++)
> +        {
> +          struct libdw_memblock *memp = dwarf->mem_tails[i];
> +          /* The first block is allocated together with the Dwarf object.
> */

Then this wouldn't be true again.

> +          while (memp != NULL && memp->prev != NULL)
> +	    {
> +	      struct libdw_memblock *prevp = memp->prev;
> +	      free (memp);
> +	      memp = prevp;
> +	    }
> +          /* Only for stack 0 though, the others are allocated
> individually.  */
> +          if (memp != NULL && i > 0)
> +            free (memp);
> +        }

And then you don't need this special case.

> +      free (dwarf->mem_tails);
> +      pthread_rwlock_destroy (&dwarf->mem_rwl);
> 
>       /* Free the pubnames helper structure.  */
>       free (dwarf->pubnames_sets);
> diff --git a/libdw/libdwP.h b/libdw/libdwP.h
> index eebb7d12..442d493d 100644
> --- a/libdw/libdwP.h
> +++ b/libdw/libdwP.h
> @@ -31,6 +31,7 @@
> 
> #include <libintl.h>
> #include <stdbool.h>
> +#include <pthread.h>
> 
> #include <libdw.h>
> #include <dwarf.h>
> @@ -218,16 +219,18 @@ struct Dwarf
>   /* Similar for addrx/constx, which will come from .debug_addr section.  */
>   struct Dwarf_CU *fake_addr_cu;
> 
> -  /* Internal memory handling.  This is basically a simplified
> +  /* Internal memory handling.  This is basically a simplified thread-local
>      reimplementation of obstacks.  Unfortunately the standard obstack
>      implementation is not usable in libraries.  */
> +  pthread_rwlock_t mem_rwl;
> +  size_t mem_stacks;
>   struct libdw_memblock
>   {
>     size_t size;
>     size_t remaining;
>     struct libdw_memblock *prev;
>     char mem[0];
> -  } *mem_tail;
> +  } **mem_tails;

Please document what/how exactly mem_rwl protects which other fields.

>   /* Default size of allocated memory blocks.  */
>   size_t mem_default_size;
> @@ -572,7 +575,7 @@ extern void __libdw_seterrno (int value)
> internal_function;
> 
> /* Memory handling, the easy parts.  This macro does not do any locking.  */

This comment isn't true anymore. __libdw_alloc_tail does locking.
Also I think the locking should be extended beyond just that
functions, see below.

> #define libdw_alloc(dbg, type, tsize, cnt) \
> -  ({ struct libdw_memblock *_tail = (dbg)->mem_tail;			      \
> +  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		      \
>      size_t _required = (tsize) * (cnt);				      \
>      type *_result = (type *) (_tail->mem + (_tail->size -
> _tail->remaining));\
>      size_t _padding = ((__alignof (type)				      \
> @@ -591,6 +594,10 @@ extern void __libdw_seterrno (int value)
> internal_function;
> #define libdw_typed_alloc(dbg, type) \
>   libdw_alloc (dbg, type, sizeof (type), 1)
> 
> +/* Callback to choose a thread-local memory allocation stack.  */
> +extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
> +     __nonnull_attribute__ (1);
> +
> /* Callback to allocate more.  */
> extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
>      __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
> diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
> index f1e08714..c3c5e8a7 100644
> --- a/libdw/libdw_alloc.c
> +++ b/libdw/libdw_alloc.c
> @@ -33,9 +33,73 @@
> 
> #include <errno.h>
> #include <stdlib.h>
> +#include <assert.h>
> #include "libdwP.h"
> #include "system.h"
> +#include "stdatomic.h"
> +#if USE_VG_ANNOTATIONS == 1
> +#include <helgrind.h>
> +#include <drd.h>
> +#else
> +#define ANNOTATE_HAPPENS_BEFORE(X)
> +#define ANNOTATE_HAPPENS_AFTER(X)
> +#endif
> +
> +
> +#define thread_local __thread
> 
> +static thread_local int initialized = 0;
> +static thread_local size_t thread_id;
> +static atomic_size_t next_id = ATOMIC_VAR_INIT(0);

Could you initialize thread_id to (size_t) -1?
Then you don't need another thread_local to indicate not initialized?

> +struct libdw_memblock *
> +__libdw_alloc_tail (Dwarf *dbg)
> +{
> +  if (!initialized)

And change this to thead_id == (size_t) -1?

> +    {
> +      thread_id = atomic_fetch_add (&next_id, 1);
> +      initialized = 1;
> +    }
> +
> +  pthread_rwlock_rdlock (&dbg->mem_rwl);
> +  if (thread_id >= dbg->mem_stacks)
> +    {
> +      pthread_rwlock_unlock (&dbg->mem_rwl);
> +      pthread_rwlock_wrlock (&dbg->mem_rwl);
> +
> +      /* Another thread may have already reallocated. In theory using an
> +         atomic would be faster, but given that this only happens once per
> +         thread per Dwarf, some minor slowdown should be fine.  */
> +      if (thread_id >= dbg->mem_stacks)
> +        {
> +          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
> +                                    * sizeof (struct libdw_memblock *));
> +          assert(dbg->mem_tails);

Don't use assert here, use if (dbg->mem_tails == NULL) dbg->oom_handler ();

But what if realloc moves the block?
Then all dbg->mem_tails[thread_id] pointers become invalid.
And this function drops the lock before returning a libdw_memblock*.

So I think the lock needs to extend beyond this function somehow.  Or
we need to prevent another thread reallocing while we are dealing with
the assigned memblock.

> +          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
> +            dbg->mem_tails[i] = NULL;
> +          dbg->mem_stacks = thread_id + 1;
> +          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
> +        }
> +
> +      pthread_rwlock_unlock (&dbg->mem_rwl);
> +      pthread_rwlock_rdlock (&dbg->mem_rwl);
> +    }
> +
> +  /* At this point, we have an entry in the tail array.  */
> +  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
> +  struct libdw_memblock *result = dbg->mem_tails[thread_id];
> +  if (result == NULL)
> +    {
> +      result = malloc (dbg->mem_default_size);
> +      result->size = dbg->mem_default_size
> +                     - offsetof (struct libdw_memblock, mem);
> +      result->remaining = result->size;
> +      result->prev = NULL;
> +      dbg->mem_tails[thread_id] = result;
> +    }
> +  pthread_rwlock_unlock (&dbg->mem_rwl);
> +  return result;

Here the lock is dropped and result points into dbg->mem_tails
which means it cannot be used because the realloc above might
make it invalid.

> +}
> 
> void *
> __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
> @@ -52,8 +116,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, size_t
> align)
>   newp->size = size - offsetof (struct libdw_memblock, mem);
>   newp->remaining = (uintptr_t) newp + size - (result + minsize);
> 
> -  newp->prev = dbg->mem_tail;
> -  dbg->mem_tail = newp;
> +  pthread_rwlock_rdlock (&dbg->mem_rwl);
> +  newp->prev = dbg->mem_tails[thread_id];
> +  dbg->mem_tails[thread_id] = newp;
> +  pthread_rwlock_unlock (&dbg->mem_rwl);
> 
>   return (void *) result;
> }

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-24 23:24 ` Mark Wielaard
@ 2019-08-25  1:11   ` Jonathon Anderson
  2019-08-25 10:05     ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-25  1:11 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic



On Sat, Aug 24, 2019 at 6:24 PM, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> On Fri, Aug 16, 2019 at 02:24:28PM -0500, Jonathon Anderson wrote:
>>  2. Adding a lock & array structure to the memory manager 
>> (pseudo-TLS)
>>    (libdwP.h, libdw_alloc.c)
> 
> Specifically concentrating on this part.
> 
>>  diff --git a/libdw/ChangeLog b/libdw/ChangeLog
>>  index bf1f4857..87abf7a7 100644
>>  --- a/libdw/ChangeLog
>>  +++ b/libdw/ChangeLog
>>  @@ -1,3 +1,12 @@
>>  +2019-08-15  Jonathon Anderson  <jma14@rice.edu>
>>  +
>>  +	* libdw_alloc.c (__libdw_allocate): Added thread-safe stack 
>> allocator.
>>  +	* libdwP.h (Dwarf): Likewise.
>>  +	* dwarf_begin_elf.c (dwarf_begin_elf): Support for above.
>>  +	* dwarf_end.c (dwarf_end): Likewise.
>>  [...]
>>  +	* Makefile.am: Link -lpthread to provide rwlocks.
> 
>>  diff --git a/libdw/Makefile.am b/libdw/Makefile.am
>>  index 7a3d5322..6d0a0187 100644
>>  --- a/libdw/Makefile.am
>>  +++ b/libdw/Makefile.am
>>  @@ -108,7 +108,7 @@ am_libdw_pic_a_OBJECTS = 
>> $(libdw_a_SOURCES:.c=.os)
>>  libdw_so_LIBS = libdw_pic.a ../libdwelf/libdwelf_pic.a \
>>  	  ../libdwfl/libdwfl_pic.a ../libebl/libebl.a
>>  libdw_so_DEPS = ../lib/libeu.a ../libelf/libelf.so
>>  -libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) 
>> $(zip_LIBS)
>>  +libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) 
>> $(zip_LIBS)
>>  -lpthread
>>  libdw_so_SOURCES =
>>  libdw.so$(EXEEXT): $(srcdir)/libdw.map $(libdw_so_LIBS) 
>> $(libdw_so_DEPS)
>>  # The rpath is necessary for libebl because its $ORIGIN use will
> 
> Do we also need -pthread for CFLAGS?

Yep, that would be the smart thing to do. Will be handled in the next 
version of the patch (once I figure out how to autotools...).

> 
>>  diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
>>  index 38c8f5c6..b3885bb5 100644
>>  --- a/libdw/dwarf_begin_elf.c
>>  +++ b/libdw/dwarf_begin_elf.c
>>  @@ -417,11 +417,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, 
>> Elf_Scn
>>  *scngrp)
>>    /* Initialize the memory handling.  */
>>    result->mem_default_size = mem_default_size;
>>    result->oom_handler = __libdw_oom;
>>  -  result->mem_tail = (struct libdw_memblock *) (result + 1);
>>  -  result->mem_tail->size = (result->mem_default_size
>>  -			    - offsetof (struct libdw_memblock, mem));
>>  -  result->mem_tail->remaining = result->mem_tail->size;
>>  -  result->mem_tail->prev = NULL;
>>  +  pthread_rwlock_init(&result->mem_rwl, NULL);
>>  +  result->mem_stacks = 1;
>>  +  result->mem_tails = malloc (sizeof (struct libdw_memblock *));
>>  +  result->mem_tails[0] = (struct libdw_memblock *) (result + 1);
>>  +  result->mem_tails[0]->size = (result->mem_default_size
>>  +			       - offsetof (struct libdw_memblock, mem));
>>  +  result->mem_tails[0]->remaining = result->mem_tails[0]->size;
>>  +  result->mem_tails[0]->prev = NULL;
> 
> Can't we just set mem_tails[0] = NULL?
> Wouldn't __libdw_alloc_tail () then initialize it?

Mostly an artifact of preserving the previous behavior for 
single-threaded programs, the malloc for the Dwarf then simplifies too 
(and less memory usage for opened but otherwise unused Dwarfs). 
Downside is an extra malloc on first usage (which shouldn't be an issue 
for any real program).

> 
>>  diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
>>  index 29795c10..6317bcda 100644
>>  --- a/libdw/dwarf_end.c
>>  +++ b/libdw/dwarf_end.c
>>  @@ -94,14 +94,22 @@ dwarf_end (Dwarf *dwarf)
>>        /* And the split Dwarf.  */
>>        tdestroy (dwarf->split_tree, noop_free);
>> 
>>  -      struct libdw_memblock *memp = dwarf->mem_tail;
>>  -      /* The first block is allocated together with the Dwarf 
>> object.  */
>>  -      while (memp->prev != NULL)
>>  -	{
>>  -	  struct libdw_memblock *prevp = memp->prev;
>>  -	  free (memp);
>>  -	  memp = prevp;
>>  -	}
>>  +      for (size_t i = 0; i < dwarf->mem_stacks; i++)
>>  +        {
>>  +          struct libdw_memblock *memp = dwarf->mem_tails[i];
>>  +          /* The first block is allocated together with the Dwarf 
>> object.
>>  */
> 
> Then this wouldn't be true again.
> 
>>  +          while (memp != NULL && memp->prev != NULL)
>>  +	    {
>>  +	      struct libdw_memblock *prevp = memp->prev;
>>  +	      free (memp);
>>  +	      memp = prevp;
>>  +	    }
>>  +          /* Only for stack 0 though, the others are allocated
>>  individually.  */
>>  +          if (memp != NULL && i > 0)
>>  +            free (memp);
>>  +        }
> 
> And then you don't need this special case.
> 
>>  +      free (dwarf->mem_tails);
>>  +      pthread_rwlock_destroy (&dwarf->mem_rwl);
>> 
>>        /* Free the pubnames helper structure.  */
>>        free (dwarf->pubnames_sets);
>>  diff --git a/libdw/libdwP.h b/libdw/libdwP.h
>>  index eebb7d12..442d493d 100644
>>  --- a/libdw/libdwP.h
>>  +++ b/libdw/libdwP.h
>>  @@ -31,6 +31,7 @@
>> 
>>  #include <libintl.h>
>>  #include <stdbool.h>
>>  +#include <pthread.h>
>> 
>>  #include <libdw.h>
>>  #include <dwarf.h>
>>  @@ -218,16 +219,18 @@ struct Dwarf
>>    /* Similar for addrx/constx, which will come from .debug_addr 
>> section.  */
>>    struct Dwarf_CU *fake_addr_cu;
>> 
>>  -  /* Internal memory handling.  This is basically a simplified
>>  +  /* Internal memory handling.  This is basically a simplified 
>> thread-local
>>       reimplementation of obstacks.  Unfortunately the standard 
>> obstack
>>       implementation is not usable in libraries.  */
>>  +  pthread_rwlock_t mem_rwl;
>>  +  size_t mem_stacks;
>>    struct libdw_memblock
>>    {
>>      size_t size;
>>      size_t remaining;
>>      struct libdw_memblock *prev;
>>      char mem[0];
>>  -  } *mem_tail;
>>  +  } **mem_tails;
> 
> Please document what/how exactly mem_rwl protects which other fields.

Got it.

> 
>>    /* Default size of allocated memory blocks.  */
>>    size_t mem_default_size;
>>  @@ -572,7 +575,7 @@ extern void __libdw_seterrno (int value)
>>  internal_function;
>> 
>>  /* Memory handling, the easy parts.  This macro does not do any 
>> locking.  */
> 
> This comment isn't true anymore. __libdw_alloc_tail does locking.

Got it.

> Also I think the locking should be extended beyond just that
> functions, see below.
> 
>>  #define libdw_alloc(dbg, type, tsize, cnt) \
>>  -  ({ struct libdw_memblock *_tail = (dbg)->mem_tail;			      \
>>  +  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		     
>>  \
>>       size_t _required = (tsize) * (cnt);				      \
>>       type *_result = (type *) (_tail->mem + (_tail->size -
>>  _tail->remaining));\
>>       size_t _padding = ((__alignof (type)				      \
>>  @@ -591,6 +594,10 @@ extern void __libdw_seterrno (int value)
>>  internal_function;
>>  #define libdw_typed_alloc(dbg, type) \
>>    libdw_alloc (dbg, type, sizeof (type), 1)
>> 
>>  +/* Callback to choose a thread-local memory allocation stack.  */
>>  +extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
>>  +     __nonnull_attribute__ (1);
>>  +
>>  /* Callback to allocate more.  */
>>  extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t 
>> align)
>>       __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
>>  diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
>>  index f1e08714..c3c5e8a7 100644
>>  --- a/libdw/libdw_alloc.c
>>  +++ b/libdw/libdw_alloc.c
>>  @@ -33,9 +33,73 @@
>> 
>>  #include <errno.h>
>>  #include <stdlib.h>
>>  +#include <assert.h>
>>  #include "libdwP.h"
>>  #include "system.h"
>>  +#include "stdatomic.h"
>>  +#if USE_VG_ANNOTATIONS == 1
>>  +#include <helgrind.h>
>>  +#include <drd.h>
>>  +#else
>>  +#define ANNOTATE_HAPPENS_BEFORE(X)
>>  +#define ANNOTATE_HAPPENS_AFTER(X)
>>  +#endif
>>  +
>>  +
>>  +#define thread_local __thread
>> 
>>  +static thread_local int initialized = 0;
>>  +static thread_local size_t thread_id;
>>  +static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
> 
> Could you initialize thread_id to (size_t) -1?
> Then you don't need another thread_local to indicate not initialized?

Just a habit, easy enough to change for the next version.

> 
>>  +struct libdw_memblock *
>>  +__libdw_alloc_tail (Dwarf *dbg)
>>  +{
>>  +  if (!initialized)
> 
> And change this to thead_id == (size_t) -1?
> 
>>  +    {
>>  +      thread_id = atomic_fetch_add (&next_id, 1);
>>  +      initialized = 1;
>>  +    }
>>  +
>>  +  pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +  if (thread_id >= dbg->mem_stacks)
>>  +    {
>>  +      pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +      pthread_rwlock_wrlock (&dbg->mem_rwl);
>>  +
>>  +      /* Another thread may have already reallocated. In theory 
>> using an
>>  +         atomic would be faster, but given that this only happens 
>> once per
>>  +         thread per Dwarf, some minor slowdown should be fine.  */
>>  +      if (thread_id >= dbg->mem_stacks)
>>  +        {
>>  +          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
>>  +                                    * sizeof (struct 
>> libdw_memblock *));
>>  +          assert(dbg->mem_tails);
> 
> Don't use assert here, use if (dbg->mem_tails == NULL) 
> dbg->oom_handler ();

Same as __libdw_allocate, got it. Apologies for not noticing sooner.

> 
> But what if realloc moves the block?
> Then all dbg->mem_tails[thread_id] pointers become invalid.
> And this function drops the lock before returning a libdw_memblock*.

Not quite, dbg->mem_tails is an array of pointers (note the ** in its 
definition, and the use of the plural "tails"). So while the 
dbg->mem_tails pointer becomes invalid, the dbg->mem_tails[thread_id] 
pointers don't.

It would be a different story if dbg->mem_tails was an array of 
libdw_memblocks, but its not. That would increase the "memory leak" 
issue mentioned previously (to ~4K per dead thread) and require more 
work on the part of the reallocing thread to initialize the new entries 
(which at the moment should reduce to a memset, assuming the compiler 
is smart enough and NULL == 0).

> 
> So I think the lock needs to extend beyond this function somehow.  Or
> we need to prevent another thread reallocing while we are dealing with
> the assigned memblock.

No need, in fact we want to drop the lock as soon as possible to let 
new threads in for realloc's. Under the assumption that we don't need 
to allocate new blocks (extremely) often, the extra cost to relock when 
we do should be relatively small. Also see __libdw_allocate.

> 
>>  +          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
>>  +            dbg->mem_tails[i] = NULL;
>>  +          dbg->mem_stacks = thread_id + 1;
>>  +          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
>>  +        }
>>  +
>>  +      pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +      pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +    }
>>  +
>>  +  /* At this point, we have an entry in the tail array.  */
>>  +  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
>>  +  struct libdw_memblock *result = dbg->mem_tails[thread_id];
>>  +  if (result == NULL)
>>  +    {
>>  +      result = malloc (dbg->mem_default_size);
>>  +      result->size = dbg->mem_default_size
>>  +                     - offsetof (struct libdw_memblock, mem);
>>  +      result->remaining = result->size;
>>  +      result->prev = NULL;
>>  +      dbg->mem_tails[thread_id] = result;
>>  +    }
>>  +  pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +  return result;
> 
> Here the lock is dropped and result points into dbg->mem_tails
> which means it cannot be used because the realloc above might
> make it invalid.

result points to a (potentially newly created) memory block, accessible 
only by this thread and never realloc'd.

>>  +}
>> 
>>  void *
>>  __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
>>  @@ -52,8 +116,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, 
>> size_t
>>  align)
>>    newp->size = size - offsetof (struct libdw_memblock, mem);
>>    newp->remaining = (uintptr_t) newp + size - (result + minsize);
>> 
>>  -  newp->prev = dbg->mem_tail;
>>  -  dbg->mem_tail = newp;
>>  +  pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +  newp->prev = dbg->mem_tails[thread_id];
>>  +  dbg->mem_tails[thread_id] = newp;
>>  +  pthread_rwlock_unlock (&dbg->mem_rwl);
>> 

Here we tinker with dbg->mem_tails under a read lock, which is mutually 
excluded from the write lock held during realloc. The new block belongs 
to this thread alone, so it doesn't need any locks to manage its usage.

>> 
>>    return (void *) result;
>>  }
> 
> Cheers,
> 
> Mark

As mentioned in other mails, this management scheme isn't (always) 
optimally memory efficient, but its speed is good under parallel 
stress. Far better than wrapping the whole thing with a single 
pthread_mutex_t.

-Jonathon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-25  1:11   ` Jonathon Anderson
@ 2019-08-25 10:05     ` Mark Wielaard
  2019-08-26  1:25       ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-08-25 10:05 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi Jonathon,

On Sat, 2019-08-24 at 20:10 -0500, Jonathon Anderson wrote:
> On Sat, Aug 24, 2019 at 6:24 PM, Mark Wielaard <mark@klomp.org> wrote:
> > But what if realloc moves the block?
> > Then all dbg->mem_tails[thread_id] pointers become invalid.
> > And this function drops the lock before returning a libdw_memblock*.
> 
> Not quite, dbg->mem_tails is an array of pointers (note the ** in its 
> definition, and the use of the plural "tails"). So while the 
> dbg->mem_tails pointer becomes invalid, the dbg->mem_tails[thread_id] 
> pointers don't.
> 
> It would be a different story if dbg->mem_tails was an array of 
> libdw_memblocks, but its not.

Aha. Yes, they are pointers to the mem_blocks, not the mem_blocks
themselves. The pointer values are moved, but never changed. So this is
indeed fine. I was confused.

>  That would increase the "memory leak" 
> issue mentioned previously (to ~4K per dead thread) and require more 
> work on the part of the reallocing thread to initialize the new entries 
> (which at the moment should reduce to a memset, assuming the compiler 
> is smart enough and NULL == 0).
> 
> > 
> > So I think the lock needs to extend beyond this function somehow.  Or
> > we need to prevent another thread reallocing while we are dealing with
> > the assigned memblock.
> 
> No need, in fact we want to drop the lock as soon as possible to let 
> new threads in for realloc's. Under the assumption that we don't need 
> to allocate new blocks (extremely) often, the extra cost to relock when 
> we do should be relatively small. Also see __libdw_allocate.

Yes, now that I have a better (correct) mental picture of the data
structure, this makes total sense.

> As mentioned in other mails, this management scheme isn't (always) 
> optimally memory efficient, but its speed is good under parallel 
> stress. Far better than wrapping the whole thing with a single 
> pthread_mutex_t.

I wouldn't tweak it too much if you know it is working correctly now.
We do have to make sure it doesn't slow down the single-threaded use
case (since most programs still are single threaded). There is the new
locking, which slows things down. But the memory use should be about
the same since you don't duplicate the mem_blocks till there is access
from multiple threads.

If there isn't much parallel stress or allocation in practice. That is,
even in multi-threaded programs, there is still just one thread that
does most/all of the allocations. Then you could maybe optimize a bit
by having one "owner" thread for a Dwarf. That would be the first that
hits __libdw_alloc_tail. For that "owner thread" you then setup an
owner thread_id field, and have one special mem_block allocated. You
only expand the mem_blocks once another thread does an allocation. Then
the memory use in multi-threaded programs, where only one thread
handles a Dwarf object (or happens to allocate) would be the same as
for single-threaded programs.

But this depends on the usage pattern. Which might vary very between
programs. So don't try it, till you know it actually helps for a real
world program. And the extra memory use, might not really matter that
much in practice anyway.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-25 10:05     ` Mark Wielaard
@ 2019-08-26  1:25       ` Jonathon Anderson
  2019-08-26 13:18         ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-26  1:25 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

Hello Mark,

Next iteration of the patch(es) are ready, hosted on Github (git 
request-pull output below). I think most of the issues we've discussed 
have been addressed; the setup with stdatomic.h is a little WIP, I 
wasn't certain how it should be done moving forward (and, I'm not all 
that familiar with Autotools). The one extra line in dwarf_getcfi.c is 
to account for a failure in the test suite when Valgrind is enabled 
(didn't catch it the first time around).

I like the idea of "owner" threads (and I think it should be simple 
enough to implement), but I'd like to do some performance comparisons 
against the pthread_key_* option before adding too much more complexity.

-Jonathon

The following changes since commit 
37fa94df1554aca83ec10ce50bc9bcb6957b204e:

  config/elfutils.spec.in: package eu-elfclassify (2019-08-15 09:17:41 
+0200)

are available in the Git repository at:

  https://github.com/blue42u/elfutils.git

for you to fetch changes up to 9e40e0b8bb329692b1140e99896164bcb7f791b8:

  lib + libdw: Add and use a concurrent version of the dynamic-size 
hash table. (2019-08-25 18:36:38 -0500)

----------------------------------------------------------------
Jonathon Anderson (3):
      Add configure options for Valgrind annotations.
      Add some supporting framework for C11-style atomics.
      libdw: Rewrite the memory handler to be thread-safe.

Srđan Milaković (1):
      lib + libdw: Add and use a concurrent version of the dynamic-size 
hash table.

 ChangeLog                        |   5 +
 configure.ac                     |  42 ++++
 lib/ChangeLog                    |  11 +
 lib/Makefile.am                  |   5 +-
 lib/atomics.h                    |  37 +++
 lib/dynamicsizehash_concurrent.c | 522 
+++++++++++++++++++++++++++++++++++++++
 lib/dynamicsizehash_concurrent.h | 118 +++++++++
 lib/stdatomic-fbsd.h             | 442 
+++++++++++++++++++++++++++++++++
 libdw/ChangeLog                  |  12 +
 libdw/Makefile.am                |   4 +-
 libdw/dwarf_abbrev_hash.c        |   2 +-
 libdw/dwarf_abbrev_hash.h        |   2 +-
 libdw/dwarf_begin_elf.c          |  13 +-
 libdw/dwarf_end.c                |  22 +-
 libdw/dwarf_getcfi.c             |   1 +
 libdw/libdwP.h                   |  19 +-
 libdw/libdw_alloc.c              |  70 +++++-
 17 files changed, 1300 insertions(+), 27 deletions(-)
 create mode 100644 lib/atomics.h
 create mode 100644 lib/dynamicsizehash_concurrent.c
 create mode 100644 lib/dynamicsizehash_concurrent.h
 create mode 100644 lib/stdatomic-fbsd.h

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-26  1:25       ` Jonathon Anderson
@ 2019-08-26 13:18         ` Mark Wielaard
  2019-08-26 13:37           ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-08-26 13:18 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

[-- Attachment #1: Type: text/plain, Size: 871 bytes --]

Hi Jonathon,

Thanks for checking our new mailinglist settings :)
You message was accepted now and the HTML attachment stripped.

On Sun, 2019-08-25 at 20:24 -0500, Jonathon Anderson wrote:
> The one extra line in dwarf_getcfi.c
> is to account for a failure in the test suite when Valgrind is
> enabled (didn't catch it the first time around).

That is an interesting catch. And obviously a latent bug. It works if
things are in the first memblock (as part of the larger Dwarf struct),
since that is allocated with calloc, and so zeroed out. But if the cfi
would get into a later memblock it would get a random value, since
those are malloced. In your patch the memblocks are always malloced,
and so the bug shows up immediately.

Lets just cherry-pick this fixup since it is a good fixup to have.

Does the attached look correct?

Thanks,

Mark

[-- Attachment #2: 0001-libdw-fix-latent-bug-in-dwarf_getcfi.c-not-setting-d.patch --]
[-- Type: text/x-patch, Size: 1223 bytes --]

From 4bcc641d362de4236ae8f0f5bc933c6d84b6f105 Mon Sep 17 00:00:00 2001
From: Jonathon Anderson <jma14@rice.edu>
Date: Sun, 25 Aug 2019 10:07:00 -0500
Subject: [PATCH] libdw: fix latent bug in dwarf_getcfi.c not setting
 default_same_value.

Signed-off-by: Jonathon Anderson <jma14@rice.edu>
---
 libdw/ChangeLog      | 4 ++++
 libdw/dwarf_getcfi.c | 1 +
 2 files changed, 5 insertions(+)

diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index bf1f4857..1d3586f0 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,7 @@
+2019-08-25  Jonathon Anderson  <jma14@rice.edu>
+
+	* dwarf_getcfi.c (dwarf_getcfi): Set default_same_value to false.
+
 2019-08-12  Mark Wielaard  <mark@klomp.org>

 	* libdw.map (ELFUTILS_0.177): Add new version of dwelf_elf_begin.
diff --git a/libdw/dwarf_getcfi.c b/libdw/dwarf_getcfi.c
index 9aed403e..51932cd9 100644
--- a/libdw/dwarf_getcfi.c
+++ b/libdw/dwarf_getcfi.c
@@ -58,6 +58,7 @@ dwarf_getcfi (Dwarf *dbg)

       cfi->e_ident = (unsigned char *) elf_getident (dbg->elf, NULL);
       cfi->other_byte_order = dbg->other_byte_order;
+      cfi->default_same_value = false;

       cfi->next_offset = 0;
       cfi->cie_tree = cfi->fde_tree = cfi->expr_tree = NULL;
-- 
2.18.1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-26 13:18         ` Mark Wielaard
@ 2019-08-26 13:37           ` Jonathon Anderson
  2019-08-27  3:52             ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-26 13:37 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

Looks correct to me (assuming it applies). I think there's another 
latent bug in there somewhere (tests that use libdwfl used to leak 
mem_tails, but now that dwarf_begin_elf doesn't do an initial malloc it 
doesn't trigger), I'll try hunting it down when I have the time.

Glad I could be of help testing the mailinglist :)

-Jonathon

On Mon, Aug 26, 2019 at 8:18 AM, Mark Wielaard <mark@klomp.org> wrote:
> Hi Jonathon,
> 
> Thanks for checking our new mailinglist settings :)
> You message was accepted now and the HTML attachment stripped.
> 
> On Sun, 2019-08-25 at 20:24 -0500, Jonathon Anderson wrote:
>>  The one extra line in dwarf_getcfi.c
>>  is to account for a failure in the test suite when Valgrind is
>>  enabled (didn't catch it the first time around).
> 
> That is an interesting catch. And obviously a latent bug. It works if
> things are in the first memblock (as part of the larger Dwarf struct),
> since that is allocated with calloc, and so zeroed out. But if the cfi
> would get into a later memblock it would get a random value, since
> those are malloced. In your patch the memblocks are always malloced,
> and so the bug shows up immediately.
> 
> Lets just cherry-pick this fixup since it is a good fixup to have.
> 
> Does the attached look correct?
> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-26 13:37           ` Jonathon Anderson
@ 2019-08-27  3:52             ` Jonathon Anderson
  2019-08-29 13:16               ` Mark Wielaard
  2019-10-26 10:54               ` [PATCH] libdw: add thread-safety to dwarf_getabbrev() Florian Weimer
  0 siblings, 2 replies; 55+ messages in thread
From: Jonathon Anderson @ 2019-08-27  3:52 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

Hello Mark,

Just finished some modifications to the patch series, git request-pull 
output below. This rebases onto the latest master and does a little 
diff cleaning, the major change is that I swapped out the memory 
management to use the pthread_key_* alternative mentioned before. I did 
some performance testing and found it to be notably faster in a 
high-stress microbenchmark (and no measurable difference in an 
application), and it makes the result easier to read overall. I've also 
removed the Valgrind configuration options patch since it doesn't add 
anything at the moment, if it becomes useful I'll submit it separately.

Thanks for taking the time to read through everything.

-Jonathon

The following changes since commit 
4bcc641d362de4236ae8f0f5bc933c6d84b6f105:

  libdw: fix latent bug in dwarf_getcfi.c not setting 
default_same_value. (2019-08-26 15:15:34 +0200)

are available in the Git repository at:

  https://github.com/blue42u/elfutils.git parallel-pr-v2

for you to fetch changes up to 1191d9ed292b508d732973a318a01051053e0f61:

  lib + libdw: Add and use a concurrent version of the dynamic-size 
hash table. (2019-08-26 22:29:45 -0500)

----------------------------------------------------------------
Jonathon Anderson (2):
      Add some supporting framework for C11-style atomics.
      libdw: Rewrite the memory handler to be thread-safe.

Srđan Milaković (1):
      lib + libdw: Add and use a concurrent version of the dynamic-size 
hash table.

 configure.ac                     |  12 ++
 lib/ChangeLog                    |  11 ++
 lib/Makefile.am                  |   5 +-
 lib/atomics.h                    |  37 +++++
 lib/dynamicsizehash_concurrent.c | 522 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/dynamicsizehash_concurrent.h | 118 +++++++++++++
 lib/stdatomic-fbsd.h             | 442 
+++++++++++++++++++++++++++++++++++++++++++++++++
 libdw/ChangeLog                  |  12 ++
 libdw/Makefile.am                |   4 +-
 libdw/dwarf_abbrev_hash.c        |   2 +-
 libdw/dwarf_abbrev_hash.h        |   2 +-
 libdw/dwarf_begin_elf.c          |  12 +-
 libdw/dwarf_end.c                |   7 +-
 libdw/libdwP.h                   |  59 ++++---
 libdw/libdw_alloc.c              |   5 +-
 15 files changed, 1210 insertions(+), 40 deletions(-)
 create mode 100644 lib/atomics.h
 create mode 100644 lib/dynamicsizehash_concurrent.c
 create mode 100644 lib/dynamicsizehash_concurrent.h
 create mode 100644 lib/stdatomic-fbsd.h

On Mon, Aug 26, 2019 at 8:37 AM, Jonathon Anderson <jma14@rice.edu> 
wrote:
> Looks correct to me (assuming it applies). I think there's another 
> latent bug in there somewhere (tests that use libdwfl used to leak 
> mem_tails, but now that dwarf_begin_elf doesn't do an initial malloc 
> it doesn't trigger), I'll try hunting it down when I have the time.
> 
> Glad I could be of help testing the mailinglist :)
> 
> -Jonathon
> 
> On Mon, Aug 26, 2019 at 8:18 AM, Mark Wielaard <mark@klomp.org> wrote:
>> Hi Jonathon,
>> 
>> Thanks for checking our new mailinglist settings :)
>> You message was accepted now and the HTML attachment stripped.
>> 
>> On Sun, 2019-08-25 at 20:24 -0500, Jonathon Anderson wrote:
>>>  The one extra line in dwarf_getcfi.c
>>>  is to account for a failure in the test suite when Valgrind is
>>>  enabled (didn't catch it the first time around).
>> 
>> That is an interesting catch. And obviously a latent bug. It works if
>> things are in the first memblock (as part of the larger Dwarf 
>> struct),
>> since that is allocated with calloc, and so zeroed out. But if the 
>> cfi
>> would get into a later memblock it would get a random value, since
>> those are malloced. In your patch the memblocks are always malloced,
>> and so the bug shows up immediately.
>> 
>> Lets just cherry-pick this fixup since it is a good fixup to have.
>> 
>> Does the attached look correct?
>> 
>> Thanks,
>> 
>> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-27  3:52             ` Jonathon Anderson
@ 2019-08-29 13:16               ` Mark Wielaard
  2019-08-29 13:16                 ` [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table Mark Wielaard
                                   ` (2 more replies)
  2019-10-26 10:54               ` [PATCH] libdw: add thread-safety to dwarf_getabbrev() Florian Weimer
  1 sibling, 3 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-08-29 13:16 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson, Srdan Milakovic

Hi Jonathan,

On Mon, 2019-08-26 at 22:52 -0500, Jonathon Anderson wrote:
> Just finished some modifications to the patch series, git request-
> pull output below. This rebases onto the latest master and does a
> little diff cleaning, the major change is that I swapped out the
> memory management to use the pthread_key_* alternative mentioned
> before. I did some performance testing and found it to be notably
> faster in a high-stress microbenchmark (and no measurable difference
> in an application), and it makes the result easier to read overall.
> I've also removed the Valgrind configuration options patch since it
> doesn't add anything at the moment, if it becomes useful I'll submit
> it separately.

Thanks for the pull request. Since I find it nice to also see all
patches on the mailinglist, I am reposting them for easier review.
As replies to this message.

> The following changes since commit 
> 4bcc641d362de4236ae8f0f5bc933c6d84b6f105:
> 
>   libdw: fix latent bug in dwarf_getcfi.c not setting 
> default_same_value. (2019-08-26 15:15:34 +0200)
> 
> are available in the Git repository at:
> 
>   https://github.com/blue42u/elfutils.git parallel-pr-v2
> 
> for you to fetch changes up to
> 1191d9ed292b508d732973a318a01051053e0f61:
> 
>   lib + libdw: Add and use a concurrent version of the dynamic-size 
> hash table. (2019-08-26 22:29:45 -0500)
> 
> ----------------------------------------------------------------
> Jonathon Anderson (2):
>       Add some supporting framework for C11-style atomics.
>       libdw: Rewrite the memory handler to be thread-safe.
> 
> SrÄ‘an MilakoviÄ‡ (1):
>       lib + libdw: Add and use a concurrent version of the dynamic-
> size hash table.

[PATCH 1/3] Add some supporting framework for C11-style atomics.
[PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe.
[PATCH 3/3] lib + libdw: Add and use a concurrent version of the

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-08-29 13:16               ` Mark Wielaard
@ 2019-08-29 13:16                 ` Mark Wielaard
  2019-10-25 23:50                   ` Mark Wielaard
  2019-08-29 13:16                 ` [PATCH 1/3] Add some supporting framework for C11-style atomics Mark Wielaard
  2019-08-29 13:16                 ` [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe Mark Wielaard
  2 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-08-29 13:16 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson, Srdan Milakovic

From: SrÄ‘an MilakoviÄ‡ <sm108@rice.edu>

Signed-off-by: SrÄ‘an MilakoviÄ‡ <sm108@rice.edu>
---
 lib/ChangeLog                    |   5 +
 lib/Makefile.am                  |   4 +-
 lib/dynamicsizehash_concurrent.c | 522 +++++++++++++++++++++++++++++++
 lib/dynamicsizehash_concurrent.h | 118 +++++++
 libdw/ChangeLog                  |   4 +
 libdw/dwarf_abbrev_hash.c        |   2 +-
 libdw/dwarf_abbrev_hash.h        |   2 +-
 7 files changed, 653 insertions(+), 4 deletions(-)
 create mode 100644 lib/dynamicsizehash_concurrent.c
 create mode 100644 lib/dynamicsizehash_concurrent.h

diff --git a/lib/ChangeLog b/lib/ChangeLog
index 3799c3aa..a209729a 100644
--- a/lib/ChangeLog
+++ b/lib/ChangeLog
@@ -1,3 +1,8 @@
+2019-08-26  SrÄ‘an MilakoviÄ‡  <sm108@rice.edu>
+
+	* dynamicsizehash_concurrent.{c,h}: New files.
+	* Makefile.am (noinst_HEADERS): Added dynamicsizehash_concurrent.h.
+
 2019-08-25  Jonathon Anderson  <jma14@rice.edu>
 
 	* stdatomic-fbsd.h: New file, taken from FreeBSD.
diff --git a/lib/Makefile.am b/lib/Makefile.am
index 3086cf06..97bf7329 100644
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -39,8 +39,8 @@ libeu_a_SOURCES = xstrdup.c xstrndup.c xmalloc.c next_prime.c \
 
 noinst_HEADERS = fixedsizehash.h libeu.h system.h dynamicsizehash.h list.h \
 		 eu-config.h color.h printversion.h bpf.h \
-		 atomics.h stdatomic-fbsd.h
-EXTRA_DIST = dynamicsizehash.c
+		 atomics.h stdatomic-fbsd.h dynamicsizehash_concurrent.h
+EXTRA_DIST = dynamicsizehash.c dynamicsizehash_concurrent.c
 
 if !GPROF
 xmalloc_CFLAGS = -ffunction-sections
diff --git a/lib/dynamicsizehash_concurrent.c b/lib/dynamicsizehash_concurrent.c
new file mode 100644
index 00000000..d645b143
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.c
@@ -0,0 +1,522 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <stdlib.h>
+#include <system.h>
+#include <pthread.h>
+
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+   COMPARE   comparison function taking two pointers to TYPE objects
+
+   The following macros if present select features:
+
+   ITERATE   iterating over the table entries is possible
+   REVERSE   iterate in reverse order of insert
+ */
+
+
+static size_t
+lookup (NAME *htab, HASHTYPE hval, TYPE val __attribute__ ((unused)))
+{
+  /* First hash function: simply take the modul but prevent zero.  Small values
+      can skip the division, which helps performance when this is common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+#if COMPARE != 0  /* A handful of tables don't actually compare the entries in
+                    the table, they instead rely on the hash.  In that case, we
+                    can skip parts that relate to the value. */
+  TYPE val_ptr;
+#endif
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    {
+#if COMPARE == 0
+      return idx;
+#else
+      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                            memory_order_acquire);
+      if (COMPARE(val_ptr, val) == 0)
+          return idx;
+#endif
+    }
+  else if (hash == 0)
+    {
+      return 0;
+    }
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+        {
+#if COMPARE == 0
+          return idx;
+#else
+          val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                                memory_order_acquire);
+          if (COMPARE(val_ptr, val) == 0)
+              return idx;
+#endif
+        }
+      else if (hash == 0)
+        {
+          return 0;
+        }
+    }
+}
+
+static int
+insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
+{
+  /* First hash function: simply take the modul but prevent zero.  Small values
+      can skip the division, which helps performance when this is common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+  TYPE val_ptr;
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    {
+      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                            memory_order_acquire);
+      if (COMPARE(val_ptr, val) != 0)
+          return -1;
+    }
+  else if (hash == 0)
+    {
+      val_ptr = NULL;
+      atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                              (uintptr_t *) &val_ptr,
+                                              (uintptr_t) val,
+                                              memory_order_acquire,
+                                              memory_order_acquire);
+
+      if (val_ptr == NULL)
+        {
+          atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                memory_order_release);
+          return 0;
+        }
+      else
+        {
+          do
+            {
+              hash = atomic_load_explicit(&htab->table[idx].val_ptr,
+                                          memory_order_acquire);
+            }
+          while (hash == 0);
+        }
+    }
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+        {
+          val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                                memory_order_acquire);
+          if (COMPARE(val_ptr, val) != 0)
+              return -1;
+        }
+      else if (hash == 0)
+        {
+          val_ptr = NULL;
+          atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                                  (uintptr_t *) &val_ptr,
+                                                  (uintptr_t) val,
+                                                  memory_order_acquire,
+                                                  memory_order_acquire);
+
+          if (val_ptr == NULL)
+            {
+              atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                    memory_order_release);
+              return 0;
+            }
+          else
+            {
+              do
+                {
+                  hash = atomic_load_explicit(&htab->table[idx].val_ptr,
+                                              memory_order_acquire);
+                }
+              while (hash == 0);
+            }
+        }
+    }
+}
+
+#define NO_RESIZING 0u
+#define ALLOCATING_MEMORY 1u
+#define MOVING_DATA 3u
+#define CLEANING 2u
+
+#define STATE_BITS 2u
+#define STATE_INCREMENT (1u << STATE_BITS)
+#define STATE_MASK (STATE_INCREMENT - 1)
+#define GET_STATE(A) ((A) & STATE_MASK)
+
+#define IS_NO_RESIZE_OR_CLEANING(A) (((A) & 0x1u) == 0)
+
+#define GET_ACTIVE_WORKERS(A) ((A) >> STATE_BITS)
+
+#define INITIALIZATION_BLOCK_SIZE 256
+#define MOVE_BLOCK_SIZE 256
+#define CEIL(A, B) (((A) + (B) - 1) / (B))
+
+/* Initializes records and copies the data from the old table.
+   It can share work with other threads */
+static void resize_helper(NAME *htab, int blocking)
+{
+  size_t num_old_blocks = CEIL(htab->old_size, MOVE_BLOCK_SIZE);
+  size_t num_new_blocks = CEIL(htab->size, INITIALIZATION_BLOCK_SIZE);
+
+  size_t my_block;
+  size_t num_finished_blocks = 0;
+
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_init_block, 1,
+                                                memory_order_acquire))
+                                                    < num_new_blocks)
+    {
+      size_t record_it = my_block * INITIALIZATION_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * INITIALIZATION_BLOCK_SIZE;
+      if (record_end > htab->size)
+          record_end = htab->size;
+
+      while (record_it++ != record_end)
+        {
+          atomic_init(&htab->table[record_it].hashval, (uintptr_t) NULL);
+          atomic_init(&htab->table[record_it].val_ptr, (uintptr_t) NULL);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_initialized_blocks,
+                            num_finished_blocks, memory_order_release);
+  while (atomic_load_explicit(&htab->num_initialized_blocks,
+                              memory_order_acquire) != num_new_blocks);
+
+  /* All block are initialized, start moving */
+  num_finished_blocks = 0;
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_move_block, 1,
+                                                memory_order_acquire))
+                                                    < num_old_blocks)
+    {
+      size_t record_it = my_block * MOVE_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * MOVE_BLOCK_SIZE;
+      if (record_end > htab->old_size)
+          record_end = htab->old_size;
+
+      while (record_it++ != record_end)
+        {
+          TYPE val_ptr = (TYPE) atomic_load_explicit(
+              &htab->old_table[record_it].val_ptr,
+              memory_order_acquire);
+          if (val_ptr == NULL)
+              continue;
+
+          HASHTYPE hashval = atomic_load_explicit(
+              &htab->old_table[record_it].hashval,
+              memory_order_acquire);
+          assert(hashval);
+
+          insert_helper(htab, hashval, val_ptr);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_moved_blocks, num_finished_blocks,
+                            memory_order_release);
+
+  if (blocking)
+      while (atomic_load_explicit(&htab->num_moved_blocks,
+                                  memory_order_acquire) != num_old_blocks);
+}
+
+static void
+resize_master(NAME *htab)
+{
+  htab->old_size = htab->size;
+  htab->old_table = htab->table;
+
+  htab->size = next_prime(htab->size * 2);
+  htab->table = malloc((1 + htab->size) * sizeof(htab->table[0]));
+  assert(htab->table);
+
+  /* Change state from ALLOCATING_MEMORY to MOVING_DATA */
+  atomic_fetch_xor_explicit(&htab->resizing_state,
+                            ALLOCATING_MEMORY ^ MOVING_DATA,
+                            memory_order_release);
+
+  resize_helper(htab, 1);
+
+  /* Change state from MOVING_DATA to CLEANING */
+  size_t resize_state = atomic_fetch_xor_explicit(&htab->resizing_state,
+                                                  MOVING_DATA ^ CLEANING,
+                                                  memory_order_acq_rel);
+  while (GET_ACTIVE_WORKERS(resize_state) != 0)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* There are no more active workers */
+  atomic_store_explicit(&htab->next_init_block, 0, memory_order_relaxed);
+  atomic_store_explicit(&htab->num_initialized_blocks, 0,
+                        memory_order_relaxed);
+
+  atomic_store_explicit(&htab->next_move_block, 0, memory_order_relaxed);
+  atomic_store_explicit(&htab->num_moved_blocks, 0, memory_order_relaxed);
+
+  free(htab->old_table);
+
+  /* Change state to NO_RESIZING */
+  atomic_fetch_xor_explicit(&htab->resizing_state, CLEANING ^ NO_RESIZING,
+                            memory_order_relaxed);
+
+}
+
+static void
+resize_worker(NAME *htab)
+{
+  size_t resize_state = atomic_load_explicit(&htab->resizing_state,
+                                              memory_order_acquire);
+
+  /* If the resize has finished */
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+      return;
+
+  /* Register as worker and check if the resize has finished in the meantime*/
+  resize_state = atomic_fetch_add_explicit(&htab->resizing_state,
+                                            STATE_INCREMENT,
+                                            memory_order_acquire);
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  /* Wait while the new table is being allocated. */
+  while (GET_STATE(resize_state) == ALLOCATING_MEMORY)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* Check if the resize is done */
+  assert(GET_STATE(resize_state) != NO_RESIZING);
+  if (GET_STATE(resize_state) == CLEANING)
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  resize_helper(htab, 0);
+
+  /* Deregister worker */
+  atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                            memory_order_release);
+}
+
+
+int
+#define INIT(name) _INIT (name)
+#define _INIT(name) \
+  name##_init
+INIT(NAME) (NAME *htab, size_t init_size)
+{
+  /* We need the size to be a prime.  */
+  init_size = next_prime (init_size);
+
+  /* Initialize the data structure.  */
+  htab->size = init_size;
+  atomic_init(&htab->filled, 0);
+  atomic_init(&htab->resizing_state, 0);
+
+  atomic_init(&htab->next_init_block, 0);
+  atomic_init(&htab->num_initialized_blocks, 0);
+
+  atomic_init(&htab->next_move_block, 0);
+  atomic_init(&htab->num_moved_blocks, 0);
+
+  pthread_rwlock_init(&htab->resize_rwl, NULL);
+
+  htab->table = (void *) malloc ((init_size + 1) * sizeof (htab->table[0]));
+  if (htab->table == NULL)
+      return -1;
+
+  for (size_t i = 0; i <= init_size; i++)
+    {
+      atomic_init(&htab->table[i].hashval, (uintptr_t) NULL);
+      atomic_init(&htab->table[i].val_ptr, (uintptr_t) NULL);
+    }
+
+  return 0;
+}
+
+
+int
+#define FREE(name) _FREE (name)
+#define _FREE(name) \
+name##_free
+FREE(NAME) (NAME *htab)
+{
+  pthread_rwlock_destroy(&htab->resize_rwl);
+  free (htab->table);
+  return 0;
+}
+
+
+int
+#define INSERT(name) _INSERT (name)
+#define _INSERT(name) \
+name##_insert
+INSERT(NAME) (NAME *htab, HASHTYPE hval, TYPE data)
+{
+  int incremented = 0;
+
+  for(;;)
+    {
+      while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+          resize_worker(htab);
+
+      size_t filled;
+      if (!incremented)
+        {
+          filled = atomic_fetch_add_explicit(&htab->filled, 1,
+                                              memory_order_acquire);
+          incremented = 1;
+        }
+      else
+        {
+          filled = atomic_load_explicit(&htab->filled,
+                                        memory_order_acquire);
+        }
+
+
+      if (100 * filled > 90 * htab->size)
+        {
+          /* Table is filled more than 90%.  Resize the table.  */
+
+          size_t resizing_state = atomic_load_explicit(&htab->resizing_state,
+                                                        memory_order_acquire);
+          if (resizing_state == 0 &&
+              atomic_compare_exchange_strong_explicit(&htab->resizing_state,
+                                                      &resizing_state,
+                                                      ALLOCATING_MEMORY,
+                                                      memory_order_acquire,
+                                                      memory_order_acquire))
+            {
+              /* Master thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+              pthread_rwlock_wrlock(&htab->resize_rwl);
+              resize_master(htab);
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+            }
+          else
+            {
+              /* Worker thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+              resize_worker(htab);
+            }
+        }
+      else
+        {
+          /* Lock acquired, no need for resize*/
+          break;
+        }
+    }
+
+  int ret_val = insert_helper(htab, hval, data);
+  if (ret_val == -1)
+      atomic_fetch_sub_explicit(&htab->filled, 1, memory_order_relaxed);
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
+
+
+
+TYPE
+#define FIND(name) _FIND (name)
+#define _FIND(name) \
+  name##_find
+FIND(NAME) (NAME *htab, HASHTYPE hval, TYPE val)
+{
+  while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+      resize_worker(htab);
+
+  size_t idx;
+
+  /* Make the hash data nonzero.  */
+  hval = hval ?: 1;
+  idx = lookup(htab, hval, val);
+
+  if (idx == 0)
+    {
+      pthread_rwlock_unlock(&htab->resize_rwl);
+      return NULL;
+    }
+
+  /* get a copy before unlocking the lock */
+  TYPE ret_val = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                             memory_order_relaxed);
+
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
+
diff --git a/lib/dynamicsizehash_concurrent.h b/lib/dynamicsizehash_concurrent.h
new file mode 100644
index 00000000..a137cbd0
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.h
@@ -0,0 +1,118 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <stddef.h>
+#include <pthread.h>
+#include "atomics.h"
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+
+   The following macros if present select features:
+
+   ITERATE   iterating over the table entries is possible
+   HASHTYPE  integer type for hash values, default unsigned long int
+ */
+
+
+
+#ifndef HASHTYPE
+# define HASHTYPE unsigned long int
+#endif
+
+#ifndef RESIZE_BLOCK_SIZE
+# define RESIZE_BLOCK_SIZE 256
+#endif
+
+/* Defined separately.  */
+extern size_t next_prime (size_t seed);
+
+
+/* Table entry type.  */
+#define _DYNHASHCONENTTYPE(name)       \
+  typedef struct name##_ent         \
+  {                                 \
+    _Atomic(HASHTYPE) hashval;      \
+    atomic_uintptr_t val_ptr;       \
+  } name##_ent
+#define DYNHASHENTTYPE(name) _DYNHASHCONENTTYPE (name)
+DYNHASHENTTYPE (NAME);
+
+/* Type of the dynamic hash table data structure.  */
+#define _DYNHASHCONTYPE(name) \
+typedef struct                                     \
+{                                                  \
+  size_t size;                                     \
+  size_t old_size;                                 \
+  atomic_size_t filled;                            \
+  name##_ent *table;                               \
+  name##_ent *old_table;                           \
+  atomic_size_t resizing_state;                    \
+  atomic_size_t next_init_block;                   \
+  atomic_size_t num_initialized_blocks;            \
+  atomic_size_t next_move_block;                   \
+  atomic_size_t num_moved_blocks;                  \
+  pthread_rwlock_t resize_rwl;                     \
+} name
+#define DYNHASHTYPE(name) _DYNHASHCONTYPE (name)
+DYNHASHTYPE (NAME);
+
+
+
+#define _FUNCTIONS(name)                                            \
+/* Initialize the hash table.  */                                   \
+extern int name##_init (name *htab, size_t init_size);              \
+                                                                    \
+/* Free resources allocated for hash table.  */                     \
+extern int name##_free (name *htab);                                \
+                                                                    \
+/* Insert new entry.  */                                            \
+extern int name##_insert (name *htab, HASHTYPE hval, TYPE data);    \
+                                                                    \
+/* Find entry in hash table.  */                                    \
+extern TYPE name##_find (name *htab, HASHTYPE hval, TYPE val);
+#define FUNCTIONS(name) _FUNCTIONS (name)
+FUNCTIONS (NAME)
+
+
+#ifndef NO_UNDEF
+# undef DYNHASHENTTYPE
+# undef DYNHASHTYPE
+# undef FUNCTIONS
+# undef _FUNCTIONS
+# undef XFUNCTIONS
+# undef _XFUNCTIONS
+# undef NAME
+# undef TYPE
+# undef ITERATE
+# undef COMPARE
+# undef FIRST
+# undef NEXT
+#endif
diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index ec809070..25616277 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,7 @@
+2019-08-26  SrÄ‘an MilakoviÄ‡  <sm108@rice.edu@rice.edu>
+
+	* dwarf_abbrev_hash.{c,h}: Use the *_concurrent hash table.
+
 2019-08-26  Jonathon Anderson  <jma14@rice.edu>
 
 	* libdw_alloc.c (__libdw_allocate): Added thread-safe stack allocator.
diff --git a/libdw/dwarf_abbrev_hash.c b/libdw/dwarf_abbrev_hash.c
index f52f5ad5..c2548140 100644
--- a/libdw/dwarf_abbrev_hash.c
+++ b/libdw/dwarf_abbrev_hash.c
@@ -38,7 +38,7 @@
 #define next_prime __libdwarf_next_prime
 extern size_t next_prime (size_t) attribute_hidden;
 
-#include <dynamicsizehash.c>
+#include <dynamicsizehash_concurrent.c>
 
 #undef next_prime
 #define next_prime attribute_hidden __libdwarf_next_prime
diff --git a/libdw/dwarf_abbrev_hash.h b/libdw/dwarf_abbrev_hash.h
index d2f02ccc..bc3d62c7 100644
--- a/libdw/dwarf_abbrev_hash.h
+++ b/libdw/dwarf_abbrev_hash.h
@@ -34,6 +34,6 @@
 #define TYPE Dwarf_Abbrev *
 #define COMPARE(a, b) (0)
 
-#include <dynamicsizehash.h>
+#include <dynamicsizehash_concurrent.h>
 
 #endif	/* dwarf_abbrev_hash.h */
-- 
2.18.1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-08-29 13:16                 ` [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table Mark Wielaard
@ 2019-10-25 23:50                   ` Mark Wielaard
  2019-10-26  4:11                     ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-25 23:50 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson, Srdan Milakovic

[-- Attachment #1: Type: text/plain, Size: 2319 bytes --]

Hi,

Sorry this took so long to review. But it is pretty complex code.
I think I got how it works mostly. It is hard to proof correct though.
How did you convince yourself that the code is correct?

For example I am not sure I can proof that in resize_worker() this
always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
In general the handling of the resizing_state field is pretty tricky.
What is the maximum number of concurrent threads it can handle?

I tried to simplify the code a little. You already observed that
COMPARE can be zero. But it always is. We never try comparing values.
So all the COMPARE and value passing to the find functions can simply
be removed. So if you agree I would like to merge the attached
simplification diff into this patch.

The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't use
COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
with the concurrent version, but I think we should. It is only used
when there are debug type units in the Dwarf. Which is not the default
(and some gcc hackers really don't like them) so you will probably not
encounter them normally. But we should probably make looking them up
also concurrent safe as a followup patch.

I only tested the single-threaded case. It is slightly slower, but not
very much. The previous patch made the code slightly faster, that
slight speed increase is gone now. But that also means it is about on
par with the old version without any concurrent improvements.

It does use a bit more memory though. For each CU the abbrev table
structure is 112 bytes larger. Given that there can be several 1000 CUs
in a (large) Dwarf that means a couple of hundred K of extra memory
use. It is probably acceptable since such files contain 100 MBs of DIE
data.

But I was wondering whether next_init_block and num_initialized_blocks
could be shared with next_move_block and num_moved_blocks. If I
understand the code in resize_helper() correctly then all threads are
either initializing or all threads are moving. So can't we just use one
next_block and one num_blocks field?

I do think the code itself is good. The above are mostly just nitpicks.
But if you could reply and give your thoughts on them that would be
appreciated.

Thanks,

Mark

[-- Attachment #2: no-compare-find-val.diff --]
[-- Type: text/x-patch, Size: 5627 bytes --]

diff --git a/lib/dynamicsizehash_concurrent.c b/lib/dynamicsizehash_concurrent.c
index d645b143..5fa38713 100644
--- a/lib/dynamicsizehash_concurrent.c
+++ b/lib/dynamicsizehash_concurrent.c
@@ -36,46 +36,24 @@
 
    NAME      name of the hash table structure.
    TYPE      data type of the hash table entries
-   COMPARE   comparison function taking two pointers to TYPE objects
-
-   The following macros if present select features:
-
-   ITERATE   iterating over the table entries is possible
-   REVERSE   iterate in reverse order of insert
  */
 
 
 static size_t
-lookup (NAME *htab, HASHTYPE hval, TYPE val __attribute__ ((unused)))
+lookup (NAME *htab, HASHTYPE hval)
 {
   /* First hash function: simply take the modul but prevent zero.  Small values
       can skip the division, which helps performance when this is common.  */
   size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
 
-#if COMPARE != 0  /* A handful of tables don't actually compare the entries in
-                    the table, they instead rely on the hash.  In that case, we
-                    can skip parts that relate to the value. */
-  TYPE val_ptr;
-#endif
   HASHTYPE hash;
 
   hash = atomic_load_explicit(&htab->table[idx].hashval,
                               memory_order_acquire);
   if (hash == hval)
-    {
-#if COMPARE == 0
-      return idx;
-#else
-      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
-                                            memory_order_acquire);
-      if (COMPARE(val_ptr, val) == 0)
-          return idx;
-#endif
-    }
+    return idx;
   else if (hash == 0)
-    {
-      return 0;
-    }
+    return 0;
 
   /* Second hash function as suggested in [Knuth].  */
   HASHTYPE second_hash = 1 + hval % (htab->size - 2);
@@ -90,20 +68,9 @@ lookup (NAME *htab, HASHTYPE hval, TYPE val __attribute__ ((unused)))
       hash = atomic_load_explicit(&htab->table[idx].hashval,
                                   memory_order_acquire);
       if (hash == hval)
-        {
-#if COMPARE == 0
-          return idx;
-#else
-          val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
-                                                memory_order_acquire);
-          if (COMPARE(val_ptr, val) == 0)
-              return idx;
-#endif
-        }
+	return idx;
       else if (hash == 0)
-        {
-          return 0;
-        }
+	return 0;
     }
 }
 
@@ -123,8 +90,6 @@ insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
     {
       val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
                                             memory_order_acquire);
-      if (COMPARE(val_ptr, val) != 0)
-          return -1;
     }
   else if (hash == 0)
     {
@@ -168,8 +133,6 @@ insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
         {
           val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
                                                 memory_order_acquire);
-          if (COMPARE(val_ptr, val) != 0)
-              return -1;
         }
       else if (hash == 0)
         {
@@ -495,7 +458,7 @@ TYPE
 #define FIND(name) _FIND (name)
 #define _FIND(name) \
   name##_find
-FIND(NAME) (NAME *htab, HASHTYPE hval, TYPE val)
+FIND(NAME) (NAME *htab, HASHTYPE hval)
 {
   while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
       resize_worker(htab);
@@ -504,7 +467,7 @@ FIND(NAME) (NAME *htab, HASHTYPE hval, TYPE val)
 
   /* Make the hash data nonzero.  */
   hval = hval ?: 1;
-  idx = lookup(htab, hval, val);
+  idx = lookup(htab, hval);
 
   if (idx == 0)
     {
diff --git a/lib/dynamicsizehash_concurrent.h b/lib/dynamicsizehash_concurrent.h
index a137cbd0..73e66e91 100644
--- a/lib/dynamicsizehash_concurrent.h
+++ b/lib/dynamicsizehash_concurrent.h
@@ -97,7 +97,7 @@ extern int name##_free (name *htab);                                \
 extern int name##_insert (name *htab, HASHTYPE hval, TYPE data);    \
                                                                     \
 /* Find entry in hash table.  */                                    \
-extern TYPE name##_find (name *htab, HASHTYPE hval, TYPE val);
+extern TYPE name##_find (name *htab, HASHTYPE hval);
 #define FUNCTIONS(name) _FUNCTIONS (name)
 FUNCTIONS (NAME)
 
diff --git a/libdw/dwarf_abbrev_hash.h b/libdw/dwarf_abbrev_hash.h
index bc3d62c7..a368c598 100644
--- a/libdw/dwarf_abbrev_hash.h
+++ b/libdw/dwarf_abbrev_hash.h
@@ -32,7 +32,6 @@
 
 #define NAME Dwarf_Abbrev_Hash
 #define TYPE Dwarf_Abbrev *
-#define COMPARE(a, b) (0)
 
 #include <dynamicsizehash_concurrent.h>
 
diff --git a/libdw/dwarf_getabbrev.c b/libdw/dwarf_getabbrev.c
index 6a7e981b..7e767fc1 100644
--- a/libdw/dwarf_getabbrev.c
+++ b/libdw/dwarf_getabbrev.c
@@ -83,7 +83,7 @@ __libdw_getabbrev (Dwarf *dbg, struct Dwarf_CU *cu, Dwarf_Off offset,
   bool foundit = false;
   Dwarf_Abbrev *abb = NULL;
   if (cu == NULL
-      || (abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code, NULL)) == NULL)
+      || (abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code)) == NULL)
     {
       if (result == NULL)
 	abb = libdw_typed_alloc (dbg, Dwarf_Abbrev);
diff --git a/libdw/dwarf_tag.c b/libdw/dwarf_tag.c
index 331eaa0d..d784970c 100644
--- a/libdw/dwarf_tag.c
+++ b/libdw/dwarf_tag.c
@@ -45,7 +45,7 @@ __libdw_findabbrev (struct Dwarf_CU *cu, unsigned int code)
     return DWARF_END_ABBREV;
 
   /* See whether the entry is already in the hash table.  */
-  abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code, NULL);
+  abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code);
   if (abb == NULL)
     while (cu->last_abbrev_offset != (size_t) -1l)
       {

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-10-25 23:50                   ` Mark Wielaard
@ 2019-10-26  4:11                     ` Jonathon Anderson
  2019-10-27 16:13                       ` Mark Wielaard
  2019-11-04 16:19                       ` Mark Wielaard
  0 siblings, 2 replies; 55+ messages in thread
From: Jonathon Anderson @ 2019-10-26  4:11 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> Sorry this took so long to review. But it is pretty complex code.
> I think I got how it works mostly. It is hard to proof correct though.
> How did you convince yourself that the code is correct?

No worries, its a complex piece of code.

Srdan designed the hash table portions, so I might have to defer to him 
on some points. That said, I'll try to answer what I can.

I think the core parallel logic can be clarified by noting:
 - The total order for insertions and lookups is based on the val_ptr 
atomic field(s). The hashval field(s) act as a sort of "peek" for 
catching collisions before issuing a full CAS.
 - Wrt resizing, the table is always in one of five phases: 
NO_RESIZING, ALLOCATING_MEMORY, MOVING_DATA (initialization), 
MOVING_DATA (insertion), and CLEANING. Four of those five transitions 
are handled by a "master" thread chosen by a CAS on the resizing_state 
field.

> 
> For example I am not sure I can proof that in resize_worker() this
> always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
> In general the handling of the resizing_state field is pretty tricky.
> What is the maximum number of concurrent threads it can handle?

The master thread waits for all workers to deregister before 
transitioning from CLEANING to NO_RESIZING. Since the worker thread by 
that point has registered (and the synchronization is arranged as such) 
it will never see a transition all the way back to NO_RESIZING before 
getting its word in. This way a worker thread doesn't "miss the boat" 
and end up in an later resizing cycle.

The maximum number of threads is a size_t minus 2 bits, so bare minimum 
is 16384 on a 16-bit machine (which standard C still technically 
supports). Any modern system will allow either a billion (30 bits) or 4 
quintillion (62 bits).

> 
> I tried to simplify the code a little. You already observed that
> COMPARE can be zero. But it always is. We never try comparing values.
> So all the COMPARE and value passing to the find functions can simply
> be removed. So if you agree I would like to merge the attached
> simplification diff into this patch.

I'm fine with it, although at a glance it seems that this means 
insert_helper will never return -1. Which doesn't quite sound accurate, 
so I'll have to defer to Srdan on this one.

> 
> The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't use
> COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
> with the concurrent version, but I think we should. It is only used
> when there are debug type units in the Dwarf. Which is not the default
> (and some gcc hackers really don't like them) so you will probably not
> encounter them normally. But we should probably make looking them up
> also concurrent safe as a followup patch.

A quick grep -r shows that ITERATE and REVERSE are used for the 
asm_symbol_tab. If iteration and insertion are not concurrent we can 
easily add bidirectional iteration (using the same Treiber stack-like 
structure as used for the memory management). Also COMPARE is not 
defined to be (0) in this instance.

> 
> I only tested the single-threaded case. It is slightly slower, but not
> very much. The previous patch made the code slightly faster, that
> slight speed increase is gone now. But that also means it is about on
> par with the old version without any concurrent improvements.
> 
> It does use a bit more memory though. For each CU the abbrev table
> structure is 112 bytes larger. Given that there can be several 1000 
> CUs
> in a (large) Dwarf that means a couple of hundred K of extra memory
> use. It is probably acceptable since such files contain 100 MBs of DIE
> data.

That extra memory comes from the state for resizing mostly, so it 
probably could be reclaimed. Maybe it could live on the master thread's 
stack and then be accessed via pointer by the worker threads? If its 
enough of an issue to consider fixing I can work out a simple patch for 
it.

> 
> But I was wondering whether next_init_block and num_initialized_blocks
> could be shared with next_move_block and num_moved_blocks. If I
> understand the code in resize_helper() correctly then all threads are
> either initializing or all threads are moving. So can't we just use 
> one
> next_block and one num_blocks field?

The main issue is that either a) somehow the fields have to be reset 
between phases (which means something like a double barrier, and/or a 
"chosen" thread), or b) the second phase has to go in reverse (which 
has the complication of juggling the difference in ending conditions, 
and dealing with the phase transition properly).

Not impossible, just probably too complex to be worth the 16 byte gain.

> 
> I do think the code itself is good. The above are mostly just 
> nitpicks.
> But if you could reply and give your thoughts on them that would be
> appreciated.
> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-10-26  4:11                     ` Jonathon Anderson
@ 2019-10-27 16:13                       ` Mark Wielaard
  2019-10-27 17:49                         ` Jonathon Anderson
  2019-10-28 20:12                         ` Mark Wielaard
  2019-11-04 16:19                       ` Mark Wielaard
  1 sibling, 2 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-10-27 16:13 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <mark@klomp.org> wrote:
> > For example I am not sure I can proof that in resize_worker() this
> > always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
> > In general the handling of the resizing_state field is pretty tricky.
> > What is the maximum number of concurrent threads it can handle?
> 
> The master thread waits for all workers to deregister before 
> transitioning from CLEANING to NO_RESIZING. Since the worker thread by 
> that point has registered (and the synchronization is arranged as such) 
> it will never see a transition all the way back to NO_RESIZING before 
> getting its word in. This way a worker thread doesn't "miss the boat" 
> and end up in an later resizing cycle.

Thanks. I think my confusion came from the fact that these states are
expressed by 2 STATE_BITs. I didn't realize they are distinct states
and cannot hold simultaneously. I somehow had the impression all states
could be "active" at the same time. The bit-fiddling with
atomic_fetch_xor_explicit is tricky.

So the busy loop to check for GET_ACTIVE_WORKERS in resize_master works
because the master itself never does an STATE_INCREMENT itself.

Are these busy loops not expensive? There are a couple in the code
where are thread is just spinning busily till all other threads are
ready and/or the state is changed.

> The maximum number of threads is a size_t minus 2 bits, so bare minimum 
> is 16384 on a 16-bit machine (which standard C still technically 
> supports). Any modern system will allow either a billion (30 bits) or 4 
> quintillion (62 bits).

Ah, right, that was also connected to my confusion about how many bits
were used to hold the state. 1 billion threads should be enough for
anybody :)

> > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't
> > use
> > COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
> > with the concurrent version, but I think we should. It is only used
> > when there are debug type units in the Dwarf. Which is not the default
> > (and some gcc hackers really don't like them) so you will probably not
> > encounter them normally. But we should probably make looking them up
> > also concurrent safe as a followup patch.
> 
> A quick grep -r shows that ITERATE and REVERSE are used for the 
> asm_symbol_tab. If iteration and insertion are not concurrent we can 
> easily add bidirectional iteration (using the same Treiber stack-like 
> structure as used for the memory management). Also COMPARE is not 
> defined to be (0) in this instance.

Yes. We would need some other solution for the libasm code. But I think
we can use the concurrent table for everything in libdw.

> > I only tested the single-threaded case. It is slightly slower, but not
> > very much. The previous patch made the code slightly faster, that
> > slight speed increase is gone now. But that also means it is about on
> > par with the old version without any concurrent improvements.
> > 
> > It does use a bit more memory though. For each CU the abbrev table
> > structure is 112 bytes larger. Given that there can be several 1000 
> > CUs
> > in a (large) Dwarf that means a couple of hundred K of extra memory
> > use. It is probably acceptable since such files contain 100 MBs of DIE
> > data.
> 
> That extra memory comes from the state for resizing mostly, so it 
> probably could be reclaimed. Maybe it could live on the master thread's 
> stack and then be accessed via pointer by the worker threads? If its 
> enough of an issue to consider fixing I can work out a simple patch for 
> it.

I am not too concerned by it. But haven't done much measurements. You
probably have one of the most demanding programs using this code. Are
you seeing a significant use in memory?

> > But I was wondering whether next_init_block and num_initialized_blocks
> > could be shared with next_move_block and num_moved_blocks. If I
> > understand the code in resize_helper() correctly then all threads are
> > either initializing or all threads are moving. So can't we just use 
> > one
> > next_block and one num_blocks field?
> 
> The main issue is that either a) somehow the fields have to be reset 
> between phases (which means something like a double barrier, and/or a 
> "chosen" thread), or b) the second phase has to go in reverse (which 
> has the complication of juggling the difference in ending conditions, 
> and dealing with the phase transition properly).
> 
> Not impossible, just probably too complex to be worth the 16 byte gain.

Yes. Now that I have a better handle on the state changes I see that
this is trickier than I thought. It looks like just 16 bytes, but this
is per CU. Given that there can be a thousand CUs that does add up. But
probably not worth it at this point (see also above about memory usage
measurements).

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-10-27 16:13                       ` Mark Wielaard
@ 2019-10-27 17:49                         ` Jonathon Anderson
  2019-10-28 14:08                           ` Mark Wielaard
  2019-10-28 20:12                         ` Mark Wielaard
  1 sibling, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-10-27 17:49 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic



On Sun, Oct 27, 2019 at 17:13, Mark Wielaard <mark@klomp.org> wrote:
> On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
>>  On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <mark@klomp.org 
>> <mailto:mark@klomp.org>> wrote:
>>  > For example I am not sure I can proof that in resize_worker() this
>>  > always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
>>  > In general the handling of the resizing_state field is pretty 
>> tricky.
>>  > What is the maximum number of concurrent threads it can handle?
>> 
>>  The master thread waits for all workers to deregister before
>>  transitioning from CLEANING to NO_RESIZING. Since the worker thread 
>> by
>>  that point has registered (and the synchronization is arranged as 
>> such)
>>  it will never see a transition all the way back to NO_RESIZING 
>> before
>>  getting its word in. This way a worker thread doesn't "miss the 
>> boat"
>>  and end up in an later resizing cycle.
> 
> Thanks. I think my confusion came from the fact that these states are
> expressed by 2 STATE_BITs. I didn't realize they are distinct states
> and cannot hold simultaneously. I somehow had the impression all 
> states
> could be "active" at the same time. The bit-fiddling with
> atomic_fetch_xor_explicit is tricky.
> 
> So the busy loop to check for GET_ACTIVE_WORKERS in resize_master 
> works
> because the master itself never does an STATE_INCREMENT itself.
> 
> Are these busy loops not expensive? There are a couple in the code
> where are thread is just spinning busily till all other threads are
> ready and/or the state is changed.

In theory (if the system isn't overloaded) the threads should finish 
their individual work at around the same time, so the amount of waiting 
any one thread would do (should be) very short. Also, this is only once 
per resize, which (shouldn't) happen very often anyway.

The other busy loops (in insert_helper) are also read-only with a 
single concurrent store, so they (should) spin in cache until the 
invalidation. Again, (should) be a short wait.

If it starts to be an issue we can add some exponential backoff, but I 
haven't noticed any issues in general (and, a nanosleep itself is a 
spin for suitably short durations, or so I've been told).

> 
>>  The maximum number of threads is a size_t minus 2 bits, so bare 
>> minimum
>>  is 16384 on a 16-bit machine (which standard C still technically
>>  supports). Any modern system will allow either a billion (30 bits) 
>> or 4
>>  quintillion (62 bits).
> 
> Ah, right, that was also connected to my confusion about how many bits
> were used to hold the state. 1 billion threads should be enough for
> anybody :)
> 
>>  > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't
>>  > use
>>  > COMPARE (nor ITERATE and REVERSE). I haven't tried to replace 
>> that one
>>  > with the concurrent version, but I think we should. It is only 
>> used
>>  > when there are debug type units in the Dwarf. Which is not the 
>> default
>>  > (and some gcc hackers really don't like them) so you will 
>> probably not
>>  > encounter them normally. But we should probably make looking them 
>> up
>>  > also concurrent safe as a followup patch.
>> 
>>  A quick grep -r shows that ITERATE and REVERSE are used for the
>>  asm_symbol_tab. If iteration and insertion are not concurrent we can
>>  easily add bidirectional iteration (using the same Treiber 
>> stack-like
>>  structure as used for the memory management). Also COMPARE is not
>>  defined to be (0) in this instance.
> 
> Yes. We would need some other solution for the libasm code. But I 
> think
> we can use the concurrent table for everything in libdw.
> 
>>  > I only tested the single-threaded case. It is slightly slower, 
>> but not
>>  > very much. The previous patch made the code slightly faster, that
>>  > slight speed increase is gone now. But that also means it is 
>> about on
>>  > par with the old version without any concurrent improvements.
>>  >
>>  > It does use a bit more memory though. For each CU the abbrev table
>>  > structure is 112 bytes larger. Given that there can be several 
>> 1000
>>  > CUs
>>  > in a (large) Dwarf that means a couple of hundred K of extra 
>> memory
>>  > use. It is probably acceptable since such files contain 100 MBs 
>> of DIE
>>  > data.
>> 
>>  That extra memory comes from the state for resizing mostly, so it
>>  probably could be reclaimed. Maybe it could live on the master 
>> thread's
>>  stack and then be accessed via pointer by the worker threads? If its
>>  enough of an issue to consider fixing I can work out a simple patch 
>> for
>>  it.
> 
> I am not too concerned by it. But haven't done much measurements. You
> probably have one of the most demanding programs using this code. Are
> you seeing a significant use in memory?

Depends on your definition of significant, anything less than 10MB 
would fall entirely below my radar, and the biggest thing I've tried 
has 935 CUs. We also tend to close our Dwarfs regularly, I think the 
maximum open we've used so far is one per thread.

That said, I can also envision cases (say, a gdb on elfutils on a small 
system) that might be prohibited (or at least annoyed) by this. I can 
lower the overhead to 64 bytes easily (moving bits to the master 
thread's stack), but that's the limit with this design.

> 
>>  > But I was wondering whether next_init_block and 
>> num_initialized_blocks
>>  > could be shared with next_move_block and num_moved_blocks. If I
>>  > understand the code in resize_helper() correctly then all threads 
>> are
>>  > either initializing or all threads are moving. So can't we just 
>> use
>>  > one
>>  > next_block and one num_blocks field?
>> 
>>  The main issue is that either a) somehow the fields have to be reset
>>  between phases (which means something like a double barrier, and/or 
>> a
>>  "chosen" thread), or b) the second phase has to go in reverse (which
>>  has the complication of juggling the difference in ending 
>> conditions,
>>  and dealing with the phase transition properly).
>> 
>>  Not impossible, just probably too complex to be worth the 16 byte 
>> gain.
> 
> Yes. Now that I have a better handle on the state changes I see that
> this is trickier than I thought. It looks like just 16 bytes, but this
> is per CU. Given that there can be a thousand CUs that does add up. 
> But
> probably not worth it at this point (see also above about memory usage
> measurements).
> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-10-27 17:49                         ` Jonathon Anderson
@ 2019-10-28 14:08                           ` Mark Wielaard
  0 siblings, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-10-28 14:08 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

On Sun, 2019-10-27 at 12:49 -0500, Jonathon Anderson wrote:
> In theory (if the system isn't overloaded) the threads should finish 
> their individual work at around the same time, so the amount of waiting 
> any one thread would do (should be) very short. Also, this is only once 
> per resize, which (shouldn't) happen very often anyway.
> 
> The other busy loops (in insert_helper) are also read-only with a 
> single concurrent store, so they (should) spin in cache until the 
> invalidation. Again, (should) be a short wait.
> 
> If it starts to be an issue we can add some exponential backoff, but I 
> haven't noticed any issues in general (and, a nanosleep itself is a 
> spin for suitably short durations, or so I've been told).

Yeah, it might even be easier to use a bit large initial (prime) size
for the hash to avoid the whole situation if it does look like it is a
problem in practice. I don't really expect it to. Just curious if there
was a different model. But sleeping might not be good either because it
might make it harder for the thread to start doing real work again if
it has to wake up the core it is running up again.


> Depends on your definition of significant, anything less than 10MB 
> would fall entirely below my radar, and the biggest thing I've tried 
> has 935 CUs. We also tend to close our Dwarfs regularly, I think the 
> maximum open we've used so far is one per thread.

Right, and that is a fairly large Dwarf which would take just 100K
extra memory. So normally it far less extra overhead.

> That said, I can also envision cases (say, a gdb on elfutils on a small 
> system) that might be prohibited (or at least annoyed) by this. I can 
> lower the overhead to 64 bytes easily (moving bits to the master 
> thread's stack), but that's the limit with this design.

Although I would like to see that if you have time, I am not that
concerned. And I think the pthread_key limit issue is more important to
resolve.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-10-27 16:13                       ` Mark Wielaard
  2019-10-27 17:49                         ` Jonathon Anderson
@ 2019-10-28 20:12                         ` Mark Wielaard
  2019-11-04 16:21                           ` Mark Wielaard
  1 sibling, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-28 20:12 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

[-- Attachment #1: Type: text/plain, Size: 908 bytes --]

On Sun, 2019-10-27 at 17:13 +0100, Mark Wielaard wrote:
> On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> > A quick grep -r shows that ITERATE and REVERSE are used for the 
> > asm_symbol_tab. If iteration and insertion are not concurrent we can 
> > easily add bidirectional iteration (using the same Treiber stack-like 
> > structure as used for the memory management). Also COMPARE is not 
> > defined to be (0) in this instance.
> 
> Yes. We would need some other solution for the libasm code. But I think
> we can use the concurrent table for everything in libdw.

And everything in libdw besides the abbrev hash is the sig8 hash.
And adopting the concurrent dynamic size hash for that is almost
trivial. See attached.

I only tested it lightly because I don't have any large projects build
with -fdebug-types-dection. But it seems to work as intended.

Cheers,

Mark

[-- Attachment #2: 0001-libdw-Use-dynamicsizehash_concurrent-for-Dwarf_Sig8_.patch --]
[-- Type: text/x-patch, Size: 2662 bytes --]

From a4bf12a8b906911172897f14d62b3f43cc4066ad Mon Sep 17 00:00:00 2001
From: Mark Wielaard <mark@klomp.org>
Date: Mon, 28 Oct 2019 20:54:35 +0100
Subject: [PATCH] libdw: Use dynamicsizehash_concurrent for Dwarf_Sig8_Hash.

Dwarf_Sig8_Hash is used as part of a Dwarf object to quickly find a type
unit based on the type signature. Use the concurrent variant of the dynamic
size hash to make it thread-safe to use.

Signed-off-by: Mark Wielaard <mark@klomp.org>
---
 libdw/ChangeLog           | 8 ++++++++
 libdw/dwarf_formref_die.c | 2 +-
 libdw/dwarf_sig8_hash.c   | 2 +-
 libdw/dwarf_sig8_hash.h   | 9 +++++++--
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index 28563b4d..61231e5c 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,11 @@
+2019-10-28  Mark Wielaard  <mark@klomp.org>
+
+	* dwarf_sig8_hash.h: Include libdw.h. Remove COMPARE. Include
+	dynamicsizehash_concurrent.h.
+	* dwarf_sig8_hash.c: Include dynamicsizehash_concurrent.c.
+	* dwarf_formref_die.c (dwarf_formref_die): Drop NULL argument to
+	Dwarf_Sig8_Hash_find.
+
 2019-08-26  Srđan Milaković  <sm108@rice.edu@rice.edu>
 
 	* dwarf_abbrev_hash.{c,h}: Use the *_concurrent hash table.
diff --git a/libdw/dwarf_formref_die.c b/libdw/dwarf_formref_die.c
index f196331a..48ba8194 100644
--- a/libdw/dwarf_formref_die.c
+++ b/libdw/dwarf_formref_die.c
@@ -83,7 +83,7 @@ dwarf_formref_die (Dwarf_Attribute *attr, Dwarf_Die *result)
 	 have to match in the type unit headers.  */
 
       uint64_t sig = read_8ubyte_unaligned (cu->dbg, attr->valp);
-      cu = Dwarf_Sig8_Hash_find (&cu->dbg->sig8_hash, sig, NULL);
+      cu = Dwarf_Sig8_Hash_find (&cu->dbg->sig8_hash, sig);
       if (cu == NULL)
 	{
 	  /* Not seen before.  We have to scan through the type units.
diff --git a/libdw/dwarf_sig8_hash.c b/libdw/dwarf_sig8_hash.c
index 043cac78..777f9ebc 100644
--- a/libdw/dwarf_sig8_hash.c
+++ b/libdw/dwarf_sig8_hash.c
@@ -38,4 +38,4 @@
 #define next_prime __libdwarf_next_prime
 extern size_t next_prime (size_t) attribute_hidden;
 
-#include <dynamicsizehash.c>
+#include <dynamicsizehash_concurrent.c>
diff --git a/libdw/dwarf_sig8_hash.h b/libdw/dwarf_sig8_hash.h
index 705ffbcd..c399919a 100644
--- a/libdw/dwarf_sig8_hash.h
+++ b/libdw/dwarf_sig8_hash.h
@@ -29,10 +29,15 @@
 #ifndef _DWARF_SIG8_HASH_H
 #define _DWARF_SIG8_HASH_H	1
 
+#ifdef HAVE_CONFIG_H
+# include <config.h>
+#endif
+
+#include "libdw.h"
+
 #define NAME Dwarf_Sig8_Hash
 #define TYPE struct Dwarf_CU *
-#define COMPARE(a, b) (0)
 
-#include <dynamicsizehash.h>
+#include <dynamicsizehash_concurrent.h>
 
 #endif	/* dwarf_sig8_hash.h */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-10-28 20:12                         ` Mark Wielaard
@ 2019-11-04 16:21                           ` Mark Wielaard
  0 siblings, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-11-04 16:21 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

[-- Attachment #1: Type: text/plain, Size: 1036 bytes --]

Hi,

On Mon, 2019-10-28 at 21:12 +0100, Mark Wielaard wrote:
> On Sun, 2019-10-27 at 17:13 +0100, Mark Wielaard wrote:
> > On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> > > A quick grep -r shows that ITERATE and REVERSE are used for the 
> > > asm_symbol_tab. If iteration and insertion are not concurrent we can 
> > > easily add bidirectional iteration (using the same Treiber stack-like 
> > > structure as used for the memory management). Also COMPARE is not 
> > > defined to be (0) in this instance.
> > 
> > Yes. We would need some other solution for the libasm code. But I think
> > we can use the concurrent table for everything in libdw.
> 
> And everything in libdw besides the abbrev hash is the sig8 hash.
> And adopting the concurrent dynamic size hash for that is almost
> trivial. See attached.
> 
> I only tested it lightly because I don't have any large projects build
> with -fdebug-types-dection. But it seems to work as intended.

Any comment on this patch?

Thanks,

Mark

[-- Attachment #2: 0001-libdw-Use-dynamicsizehash_concurrent-for-Dwarf_Sig8_.patch --]
[-- Type: text/x-patch, Size: 2662 bytes --]

From a4bf12a8b906911172897f14d62b3f43cc4066ad Mon Sep 17 00:00:00 2001
From: Mark Wielaard <mark@klomp.org>
Date: Mon, 28 Oct 2019 20:54:35 +0100
Subject: [PATCH] libdw: Use dynamicsizehash_concurrent for Dwarf_Sig8_Hash.

Dwarf_Sig8_Hash is used as part of a Dwarf object to quickly find a type
unit based on the type signature. Use the concurrent variant of the dynamic
size hash to make it thread-safe to use.

Signed-off-by: Mark Wielaard <mark@klomp.org>
---
 libdw/ChangeLog           | 8 ++++++++
 libdw/dwarf_formref_die.c | 2 +-
 libdw/dwarf_sig8_hash.c   | 2 +-
 libdw/dwarf_sig8_hash.h   | 9 +++++++--
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index 28563b4d..61231e5c 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,11 @@
+2019-10-28  Mark Wielaard  <mark@klomp.org>
+
+	* dwarf_sig8_hash.h: Include libdw.h. Remove COMPARE. Include
+	dynamicsizehash_concurrent.h.
+	* dwarf_sig8_hash.c: Include dynamicsizehash_concurrent.c.
+	* dwarf_formref_die.c (dwarf_formref_die): Drop NULL argument to
+	Dwarf_Sig8_Hash_find.
+
 2019-08-26  Srđan Milaković  <sm108@rice.edu@rice.edu>
 
 	* dwarf_abbrev_hash.{c,h}: Use the *_concurrent hash table.
diff --git a/libdw/dwarf_formref_die.c b/libdw/dwarf_formref_die.c
index f196331a..48ba8194 100644
--- a/libdw/dwarf_formref_die.c
+++ b/libdw/dwarf_formref_die.c
@@ -83,7 +83,7 @@ dwarf_formref_die (Dwarf_Attribute *attr, Dwarf_Die *result)
 	 have to match in the type unit headers.  */
 
       uint64_t sig = read_8ubyte_unaligned (cu->dbg, attr->valp);
-      cu = Dwarf_Sig8_Hash_find (&cu->dbg->sig8_hash, sig, NULL);
+      cu = Dwarf_Sig8_Hash_find (&cu->dbg->sig8_hash, sig);
       if (cu == NULL)
 	{
 	  /* Not seen before.  We have to scan through the type units.
diff --git a/libdw/dwarf_sig8_hash.c b/libdw/dwarf_sig8_hash.c
index 043cac78..777f9ebc 100644
--- a/libdw/dwarf_sig8_hash.c
+++ b/libdw/dwarf_sig8_hash.c
@@ -38,4 +38,4 @@
 #define next_prime __libdwarf_next_prime
 extern size_t next_prime (size_t) attribute_hidden;
 
-#include <dynamicsizehash.c>
+#include <dynamicsizehash_concurrent.c>
diff --git a/libdw/dwarf_sig8_hash.h b/libdw/dwarf_sig8_hash.h
index 705ffbcd..c399919a 100644
--- a/libdw/dwarf_sig8_hash.h
+++ b/libdw/dwarf_sig8_hash.h
@@ -29,10 +29,15 @@
 #ifndef _DWARF_SIG8_HASH_H
 #define _DWARF_SIG8_HASH_H	1
 
+#ifdef HAVE_CONFIG_H
+# include <config.h>
+#endif
+
+#include "libdw.h"
+
 #define NAME Dwarf_Sig8_Hash
 #define TYPE struct Dwarf_CU *
-#define COMPARE(a, b) (0)
 
-#include <dynamicsizehash.h>
+#include <dynamicsizehash_concurrent.h>
 
 #endif	/* dwarf_sig8_hash.h */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-10-26  4:11                     ` Jonathon Anderson
  2019-10-27 16:13                       ` Mark Wielaard
@ 2019-11-04 16:19                       ` Mark Wielaard
  2019-11-04 17:03                         ` [PATCH] " Jonathon Anderson
  1 sibling, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-11-04 16:19 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

Hi,

On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> > I tried to simplify the code a little. You already observed that
> > COMPARE can be zero. But it always is. We never try comparing values.
> > So all the COMPARE and value passing to the find functions can simply
> > be removed. So if you agree I would like to merge the attached
> > simplification diff into this patch.
> 
> I'm fine with it, although at a glance it seems that this means 
> insert_helper will never return -1. Which doesn't quite sound accurate, 
> so I'll have to defer to Srdan on this one.

Srdan, any feedback on this?

Thanks,

Mark

[-- Attachment #2: 0001-no-compare-or-val-pass.patch --]
[-- Type: text/x-patch, Size: 6122 bytes --]

From b70b350242d9752f41407c0ed7fe4683c8f31ce6 Mon Sep 17 00:00:00 2001
From: Mark Wielaard <mark@klomp.org>
Date: Sat, 26 Oct 2019 01:54:43 +0200
Subject: [PATCH] no compare or val pass

---
 lib/dynamicsizehash_concurrent.c | 51 +++++---------------------------
 lib/dynamicsizehash_concurrent.h |  2 +-
 libdw/dwarf_abbrev_hash.h        |  1 -
 libdw/dwarf_getabbrev.c          |  2 +-
 libdw/dwarf_tag.c                |  2 +-
 5 files changed, 10 insertions(+), 48 deletions(-)

diff --git a/lib/dynamicsizehash_concurrent.c b/lib/dynamicsizehash_concurrent.c
index d645b143..5fa38713 100644
--- a/lib/dynamicsizehash_concurrent.c
+++ b/lib/dynamicsizehash_concurrent.c
@@ -36,46 +36,24 @@
 
    NAME      name of the hash table structure.
    TYPE      data type of the hash table entries
-   COMPARE   comparison function taking two pointers to TYPE objects
-
-   The following macros if present select features:
-
-   ITERATE   iterating over the table entries is possible
-   REVERSE   iterate in reverse order of insert
  */
 
 
 static size_t
-lookup (NAME *htab, HASHTYPE hval, TYPE val __attribute__ ((unused)))
+lookup (NAME *htab, HASHTYPE hval)
 {
   /* First hash function: simply take the modul but prevent zero.  Small values
       can skip the division, which helps performance when this is common.  */
   size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
 
-#if COMPARE != 0  /* A handful of tables don't actually compare the entries in
-                    the table, they instead rely on the hash.  In that case, we
-                    can skip parts that relate to the value. */
-  TYPE val_ptr;
-#endif
   HASHTYPE hash;
 
   hash = atomic_load_explicit(&htab->table[idx].hashval,
                               memory_order_acquire);
   if (hash == hval)
-    {
-#if COMPARE == 0
-      return idx;
-#else
-      val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
-                                            memory_order_acquire);
-      if (COMPARE(val_ptr, val) == 0)
-          return idx;
-#endif
-    }
+    return idx;
   else if (hash == 0)
-    {
-      return 0;
-    }
+    return 0;
 
   /* Second hash function as suggested in [Knuth].  */
   HASHTYPE second_hash = 1 + hval % (htab->size - 2);
@@ -90,20 +68,9 @@ lookup (NAME *htab, HASHTYPE hval, TYPE val __attribute__ ((unused)))
       hash = atomic_load_explicit(&htab->table[idx].hashval,
                                   memory_order_acquire);
       if (hash == hval)
-        {
-#if COMPARE == 0
-          return idx;
-#else
-          val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
-                                                memory_order_acquire);
-          if (COMPARE(val_ptr, val) == 0)
-              return idx;
-#endif
-        }
+	return idx;
       else if (hash == 0)
-        {
-          return 0;
-        }
+	return 0;
     }
 }
 
@@ -123,8 +90,6 @@ insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
     {
       val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
                                             memory_order_acquire);
-      if (COMPARE(val_ptr, val) != 0)
-          return -1;
     }
   else if (hash == 0)
     {
@@ -168,8 +133,6 @@ insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
         {
           val_ptr = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
                                                 memory_order_acquire);
-          if (COMPARE(val_ptr, val) != 0)
-              return -1;
         }
       else if (hash == 0)
         {
@@ -495,7 +458,7 @@ TYPE
 #define FIND(name) _FIND (name)
 #define _FIND(name) \
   name##_find
-FIND(NAME) (NAME *htab, HASHTYPE hval, TYPE val)
+FIND(NAME) (NAME *htab, HASHTYPE hval)
 {
   while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
       resize_worker(htab);
@@ -504,7 +467,7 @@ FIND(NAME) (NAME *htab, HASHTYPE hval, TYPE val)
 
   /* Make the hash data nonzero.  */
   hval = hval ?: 1;
-  idx = lookup(htab, hval, val);
+  idx = lookup(htab, hval);
 
   if (idx == 0)
     {
diff --git a/lib/dynamicsizehash_concurrent.h b/lib/dynamicsizehash_concurrent.h
index a137cbd0..73e66e91 100644
--- a/lib/dynamicsizehash_concurrent.h
+++ b/lib/dynamicsizehash_concurrent.h
@@ -97,7 +97,7 @@ extern int name##_free (name *htab);                                \
 extern int name##_insert (name *htab, HASHTYPE hval, TYPE data);    \
                                                                     \
 /* Find entry in hash table.  */                                    \
-extern TYPE name##_find (name *htab, HASHTYPE hval, TYPE val);
+extern TYPE name##_find (name *htab, HASHTYPE hval);
 #define FUNCTIONS(name) _FUNCTIONS (name)
 FUNCTIONS (NAME)
 
diff --git a/libdw/dwarf_abbrev_hash.h b/libdw/dwarf_abbrev_hash.h
index bc3d62c7..a368c598 100644
--- a/libdw/dwarf_abbrev_hash.h
+++ b/libdw/dwarf_abbrev_hash.h
@@ -32,7 +32,6 @@
 
 #define NAME Dwarf_Abbrev_Hash
 #define TYPE Dwarf_Abbrev *
-#define COMPARE(a, b) (0)
 
 #include <dynamicsizehash_concurrent.h>
 
diff --git a/libdw/dwarf_getabbrev.c b/libdw/dwarf_getabbrev.c
index 6a7e981b..7e767fc1 100644
--- a/libdw/dwarf_getabbrev.c
+++ b/libdw/dwarf_getabbrev.c
@@ -83,7 +83,7 @@ __libdw_getabbrev (Dwarf *dbg, struct Dwarf_CU *cu, Dwarf_Off offset,
   bool foundit = false;
   Dwarf_Abbrev *abb = NULL;
   if (cu == NULL
-      || (abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code, NULL)) == NULL)
+      || (abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code)) == NULL)
     {
       if (result == NULL)
 	abb = libdw_typed_alloc (dbg, Dwarf_Abbrev);
diff --git a/libdw/dwarf_tag.c b/libdw/dwarf_tag.c
index 331eaa0d..d784970c 100644
--- a/libdw/dwarf_tag.c
+++ b/libdw/dwarf_tag.c
@@ -45,7 +45,7 @@ __libdw_findabbrev (struct Dwarf_CU *cu, unsigned int code)
     return DWARF_END_ABBREV;
 
   /* See whether the entry is already in the hash table.  */
-  abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code, NULL);
+  abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code);
   if (abb == NULL)
     while (cu->last_abbrev_offset != (size_t) -1l)
       {
-- 
2.18.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-04 16:19                       ` Mark Wielaard
@ 2019-11-04 17:03                         ` Jonathon Anderson
  2019-11-07 11:07                           ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-11-04 17:03 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

From: SrÄ‘an MilakoviÄ‡ <sm108@rice.edu>

Signed-off-by: SrÄ‘an MilakoviÄ‡ <sm108@rice.edu>
---
Apologies for the delay, things have been busy here the past week.

> On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> > I tried to simplify the code a little. You already observed that
> > COMPARE can be zero. But it always is. We never try comparing values.
> > So all the COMPARE and value passing to the find functions can simply
> > be removed. So if you agree I would like to merge the attached
> > simplification diff into this patch.
> >
> > I'm fine with it, although at a glance it seems that this means
> > insert_helper will never return -1. Which doesn't quite sound accurate,
> > so I'll have to defer to Srdan on this one.
>
> Srdan, any feedback on this?

Apologies, I talked with Srdan in person and forgot to relay the message. He
gave me an updated version of the hash table that handles that issue (and
another previously harmless typo).

> On Mon, 2019-10-28 at 21:12 +0100, Mark Wielaard wrote:
> > On Sun, 2019-10-27 at 17:13 +0100, Mark Wielaard wrote:
> > > On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> > > > A quick grep -r shows that ITERATE and REVERSE are used for the
> > > > asm_symbol_tab. If iteration and insertion are not concurrent we can
> > > > easily add bidirectional iteration (using the same Treiber stack-like
> > > > structure as used for the memory management). Also COMPARE is not
> > > > defined to be (0) in this instance.
> > >
> > > Yes. We would need some other solution for the libasm code. But I think
> > > we can use the concurrent table for everything in libdw.
> >
> > And everything in libdw besides the abbrev hash is the sig8 hash.
> > And adopting the concurrent dynamic size hash for that is almost
> > trivial. See attached.
> >
> > I only tested it lightly because I don't have any large projects build
> > with -fdebug-types-dection. But it seems to work as intended.
>
> Any comment on this patch?

I'm a little wary of applying this to Sig8, only because the contents are
entire CUs (rather than just Abbrevs). It seems to work though, so I just
need to make a more careful pass to convince myself that it works in
general too.

This patch has the updated hash table and all the other suggested patches
integrated in, it compiles and passes both the test suite and our tests.

 lib/ChangeLog                    |   5 +
 lib/Makefile.am                  |   4 +-
 lib/dynamicsizehash_concurrent.c | 482 +++++++++++++++++++++++++++++++
 lib/dynamicsizehash_concurrent.h | 118 ++++++++
 libdw/ChangeLog                  |  12 +
 libdw/dwarf_abbrev_hash.c        |   2 +-
 libdw/dwarf_abbrev_hash.h        |   3 +-
 libdw/dwarf_formref_die.c        |   2 +-
 libdw/dwarf_getabbrev.c          |   2 +-
 libdw/dwarf_sig8_hash.c          |   2 +-
 libdw/dwarf_sig8_hash.h          |   9 +-
 libdw/dwarf_tag.c                |   2 +-
 12 files changed, 632 insertions(+), 11 deletions(-)
 create mode 100644 lib/dynamicsizehash_concurrent.c
 create mode 100644 lib/dynamicsizehash_concurrent.h

diff --git a/lib/ChangeLog b/lib/ChangeLog
index 3799c3aa..51c79841 100644
--- a/lib/ChangeLog
+++ b/lib/ChangeLog
@@ -1,3 +1,8 @@
+2019-08-25  SrÄ‘an MilakoviÄ‡  <sm108@rice.edu>
+
+	* dynamicsizehash_concurrent.{c,h}: New files.
+	* Makefile.am (noinst_HEADERS): Added dynamicsizehash_concurrent.h.
+
 2019-08-25  Jonathon Anderson  <jma14@rice.edu>
 
 	* stdatomic-fbsd.h: New file, taken from FreeBSD.
diff --git a/lib/Makefile.am b/lib/Makefile.am
index 3086cf06..97bf7329 100644
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -39,8 +39,8 @@ libeu_a_SOURCES = xstrdup.c xstrndup.c xmalloc.c next_prime.c \
 
 noinst_HEADERS = fixedsizehash.h libeu.h system.h dynamicsizehash.h list.h \
 		 eu-config.h color.h printversion.h bpf.h \
-		 atomics.h stdatomic-fbsd.h
-EXTRA_DIST = dynamicsizehash.c
+		 atomics.h stdatomic-fbsd.h dynamicsizehash_concurrent.h
+EXTRA_DIST = dynamicsizehash.c dynamicsizehash_concurrent.c
 
 if !GPROF
 xmalloc_CFLAGS = -ffunction-sections
diff --git a/lib/dynamicsizehash_concurrent.c b/lib/dynamicsizehash_concurrent.c
new file mode 100644
index 00000000..2d53bec6
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.c
@@ -0,0 +1,482 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <stdlib.h>
+#include <system.h>
+#include <pthread.h>
+
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+ */
+
+
+static size_t
+lookup (NAME *htab, HASHTYPE hval)
+{
+  /* First hash function: simply take the modul but prevent zero.  Small values
+      can skip the division, which helps performance when this is common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    return idx;
+  else if (hash == 0)
+    return 0;
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+	return idx;
+      else if (hash == 0)
+	return 0;
+    }
+}
+
+static int
+insert_helper (NAME *htab, HASHTYPE hval, TYPE val)
+{
+  /* First hash function: simply take the modul but prevent zero.  Small values
+      can skip the division, which helps performance when this is common.  */
+  size_t idx = 1 + (hval < htab->size ? hval : hval % htab->size);
+
+  TYPE val_ptr;
+  HASHTYPE hash;
+
+  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                              memory_order_acquire);
+  if (hash == hval)
+    return -1;
+  else if (hash == 0)
+    {
+      val_ptr = NULL;
+      atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                              (uintptr_t *) &val_ptr,
+                                              (uintptr_t) val,
+                                              memory_order_acquire,
+                                              memory_order_acquire);
+
+      if (val_ptr == NULL)
+        {
+          atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                memory_order_release);
+          return 0;
+        }
+      else
+        {
+          do
+            {
+              hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                          memory_order_acquire);
+            }
+          while (hash == 0);
+          if (hash == hval)
+            return -1;
+        }
+    }
+
+  /* Second hash function as suggested in [Knuth].  */
+  HASHTYPE second_hash = 1 + hval % (htab->size - 2);
+
+  for(;;)
+    {
+      if (idx <= second_hash)
+          idx = htab->size + idx - second_hash;
+      else
+          idx -= second_hash;
+
+      hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                  memory_order_acquire);
+      if (hash == hval)
+        return -1;
+      else if (hash == 0)
+        {
+          val_ptr = NULL;
+          atomic_compare_exchange_strong_explicit(&htab->table[idx].val_ptr,
+                                                  (uintptr_t *) &val_ptr,
+                                                  (uintptr_t) val,
+                                                  memory_order_acquire,
+                                                  memory_order_acquire);
+
+          if (val_ptr == NULL)
+            {
+              atomic_store_explicit(&htab->table[idx].hashval, hval,
+                                    memory_order_release);
+              return 0;
+            }
+          else
+            {
+              do
+                {
+                  hash = atomic_load_explicit(&htab->table[idx].hashval,
+                                              memory_order_acquire);
+                }
+              while (hash == 0);
+              if (hash == hval)
+                return -1;
+            }
+        }
+    }
+}
+
+#define NO_RESIZING 0u
+#define ALLOCATING_MEMORY 1u
+#define MOVING_DATA 3u
+#define CLEANING 2u
+
+#define STATE_BITS 2u
+#define STATE_INCREMENT (1u << STATE_BITS)
+#define STATE_MASK (STATE_INCREMENT - 1)
+#define GET_STATE(A) ((A) & STATE_MASK)
+
+#define IS_NO_RESIZE_OR_CLEANING(A) (((A) & 0x1u) == 0)
+
+#define GET_ACTIVE_WORKERS(A) ((A) >> STATE_BITS)
+
+#define INITIALIZATION_BLOCK_SIZE 256
+#define MOVE_BLOCK_SIZE 256
+#define CEIL(A, B) (((A) + (B) - 1) / (B))
+
+/* Initializes records and copies the data from the old table.
+   It can share work with other threads */
+static void resize_helper(NAME *htab, int blocking)
+{
+  size_t num_old_blocks = CEIL(htab->old_size, MOVE_BLOCK_SIZE);
+  size_t num_new_blocks = CEIL(htab->size, INITIALIZATION_BLOCK_SIZE);
+
+  size_t my_block;
+  size_t num_finished_blocks = 0;
+
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_init_block, 1,
+                                                memory_order_acquire))
+                                                    < num_new_blocks)
+    {
+      size_t record_it = my_block * INITIALIZATION_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * INITIALIZATION_BLOCK_SIZE;
+      if (record_end > htab->size)
+          record_end = htab->size;
+
+      while (record_it++ != record_end)
+        {
+          atomic_init(&htab->table[record_it].hashval, (uintptr_t) NULL);
+          atomic_init(&htab->table[record_it].val_ptr, (uintptr_t) NULL);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_initialized_blocks,
+                            num_finished_blocks, memory_order_release);
+  while (atomic_load_explicit(&htab->num_initialized_blocks,
+                              memory_order_acquire) != num_new_blocks);
+
+  /* All block are initialized, start moving */
+  num_finished_blocks = 0;
+  while ((my_block = atomic_fetch_add_explicit(&htab->next_move_block, 1,
+                                                memory_order_acquire))
+                                                    < num_old_blocks)
+    {
+      size_t record_it = my_block * MOVE_BLOCK_SIZE;
+      size_t record_end = (my_block + 1) * MOVE_BLOCK_SIZE;
+      if (record_end > htab->old_size)
+          record_end = htab->old_size;
+
+      while (record_it++ != record_end)
+        {
+          TYPE val_ptr = (TYPE) atomic_load_explicit(
+              &htab->old_table[record_it].val_ptr,
+              memory_order_acquire);
+          if (val_ptr == NULL)
+              continue;
+
+          HASHTYPE hashval = atomic_load_explicit(
+              &htab->old_table[record_it].hashval,
+              memory_order_acquire);
+          assert(hashval);
+
+          insert_helper(htab, hashval, val_ptr);
+        }
+
+      num_finished_blocks++;
+    }
+
+  atomic_fetch_add_explicit(&htab->num_moved_blocks, num_finished_blocks,
+                            memory_order_release);
+
+  if (blocking)
+      while (atomic_load_explicit(&htab->num_moved_blocks,
+                                  memory_order_acquire) != num_old_blocks);
+}
+
+static void
+resize_master(NAME *htab)
+{
+  htab->old_size = htab->size;
+  htab->old_table = htab->table;
+
+  htab->size = next_prime(htab->size * 2);
+  htab->table = malloc((1 + htab->size) * sizeof(htab->table[0]));
+  assert(htab->table);
+
+  /* Change state from ALLOCATING_MEMORY to MOVING_DATA */
+  atomic_fetch_xor_explicit(&htab->resizing_state,
+                            ALLOCATING_MEMORY ^ MOVING_DATA,
+                            memory_order_release);
+
+  resize_helper(htab, 1);
+
+  /* Change state from MOVING_DATA to CLEANING */
+  size_t resize_state = atomic_fetch_xor_explicit(&htab->resizing_state,
+                                                  MOVING_DATA ^ CLEANING,
+                                                  memory_order_acq_rel);
+  while (GET_ACTIVE_WORKERS(resize_state) != 0)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* There are no more active workers */
+  atomic_store_explicit(&htab->next_init_block, 0, memory_order_relaxed);
+  atomic_store_explicit(&htab->num_initialized_blocks, 0,
+                        memory_order_relaxed);
+
+  atomic_store_explicit(&htab->next_move_block, 0, memory_order_relaxed);
+  atomic_store_explicit(&htab->num_moved_blocks, 0, memory_order_relaxed);
+
+  free(htab->old_table);
+
+  /* Change state to NO_RESIZING */
+  atomic_fetch_xor_explicit(&htab->resizing_state, CLEANING ^ NO_RESIZING,
+                            memory_order_relaxed);
+
+}
+
+static void
+resize_worker(NAME *htab)
+{
+  size_t resize_state = atomic_load_explicit(&htab->resizing_state,
+                                              memory_order_acquire);
+
+  /* If the resize has finished */
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+      return;
+
+  /* Register as worker and check if the resize has finished in the meantime*/
+  resize_state = atomic_fetch_add_explicit(&htab->resizing_state,
+                                            STATE_INCREMENT,
+                                            memory_order_acquire);
+  if (IS_NO_RESIZE_OR_CLEANING(resize_state))
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  /* Wait while the new table is being allocated. */
+  while (GET_STATE(resize_state) == ALLOCATING_MEMORY)
+      resize_state = atomic_load_explicit(&htab->resizing_state,
+                                          memory_order_acquire);
+
+  /* Check if the resize is done */
+  assert(GET_STATE(resize_state) != NO_RESIZING);
+  if (GET_STATE(resize_state) == CLEANING)
+    {
+      atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                                memory_order_relaxed);
+      return;
+    }
+
+  resize_helper(htab, 0);
+
+  /* Deregister worker */
+  atomic_fetch_sub_explicit(&htab->resizing_state, STATE_INCREMENT,
+                            memory_order_release);
+}
+
+
+int
+#define INIT(name) _INIT (name)
+#define _INIT(name) \
+  name##_init
+INIT(NAME) (NAME *htab, size_t init_size)
+{
+  /* We need the size to be a prime.  */
+  init_size = next_prime (init_size);
+
+  /* Initialize the data structure.  */
+  htab->size = init_size;
+  atomic_init(&htab->filled, 0);
+  atomic_init(&htab->resizing_state, 0);
+
+  atomic_init(&htab->next_init_block, 0);
+  atomic_init(&htab->num_initialized_blocks, 0);
+
+  atomic_init(&htab->next_move_block, 0);
+  atomic_init(&htab->num_moved_blocks, 0);
+
+  pthread_rwlock_init(&htab->resize_rwl, NULL);
+
+  htab->table = (void *) malloc ((init_size + 1) * sizeof (htab->table[0]));
+  if (htab->table == NULL)
+      return -1;
+
+  for (size_t i = 0; i <= init_size; i++)
+    {
+      atomic_init(&htab->table[i].hashval, (uintptr_t) NULL);
+      atomic_init(&htab->table[i].val_ptr, (uintptr_t) NULL);
+    }
+
+  return 0;
+}
+
+
+int
+#define FREE(name) _FREE (name)
+#define _FREE(name) \
+name##_free
+FREE(NAME) (NAME *htab)
+{
+  pthread_rwlock_destroy(&htab->resize_rwl);
+  free (htab->table);
+  return 0;
+}
+
+
+int
+#define INSERT(name) _INSERT (name)
+#define _INSERT(name) \
+name##_insert
+INSERT(NAME) (NAME *htab, HASHTYPE hval, TYPE data)
+{
+  int incremented = 0;
+
+  for(;;)
+    {
+      while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+          resize_worker(htab);
+
+      size_t filled;
+      if (!incremented)
+        {
+          filled = atomic_fetch_add_explicit(&htab->filled, 1,
+                                              memory_order_acquire);
+          incremented = 1;
+        }
+      else
+        {
+          filled = atomic_load_explicit(&htab->filled,
+                                        memory_order_acquire);
+        }
+
+
+      if (100 * filled > 90 * htab->size)
+        {
+          /* Table is filled more than 90%.  Resize the table.  */
+
+          size_t resizing_state = atomic_load_explicit(&htab->resizing_state,
+                                                        memory_order_acquire);
+          if (resizing_state == 0 &&
+              atomic_compare_exchange_strong_explicit(&htab->resizing_state,
+                                                      &resizing_state,
+                                                      ALLOCATING_MEMORY,
+                                                      memory_order_acquire,
+                                                      memory_order_acquire))
+            {
+              /* Master thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+              pthread_rwlock_wrlock(&htab->resize_rwl);
+              resize_master(htab);
+              pthread_rwlock_unlock(&htab->resize_rwl);
+
+            }
+          else
+            {
+              /* Worker thread */
+              pthread_rwlock_unlock(&htab->resize_rwl);
+              resize_worker(htab);
+            }
+        }
+      else
+        {
+          /* Lock acquired, no need for resize*/
+          break;
+        }
+    }
+
+  int ret_val = insert_helper(htab, hval, data);
+  if (ret_val == -1)
+      atomic_fetch_sub_explicit(&htab->filled, 1, memory_order_relaxed);
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
+
+
+
+TYPE
+#define FIND(name) _FIND (name)
+#define _FIND(name) \
+  name##_find
+FIND(NAME) (NAME *htab, HASHTYPE hval)
+{
+  while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
+      resize_worker(htab);
+
+  size_t idx;
+
+  /* Make the hash data nonzero.  */
+  hval = hval ?: 1;
+  idx = lookup(htab, hval);
+
+  if (idx == 0)
+    {
+      pthread_rwlock_unlock(&htab->resize_rwl);
+      return NULL;
+    }
+
+  /* get a copy before unlocking the lock */
+  TYPE ret_val = (TYPE) atomic_load_explicit(&htab->table[idx].val_ptr,
+                                             memory_order_relaxed);
+
+  pthread_rwlock_unlock(&htab->resize_rwl);
+  return ret_val;
+}
diff --git a/lib/dynamicsizehash_concurrent.h b/lib/dynamicsizehash_concurrent.h
new file mode 100644
index 00000000..73e66e91
--- /dev/null
+++ b/lib/dynamicsizehash_concurrent.h
@@ -0,0 +1,118 @@
+/* Copyright (C) 2000-2019 Red Hat, Inc.
+   This file is part of elfutils.
+   Written by Srdan Milakovic <sm108@rice.edu>, 2019.
+   Derived from Ulrich Drepper <drepper@redhat.com>, 2000.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include <stddef.h>
+#include <pthread.h>
+#include "atomics.h"
+/* Before including this file the following macros must be defined:
+
+   NAME      name of the hash table structure.
+   TYPE      data type of the hash table entries
+
+   The following macros if present select features:
+
+   ITERATE   iterating over the table entries is possible
+   HASHTYPE  integer type for hash values, default unsigned long int
+ */
+
+
+
+#ifndef HASHTYPE
+# define HASHTYPE unsigned long int
+#endif
+
+#ifndef RESIZE_BLOCK_SIZE
+# define RESIZE_BLOCK_SIZE 256
+#endif
+
+/* Defined separately.  */
+extern size_t next_prime (size_t seed);
+
+
+/* Table entry type.  */
+#define _DYNHASHCONENTTYPE(name)       \
+  typedef struct name##_ent         \
+  {                                 \
+    _Atomic(HASHTYPE) hashval;      \
+    atomic_uintptr_t val_ptr;       \
+  } name##_ent
+#define DYNHASHENTTYPE(name) _DYNHASHCONENTTYPE (name)
+DYNHASHENTTYPE (NAME);
+
+/* Type of the dynamic hash table data structure.  */
+#define _DYNHASHCONTYPE(name) \
+typedef struct                                     \
+{                                                  \
+  size_t size;                                     \
+  size_t old_size;                                 \
+  atomic_size_t filled;                            \
+  name##_ent *table;                               \
+  name##_ent *old_table;                           \
+  atomic_size_t resizing_state;                    \
+  atomic_size_t next_init_block;                   \
+  atomic_size_t num_initialized_blocks;            \
+  atomic_size_t next_move_block;                   \
+  atomic_size_t num_moved_blocks;                  \
+  pthread_rwlock_t resize_rwl;                     \
+} name
+#define DYNHASHTYPE(name) _DYNHASHCONTYPE (name)
+DYNHASHTYPE (NAME);
+
+
+
+#define _FUNCTIONS(name)                                            \
+/* Initialize the hash table.  */                                   \
+extern int name##_init (name *htab, size_t init_size);              \
+                                                                    \
+/* Free resources allocated for hash table.  */                     \
+extern int name##_free (name *htab);                                \
+                                                                    \
+/* Insert new entry.  */                                            \
+extern int name##_insert (name *htab, HASHTYPE hval, TYPE data);    \
+                                                                    \
+/* Find entry in hash table.  */                                    \
+extern TYPE name##_find (name *htab, HASHTYPE hval);
+#define FUNCTIONS(name) _FUNCTIONS (name)
+FUNCTIONS (NAME)
+
+
+#ifndef NO_UNDEF
+# undef DYNHASHENTTYPE
+# undef DYNHASHTYPE
+# undef FUNCTIONS
+# undef _FUNCTIONS
+# undef XFUNCTIONS
+# undef _XFUNCTIONS
+# undef NAME
+# undef TYPE
+# undef ITERATE
+# undef COMPARE
+# undef FIRST
+# undef NEXT
+#endif
diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index 036cee61..927793f9 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,15 @@
+2019-10-28  Mark Wielaard  <mark@klomp.org>
+
+	* dwarf_sig8_hash.h: Include libdw.h. Remove COMPARE. Include
+	dynamicsizehash_concurrent.h.
+	* dwarf_sig8_hash.c: Include dynamicsizehash_concurrent.c.
+	* dwarf_formref_die.c (dwarf_formref_die): Drop NULL argument to
+	Dwarf_Sig8_Hash_find.
+
+2019-08-26  SrÄ‘an MilakoviÄ‡  <sm108@rice.edu@rice.edu>
+
+	* dwarf_abbrev_hash.{c,h}: Use the *_concurrent hash table.
+
 2019-10-28  Jonathon Anderson  <jma14@rice.edu>
 
 	* libdw_alloc.c: Added __libdw_alloc_tail.
diff --git a/libdw/dwarf_abbrev_hash.c b/libdw/dwarf_abbrev_hash.c
index f52f5ad5..c2548140 100644
--- a/libdw/dwarf_abbrev_hash.c
+++ b/libdw/dwarf_abbrev_hash.c
@@ -38,7 +38,7 @@
 #define next_prime __libdwarf_next_prime
 extern size_t next_prime (size_t) attribute_hidden;
 
-#include <dynamicsizehash.c>
+#include <dynamicsizehash_concurrent.c>
 
 #undef next_prime
 #define next_prime attribute_hidden __libdwarf_next_prime
diff --git a/libdw/dwarf_abbrev_hash.h b/libdw/dwarf_abbrev_hash.h
index d2f02ccc..a368c598 100644
--- a/libdw/dwarf_abbrev_hash.h
+++ b/libdw/dwarf_abbrev_hash.h
@@ -32,8 +32,7 @@
 
 #define NAME Dwarf_Abbrev_Hash
 #define TYPE Dwarf_Abbrev *
-#define COMPARE(a, b) (0)
 
-#include <dynamicsizehash.h>
+#include <dynamicsizehash_concurrent.h>
 
 #endif	/* dwarf_abbrev_hash.h */
diff --git a/libdw/dwarf_formref_die.c b/libdw/dwarf_formref_die.c
index f196331a..48ba8194 100644
--- a/libdw/dwarf_formref_die.c
+++ b/libdw/dwarf_formref_die.c
@@ -83,7 +83,7 @@ dwarf_formref_die (Dwarf_Attribute *attr, Dwarf_Die *result)
 	 have to match in the type unit headers.  */
 
       uint64_t sig = read_8ubyte_unaligned (cu->dbg, attr->valp);
-      cu = Dwarf_Sig8_Hash_find (&cu->dbg->sig8_hash, sig, NULL);
+      cu = Dwarf_Sig8_Hash_find (&cu->dbg->sig8_hash, sig);
       if (cu == NULL)
 	{
 	  /* Not seen before.  We have to scan through the type units.
diff --git a/libdw/dwarf_getabbrev.c b/libdw/dwarf_getabbrev.c
index 6a7e981b..7e767fc1 100644
--- a/libdw/dwarf_getabbrev.c
+++ b/libdw/dwarf_getabbrev.c
@@ -83,7 +83,7 @@ __libdw_getabbrev (Dwarf *dbg, struct Dwarf_CU *cu, Dwarf_Off offset,
   bool foundit = false;
   Dwarf_Abbrev *abb = NULL;
   if (cu == NULL
-      || (abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code, NULL)) == NULL)
+      || (abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code)) == NULL)
     {
       if (result == NULL)
 	abb = libdw_typed_alloc (dbg, Dwarf_Abbrev);
diff --git a/libdw/dwarf_sig8_hash.c b/libdw/dwarf_sig8_hash.c
index 043cac78..777f9ebc 100644
--- a/libdw/dwarf_sig8_hash.c
+++ b/libdw/dwarf_sig8_hash.c
@@ -38,4 +38,4 @@
 #define next_prime __libdwarf_next_prime
 extern size_t next_prime (size_t) attribute_hidden;
 
-#include <dynamicsizehash.c>
+#include <dynamicsizehash_concurrent.c>
diff --git a/libdw/dwarf_sig8_hash.h b/libdw/dwarf_sig8_hash.h
index 705ffbcd..c399919a 100644
--- a/libdw/dwarf_sig8_hash.h
+++ b/libdw/dwarf_sig8_hash.h
@@ -29,10 +29,15 @@
 #ifndef _DWARF_SIG8_HASH_H
 #define _DWARF_SIG8_HASH_H	1
 
+#ifdef HAVE_CONFIG_H
+# include <config.h>
+#endif
+
+#include "libdw.h"
+
 #define NAME Dwarf_Sig8_Hash
 #define TYPE struct Dwarf_CU *
-#define COMPARE(a, b) (0)
 
-#include <dynamicsizehash.h>
+#include <dynamicsizehash_concurrent.h>
 
 #endif	/* dwarf_sig8_hash.h */
diff --git a/libdw/dwarf_tag.c b/libdw/dwarf_tag.c
index 331eaa0d..d784970c 100644
--- a/libdw/dwarf_tag.c
+++ b/libdw/dwarf_tag.c
@@ -45,7 +45,7 @@ __libdw_findabbrev (struct Dwarf_CU *cu, unsigned int code)
     return DWARF_END_ABBREV;
 
   /* See whether the entry is already in the hash table.  */
-  abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code, NULL);
+  abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code);
   if (abb == NULL)
     while (cu->last_abbrev_offset != (size_t) -1l)
       {
-- 
2.24.0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-04 17:03                         ` [PATCH] " Jonathon Anderson
@ 2019-11-07 11:07                           ` Mark Wielaard
  2019-11-07 15:25                             ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-11-07 11:07 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Mon, 2019-11-04 at 10:39 -0600, Jonathon Anderson wrote:
> Apologies, I talked with Srdan in person and forgot to relay the
> message. He gave me an updated version of the hash table that handles
> that issue (and another previously harmless typo).

This looks good. I am about to commit this, but would like to confirm
that the fixes in this change are as intended.

Looking at the difference between the previous version and this one, it
incorporates my simplification of FIND and lookup functions. And fixes
it by making insert_helper consistently return -1 when the value was
already inserted (hash == hval). And it fixes an issue where we were
using the the table entry val_ptr instead of the hashval as hash (was
that the typo? it didn't look harmless).

Sorry for being pedantic about this. But it is really tricky code, so I
want to be 100% sure all the above changes were intended.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-07 11:07                           ` Mark Wielaard
@ 2019-11-07 15:25                             ` Jonathon Anderson
  2019-11-08 14:07                               ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-11-07 15:25 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

On Thu, Nov 7, 2019 at 12:07, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> On Mon, 2019-11-04 at 10:39 -0600, Jonathon Anderson wrote:
>>  Apologies, I talked with Srdan in person and forgot to relay the
>>  message. He gave me an updated version of the hash table that 
>> handles
>>  that issue (and another previously harmless typo).
> 
> This looks good. I am about to commit this, but would like to confirm
> that the fixes in this change are as intended.
> 
> Looking at the difference between the previous version and this one, 
> it
> incorporates my simplification of FIND and lookup functions. And fixes
> it by making insert_helper consistently return -1 when the value was
> already inserted (hash == hval). And it fixes an issue where we were
> using the the table entry val_ptr instead of the hashval as hash (was
> that the typo? it didn't look harmless).

Yep, those are the changes (plus the Sig8 patch). That typo was 
harmless because hash would be overwritten before its next use (or just 
unused), now with the (hash == hval) clause its always read so the typo 
is fixed.

> 
> Sorry for being pedantic about this. But it is really tricky code, so 
> I
> want to be 100% sure all the above changes were intended.

No worries, I would be worried if you weren't.

Regarding the Sig8 table: I took a close look, and at the moment its 
use is in an area that isn't thread-safe anyway (in particular, 
__libdw_intern_next_unit). Depending on how that is parallelized there 
might be an issue (if its just wrapped with a separate mutex a thread 
might "miss" a CU if its not already in the table), but since that 
region would need inspection at that time anyway I'm fine with either 
option.

This isn't an issue for Dwarf_Abbrev, the worst that can happen there 
is a duplicate alloc and parsing (as long as the DWARF doesn't have 
erroneous abbrev entries, if it does we would need to thread-safe the 
error handling too).

-Jonathon

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-07 15:25                             ` Jonathon Anderson
@ 2019-11-08 14:07                               ` Mark Wielaard
  2019-11-08 15:29                                 ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-11-08 14:07 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Thu, 2019-11-07 at 09:24 -0600, Jonathon Anderson wrote:
> On Thu, Nov 7, 2019 at 12:07, Mark Wielaard <mark@klomp.org> wrote:
> > Looking at the difference between the previous version and this one, 
> > it
> > incorporates my simplification of FIND and lookup functions. And fixes
> > it by making insert_helper consistently return -1 when the value was
> > already inserted (hash == hval). And it fixes an issue where we were
> > using the the table entry val_ptr instead of the hashval as hash (was
> > that the typo? it didn't look harmless).
> 
> Yep, those are the changes (plus the Sig8 patch). That typo was 
> harmless because hash would be overwritten before its next use (or just 
> unused), now with the (hash == hval) clause its always read so the typo 
> is fixed.

Thanks for explaining. I have now finally pushed this to master.

> Regarding the Sig8 table: I took a close look, and at the moment its 
> use is in an area that isn't thread-safe anyway (in particular, 
> __libdw_intern_next_unit). Depending on how that is parallelized there 
> might be an issue (if its just wrapped with a separate mutex a thread 
> might "miss" a CU if its not already in the table), but since that 
> region would need inspection at that time anyway I'm fine with either 
> option.

I still kept the code to handle the Sig8 table with the new concurrent-
safe code, since I think it is better if we use the new code always
(even in the single threaded case).

So to fix this we do need some mutex to protect the binary search tree
when calling tsearch/tfind? Or do you see other issues too?

> This isn't an issue for Dwarf_Abbrev, the worst that can happen there 
> is a duplicate alloc and parsing (as long as the DWARF doesn't have 
> erroneous abbrev entries, if it does we would need to thread-safe the 
> error handling too).

Unfortunately we don't always control the data, so bad abbrev entries
could happen. The extra alloc wouldn't really "leak" because it would
be freed with the Dwarf. So I am not too concerned about that. Is that
the worse that can happen in __libdw_getabbrev? When we goto invalid
the Dwarf_Abbrev would indeed "leak", but it isn't really lost, it will
get cleaned up when the Dwarf is destroyed.

Thanks,

Makr

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-08 14:07                               ` Mark Wielaard
@ 2019-11-08 15:29                                 ` Jonathon Anderson
  2019-11-10 23:24                                   ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-11-08 15:29 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic

On Fri, Nov 8, 2019 at 15:07, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> On Thu, 2019-11-07 at 09:24 -0600, Jonathon Anderson wrote:
>>  On Thu, Nov 7, 2019 at 12:07, Mark Wielaard <mark@klomp.org 
>> <mailto:mark@klomp.org>> wrote:
>>  > Looking at the difference between the previous version and this 
>> one,
>>  > it
>>  > incorporates my simplification of FIND and lookup functions. And 
>> fixes
>>  > it by making insert_helper consistently return -1 when the value 
>> was
>>  > already inserted (hash == hval). And it fixes an issue where we 
>> were
>>  > using the the table entry val_ptr instead of the hashval as hash 
>> (was
>>  > that the typo? it didn't look harmless).
>> 
>>  Yep, those are the changes (plus the Sig8 patch). That typo was
>>  harmless because hash would be overwritten before its next use (or 
>> just
>>  unused), now with the (hash == hval) clause its always read so the 
>> typo
>>  is fixed.
> 
> Thanks for explaining. I have now finally pushed this to master.
> 
>>  Regarding the Sig8 table: I took a close look, and at the moment its
>>  use is in an area that isn't thread-safe anyway (in particular,
>>  __libdw_intern_next_unit). Depending on how that is parallelized 
>> there
>>  might be an issue (if its just wrapped with a separate mutex a 
>> thread
>>  might "miss" a CU if its not already in the table), but since that
>>  region would need inspection at that time anyway I'm fine with 
>> either
>>  option.
> 
> I still kept the code to handle the Sig8 table with the new 
> concurrent-
> safe code, since I think it is better if we use the new code always
> (even in the single threaded case).
> 
> So to fix this we do need some mutex to protect the binary search tree
> when calling tsearch/tfind? Or do you see other issues too?

The search tree can be handled with a mutex, the issue is with 
next_{tu,cu}_offset and the general logic of the function. As an 
example: suppose two threads look up in the Sig8 for A and see that its 
not currently present. They'll both use __libdw_intern_next_unit to 
load CUs (or units, I suppose) until they either find it or run out of 
units.

If the entirety of intern_next_unit is wrapped in a mutex, one of the 
two will load in the missing entry and finish, while the other has 
"missed" it and will keep going until no units remain. The easy 
solution is to have the threads check the table again on next_unit 
failure for whether the entry has been added, but that incurs a 
large-ish overhead for the constant reprobing. The easiest way around 
that is to add an interface on the Sig8 table that returns a "probe" on 
lookup failure that can be continued with only a handful of atomics 
(essentially saving state from the first find). The downside to this 
approach is that unit parsing is fully serialized.

If the next_*_offset is handled with a separate mutex or as an atomic 
(say, using fetch_add), the same issue occurs but without the mutex 
there's no guarantee that another thread isn't currently parsing and 
will write the entry soon, so the easy solution doesn't work. Since the 
Sig8 key is only known after the parsing is complete, we can't even 
insert a "in progress" entry. One solution is to allow for duplicate 
parsing (but then next_*_offset would have to be updated *after* 
Sig8_Hash_insert), another is to use a condition variable on whether 
all the units have been parsed (so threads that don't find what they're 
looking for would block until its certain that it doesn't exist).

Both are viable directions, but neither are trivial.

> 
>>  This isn't an issue for Dwarf_Abbrev, the worst that can happen 
>> there
>>  is a duplicate alloc and parsing (as long as the DWARF doesn't have
>>  erroneous abbrev entries, if it does we would need to thread-safe 
>> the
>>  error handling too).
> 
> Unfortunately we don't always control the data, so bad abbrev entries
> could happen. The extra alloc wouldn't really "leak" because it would
> be freed with the Dwarf. So I am not too concerned about that. Is that
> the worse that can happen in __libdw_getabbrev? When we goto invalid
> the Dwarf_Abbrev would indeed "leak", but it isn't really lost, it 
> will
> get cleaned up when the Dwarf is destroyed.

It wouldn't "leak," but it would be taking up space until the 
dwarf_end. Not that I mind (they're really small).

I'm thinking more of the case where the Abbrev_insert returns -1 (entry 
collision), in that case the new Abbrev would stick around until the 
dwarf_end.

> 
> Thanks,
> 
> Makr

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-08 15:29                                 ` Jonathon Anderson
@ 2019-11-10 23:24                                   ` Mark Wielaard
  2019-11-11 23:38                                     ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-11-10 23:24 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

[-- Attachment #1: Type: text/plain, Size: 3515 bytes --]

Hi Jonathon,

On Fri, 2019-11-08 at 09:28 -0600, Jonathon Anderson wrote:
> > So to fix this we do need some mutex to protect the binary search
> > tree when calling tsearch/tfind? Or do you see other issues too?
> 
> The search tree can be handled with a mutex, the issue is with 
> next_{tu,cu}_offset and the general logic of the function. As an 
> example: suppose two threads look up in the Sig8 for A and see that its 
> not currently present. They'll both use __libdw_intern_next_unit to 
> load CUs (or units, I suppose) until they either find it or run out of 
> units.
> 
> If the entirety of intern_next_unit is wrapped in a mutex, one of the 
> two will load in the missing entry and finish, while the other has 
> "missed" it and will keep going until no units remain. The easy 
> solution is to have the threads check the table again on next_unit 
> failure for whether the entry has been added, but that incurs a 
> large-ish overhead for the constant reprobing. The easiest way around 
> that is to add an interface on the Sig8 table that returns a "probe" on 
> lookup failure that can be continued with only a handful of atomics 
> (essentially saving state from the first find). The downside to this 
> approach is that unit parsing is fully serialized.
> 
> If the next_*_offset is handled with a separate mutex or as an atomic 
> (say, using fetch_add), the same issue occurs but without the mutex 
> there's no guarantee that another thread isn't currently parsing and 
> will write the entry soon, so the easy solution doesn't work. Since the 
> Sig8 key is only known after the parsing is complete, we can't even 
> insert a "in progress" entry. One solution is to allow for duplicate 
> parsing (but then next_*_offset would have to be updated *after* 
> Sig8_Hash_insert), another is to use a condition variable on whether 
> all the units have been parsed (so threads that don't find what they're 
> looking for would block until its certain that it doesn't exist).
> 
> Both are viable directions, but neither are trivial.

Thanks, I missed that completely. We do need to fix this, but I might
take a look at it after the next release (which I really would like to
do in about a week). It is indeed not completely trivial. Luckily
debug_types aren't widely used. But if they are used, it would be bad
if it would break a concurrent DIE reader.

> > Unfortunately we don't always control the data, so bad abbrev entries
> > could happen. The extra alloc wouldn't really "leak" because it would
> > be freed with the Dwarf. So I am not too concerned about that. Is that
> > the worse that can happen in __libdw_getabbrev? When we goto invalid
> > the Dwarf_Abbrev would indeed "leak", but it isn't really lost, it 
> > will
> > get cleaned up when the Dwarf is destroyed.
> 
> It wouldn't "leak," but it would be taking up space until the 
> dwarf_end. Not that I mind (they're really small).
> 
> I'm thinking more of the case where the Abbrev_insert returns -1 (entry 
> collision), in that case the new Abbrev would stick around until the 
> dwarf_end.

Since the memblocks are per thread, it seems we could easily back out
an allocation we don't need as long as the thread hasn't done any other
allocation from the memblock. What do you think of the attached patch?

It is a bit hard to test without a massively parallel testcase where
things collide a lot. Since you have one, could you try this out?

Thanks,

Mark

[-- Attachment #2: Type: text/x-patch, Size: 4209 bytes --]

From 24db17a8616883b553c1deffea7950b58ba780cb Mon Sep 17 00:00:00 2001
From: Mark Wielaard <mark@klomp.org>
Date: Mon, 11 Nov 2019 00:15:55 +0100
Subject: [PATCH] libdw: Introduce libdw_unalloc to stop Dwarf_Abbrev leaks.

In the case of reading an invalid abbrev or when reading an abbrev
concurrently the Dwarf_Abbrev just created might leak because it isn't
needed after all. Introduce libdw_unalloc and libdw_typed_unalloc to
unallocate such Dwarf_Abbrevs so they don't leak.

Signed-off-by: Mark Wielaard <mark@klomp.org>
---
 libdw/ChangeLog         |  9 +++++++++
 libdw/dwarf_getabbrev.c | 10 +++++++++-
 libdw/libdwP.h          | 13 +++++++++++++
 libdw/libdw_alloc.c     | 12 ++++++++++++
 4 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index d308172b..95ac28a7 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,12 @@
+2019-11-10  Mark Wielaard  <mark@klomp.org>
+
+	* libdwP.h (libdw_unalloc): New define.
+	(libdw_typed_unalloc): Likewise.
+	(__libdw_thread_tail): New function declaration.
+	* libdw_alloc.c (__libdw_thread_tail): New function.
+	* dwarf_getabbrev.c (__libdw_getabbrev): Call libdw_typed_unalloc
+	when reading invalid data or when hash collission detected.
+
 2019-10-28  Jonathon Anderson  <jma14@rice.edu>
 
 	* libdw_alloc.c: Added __libdw_alloc_tail.
diff --git a/libdw/dwarf_getabbrev.c b/libdw/dwarf_getabbrev.c
index 7e767fc1..13bee493 100644
--- a/libdw/dwarf_getabbrev.c
+++ b/libdw/dwarf_getabbrev.c
@@ -99,6 +99,8 @@ __libdw_getabbrev (Dwarf *dbg, struct Dwarf_CU *cu, Dwarf_Off offset,
 	  /* A duplicate abbrev code at a different offset,
 	     that should never happen.  */
 	invalid:
+	  if (! foundit)
+	    libdw_typed_unalloc (dbg, Dwarf_Abbrev);
 	  __libdw_seterrno (DWARF_E_INVALID_DWARF);
 	  return NULL;
 	}
@@ -148,7 +150,13 @@ __libdw_getabbrev (Dwarf *dbg, struct Dwarf_CU *cu, Dwarf_Off offset,
 
   /* Add the entry to the hash table.  */
   if (cu != NULL && ! foundit)
-    (void) Dwarf_Abbrev_Hash_insert (&cu->abbrev_hash, abb->code, abb);
+    if (Dwarf_Abbrev_Hash_insert (&cu->abbrev_hash, abb->code, abb) == -1)
+      {
+	/* The entry was already in the table, remove the one we just
+	   created and get the one already inserted.  */
+	libdw_typed_unalloc (dbg, Dwarf_Abbrev);
+	abb = Dwarf_Abbrev_Hash_find (&cu->abbrev_hash, code);
+      }
 
  out:
   return abb;
diff --git a/libdw/libdwP.h b/libdw/libdwP.h
index 3e1ef59b..36c2acd9 100644
--- a/libdw/libdwP.h
+++ b/libdw/libdwP.h
@@ -599,10 +599,23 @@ extern void __libdw_seterrno (int value) internal_function;
 #define libdw_typed_alloc(dbg, type) \
   libdw_alloc (dbg, type, sizeof (type), 1)
 
+/* Can only be used to undo the last libdw_alloc.  */
+#define libdw_unalloc(dbg, type, tsize, cnt) \
+  ({ struct libdw_memblock *_tail = __libdw_thread_tail (dbg);		      \
+     size_t _required = (tsize) * (cnt);				      \
+     /* We cannot know the padding, it is lost.  */			      \
+     _tail->remaining += _required; })					      \
+
+#define libdw_typed_unalloc(dbg, type) \
+  libdw_unalloc (dbg, type, sizeof (type), 1)
+
 /* Callback to choose a thread-local memory allocation stack.  */
 extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
      __nonnull_attribute__ (1);
 
+extern struct libdw_memblock *__libdw_thread_tail (Dwarf* dbg)
+     __nonnull_attribute__ (1);
+
 /* Callback to allocate more.  */
 extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
      __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
index 0eb02c34..e0281a3d 100644
--- a/libdw/libdw_alloc.c
+++ b/libdw/libdw_alloc.c
@@ -97,6 +97,18 @@ __libdw_alloc_tail (Dwarf *dbg)
   return result;
 }
 
+/* Can only be called after a allocation for this thread has already
+   been done, to possibly undo it.  */
+struct libdw_memblock *
+__libdw_thread_tail (Dwarf *dbg)
+{
+  struct libdw_memblock *result;
+  pthread_rwlock_rdlock (&dbg->mem_rwl);
+  result = dbg->mem_tails[thread_id];
+  pthread_rwlock_unlock (&dbg->mem_rwl);
+  return result;
+}
+
 void *
 __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
 {
-- 
2.18.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-10 23:24                                   ` Mark Wielaard
@ 2019-11-11 23:38                                     ` Jonathon Anderson
  2019-11-12 21:45                                       ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-11-11 23:38 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic



On Mon, Nov 11, 2019 at 00:23, Mark Wielaard <mark@klomp.org> wrote:
> Hi Jonathon,
> 
> On Fri, 2019-11-08 at 09:28 -0600, Jonathon Anderson wrote:
>>  > So to fix this we do need some mutex to protect the binary search
>>  > tree when calling tsearch/tfind? Or do you see other issues too?
>> 
>>  The search tree can be handled with a mutex, the issue is with
>>  next_{tu,cu}_offset and the general logic of the function. As an
>>  example: suppose two threads look up in the Sig8 for A and see that 
>> its
>>  not currently present. They'll both use __libdw_intern_next_unit to
>>  load CUs (or units, I suppose) until they either find it or run out 
>> of
>>  units.
>> 
>>  If the entirety of intern_next_unit is wrapped in a mutex, one of 
>> the
>>  two will load in the missing entry and finish, while the other has
>>  "missed" it and will keep going until no units remain. The easy
>>  solution is to have the threads check the table again on next_unit
>>  failure for whether the entry has been added, but that incurs a
>>  large-ish overhead for the constant reprobing. The easiest way 
>> around
>>  that is to add an interface on the Sig8 table that returns a 
>> "probe" on
>>  lookup failure that can be continued with only a handful of atomics
>>  (essentially saving state from the first find). The downside to this
>>  approach is that unit parsing is fully serialized.
>> 
>>  If the next_*_offset is handled with a separate mutex or as an 
>> atomic
>>  (say, using fetch_add), the same issue occurs but without the mutex
>>  there's no guarantee that another thread isn't currently parsing and
>>  will write the entry soon, so the easy solution doesn't work. Since 
>> the
>>  Sig8 key is only known after the parsing is complete, we can't even
>>  insert a "in progress" entry. One solution is to allow for duplicate
>>  parsing (but then next_*_offset would have to be updated *after*
>>  Sig8_Hash_insert), another is to use a condition variable on whether
>>  all the units have been parsed (so threads that don't find what 
>> they're
>>  looking for would block until its certain that it doesn't exist).
>> 
>>  Both are viable directions, but neither are trivial.
> 
> Thanks, I missed that completely. We do need to fix this, but I might
> take a look at it after the next release (which I really would like to
> do in about a week). It is indeed not completely trivial. Luckily
> debug_types aren't widely used. But if they are used, it would be bad
> if it would break a concurrent DIE reader.
> 
>>  > Unfortunately we don't always control the data, so bad abbrev 
>> entries
>>  > could happen. The extra alloc wouldn't really "leak" because it 
>> would
>>  > be freed with the Dwarf. So I am not too concerned about that. Is 
>> that
>>  > the worse that can happen in __libdw_getabbrev? When we goto 
>> invalid
>>  > the Dwarf_Abbrev would indeed "leak", but it isn't really lost, it
>>  > will
>>  > get cleaned up when the Dwarf is destroyed.
>> 
>>  It wouldn't "leak," but it would be taking up space until the
>>  dwarf_end. Not that I mind (they're really small).
>> 
>>  I'm thinking more of the case where the Abbrev_insert returns -1 
>> (entry
>>  collision), in that case the new Abbrev would stick around until the
>>  dwarf_end.
> 
> Since the memblocks are per thread, it seems we could easily back out
> an allocation we don't need as long as the thread hasn't done any 
> other
> allocation from the memblock. What do you think of the attached patch?

Looks good to me.

> 
> It is a bit hard to test without a massively parallel testcase where
> things collide a lot. Since you have one, could you try this out?

Everything seems to work fine on this end.

(Plus, if you replace the `Abbrev_find == NULL` with `1`, you can make 
everything collide. Which I did too, and it still works fine.)

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.
  2019-11-11 23:38                                     ` Jonathon Anderson
@ 2019-11-12 21:45                                       ` Mark Wielaard
  0 siblings, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-11-12 21:45 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi Jonathon,

On Mon, 2019-11-11 at 17:38 -0600, Jonathon Anderson wrote:
> > Since the memblocks are per thread, it seems we could easily back out
> > an allocation we don't need as long as the thread hasn't done any 
> > other
> > allocation from the memblock. What do you think of the attached patch?
> 
> Looks good to me.
> 
> > It is a bit hard to test without a massively parallel testcase where
> > things collide a lot. Since you have one, could you try this out?
> 
> Everything seems to work fine on this end.
> 
> (Plus, if you replace the `Abbrev_find == NULL` with `1`, you can make 
> everything collide. Which I did too, and it still works fine.)

O nice trick. Just build and ran a distcheck with this to be sure
things still work.

Thanks for your review and testing. Pushed to master.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/3] Add some supporting framework for C11-style atomics.
  2019-08-29 13:16               ` Mark Wielaard
  2019-08-29 13:16                 ` [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table Mark Wielaard
@ 2019-08-29 13:16                 ` Mark Wielaard
  2019-10-22 16:31                   ` Mark Wielaard
  2019-08-29 13:16                 ` [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe Mark Wielaard
  2 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-08-29 13:16 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson, Srdan Milakovic

From: Jonathon Anderson <jma14@rice.edu>

Uses the stdatomic.h provided by FreeBSD when GCC doesn't (ie. GCC < 4.9)

Signed-off-by: Jonathon Anderson <jma14@rice.edu>
Signed-off-by: SrÄ‘an MilakoviÄ‡ <sm108@rice.edu>
---
 configure.ac         |  12 ++
 lib/ChangeLog        |   6 +
 lib/Makefile.am      |   3 +-
 lib/atomics.h        |  37 ++++
 lib/stdatomic-fbsd.h | 442 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 499 insertions(+), 1 deletion(-)
 create mode 100644 lib/atomics.h
 create mode 100644 lib/stdatomic-fbsd.h

diff --git a/configure.ac b/configure.ac
index c443fa3b..b8aba460 100644
--- a/configure.ac
+++ b/configure.ac
@@ -226,6 +226,18 @@ LDFLAGS="$save_LDFLAGS"])
 AS_IF([test "x$ac_cv_tls" != xyes],
       AC_MSG_ERROR([__thread support required]))
 
+dnl Before 4.9 gcc doesn't ship stdatomic.h, but the nessesary atomics are
+dnl available by (at least) 4.7. So if the system doesn't have a stdatomic.h we
+dnl fall back on one copied from FreeBSD that handles the difference.
+AC_CACHE_CHECK([whether gcc provides stdatomic.h], ac_cv_has_stdatomic,
+  [AC_COMPILE_IFELSE([AC_LANG_SOURCE([[#include <stdatomic.h>]])],
+		     ac_cv_has_stdatomic=yes, ac_cv_has_stdatomic=no)])
+AM_CONDITIONAL(HAVE_STDATOMIC_H, test "x$ac_cv_has_stdatomic" = xyes)
+AS_IF([test "x$ac_cv_has_stdatomic" = xyes], [AC_DEFINE(HAVE_STDATOMIC_H)])
+
+AH_TEMPLATE([HAVE_STDATOMIC_H], [Define to 1 if `stdatomic.h` is provided by the
+                                 system, 0 otherwise.])
+
 dnl This test must come as early as possible after the compiler configuration
 dnl tests, because the choice of the file model can (in principle) affect
 dnl whether functions and headers are available, whether they work, etc.
diff --git a/lib/ChangeLog b/lib/ChangeLog
index 7381860c..3799c3aa 100644
--- a/lib/ChangeLog
+++ b/lib/ChangeLog
@@ -1,3 +1,9 @@
+2019-08-25  Jonathon Anderson  <jma14@rice.edu>
+
+	* stdatomic-fbsd.h: New file, taken from FreeBSD.
+	* atomics.h: New file.
+	* Makefile.am (noinst_HEADERS): Added *.h above.
+
 2019-05-03  Rosen Penev  <rosenp@gmail.com>
 
 	* color.c (parse_opt): Cast program_invocation_short_name to char *.
diff --git a/lib/Makefile.am b/lib/Makefile.am
index 36d21a07..3086cf06 100644
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -38,7 +38,8 @@ libeu_a_SOURCES = xstrdup.c xstrndup.c xmalloc.c next_prime.c \
 		  color.c printversion.c
 
 noinst_HEADERS = fixedsizehash.h libeu.h system.h dynamicsizehash.h list.h \
-		 eu-config.h color.h printversion.h bpf.h
+		 eu-config.h color.h printversion.h bpf.h \
+		 atomics.h stdatomic-fbsd.h
 EXTRA_DIST = dynamicsizehash.c
 
 if !GPROF
diff --git a/lib/atomics.h b/lib/atomics.h
new file mode 100644
index 00000000..ffd12f87
--- /dev/null
+++ b/lib/atomics.h
@@ -0,0 +1,37 @@
+/* Conditional wrapper header for C11-style atomics.
+   Copyright (C) 2019-2019 Red Hat, Inc.
+   This file is part of elfutils.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of either
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at
+       your option) any later version
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at
+       your option) any later version
+
+   or both in parallel, as here.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+
+#if HAVE_STDATOMIC_H
+/* If possible, use the compiler's preferred atomics.  */
+# include <stdatomic.h>
+#else
+/* Otherwise, try to use the builtins provided by this compiler.  */
+# include "stdatomic-fbsd.h"
+#endif /* HAVE_STDATOMIC_H */
diff --git a/lib/stdatomic-fbsd.h b/lib/stdatomic-fbsd.h
new file mode 100644
index 00000000..49626662
--- /dev/null
+++ b/lib/stdatomic-fbsd.h
@@ -0,0 +1,442 @@
+/*-
+ * Copyright (c) 2011 Ed Schouten <ed@FreeBSD.org>
+ *                    David Chisnall <theraven@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef _STDATOMIC_H_
+#define	_STDATOMIC_H_
+
+#include <stddef.h>
+#include <stdint.h>
+
+#if !defined(__has_feature)
+#define __has_feature(x) 0
+#endif
+#if !defined(__has_builtin)
+#define __has_builtin(x) 0
+#endif
+#if !defined(__GNUC_PREREQ__)
+#if defined(__GNUC__) && defined(__GNUC_MINOR__)
+#define __GNUC_PREREQ__(maj, min)					\
+	((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
+#else
+#define __GNUC_PREREQ__(maj, min) 0
+#endif
+#endif
+
+#if !defined(__CLANG_ATOMICS) && !defined(__GNUC_ATOMICS)
+#if __has_feature(c_atomic)
+#define	__CLANG_ATOMICS
+#elif __GNUC_PREREQ__(4, 7)
+#define	__GNUC_ATOMICS
+#elif !defined(__GNUC__)
+#error "stdatomic.h does not support your compiler"
+#endif
+#endif
+
+/*
+ * language independent type to represent a Boolean value
+ */
+
+typedef int __Bool;
+
+/*
+ * 7.17.1 Atomic lock-free macros.
+ */
+
+#ifdef __GCC_ATOMIC_BOOL_LOCK_FREE
+#define	ATOMIC_BOOL_LOCK_FREE		__GCC_ATOMIC_BOOL_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR_LOCK_FREE
+#define	ATOMIC_CHAR_LOCK_FREE		__GCC_ATOMIC_CHAR_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR16_T_LOCK_FREE
+#define	ATOMIC_CHAR16_T_LOCK_FREE	__GCC_ATOMIC_CHAR16_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_CHAR32_T_LOCK_FREE
+#define	ATOMIC_CHAR32_T_LOCK_FREE	__GCC_ATOMIC_CHAR32_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_WCHAR_T_LOCK_FREE
+#define	ATOMIC_WCHAR_T_LOCK_FREE	__GCC_ATOMIC_WCHAR_T_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_SHORT_LOCK_FREE
+#define	ATOMIC_SHORT_LOCK_FREE		__GCC_ATOMIC_SHORT_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_INT_LOCK_FREE
+#define	ATOMIC_INT_LOCK_FREE		__GCC_ATOMIC_INT_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_LONG_LOCK_FREE
+#define	ATOMIC_LONG_LOCK_FREE		__GCC_ATOMIC_LONG_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_LLONG_LOCK_FREE
+#define	ATOMIC_LLONG_LOCK_FREE		__GCC_ATOMIC_LLONG_LOCK_FREE
+#endif
+#ifdef __GCC_ATOMIC_POINTER_LOCK_FREE
+#define	ATOMIC_POINTER_LOCK_FREE	__GCC_ATOMIC_POINTER_LOCK_FREE
+#endif
+
+#if !defined(__CLANG_ATOMICS)
+#define	_Atomic(T)			struct { volatile __typeof__(T) __val; }
+#endif
+
+/*
+ * 7.17.2 Initialization.
+ */
+
+#if defined(__CLANG_ATOMICS)
+#define	ATOMIC_VAR_INIT(value)		(value)
+#define	atomic_init(obj, value)		__c11_atomic_init(obj, value)
+#else
+#define	ATOMIC_VAR_INIT(value)		{ .__val = (value) }
+#define	atomic_init(obj, value)		((void)((obj)->__val = (value)))
+#endif
+
+/*
+ * Clang and recent GCC both provide predefined macros for the memory
+ * orderings.  If we are using a compiler that doesn't define them, use the
+ * clang values - these will be ignored in the fallback path.
+ */
+
+#ifndef __ATOMIC_RELAXED
+#define __ATOMIC_RELAXED		0
+#endif
+#ifndef __ATOMIC_CONSUME
+#define __ATOMIC_CONSUME		1
+#endif
+#ifndef __ATOMIC_ACQUIRE
+#define __ATOMIC_ACQUIRE		2
+#endif
+#ifndef __ATOMIC_RELEASE
+#define __ATOMIC_RELEASE		3
+#endif
+#ifndef __ATOMIC_ACQ_REL
+#define __ATOMIC_ACQ_REL		4
+#endif
+#ifndef __ATOMIC_SEQ_CST
+#define __ATOMIC_SEQ_CST		5
+#endif
+
+/*
+ * 7.17.3 Order and consistency.
+ *
+ * The memory_order_* constants that denote the barrier behaviour of the
+ * atomic operations.
+ */
+
+typedef enum {
+    memory_order_relaxed = __ATOMIC_RELAXED,
+    memory_order_consume = __ATOMIC_CONSUME,
+    memory_order_acquire = __ATOMIC_ACQUIRE,
+    memory_order_release = __ATOMIC_RELEASE,
+    memory_order_acq_rel = __ATOMIC_ACQ_REL,
+    memory_order_seq_cst = __ATOMIC_SEQ_CST
+} memory_order;
+
+/*
+ * 7.17.4 Fences.
+ */
+
+//#define __unused
+
+//static __inline void
+//atomic_thread_fence(memory_order __order __unused)
+//{
+//
+//#ifdef __CLANG_ATOMICS
+//    __c11_atomic_thread_fence(__order);
+//#elif defined(__GNUC_ATOMICS)
+//    __atomic_thread_fence(__order);
+//#else
+//    __sync_synchronize();
+//#endif
+//}
+//
+//static __inline void
+//atomic_signal_fence(memory_order __order __unused)
+//{
+//
+//#ifdef __CLANG_ATOMICS
+//    __c11_atomic_signal_fence(__order);
+//#elif defined(__GNUC_ATOMICS)
+//    __atomic_signal_fence(__order);
+//#else
+//    __asm volatile ("" ::: "memory");
+//#endif
+//}
+
+//#undef __unused
+
+/*
+ * 7.17.5 Lock-free property.
+ */
+
+#if defined(_KERNEL)
+/* Atomics in kernelspace are always lock-free. */
+#define	atomic_is_lock_free(obj) \
+	((void)(obj), (__Bool)1)
+#elif defined(__CLANG_ATOMICS)
+#define	atomic_is_lock_free(obj) \
+	__atomic_is_lock_free(sizeof(*(obj)), obj)
+#elif defined(__GNUC_ATOMICS)
+#define	atomic_is_lock_free(obj) \
+	__atomic_is_lock_free(sizeof((obj)->__val), &(obj)->__val)
+#else
+#define	atomic_is_lock_free(obj) \
+	((void)(obj), sizeof((obj)->__val) <= sizeof(void *))
+#endif
+
+/*
+ * 7.17.6 Atomic integer types.
+ */
+
+typedef _Atomic(__Bool)			atomic_bool;
+typedef _Atomic(char)			atomic_char;
+typedef _Atomic(signed char)		atomic_schar;
+typedef _Atomic(unsigned char)		atomic_uchar;
+typedef _Atomic(short)			atomic_short;
+typedef _Atomic(unsigned short)		atomic_ushort;
+typedef _Atomic(int)			atomic_int;
+typedef _Atomic(unsigned int)		atomic_uint;
+typedef _Atomic(long)			atomic_long;
+typedef _Atomic(unsigned long)		atomic_ulong;
+typedef _Atomic(long long)		atomic_llong;
+typedef _Atomic(unsigned long long)	atomic_ullong;
+#if 0
+typedef _Atomic(char16_t)		atomic_char16_t;
+typedef _Atomic(char32_t)		atomic_char32_t;
+#endif
+typedef _Atomic(wchar_t)		atomic_wchar_t;
+typedef _Atomic(int_least8_t)		atomic_int_least8_t;
+typedef _Atomic(uint_least8_t)		atomic_uint_least8_t;
+typedef _Atomic(int_least16_t)		atomic_int_least16_t;
+typedef _Atomic(uint_least16_t)		atomic_uint_least16_t;
+typedef _Atomic(int_least32_t)		atomic_int_least32_t;
+typedef _Atomic(uint_least32_t)		atomic_uint_least32_t;
+typedef _Atomic(int_least64_t)		atomic_int_least64_t;
+typedef _Atomic(uint_least64_t)		atomic_uint_least64_t;
+typedef _Atomic(int_fast8_t)		atomic_int_fast8_t;
+typedef _Atomic(uint_fast8_t)		atomic_uint_fast8_t;
+typedef _Atomic(int_fast16_t)		atomic_int_fast16_t;
+typedef _Atomic(uint_fast16_t)		atomic_uint_fast16_t;
+typedef _Atomic(int_fast32_t)		atomic_int_fast32_t;
+typedef _Atomic(uint_fast32_t)		atomic_uint_fast32_t;
+typedef _Atomic(int_fast64_t)		atomic_int_fast64_t;
+typedef _Atomic(uint_fast64_t)		atomic_uint_fast64_t;
+typedef _Atomic(intptr_t)		atomic_intptr_t;
+typedef _Atomic(uintptr_t)		atomic_uintptr_t;
+typedef _Atomic(size_t)			atomic_size_t;
+typedef _Atomic(ptrdiff_t)		atomic_ptrdiff_t;
+typedef _Atomic(intmax_t)		atomic_intmax_t;
+typedef _Atomic(uintmax_t)		atomic_uintmax_t;
+
+/*
+ * 7.17.7 Operations on atomic types.
+ */
+
+/*
+ * Compiler-specific operations.
+ */
+
+#if defined(__CLANG_ATOMICS)
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)						\
+	__c11_atomic_compare_exchange_strong(object, expected, desired,	\
+	    success, failure)
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	__c11_atomic_compare_exchange_weak(object, expected, desired,	\
+	    success, failure)
+#define	atomic_exchange_explicit(object, desired, order)		\
+	__c11_atomic_exchange(object, desired, order)
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	__c11_atomic_fetch_add(object, operand, order)
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	__c11_atomic_fetch_and(object, operand, order)
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	__c11_atomic_fetch_or(object, operand, order)
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	__c11_atomic_fetch_sub(object, operand, order)
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	__c11_atomic_fetch_xor(object, operand, order)
+#define	atomic_load_explicit(object, order)				\
+	__c11_atomic_load(object, order)
+#define	atomic_store_explicit(object, desired, order)			\
+	__c11_atomic_store(object, desired, order)
+#elif defined(__GNUC_ATOMICS)
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)						\
+	__atomic_compare_exchange_n(&(object)->__val, expected,		\
+	    desired, 0, success, failure)
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	__atomic_compare_exchange_n(&(object)->__val, expected,		\
+	    desired, 1, success, failure)
+#define	atomic_exchange_explicit(object, desired, order)		\
+	__atomic_exchange_n(&(object)->__val, desired, order)
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	__atomic_fetch_add(&(object)->__val, operand, order)
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	__atomic_fetch_and(&(object)->__val, operand, order)
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	__atomic_fetch_or(&(object)->__val, operand, order)
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	__atomic_fetch_sub(&(object)->__val, operand, order)
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	__atomic_fetch_xor(&(object)->__val, operand, order)
+#define	atomic_load_explicit(object, order)				\
+	__atomic_load_n(&(object)->__val, order)
+#define	atomic_store_explicit(object, desired, order)			\
+	__atomic_store_n(&(object)->__val, desired, order)
+#else
+#define	__atomic_apply_stride(object, operand) \
+	(((__typeof__((object)->__val))0) + (operand))
+#define	atomic_compare_exchange_strong_explicit(object, expected,	\
+    desired, success, failure)	__extension__ ({			\
+	__typeof__(expected) __ep = (expected);				\
+	__typeof__(*__ep) __e = *__ep;					\
+	(void)(success); (void)(failure);				\
+	(__Bool)((*__ep = __sync_val_compare_and_swap(&(object)->__val,	\
+	    __e, desired)) == __e);					\
+})
+#define	atomic_compare_exchange_weak_explicit(object, expected,		\
+    desired, success, failure)						\
+	atomic_compare_exchange_strong_explicit(object, expected,	\
+		desired, success, failure)
+#if __has_builtin(__sync_swap)
+/* Clang provides a full-barrier atomic exchange - use it if available. */
+#define	atomic_exchange_explicit(object, desired, order)		\
+	((void)(order), __sync_swap(&(object)->__val, desired))
+#else
+/*
+ * __sync_lock_test_and_set() is only an acquire barrier in theory (although in
+ * practice it is usually a full barrier) so we need an explicit barrier before
+ * it.
+ */
+#define	atomic_exchange_explicit(object, desired, order)		\
+__extension__ ({							\
+	__typeof__(object) __o = (object);				\
+	__typeof__(desired) __d = (desired);				\
+	(void)(order);							\
+	__sync_synchronize();						\
+	__sync_lock_test_and_set(&(__o)->__val, __d);			\
+})
+#endif
+#define	atomic_fetch_add_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_add(&(object)->__val,		\
+	    __atomic_apply_stride(object, operand)))
+#define	atomic_fetch_and_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_and(&(object)->__val, operand))
+#define	atomic_fetch_or_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_or(&(object)->__val, operand))
+#define	atomic_fetch_sub_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_sub(&(object)->__val,		\
+	    __atomic_apply_stride(object, operand)))
+#define	atomic_fetch_xor_explicit(object, operand, order)		\
+	((void)(order), __sync_fetch_and_xor(&(object)->__val, operand))
+#define	atomic_load_explicit(object, order)				\
+	((void)(order), __sync_fetch_and_add(&(object)->__val, 0))
+#define	atomic_store_explicit(object, desired, order)			\
+	((void)atomic_exchange_explicit(object, desired, order))
+#endif
+
+/*
+ * Convenience functions.
+ *
+ * Don't provide these in kernel space. In kernel space, we should be
+ * disciplined enough to always provide explicit barriers.
+ */
+
+#ifndef _KERNEL
+#define	atomic_compare_exchange_strong(object, expected, desired)	\
+	atomic_compare_exchange_strong_explicit(object, expected,	\
+	    desired, memory_order_seq_cst, memory_order_seq_cst)
+#define	atomic_compare_exchange_weak(object, expected, desired)		\
+	atomic_compare_exchange_weak_explicit(object, expected,		\
+	    desired, memory_order_seq_cst, memory_order_seq_cst)
+#define	atomic_exchange(object, desired)				\
+	atomic_exchange_explicit(object, desired, memory_order_seq_cst)
+#define	atomic_fetch_add(object, operand)				\
+	atomic_fetch_add_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_and(object, operand)				\
+	atomic_fetch_and_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_or(object, operand)				\
+	atomic_fetch_or_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_sub(object, operand)				\
+	atomic_fetch_sub_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_fetch_xor(object, operand)				\
+	atomic_fetch_xor_explicit(object, operand, memory_order_seq_cst)
+#define	atomic_load(object)						\
+	atomic_load_explicit(object, memory_order_seq_cst)
+#define	atomic_store(object, desired)					\
+	atomic_store_explicit(object, desired, memory_order_seq_cst)
+#endif /* !_KERNEL */
+
+/*
+ * 7.17.8 Atomic flag type and operations.
+ *
+ * XXX: Assume atomic_bool can be used as an atomic_flag. Is there some
+ * kind of compiler built-in type we could use?
+ */
+
+typedef struct {
+    atomic_bool	__flag;
+} atomic_flag;
+
+#define	ATOMIC_FLAG_INIT		{ ATOMIC_VAR_INIT(0) }
+
+static __inline __Bool
+atomic_flag_test_and_set_explicit(volatile atomic_flag *__object,
+                                  memory_order __order)
+{
+    return (atomic_exchange_explicit(&__object->__flag, 1, __order));
+}
+
+static __inline void
+atomic_flag_clear_explicit(volatile atomic_flag *__object, memory_order __order)
+{
+
+    atomic_store_explicit(&__object->__flag, 0, __order);
+}
+
+#ifndef _KERNEL
+static __inline __Bool
+atomic_flag_test_and_set(volatile atomic_flag *__object)
+{
+
+    return (atomic_flag_test_and_set_explicit(__object,
+                                              memory_order_seq_cst));
+}
+
+static __inline void
+atomic_flag_clear(volatile atomic_flag *__object)
+{
+
+    atomic_flag_clear_explicit(__object, memory_order_seq_cst);
+}
+#endif /* !_KERNEL */
+
+#endif /* !_STDATOMIC_H_ */
\ No newline at end of file
-- 
2.18.1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/3] Add some supporting framework for C11-style atomics.
  2019-08-29 13:16                 ` [PATCH 1/3] Add some supporting framework for C11-style atomics Mark Wielaard
@ 2019-10-22 16:31                   ` Mark Wielaard
  0 siblings, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-10-22 16:31 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson, Srdan Milakovic

On Thu, 2019-08-29 at 15:16 +0200, Mark Wielaard wrote:
> From: Jonathon Anderson <jma14@rice.edu>
> 
> Uses the stdatomic.h provided by FreeBSD when GCC doesn't (ie. GCC <
> 4.9)
> 
> Signed-off-by: Jonathon Anderson <jma14@rice.edu>
> Signed-off-by: Srđan Milaković <sm108@rice.edu>
>
> diff --git a/lib/ChangeLog b/lib/ChangeLog
> index 7381860c..3799c3aa 100644
> --- a/lib/ChangeLog
> +++ b/lib/ChangeLog
> @@ -1,3 +1,9 @@
> +2019-08-25  Jonathon Anderson  <jma14@rice.edu>
> +
> +	* stdatomic-fbsd.h: New file, taken from FreeBSD.
> +	* atomics.h: New file.
> +	* Makefile.am (noinst_HEADERS): Added *.h above.

Thanks. I cherry-picked and rebased this and pushed to master.

Sorry this took so long, even though the change itself was fairly small
and self-contained. I wanted to play with it in context of the other
patches first, which weren't that simple.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe.
  2019-08-29 13:16               ` Mark Wielaard
  2019-08-29 13:16                 ` [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table Mark Wielaard
  2019-08-29 13:16                 ` [PATCH 1/3] Add some supporting framework for C11-style atomics Mark Wielaard
@ 2019-08-29 13:16                 ` Mark Wielaard
  2019-10-21 16:13                   ` Mark Wielaard
  2 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-08-29 13:16 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson, Srdan Milakovic

From: Jonathon Anderson <jma14@rice.edu>

Signed-off-by: Jonathon Anderson <jma14@rice.edu>
---
 libdw/ChangeLog         |  8 ++++++
 libdw/Makefile.am       |  4 +--
 libdw/dwarf_begin_elf.c | 12 ++++-----
 libdw/dwarf_end.c       |  7 ++---
 libdw/libdwP.h          | 59 ++++++++++++++++++++++++++---------------
 libdw/libdw_alloc.c     |  5 ++--
 6 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index 1d3586f0..ec809070 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,11 @@
+2019-08-26  Jonathon Anderson  <jma14@rice.edu>
+
+	* libdw_alloc.c (__libdw_allocate): Added thread-safe stack allocator.
+	* libdwP.h (Dwarf): Likewise.
+	* dwarf_begin_elf.c (dwarf_begin_elf): Support for above.
+	* dwarf_end.c (dwarf_end): Likewise.
+	* Makefile.am: Use -pthread to provide rwlocks.
+
 2019-08-25  Jonathon Anderson  <jma14@rice.edu>
 
 	* dwarf_getcfi.c (dwarf_getcfi): Set default_same_value to false.
diff --git a/libdw/Makefile.am b/libdw/Makefile.am
index 7a3d5322..ba5745f3 100644
--- a/libdw/Makefile.am
+++ b/libdw/Makefile.am
@@ -31,7 +31,7 @@ include $(top_srcdir)/config/eu.am
 if BUILD_STATIC
 AM_CFLAGS += $(fpic_CFLAGS)
 endif
-AM_CPPFLAGS += -I$(srcdir)/../libelf -I$(srcdir)/../libdwelf
+AM_CPPFLAGS += -I$(srcdir)/../libelf -I$(srcdir)/../libdwelf -pthread
 VERSION = 1
 
 lib_LIBRARIES = libdw.a
@@ -108,7 +108,7 @@ am_libdw_pic_a_OBJECTS = $(libdw_a_SOURCES:.c=.os)
 libdw_so_LIBS = libdw_pic.a ../libdwelf/libdwelf_pic.a \
 	  ../libdwfl/libdwfl_pic.a ../libebl/libebl.a
 libdw_so_DEPS = ../lib/libeu.a ../libelf/libelf.so
-libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS)
+libdw_so_LDLIBS = $(libdw_so_DEPS) -ldl -lz $(argp_LDADD) $(zip_LIBS) -pthread
 libdw_so_SOURCES =
 libdw.so$(EXEEXT): $(srcdir)/libdw.map $(libdw_so_LIBS) $(libdw_so_DEPS)
 # The rpath is necessary for libebl because its $ORIGIN use will
diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
index 38c8f5c6..f865f69c 100644
--- a/libdw/dwarf_begin_elf.c
+++ b/libdw/dwarf_begin_elf.c
@@ -397,7 +397,7 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn *scngrp)
   assert (sizeof (struct Dwarf) < mem_default_size);
 
   /* Allocate the data structure.  */
-  Dwarf *result = (Dwarf *) calloc (1, sizeof (Dwarf) + mem_default_size);
+  Dwarf *result = (Dwarf *) calloc (1, sizeof (Dwarf));
   if (unlikely (result == NULL)
       || unlikely (Dwarf_Sig8_Hash_init (&result->sig8_hash, 11) < 0))
     {
@@ -414,14 +414,12 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn *scngrp)
   result->elf = elf;
   result->alt_fd = -1;
 
-  /* Initialize the memory handling.  */
+  /* Initialize the memory handling.  Initial blocks are allocated on first
+     actual allocation.  */
   result->mem_default_size = mem_default_size;
   result->oom_handler = __libdw_oom;
-  result->mem_tail = (struct libdw_memblock *) (result + 1);
-  result->mem_tail->size = (result->mem_default_size
-			    - offsetof (struct libdw_memblock, mem));
-  result->mem_tail->remaining = result->mem_tail->size;
-  result->mem_tail->prev = NULL;
+  pthread_key_create (&result->mem_key, NULL);
+  atomic_init (&result->mem_tail, (uintptr_t)NULL);
 
   if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
     {
diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
index 29795c10..fc573cb3 100644
--- a/libdw/dwarf_end.c
+++ b/libdw/dwarf_end.c
@@ -94,14 +94,15 @@ dwarf_end (Dwarf *dwarf)
       /* And the split Dwarf.  */
       tdestroy (dwarf->split_tree, noop_free);
 
-      struct libdw_memblock *memp = dwarf->mem_tail;
-      /* The first block is allocated together with the Dwarf object.  */
-      while (memp->prev != NULL)
+      /* Free the internally allocated memory.  */
+      struct libdw_memblock *memp = (struct libdw_memblock *)dwarf->mem_tail;
+      while (memp != NULL)
 	{
 	  struct libdw_memblock *prevp = memp->prev;
 	  free (memp);
 	  memp = prevp;
 	}
+      pthread_key_delete (dwarf->mem_key);
 
       /* Free the pubnames helper structure.  */
       free (dwarf->pubnames_sets);
diff --git a/libdw/libdwP.h b/libdw/libdwP.h
index eebb7d12..ad2599eb 100644
--- a/libdw/libdwP.h
+++ b/libdw/libdwP.h
@@ -31,9 +31,11 @@
 
 #include <libintl.h>
 #include <stdbool.h>
+#include <pthread.h>
 
 #include <libdw.h>
 #include <dwarf.h>
+#include "atomics.h"
 
 
 /* gettext helper macros.  */
@@ -147,6 +149,17 @@ enum
 
 #include "dwarf_sig8_hash.h"
 
+/* Structure for internal memory handling.  This is basically a simplified
+   reimplementation of obstacks.  Unfortunately the standard obstack
+   implementation is not usable in libraries.  */
+struct libdw_memblock
+{
+  size_t size;
+  size_t remaining;
+  struct libdw_memblock *prev;
+  char mem[0];
+};
+
 /* This is the structure representing the debugging state.  */
 struct Dwarf
 {
@@ -218,16 +231,11 @@ struct Dwarf
   /* Similar for addrx/constx, which will come from .debug_addr section.  */
   struct Dwarf_CU *fake_addr_cu;
 
-  /* Internal memory handling.  This is basically a simplified
-     reimplementation of obstacks.  Unfortunately the standard obstack
-     implementation is not usable in libraries.  */
-  struct libdw_memblock
-  {
-    size_t size;
-    size_t remaining;
-    struct libdw_memblock *prev;
-    char mem[0];
-  } *mem_tail;
+  /* Internal memory handling.  Each thread allocates separately and only
+     allocates from its own blocks, while all the blocks are pushed atomically
+     onto a unified stack for easy deallocation.  */
+  pthread_key_t mem_key;
+  atomic_uintptr_t mem_tail;
 
   /* Default size of allocated memory blocks.  */
   size_t mem_default_size;
@@ -570,21 +578,28 @@ libdw_valid_user_form (int form)
 extern void __libdw_seterrno (int value) internal_function;
 
 
-/* Memory handling, the easy parts.  This macro does not do any locking.  */
+/* Memory handling, the easy parts.  This macro does not do nor need to do any
+   locking for proper concurrent operation.  */
 #define libdw_alloc(dbg, type, tsize, cnt) \
-  ({ struct libdw_memblock *_tail = (dbg)->mem_tail;			      \
-     size_t _required = (tsize) * (cnt);				      \
-     type *_result = (type *) (_tail->mem + (_tail->size - _tail->remaining));\
-     size_t _padding = ((__alignof (type)				      \
-			 - ((uintptr_t) _result & (__alignof (type) - 1)))    \
-			& (__alignof (type) - 1));			      \
-     if (unlikely (_tail->remaining < _required + _padding))		      \
-       _result = (type *) __libdw_allocate (dbg, _required, __alignof (type));\
+  ({ struct libdw_memblock *_tail = pthread_getspecific (dbg->mem_key);       \
+     size_t _req = (tsize) * (cnt);					      \
+     type *_result;							      \
+     if (unlikely (_tail == NULL))					      \
+       _result = (type *) __libdw_allocate (dbg, _req, __alignof (type));     \
      else								      \
        {								      \
-	 _required += _padding;						      \
-	 _result = (type *) ((char *) _result + _padding);		      \
-	 _tail->remaining -= _required;					      \
+	 _result = (type *) (_tail->mem + (_tail->size - _tail->remaining));  \
+	 size_t _padding = ((__alignof (type)				      \
+			    - ((uintptr_t) _result & (__alignof (type) - 1))) \
+			       & (__alignof (type) - 1));		      \
+	 if (unlikely (_tail->remaining < _req + _padding))		      \
+	   _result = (type *) __libdw_allocate (dbg, _req, __alignof (type)); \
+	 else								      \
+	   {								      \
+	     _req += _padding;						      \
+	     _result = (type *) ((char *) _result + _padding);		      \
+	     _tail->remaining -= _req;					      \
+	   }								      \
        }								      \
      _result; })
 
diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
index f1e08714..78977e54 100644
--- a/libdw/libdw_alloc.c
+++ b/libdw/libdw_alloc.c
@@ -52,8 +52,9 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
   newp->size = size - offsetof (struct libdw_memblock, mem);
   newp->remaining = (uintptr_t) newp + size - (result + minsize);
 
-  newp->prev = dbg->mem_tail;
-  dbg->mem_tail = newp;
+  newp->prev = (struct libdw_memblock*)atomic_exchange_explicit(
+      &dbg->mem_tail, (uintptr_t)newp, memory_order_relaxed);
+  pthread_setspecific(dbg->mem_key, newp);
 
   return (void *) result;
 }
-- 
2.18.1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe.
  2019-08-29 13:16                 ` [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe Mark Wielaard
@ 2019-10-21 16:13                   ` Mark Wielaard
  2019-10-21 16:28                     ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-21 16:13 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson, Srdan Milakovic

Hi,

Finally back to this patch series.

On Thu, 2019-08-29 at 15:16 +0200, Mark Wielaard wrote:
> diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
> index 29795c10..fc573cb3 100644
> --- a/libdw/dwarf_end.c
> +++ b/libdw/dwarf_end.c
> @@ -94,14 +94,15 @@ dwarf_end (Dwarf *dwarf)
>        /* And the split Dwarf.  */
>        tdestroy (dwarf->split_tree, noop_free);
>  
> -      struct libdw_memblock *memp = dwarf->mem_tail;
> -      /* The first block is allocated together with the Dwarf object.  */
> -      while (memp->prev != NULL)
> +      /* Free the internally allocated memory.  */
> +      struct libdw_memblock *memp = (struct libdw_memblock *)dwarf->mem_tail;
> +      while (memp != NULL)
>  	{
>  	  struct libdw_memblock *prevp = memp->prev;
>  	  free (memp);
>  	  memp = prevp;
>  	}
> +      pthread_key_delete (dwarf->mem_key);
>  
>        /* Free the pubnames helper structure.  */
>        free (dwarf->pubnames_sets);

This doesn't build on older GCCs (I am using 4.8.5) with this compile error:


libdw/dwarf_end.c: In function ‘dwarf_end’:
libdw/dwarf_end.c:98:45: error: cannot convert to a pointer type
       struct libdw_memblock *memp = (struct libdw_memblock *)dwarf->mem_tail;
                                             ^

This is because mem_tail is defined as:

> diff --git a/libdw/libdwP.h b/libdw/libdwP.h
> index eebb7d12..ad2599eb 100644
> --- a/libdw/libdwP.h
> +++ b/libdw/libdwP.h
> @@ -218,16 +231,11 @@ struct Dwarf
>    /* Similar for addrx/constx, which will come from .debug_addr section.  */
>    struct Dwarf_CU *fake_addr_cu;
>  
> -  /* Internal memory handling.  This is basically a simplified
> -     reimplementation of obstacks.  Unfortunately the standard obstack
> -     implementation is not usable in libraries.  */
> -  struct libdw_memblock
> -  {
> -    size_t size;
> -    size_t remaining;
> -    struct libdw_memblock *prev;
> -    char mem[0];
> -  } *mem_tail;
> +  /* Internal memory handling.  Each thread allocates separately and only
> +     allocates from its own blocks, while all the blocks are pushed atomically
> +     onto a unified stack for easy deallocation.  */
> +  pthread_key_t mem_key;
> +  atomic_uintptr_t mem_tail;
>  
>    /* Default size of allocated memory blocks.  */
>    size_t mem_default_size;

And for older compilers, without stdatomic.h, this means atomic_uintptr_t is really:

typedef _Atomic(uintptr_t)              atomic_uintptr_t;
#define _Atomic(T)                      struct { volatile __typeof__(T) __val; }

And you cannot cast a struct to a pointer type directly.

To make this work on both older and newer gcc versions I changed this to:

diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
index fc573cb3..76ab9954 100644
--- a/libdw/dwarf_end.c
+++ b/libdw/dwarf_end.c
@@ -95,7 +95,8 @@ dwarf_end (Dwarf *dwarf)
       tdestroy (dwarf->split_tree, noop_free);
 
       /* Free the internally allocated memory.  */
-      struct libdw_memblock *memp = (struct libdw_memblock *)dwarf->mem_tail;
+      struct libdw_memblock *memp;
+      memp = (struct libdw_memblock *) atomic_load (&dwarf->mem_tail);
       while (memp != NULL)
 	{
 	  struct libdw_memblock *prevp = memp->prev;

Does that look reasonable?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe.
  2019-10-21 16:13                   ` Mark Wielaard
@ 2019-10-21 16:28                     ` Jonathon Anderson
  2019-10-21 18:00                       ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-10-21 16:28 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Srdan Milakovic



On Mon, Oct 21, 2019 at 18:13, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> Finally back to this patch series.
> 
> On Thu, 2019-08-29 at 15:16 +0200, Mark Wielaard wrote:
>>  diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
>>  index 29795c10..fc573cb3 100644
>>  --- a/libdw/dwarf_end.c
>>  +++ b/libdw/dwarf_end.c
>>  @@ -94,14 +94,15 @@ dwarf_end (Dwarf *dwarf)
>>         /* And the split Dwarf.  */
>>         tdestroy (dwarf->split_tree, noop_free);
>> 
>>  -      struct libdw_memblock *memp = dwarf->mem_tail;
>>  -      /* The first block is allocated together with the Dwarf 
>> object.  */
>>  -      while (memp->prev != NULL)
>>  +      /* Free the internally allocated memory.  */
>>  +      struct libdw_memblock *memp = (struct libdw_memblock 
>> *)dwarf->mem_tail;
>>  +      while (memp != NULL)
>>   	{
>>   	  struct libdw_memblock *prevp = memp->prev;
>>   	  free (memp);
>>   	  memp = prevp;
>>   	}
>>  +      pthread_key_delete (dwarf->mem_key);
>> 
>>         /* Free the pubnames helper structure.  */
>>         free (dwarf->pubnames_sets);
> 
> This doesn't build on older GCCs (I am using 4.8.5) with this compile 
> error:
> 
> 
> libdw/dwarf_end.c: In function ‘dwarf_end’:
> libdw/dwarf_end.c:98:45: error: cannot convert to a pointer type
>        struct libdw_memblock *memp = (struct libdw_memblock 
> *)dwarf->mem_tail;
>                                              ^

Ah, whoops. Thanks for catching that one.

> 
> This is because mem_tail is defined as:
> 
>>  diff --git a/libdw/libdwP.h b/libdw/libdwP.h
>>  index eebb7d12..ad2599eb 100644
>>  --- a/libdw/libdwP.h
>>  +++ b/libdw/libdwP.h
>>  @@ -218,16 +231,11 @@ struct Dwarf
>>     /* Similar for addrx/constx, which will come from .debug_addr 
>> section.  */
>>     struct Dwarf_CU *fake_addr_cu;
>> 
>>  -  /* Internal memory handling.  This is basically a simplified
>>  -     reimplementation of obstacks.  Unfortunately the standard 
>> obstack
>>  -     implementation is not usable in libraries.  */
>>  -  struct libdw_memblock
>>  -  {
>>  -    size_t size;
>>  -    size_t remaining;
>>  -    struct libdw_memblock *prev;
>>  -    char mem[0];
>>  -  } *mem_tail;
>>  +  /* Internal memory handling.  Each thread allocates separately 
>> and only
>>  +     allocates from its own blocks, while all the blocks are 
>> pushed atomically
>>  +     onto a unified stack for easy deallocation.  */
>>  +  pthread_key_t mem_key;
>>  +  atomic_uintptr_t mem_tail;
>> 
>>     /* Default size of allocated memory blocks.  */
>>     size_t mem_default_size;
> 
> And for older compilers, without stdatomic.h, this means 
> atomic_uintptr_t is really:
> 
> typedef _Atomic(uintptr_t)              atomic_uintptr_t;
> #define _Atomic(T)                      struct { volatile 
> __typeof__(T) __val; }
> 
> And you cannot cast a struct to a pointer type directly.
> 
> To make this work on both older and newer gcc versions I changed this 
> to:
> 
> diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
> index fc573cb3..76ab9954 100644
> --- a/libdw/dwarf_end.c
> +++ b/libdw/dwarf_end.c
> @@ -95,7 +95,8 @@ dwarf_end (Dwarf *dwarf)
>        tdestroy (dwarf->split_tree, noop_free);
> 
>        /* Free the internally allocated memory.  */
> -      struct libdw_memblock *memp = (struct libdw_memblock 
> *)dwarf->mem_tail;
> +      struct libdw_memblock *memp;
> +      memp = (struct libdw_memblock *) atomic_load 
> (&dwarf->mem_tail);
>        while (memp != NULL)
>  	{
>  	  struct libdw_memblock *prevp = memp->prev;
> 
> Does that look reasonable?

It does, although I would prefer:

diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
index 9ca17212..6da9e0cd 100644
--- a/libdw/dwarf_end.c
+++ b/libdw/dwarf_end.c
@@ -95,7 +95,9 @@ dwarf_end (Dwarf *dwarf)
       tdestroy (dwarf->split_tree, noop_free);

       /* Free the internally allocated memory.  */
-      struct libdw_memblock *memp = (struct libdw_memblock 
*)dwarf->mem_tail;
+      struct libdw_memblock *memp;
+      memp  = (struct libdw_memblock *)atomic_load(&dwarf->mem_tail,
+                                                   
memory_order_relaxed);
       while (memp != NULL)
         {
           struct libdw_memblock *prevp = memp->prev;

Because some idiot thought making seq_cst the default was a good idea. 
And this way it notes in the code that this load is non-synchronizing.

-Jonathon

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe.
  2019-10-21 16:28                     ` Jonathon Anderson
@ 2019-10-21 18:00                       ` Mark Wielaard
  2019-10-24 16:47                         ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-21 18:00 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Mon, 2019-10-21 at 11:28 -0500, Jonathon Anderson wrote:
> On Mon, Oct 21, 2019 at 18:13, Mark Wielaard <mark@klomp.org> wrote:
> > Does that look reasonable?
> 
> It does, although I would prefer:
> 
> diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
> index 9ca17212..6da9e0cd 100644
> --- a/libdw/dwarf_end.c
> +++ b/libdw/dwarf_end.c
> @@ -95,7 +95,9 @@ dwarf_end (Dwarf *dwarf)
>        tdestroy (dwarf->split_tree, noop_free);
> 
>        /* Free the internally allocated memory.  */
> -      struct libdw_memblock *memp = (struct libdw_memblock 
> *)dwarf->mem_tail;
> +      struct libdw_memblock *memp;
> +      memp  = (struct libdw_memblock *)atomic_load(&dwarf->mem_tail,
> +                                                   memory_order_relaxed);
>        while (memp != NULL)
>          {
>            struct libdw_memblock *prevp = memp->prev;
> 
> Because some idiot thought making seq_cst the default was a good idea. 
> And this way it notes in the code that this load is non-synchronizing.

Lets avoid the "strong" language about people. But lets see if we can
make the load less "strong" for the atomics :)

I think we cannot use the atomic_load () function, but have to use
atomic_load_explicit. So it would become:

diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
index fc573cb3..a2e94436 100644
--- a/libdw/dwarf_end.c
+++ b/libdw/dwarf_end.c
@@ -95,7 +95,10 @@ dwarf_end (Dwarf *dwarf)
       tdestroy (dwarf->split_tree, noop_free);
 
       /* Free the internally allocated memory.  */
-      struct libdw_memblock *memp = (struct libdw_memblock *)dwarf->mem_tail;
+      struct libdw_memblock *memp;
+      memp = (struct libdw_memblock *) (atomic_load_explicit
+					(&dwarf->mem_tail,
+					 memory_order_relaxed));
       while (memp != NULL)
 	{
 	  struct libdw_memblock *prevp = memp->prev;

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe.
  2019-10-21 18:00                       ` Mark Wielaard
@ 2019-10-24 16:47                         ` Mark Wielaard
  0 siblings, 0 replies; 55+ messages in thread
From: Mark Wielaard @ 2019-10-24 16:47 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Mon, 2019-10-21 at 20:00 +0200, Mark Wielaard wrote:
> I think we cannot use the atomic_load () function, but have to use
> atomic_load_explicit. So it would become:
> 
> diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
> index fc573cb3..a2e94436 100644
> --- a/libdw/dwarf_end.c
> +++ b/libdw/dwarf_end.c
> @@ -95,7 +95,10 @@ dwarf_end (Dwarf *dwarf)
>        tdestroy (dwarf->split_tree, noop_free);
>  
>        /* Free the internally allocated memory.  */
> -      struct libdw_memblock *memp = (struct libdw_memblock *)dwarf->mem_tail;
> +      struct libdw_memblock *memp;
> +      memp = (struct libdw_memblock *) (atomic_load_explicit
> +					(&dwarf->mem_tail,
> +					 memory_order_relaxed));
>        while (memp != NULL)
>  	{
>  	  struct libdw_memblock *prevp = memp->prev;

I made two more small changes to add error checking for
pthread_key_create and pthread_setspecific, even though I couldn't
trigger any of them to fail in this code, it seemed bad to just ignore
if they would fail:

diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
index f865f69c..8d137414 100644
--- a/libdw/dwarf_begin_elf.c
+++ b/libdw/dwarf_begin_elf.c
@@ -418,7 +418,12 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn *scngrp)
      actual allocation.  */
   result->mem_default_size = mem_default_size;
   result->oom_handler = __libdw_oom;
-  pthread_key_create (&result->mem_key, NULL);
+  if (pthread_key_create (&result->mem_key, NULL) != 0)
+    {
+      free (result);
+      __libdw_seterrno (DWARF_E_NOMEM); /* no memory or max pthread keys.  */
+      return NULL;
+    }
   atomic_init (&result->mem_tail, (uintptr_t)NULL);
 
   if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
index 78977e54..f2e74d18 100644
--- a/libdw/libdw_alloc.c
+++ b/libdw/libdw_alloc.c
@@ -54,7 +54,8 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
 
   newp->prev = (struct libdw_memblock*)atomic_exchange_explicit(
       &dbg->mem_tail, (uintptr_t)newp, memory_order_relaxed);
-  pthread_setspecific(dbg->mem_key, newp);
+  if (pthread_setspecific (dbg->mem_key, newp) != 0)
+    dbg->oom_handler ();
 
   return (void *) result;
 }

With that I pushed it to master. Thanks a lot for this code.

To my surprise the code was actually slightly (although almost in the
noise) faster in the single threaded cases I tested. Well done!

Cheers,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-08-27  3:52             ` Jonathon Anderson
  2019-08-29 13:16               ` Mark Wielaard
@ 2019-10-26 10:54               ` Florian Weimer
  2019-10-26 12:06                 ` Mark Wielaard
  1 sibling, 1 reply; 55+ messages in thread
From: Florian Weimer @ 2019-10-26 10:54 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: Mark Wielaard, elfutils-devel, Srdan Milakovic

* Jonathon Anderson:

> Just finished some modifications to the patch series, git request-pull 
> output below. This rebases onto the latest master and does a little 
> diff cleaning, the major change is that I swapped out the memory 
> management to use the pthread_key_* alternative mentioned before.

This use of pthread_key_* is rather unusual.  In particular, there are
only 1024 keys supported for the entire process, so it limits the
number of Dwarf * objects that can be created by one process, even if
it does not use any threads.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 10:54               ` [PATCH] libdw: add thread-safety to dwarf_getabbrev() Florian Weimer
@ 2019-10-26 12:06                 ` Mark Wielaard
  2019-10-26 16:14                   ` Florian Weimer
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-26 12:06 UTC (permalink / raw)
  To: Florian Weimer, Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

On Sat, 2019-10-26 at 12:54 +0200, Florian Weimer wrote:
> * Jonathon Anderson:
> 
> > Just finished some modifications to the patch series, git request-pull 
> > output below. This rebases onto the latest master and does a little 
> > diff cleaning, the major change is that I swapped out the memory 
> > management to use the pthread_key_* alternative mentioned before.
> 
> This use of pthread_key_* is rather unusual.  In particular, there are
> only 1024 keys supported for the entire process, so it limits the
> number of Dwarf * objects that can be created by one process, even if
> it does not use any threads.

O, I didn't know that there was such a low limit on pthread_keys. That
is indeed a bit of a problem, especially with split-dwarf where
basically every CU could be its own Dwarf object (from a .dwo file).
That would easily exceed such a low limit.

I'll see if I can create a case where that is a problem. Then we can
see how to adjust things to use less pthread_keys. Is there a different
pattern we can use?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 12:06                 ` Mark Wielaard
@ 2019-10-26 16:14                   ` Florian Weimer
  2019-10-26 16:45                     ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Florian Weimer @ 2019-10-26 16:14 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Jonathon Anderson, elfutils-devel, Srdan Milakovic

* Mark Wielaard:

> I'll see if I can create a case where that is a problem. Then we can
> see how to adjust things to use less pthread_keys. Is there a different
> pattern we can use?

It's unclear what purpose thread-local storage serves in this context.
You already have a Dwarf *.  I would consider adding some sort of
clone function which creates a shallow Dwarf * with its own embedded
allocator or something like that.  This assumes that memory allocation
is actually a performance problem, otherwise you could let malloc
handle the details.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 16:14                   ` Florian Weimer
@ 2019-10-26 16:45                     ` Jonathon Anderson
  2019-10-26 16:50                       ` Florian Weimer
  2019-10-26 22:50                       ` Mark Wielaard
  0 siblings, 2 replies; 55+ messages in thread
From: Jonathon Anderson @ 2019-10-26 16:45 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Mark Wielaard, elfutils-devel, Srdan Milakovic

Hello Florian Weimer,

I'm the original author of this patch, so I'll try to answer what I can.

For some overall perspective, this patch replaces the original libdw 
allocator with a thread-safe variant. The original acts both as a 
suballocator (to keep from paying the malloc tax on frequent small 
allocations) and a garbage collection list (to free internal structures 
on dwarf_end). The patch attempts to replicate the same overall 
behavior in the more volatile parallel case.

On Sat, Oct 26, 2019 at 18:14, Florian Weimer <fw@deneb.enyo.de> wrote:
> * Mark Wielaard:
> 
>>  I'll see if I can create a case where that is a problem. Then we can
>>  see how to adjust things to use less pthread_keys. Is there a 
>> different
>>  pattern we can use?
> 
> It's unclear what purpose thread-local storage serves in this context.

The thread-local storage provides the suballocator side: for each 
Dwarf, each thread has its own "top block" to perform allocations from. 
To make this simple, each Dwarf has a key to give threads local storage 
specific to that Dwarf. Or at least that was the intent, I didn't think 
to consider the limit, we didn't run into it in our use cases.

There may be other ways to handle this, I'm high-level considering 
potential alternatives (with more atomics, of course). The difficulty 
is mostly in providing the same performance in the single-threaded case.

> You already have a Dwarf *.  I would consider adding some sort of
> clone function which creates a shallow Dwarf * with its own embedded
> allocator or something like that.

The downside with this is that its an API addition, which we (the 
Dyninst + HPCToolkit projects) would need to enforce. Which isn't a 
huge deal for us, but I will need to make a case to those teams to make 
the shift.

On the upside, it does provide a very understandable semantic in the 
case of parallelism. For an API without synchronization clauses, this 
would put our work back into the realm of "reasonably correct" (from 
"technically incorrect but works.")

> This assumes that memory allocation
> is actually a performance problem, otherwise you could let malloc
> handle the details.

In our (Dyninst + HPCToolkit) work, we have found that malloc tends to 
be slow in the multithreaded case, in particular with many small 
allocations. The glibc implementation (which most of our clients use) 
uses a full mutex to provide thread-safety. While we could do a lot 
better in our own projects with regards to memory management, the fact 
remains that malloc alone is a notable facet to the performance of 
libdw.

Hopefully this helps give a little light on the issue.

-Jonathon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 16:45                     ` Jonathon Anderson
@ 2019-10-26 16:50                       ` Florian Weimer
  2019-10-26 22:53                         ` Mark Wielaard
  2019-10-26 22:50                       ` Mark Wielaard
  1 sibling, 1 reply; 55+ messages in thread
From: Florian Weimer @ 2019-10-26 16:50 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: Mark Wielaard, elfutils-devel, Srdan Milakovic

* Jonathon Anderson:

>> This assumes that memory allocation
>> is actually a performance problem, otherwise you could let malloc
>> handle the details.
>
> In our (Dyninst + HPCToolkit) work, we have found that malloc tends to 
> be slow in the multithreaded case, in particular with many small 
> allocations. The glibc implementation (which most of our clients use) 
> uses a full mutex to provide thread-safety. While we could do a lot 
> better in our own projects with regards to memory management, the fact 
> remains that malloc alone is a notable facet to the performance of 
> libdw.

Current glibc versions have a thread-local fast path, which should
address some of these concerns.  It's still not a bump-pointer
allocator, but at least there are no atomics on that path.

I'm not sure if it is prudent to code around imperfections in older
allocators.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 16:50                       ` Florian Weimer
@ 2019-10-26 22:53                         ` Mark Wielaard
  2019-10-27  8:59                           ` Florian Weimer
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-26 22:53 UTC (permalink / raw)
  To: Florian Weimer, Jonathon Anderson; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Sat, 2019-10-26 at 18:50 +0200, Florian Weimer wrote:
> * Jonathon Anderson:
> 
> > > This assumes that memory allocation
> > > is actually a performance problem, otherwise you could let malloc
> > > handle the details.
> > 
> > In our (Dyninst + HPCToolkit) work, we have found that malloc tends to 
> > be slow in the multithreaded case, in particular with many small 
> > allocations. The glibc implementation (which most of our clients use) 
> > uses a full mutex to provide thread-safety. While we could do a lot 
> > better in our own projects with regards to memory management, the fact 
> > remains that malloc alone is a notable facet to the performance of 
> > libdw.
> 
> Current glibc versions have a thread-local fast path, which should
> address some of these concerns.  It's still not a bump-pointer
> allocator, but at least there are no atomics on that path.

Since which version of glibc is there a thread-local fast path?

We don't need a bump-pointer allocator, but we do need something to
store lots of small object allocated over time together, so we can
easily dispose of them when done. The storage size only has to grow.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 22:53                         ` Mark Wielaard
@ 2019-10-27  8:59                           ` Florian Weimer
  2019-10-27 18:11                             ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Florian Weimer @ 2019-10-27  8:59 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Jonathon Anderson, elfutils-devel, Srdan Milakovic

* Mark Wielaard:

>> Current glibc versions have a thread-local fast path, which should
>> address some of these concerns.  It's still not a bump-pointer
>> allocator, but at least there are no atomics on that path.
>
> Since which version of glibc is there a thread-local fast path?

It was added in:

commit d5c3fafc4307c9b7a4c7d5cb381fcdbfad340bcc
Author: DJ Delorie <dj@delorie.com>
Date:   Thu Jul 6 13:37:30 2017 -0400

    Add per-thread cache to malloc

So glibc 2.26.  But it is a build-time option, enabled by default, but
it can be switched off by distributions.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-27  8:59                           ` Florian Weimer
@ 2019-10-27 18:11                             ` Jonathon Anderson
  2019-10-27 18:44                               ` Florian Weimer
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-10-27 18:11 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Mark Wielaard, elfutils-devel, Srdan Milakovic

On Sun, Oct 27, 2019 at 09:59, Florian Weimer <fw@deneb.enyo.de> wrote:
> * Mark Wielaard:
> 
>>>  Current glibc versions have a thread-local fast path, which should
>>>  address some of these concerns.  It's still not a bump-pointer
>>>  allocator, but at least there are no atomics on that path.
>> 
>>  Since which version of glibc is there a thread-local fast path?
> 
> It was added in:
> 
> commit d5c3fafc4307c9b7a4c7d5cb381fcdbfad340bcc
> Author: DJ Delorie <dj@delorie.com <mailto:dj@delorie.com>>
> Date:   Thu Jul 6 13:37:30 2017 -0400
> 
>     Add per-thread cache to malloc
> 
> So glibc 2.26.  But it is a build-time option, enabled by default, but
> it can be switched off by distributions.

I doubt any non-mobile distros would disable it, the cost seems fairly 
small.

My main concern is that it seems like chunks will only enter the 
thread-local cache in the presence of free()s (since they have to enter 
the "smallbins" or "fastbins" first, and those two at a glance seem to 
be filled very lazily or on free()); since the free()s are all on 
dwarf_end this would pose an issue. I could also be entirely mistaken, 
glibc is by no means a simple piece of code.

According to the comments, there might also be a 16 byte overhead per 
allocation, which would explode the small allocations considerably.

-Jonathon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-27 18:11                             ` Jonathon Anderson
@ 2019-10-27 18:44                               ` Florian Weimer
  0 siblings, 0 replies; 55+ messages in thread
From: Florian Weimer @ 2019-10-27 18:44 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: Mark Wielaard, elfutils-devel, Srdan Milakovic

* Jonathon Anderson:

> On Sun, Oct 27, 2019 at 09:59, Florian Weimer <fw@deneb.enyo.de> wrote:
>> * Mark Wielaard:
>> 
>>>>  Current glibc versions have a thread-local fast path, which should
>>>>  address some of these concerns.  It's still not a bump-pointer
>>>>  allocator, but at least there are no atomics on that path.
>>> 
>>>  Since which version of glibc is there a thread-local fast path?
>> 
>> It was added in:
>> 
>> commit d5c3fafc4307c9b7a4c7d5cb381fcdbfad340bcc
>> Author: DJ Delorie <dj@delorie.com <mailto:dj@delorie.com>>
>> Date:   Thu Jul 6 13:37:30 2017 -0400
>> 
>>     Add per-thread cache to malloc
>> 
>> So glibc 2.26.  But it is a build-time option, enabled by default, but
>> it can be switched off by distributions.
>
> I doubt any non-mobile distros would disable it, the cost seems fairly 
> small.

It increases fragmentation.  Vmware's Photon distribution disables it.

> My main concern is that it seems like chunks will only enter the 
> thread-local cache in the presence of free()s (since they have to enter 
> the "smallbins" or "fastbins" first, and those two at a glance seem to 
> be filled very lazily or on free()); since the free()s are all on 
> dwarf_end this would pose an issue. I could also be entirely mistaken, 
> glibc is by no means a simple piece of code.

No, there is a prefill step if the cache is empty, where the cache is
populated with one arena allocation which is then split up.

> According to the comments, there might also be a 16 byte overhead per 
> allocation, which would explode the small allocations considerably.

Available allocatable sizes in bytes are congruent 8 modulo (16), and
the smallest allocatable size is 24.  In general, the overhead is
8 bytes.  (All numbers are for 64-bit architectures.)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 16:45                     ` Jonathon Anderson
  2019-10-26 16:50                       ` Florian Weimer
@ 2019-10-26 22:50                       ` Mark Wielaard
  2019-10-27  0:56                         ` Jonathon Anderson
  1 sibling, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-26 22:50 UTC (permalink / raw)
  To: Jonathon Anderson, Florian Weimer; +Cc: elfutils-devel, Srdan Milakovic

Hi,

On Sat, 2019-10-26 at 11:45 -0500, Jonathon Anderson wrote:
> For some overall perspective, this patch replaces the original libdw 
> allocator with a thread-safe variant. The original acts both as a 
> suballocator (to keep from paying the malloc tax on frequent small 
> allocations) and a garbage collection list (to free internal structures 
> on dwarf_end). The patch attempts to replicate the same overall 
> behavior in the more volatile parallel case.

That is a nice description. Basically it is a little obstack
implementation. There are a lot of small allocations which we want to
store together and free together when the Dwarf object is destroyed. 

The allocations (and parsing of DWARF structures) is done lazily. So
you only pay when you are actually using the data. e.g. if you skip a
DIE (subtree) or CU no parsing or allocations are done.

For example when parsing all of the linux kernel debug data we are
talking about ~535000 allocations, a bit less than half (~233000) are
of the same small size, 24bytes.

> On Sat, Oct 26, 2019 at 18:14, Florian Weimer <fw@deneb.enyo.de> wrote:
> > * Mark Wielaard:
> > 
> > >  I'll see if I can create a case where that is a problem. Then we can
> > >  see how to adjust things to use less pthread_keys. Is there a 
> > > different
> > >  pattern we can use?
> > 
> > It's unclear what purpose thread-local storage serves in this context.
> 
> The thread-local storage provides the suballocator side: for each 
> Dwarf, each thread has its own "top block" to perform allocations from. 
> To make this simple, each Dwarf has a key to give threads local storage 
> specific to that Dwarf. Or at least that was the intent, I didn't think 
> to consider the limit, we didn't run into it in our use cases.

I see that getconf PTHREAD_KEYS_MAX gives 1024 on my machine.
Is this tunable in any way?

> There may be other ways to handle this, I'm high-level considering 
> potential alternatives (with more atomics, of course). The difficulty 
> is mostly in providing the same performance in the single-threaded case.
> 
> > You already have a Dwarf *.  I would consider adding some sort of
> > clone function which creates a shallow Dwarf * with its own embedded
> > allocator or something like that.
> 
> The downside with this is that its an API addition, which we (the 
> Dyninst + HPCToolkit projects) would need to enforce. Which isn't a 
> huge deal for us, but I will need to make a case to those teams to make 
> the shift.
> 
> On the upside, it does provide a very understandable semantic in the 
> case of parallelism. For an API without synchronization clauses, this 
> would put our work back into the realm of "reasonably correct" (from 
> "technically incorrect but works.")

Could someone give an example of this pattern?
I don't fully understand what is being proposed and how the interface
would look exactly.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-26 22:50                       ` Mark Wielaard
@ 2019-10-27  0:56                         ` Jonathon Anderson
  2019-10-28 13:26                           ` Mark Wielaard
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathon Anderson @ 2019-10-27  0:56 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Florian Weimer, elfutils-devel, Srdan Milakovic



On Sun, Oct 27, 2019 at 00:50, Mark Wielaard <mark@klomp.org> wrote:
> Hi,
> 
> On Sat, 2019-10-26 at 11:45 -0500, Jonathon Anderson wrote:
>>  For some overall perspective, this patch replaces the original libdw
>>  allocator with a thread-safe variant. The original acts both as a
>>  suballocator (to keep from paying the malloc tax on frequent small
>>  allocations) and a garbage collection list (to free internal 
>> structures
>>  on dwarf_end). The patch attempts to replicate the same overall
>>  behavior in the more volatile parallel case.
> 
> That is a nice description. Basically it is a little obstack
> implementation. There are a lot of small allocations which we want to
> store together and free together when the Dwarf object is destroyed.
> 
> The allocations (and parsing of DWARF structures) is done lazily. So
> you only pay when you are actually using the data. e.g. if you skip a
> DIE (subtree) or CU no parsing or allocations are done.
> 
> For example when parsing all of the linux kernel debug data we are
> talking about ~535000 allocations, a bit less than half (~233000) are
> of the same small size, 24bytes.
> 
>>  On Sat, Oct 26, 2019 at 18:14, Florian Weimer <fw@deneb.enyo.de 
>> <mailto:fw@deneb.enyo.de>> wrote:
>>  > * Mark Wielaard:
>>  >
>>  > >  I'll see if I can create a case where that is a problem. Then 
>> we can
>>  > >  see how to adjust things to use less pthread_keys. Is there a
>>  > > different
>>  > >  pattern we can use?
>>  >
>>  > It's unclear what purpose thread-local storage serves in this 
>> context.
>> 
>>  The thread-local storage provides the suballocator side: for each
>>  Dwarf, each thread has its own "top block" to perform allocations 
>> from.
>>  To make this simple, each Dwarf has a key to give threads local 
>> storage
>>  specific to that Dwarf. Or at least that was the intent, I didn't 
>> think
>>  to consider the limit, we didn't run into it in our use cases.
> 
> I see that getconf PTHREAD_KEYS_MAX gives 1024 on my machine.
> Is this tunable in any way?

 From what I can tell, no. A quick google search indicates as such, and 
its even #defined as 1024 on my machine.

> 
>>  There may be other ways to handle this, I'm high-level considering
>>  potential alternatives (with more atomics, of course). The 
>> difficulty
>>  is mostly in providing the same performance in the single-threaded 
>> case.
>> 
>>  > You already have a Dwarf *.  I would consider adding some sort of
>>  > clone function which creates a shallow Dwarf * with its own 
>> embedded
>>  > allocator or something like that.
>> 
>>  The downside with this is that its an API addition, which we (the
>>  Dyninst + HPCToolkit projects) would need to enforce. Which isn't a
>>  huge deal for us, but I will need to make a case to those teams to 
>> make
>>  the shift.
>> 
>>  On the upside, it does provide a very understandable semantic in the
>>  case of parallelism. For an API without synchronization clauses, 
>> this
>>  would put our work back into the realm of "reasonably correct" (from
>>  "technically incorrect but works.")
> 
> Could someone give an example of this pattern?
> I don't fully understand what is being proposed and how the interface
> would look exactly.

An application would do something along these lines:

Dwarf* dbg = dwarf_begin(...);
Dwarf* dbg2 = dwarf_clone(dbg, ...);
pthread_create(worker, ...);
// ...
dwarf_get_units(dbg, ...);
// ...
pthread_join(worker);
dwarf_end(dbg);

// worker:
// ...
dwarf_getabbrev(...);
// ...
dwarf_end(dbg2);

The idea being that dbg2 and dbg share most of the same internal state, 
but concurrent access to said state is between Dwarfs (or 
"Dwarf_Views", maybe?), and the state is cleaned up on the original's 
dwarf_end. I suppose in database terms the Dwarfs are acting like 
separate "cursors" for the internal DWARF data. For this particular 
instance, the "top of stack" pointers would be in dbg and dbg2 (the 
non-shared state), while the atomic mem_tail would be part of the 
internal (shared) state.

I'm not sure how viable implementing this sort of thing would be, it 
might end up overhauling a lot of internals, and I'm not familiar 
enough with all the components of the API to know whether there would 
be some quirks with this style. But at least then the blanket "Dwarfs 
must be externally synchronized (all operations issued in serial)" 
implicit clause doesn't limit the parallelism at the API level. And 
those of us who don't follow that rule wouldn't have to walk on 
eggshells to avoid segfaulting.

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-27  0:56                         ` Jonathon Anderson
@ 2019-10-28 13:26                           ` Mark Wielaard
  2019-10-28 15:32                             ` Jonathon Anderson
  0 siblings, 1 reply; 55+ messages in thread
From: Mark Wielaard @ 2019-10-28 13:26 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: Florian Weimer, elfutils-devel, Srdan Milakovic

On Sat, 2019-10-26 at 19:56 -0500, Jonathon Anderson wrote:
> On Sun, Oct 27, 2019 at 00:50, Mark Wielaard <mark@klomp.org> wrote:
> > 
> > I see that getconf PTHREAD_KEYS_MAX gives 1024 on my machine.
> > Is this tunable in any way?
> 
>  From what I can tell, no. A quick google search indicates as such,
> and its even #defined as 1024 on my machine.

I see, it is a hardcoded constant per architecture, but it seems every
architecture simply uses 1024. I am afraid that kind of rules out
having a pthread_key per Dwarf object. It is not that large a number.

Programs are sometimes linked against 50 till 100 shared libraries, if
they use dwz/alt-files that means the potential open Dwarf objects is
several hundred already. It wouldn't be that crazy to have all of them
open at the same time. That might not reach the limit yet, but I think
in practice you could come close to half very easily. And with split-
dwarf every CU basically turns into a Dwarf object, which can easily go
past 1024.

> >  There may be other ways to handle this, I'm high-level considering
> > >  potential alternatives (with more atomics, of course). The 
> > > difficulty
> > >  is mostly in providing the same performance in the single-threaded 
> > > case.
> > > 
> > >  > You already have a Dwarf *.  I would consider adding some sort of
> > >  > clone function which creates a shallow Dwarf * with its own 
> > > embedded
> > >  > allocator or something like that.
> > > 
> > >  The downside with this is that its an API addition, which we (the
> > >  Dyninst + HPCToolkit projects) would need to enforce. Which isn't a
> > >  huge deal for us, but I will need to make a case to those teams to 
> > > make
> > >  the shift.
> > > 
> > >  On the upside, it does provide a very understandable semantic in the
> > >  case of parallelism. For an API without synchronization clauses, 
> > > this
> > >  would put our work back into the realm of "reasonably correct" (from
> > >  "technically incorrect but works.")
> > 
> > Could someone give an example of this pattern?
> > I don't fully understand what is being proposed and how the interface
> > would look exactly.
> 
> An application would do something along these lines:
> 
> Dwarf* dbg = dwarf_begin(...);
> Dwarf* dbg2 = dwarf_clone(dbg, ...);
> pthread_create(worker, ...);
> // ...
> dwarf_get_units(dbg, ...);
> // ...
> pthread_join(worker);
> dwarf_end(dbg);
> 
> // worker:
> // ...
> dwarf_getabbrev(...);
> // ...
> dwarf_end(dbg2);
> 
> The idea being that dbg2 and dbg share most of the same internal state, 
> but concurrent access to said state is between Dwarfs (or 
> "Dwarf_Views", maybe?), and the state is cleaned up on the original's 
> dwarf_end. I suppose in database terms the Dwarfs are acting like 
> separate "cursors" for the internal DWARF data. For this particular 
> instance, the "top of stack" pointers would be in dbg and dbg2 (the 
> non-shared state), while the atomic mem_tail would be part of the 
> internal (shared) state.
> 
> I'm not sure how viable implementing this sort of thing would be, it 
> might end up overhauling a lot of internals, and I'm not familiar 
> enough with all the components of the API to know whether there would 
> be some quirks with this style.

So they would have separate lazy DWARF DIE/abbrev readers and separate
allocators? And any abbrevs read in the clone would just be thrown away
after a dwarf_end?

In general I think this style is not as nice as having a shared state
of the Dwarf object that is only manipulated in a thread-safe manner.
You did have an earlier implementation that didn't use pthread_keys.
Maybe we should just fall back to that one?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()
  2019-10-28 13:26                           ` Mark Wielaard
@ 2019-10-28 15:32                             ` Jonathon Anderson
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathon Anderson @ 2019-10-28 15:32 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Florian Weimer, elfutils-devel, Srdan Milakovic



On Mon, Oct 28, 2019 at 14:26, Mark Wielaard <mark@klomp.org> wrote:
> On Sat, 2019-10-26 at 19:56 -0500, Jonathon Anderson wrote:
>>  On Sun, Oct 27, 2019 at 00:50, Mark Wielaard <mark@klomp.org 
>> <mailto:mark@klomp.org>> wrote:
>>  >
>>  > I see that getconf PTHREAD_KEYS_MAX gives 1024 on my machine.
>>  > Is this tunable in any way?
>> 
>>   From what I can tell, no. A quick google search indicates as such,
>>  and its even #defined as 1024 on my machine.
> 
> I see, it is a hardcoded constant per architecture, but it seems every
> architecture simply uses 1024. I am afraid that kind of rules out
> having a pthread_key per Dwarf object. It is not that large a number.
> 
> Programs are sometimes linked against 50 till 100 shared libraries, if
> they use dwz/alt-files that means the potential open Dwarf objects is
> several hundred already. It wouldn't be that crazy to have all of them
> open at the same time. That might not reach the limit yet, but I think
> in practice you could come close to half very easily. And with split-
> dwarf every CU basically turns into a Dwarf object, which can easily 
> go
> past 1024.
> 
>>  >  There may be other ways to handle this, I'm high-level 
>> considering
>>  > >  potential alternatives (with more atomics, of course). The
>>  > > difficulty
>>  > >  is mostly in providing the same performance in the 
>> single-threaded
>>  > > case.
>>  > >
>>  > >  > You already have a Dwarf *.  I would consider adding some 
>> sort of
>>  > >  > clone function which creates a shallow Dwarf * with its own
>>  > > embedded
>>  > >  > allocator or something like that.
>>  > >
>>  > >  The downside with this is that its an API addition, which we 
>> (the
>>  > >  Dyninst + HPCToolkit projects) would need to enforce. Which 
>> isn't a
>>  > >  huge deal for us, but I will need to make a case to those 
>> teams to
>>  > > make
>>  > >  the shift.
>>  > >
>>  > >  On the upside, it does provide a very understandable semantic 
>> in the
>>  > >  case of parallelism. For an API without synchronization 
>> clauses,
>>  > > this
>>  > >  would put our work back into the realm of "reasonably correct" 
>> (from
>>  > >  "technically incorrect but works.")
>>  >
>>  > Could someone give an example of this pattern?
>>  > I don't fully understand what is being proposed and how the 
>> interface
>>  > would look exactly.
>> 
>>  An application would do something along these lines:
>> 
>>  Dwarf* dbg = dwarf_begin(...);
>>  Dwarf* dbg2 = dwarf_clone(dbg, ...);
>>  pthread_create(worker, ...);
>>  // ...
>>  dwarf_get_units(dbg, ...);
>>  // ...
>>  pthread_join(worker);
>>  dwarf_end(dbg);
>> 
>>  // worker:
>>  // ...
>>  dwarf_getabbrev(...);
>>  // ...
>>  dwarf_end(dbg2);
>> 
>>  The idea being that dbg2 and dbg share most of the same internal 
>> state,
>>  but concurrent access to said state is between Dwarfs (or
>>  "Dwarf_Views", maybe?), and the state is cleaned up on the 
>> original's
>>  dwarf_end. I suppose in database terms the Dwarfs are acting like
>>  separate "cursors" for the internal DWARF data. For this particular
>>  instance, the "top of stack" pointers would be in dbg and dbg2 (the
>>  non-shared state), while the atomic mem_tail would be part of the
>>  internal (shared) state.
>> 
>>  I'm not sure how viable implementing this sort of thing would be, it
>>  might end up overhauling a lot of internals, and I'm not familiar
>>  enough with all the components of the API to know whether there 
>> would
>>  be some quirks with this style.
> 
> So they would have separate lazy DWARF DIE/abbrev readers and separate
> allocators? And any abbrevs read in the clone would just be thrown 
> away
> after a dwarf_end?

Separate allocators but the same lazy DIE/abbrev readers, most likely. 
So a Dwarf would be split into the concurrent Dwarf_Shared and 
non-concurrent Dwarf "View", maybe something like:

struct Dwarf_Shared
{
  pthread_rwlock_t rwl; /* For all currently non-concurrent internals */
  Elf *elf;
  char *debugdir;
  Dwarf *alt_dwarf;
  Elf_Data *sectiondata[IDX_last];
  bool other_byte_order;
  bool free_elf;
  int alt_fd;
  struct pubnames_s
  {
    Dwarf_Off cu_offset;
    Dwarf_Off set_start;
    unsigned int cu_header_size;
    int address_len;
  } *pubnames_sets;
  size_t pubnames_nsets;
  void *cu_tree;
  /* Dwarf_Off next_cu_offset; // Moved to Dwarf View */
  void *tu_tree;
  /* Dwarf_Off next_tu_offset; // Moved to Dwarf View */
  Dwarf_Sig8_Hash sig8_hash;
  void *split_tree;
  void *macro_ops;
  void *files_lines;
  Dwarf_Aranges *aranges;
  struct Dwarf_CFI_s *cfi;
  struct Dwarf_CU *fake_loc_cu;
  struct Dwarf_CU *fake_loclists_cu;
  struct Dwarf_CU *fake_addr_cu;
  /* pthread_key_t mem_key; // Implemented differently */
  atomic_uintptr_t shared_mem_tail;
  size_t mem_default_size;
  /* Dwarf_OOM oom_handler; // Moved to Dwarf View, maybe */
};
struct Dwarf /*View*/ {
  bool free_shared; /* If true, we handle cleaning up on dwarf_end */
  struct Dwarf_Shared *shared; /* All cloned() Views share the same 
Shared in the back */
  Dwarf_Off next_cu_offset;
  Dwarf_Off next_tu_offset;
  struct libdw_memblock *mem_tail;
  Dwarf_OOM oom_handler;
};

So most everything is in Dwarf_Shared, and the bits that really only 
make sense when done in serial are part of the View. And then 
allocations are done from a View-local stack, while everything is 
pushed as is now onto the Shared stack for deallocation.

> 
> In general I think this style is not as nice as having a shared state
> of the Dwarf object that is only manipulated in a thread-safe manner.

I tend to prefer styles like this only because they have a clean split 
to define what can and cannot be concurrent, even logically (What makes 
sense when iterating CUs from multiple threads? Does the OOM handler 
have to be thread-safe?). Otherwise you end up trying to answer an N^2 
problem for how all the moving pieces interact (and it often requires 
knowledge of the code to know which pieces do interact).

> You did have an earlier implementation that didn't use pthread_keys.
> Maybe we should just fall back to that one?

I can revive and rebase it, its on my list for doing ASAP after my 
other more pressing duties. It still has quirks that bother me a little 
(like not doing well without a thread pool), but it'll work fine.

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2019-11-12 21:45 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-16 19:24 [PATCH] libdw: add thread-safety to dwarf_getabbrev() Jonathon Anderson
2019-08-21 11:16 ` Mark Wielaard
2019-08-21 14:21   ` Jonathon Anderson
2019-08-23 21:22     ` Mark Wielaard
     [not found]   ` <1566396518.5389.0@smtp.mail.rice.edu>
2019-08-23 18:25     ` Mark Wielaard
2019-08-23 22:36       ` Jonathon Anderson
2019-08-21 21:50 ` Mark Wielaard
2019-08-21 22:01   ` Mark Wielaard
2019-08-21 22:21   ` Jonathon Anderson
2019-08-23 21:26     ` Mark Wielaard
2019-08-24 23:24 ` Mark Wielaard
2019-08-25  1:11   ` Jonathon Anderson
2019-08-25 10:05     ` Mark Wielaard
2019-08-26  1:25       ` Jonathon Anderson
2019-08-26 13:18         ` Mark Wielaard
2019-08-26 13:37           ` Jonathon Anderson
2019-08-27  3:52             ` Jonathon Anderson
2019-08-29 13:16               ` Mark Wielaard
2019-08-29 13:16                 ` [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table Mark Wielaard
2019-10-25 23:50                   ` Mark Wielaard
2019-10-26  4:11                     ` Jonathon Anderson
2019-10-27 16:13                       ` Mark Wielaard
2019-10-27 17:49                         ` Jonathon Anderson
2019-10-28 14:08                           ` Mark Wielaard
2019-10-28 20:12                         ` Mark Wielaard
2019-11-04 16:21                           ` Mark Wielaard
2019-11-04 16:19                       ` Mark Wielaard
2019-11-04 17:03                         ` [PATCH] " Jonathon Anderson
2019-11-07 11:07                           ` Mark Wielaard
2019-11-07 15:25                             ` Jonathon Anderson
2019-11-08 14:07                               ` Mark Wielaard
2019-11-08 15:29                                 ` Jonathon Anderson
2019-11-10 23:24                                   ` Mark Wielaard
2019-11-11 23:38                                     ` Jonathon Anderson
2019-11-12 21:45                                       ` Mark Wielaard
2019-08-29 13:16                 ` [PATCH 1/3] Add some supporting framework for C11-style atomics Mark Wielaard
2019-10-22 16:31                   ` Mark Wielaard
2019-08-29 13:16                 ` [PATCH 2/3] libdw: Rewrite the memory handler to be thread-safe Mark Wielaard
2019-10-21 16:13                   ` Mark Wielaard
2019-10-21 16:28                     ` Jonathon Anderson
2019-10-21 18:00                       ` Mark Wielaard
2019-10-24 16:47                         ` Mark Wielaard
2019-10-26 10:54               ` [PATCH] libdw: add thread-safety to dwarf_getabbrev() Florian Weimer
2019-10-26 12:06                 ` Mark Wielaard
2019-10-26 16:14                   ` Florian Weimer
2019-10-26 16:45                     ` Jonathon Anderson
2019-10-26 16:50                       ` Florian Weimer
2019-10-26 22:53                         ` Mark Wielaard
2019-10-27  8:59                           ` Florian Weimer
2019-10-27 18:11                             ` Jonathon Anderson
2019-10-27 18:44                               ` Florian Weimer
2019-10-26 22:50                       ` Mark Wielaard
2019-10-27  0:56                         ` Jonathon Anderson
2019-10-28 13:26                           ` Mark Wielaard
2019-10-28 15:32                             ` Jonathon Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).