[PATCH 0/2] libdw: Rewrite the memory handler to be more robust

public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/2] libdw: Rewrite the memory handler to be more robust
@ 2019-10-29 18:55 Jonathon Anderson
  2019-10-29 20:17 ` Mark Wielaard
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathon Anderson @ 2019-10-29 18:55 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel

Hello,

This is (revived and rebased) version of the libdw memory manager that 
isn't affected by the PTHREAD_KEYS_MAX limit. There are some downsides, 
in particular if an application spawns many short-lived threads that 
all touch a Dwarf (enough to cause an allocation), there's about ~8N 
bytes of memory overhead.

The first patch is not required and adds some configure-time options 
for Valgrind annotation support (although, I'm not a serious autotools 
user, so it might need some work).

-Jonathon

----------------------------------------------------------------
Jonathon Anderson (2):
      Add configure options for Valgrind annotations.
      libdw: Rewrite the memory handler to be more robust.

 ChangeLog               |  5 +++++
 configure.ac            | 30 ++++++++++++++++++++++++++++++
 lib/atomics.h           |  2 ++
 libdw/ChangeLog         |  9 +++++++++
 libdw/dwarf_begin_elf.c |  7 ++++---
 libdw/dwarf_end.c       | 24 +++++++++++++-----------
 libdw/libdwP.h          | 67 
++++++++++++++++++++++++++++++++-----------------------------------
 libdw/libdw_alloc.c     | 69 
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 8 files changed, 160 insertions(+), 53 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] libdw: Rewrite the memory handler to be more robust
  2019-10-29 18:55 [PATCH 0/2] libdw: Rewrite the memory handler to be more robust Jonathon Anderson
@ 2019-10-29 20:17 ` Mark Wielaard
  2019-10-29 20:22   ` Jonathon Anderson
  0 siblings, 1 reply; 12+ messages in thread
From: Mark Wielaard @ 2019-10-29 20:17 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel

Hi Jonathon,

On Tue, Oct 29, 2019 at 01:55:25PM -0500, Jonathon Anderson wrote:
> This is (revived and rebased) version of the libdw memory manager that isn't
> affected by the PTHREAD_KEYS_MAX limit. There are some downsides, in
> particular if an application spawns many short-lived threads that all touch
> a Dwarf (enough to cause an allocation), there's about ~8N bytes of memory
> overhead.

Thanks. But it looks like your mail client munged the patches a bit
making it a bit tricky to apply. Could you resent them using git
send-email or do you have some public repo I could get them from?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] libdw: Rewrite the memory handler to be more robust
  2019-10-29 20:17 ` Mark Wielaard
@ 2019-10-29 20:22   ` Jonathon Anderson
  2019-10-29 21:15     ` [PATCH 1/2] Add configure options for Valgrind annotations Mark Wielaard
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathon Anderson @ 2019-10-29 20:22 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel

...Drat, I thought I had it this time. Oh well, sorry to make a mess 
again.

The following changes since commit 
6f447ef7f0c5000e88d11312c06df9d5021d4ecd:

  libdwfl: don't bother freeing frames outside of dwfl_thread_getframes 
(2019-10-29 17:48:05 +0100)

are available in the Git repository at:

  <https://github.com/blue42u/elfutils.git> libdw-mem-pr-v2

for you to fetch changes up to 6813732e29766afbe9c1763a5d397f1f51a633d6:

  libdw: Rewrite the memory handler to be more robust. (2019-10-29 
13:35:33 -0500)

----------------------------------------------------------------
Jonathon Anderson (2):
      Add configure options for Valgrind annotations.
      libdw: Rewrite the memory handler to be more robust.

 ChangeLog               |  5 +++++
 configure.ac            | 30 ++++++++++++++++++++++++++++++
 lib/atomics.h           |  2 ++
 libdw/ChangeLog         |  9 +++++++++
 libdw/dwarf_begin_elf.c |  7 ++++---
 libdw/dwarf_end.c       | 24 +++++++++++++-----------
 libdw/libdwP.h          | 67 
++++++++++++++++++++++++++++++++-----------------------------------
 libdw/libdw_alloc.c     | 69 
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 8 files changed, 160 insertions(+), 53 deletions(-)

On Tue, Oct 29, 2019 at 21:17, Mark Wielaard <mark@klomp.org> wrote:
> Hi Jonathon,
> 
> On Tue, Oct 29, 2019 at 01:55:25PM -0500, Jonathon Anderson wrote:
>>  This is (revived and rebased) version of the libdw memory manager 
>> that isn't
>>  affected by the PTHREAD_KEYS_MAX limit. There are some downsides, in
>>  particular if an application spawns many short-lived threads that 
>> all touch
>>  a Dwarf (enough to cause an allocation), there's about ~8N bytes of 
>> memory
>>  overhead.
> 
> Thanks. But it looks like your mail client munged the patches a bit
> making it a bit tricky to apply. Could you resent them using git
> send-email or do you have some public repo I could get them from?
> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
  2019-10-29 21:15     ` [PATCH 1/2] Add configure options for Valgrind annotations Mark Wielaard
@ 2019-10-29 21:15       ` Mark Wielaard
  2019-11-07 17:20         ` Mark Wielaard
  2019-11-07 17:19       ` [PATCH 1/2] Add configure options for Valgrind annotations Mark Wielaard
  1 sibling, 1 reply; 12+ messages in thread
From: Mark Wielaard @ 2019-10-29 21:15 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson

From: Jonathon Anderson <jma14@rice.edu>

Pthread's thread-local variables are highly limited, which makes
it difficult to use many Dwarfs. This replaces that with a
less efficient (or elegant) but more robust method.

Signed-off-by: Jonathon Anderson <jma14@rice.edu>
---
 lib/atomics.h           |  2 ++
 libdw/ChangeLog         |  9 ++++++
 libdw/dwarf_begin_elf.c |  7 +++--
 libdw/dwarf_end.c       | 24 +++++++-------
 libdw/libdwP.h          | 67 +++++++++++++++++++--------------------
 libdw/libdw_alloc.c     | 69 ++++++++++++++++++++++++++++++++++++++---
 6 files changed, 125 insertions(+), 53 deletions(-)

diff --git a/lib/atomics.h b/lib/atomics.h
index ffd12f87..e3773759 100644
--- a/lib/atomics.h
+++ b/lib/atomics.h
@@ -31,7 +31,9 @@
 #if HAVE_STDATOMIC_H
 /* If possible, use the compiler's preferred atomics.  */
 # include <stdatomic.h>
+# include <threads.h>
 #else
 /* Otherwise, try to use the builtins provided by this compiler.  */
 # include "stdatomic-fbsd.h"
+# define thread_local __thread
 #endif /* HAVE_STDATOMIC_H */
diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index 394c0df2..8de8a837 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,12 @@
+2019-10-28  Jonathon Anderson  <jma14@rice.edu>
+
+	* libdw_alloc.c: Added __libdw_alloc_tail.
+	(__libdw_allocate): Switch to use the mem_tails array.
+	* libdwP.h (Dwarf): Likewise.
+	* dwarf_begin_elf.c (dwarf_begin_elf): Support for above.
+	* dwarf_end.c (dwarf_end): Likewise.
+	* atomics.h: Add support for thread_local.
+
 2019-08-26  Jonathon Anderson  <jma14@rice.edu>
 
 	* libdw_alloc.c (__libdw_allocate): Added thread-safe stack allocator.
diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
index 8d137414..eadff13b 100644
--- a/libdw/dwarf_begin_elf.c
+++ b/libdw/dwarf_begin_elf.c
@@ -418,13 +418,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn *scngrp)
      actual allocation.  */
   result->mem_default_size = mem_default_size;
   result->oom_handler = __libdw_oom;
-  if (pthread_key_create (&result->mem_key, NULL) != 0)
+  if (pthread_rwlock_init(&result->mem_rwl, NULL) != 0)
     {
       free (result);
-      __libdw_seterrno (DWARF_E_NOMEM); /* no memory or max pthread keys.  */
+      __libdw_seterrno (DWARF_E_NOMEM); /* no memory.  */
       return NULL;
     }
-  atomic_init (&result->mem_tail, (uintptr_t)NULL);
+  result->mem_stacks = 0;
+  result->mem_tails = NULL;
 
   if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
     {
diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
index a2e94436..3fd2836d 100644
--- a/libdw/dwarf_end.c
+++ b/libdw/dwarf_end.c
@@ -95,17 +95,19 @@ dwarf_end (Dwarf *dwarf)
       tdestroy (dwarf->split_tree, noop_free);
 
       /* Free the internally allocated memory.  */
-      struct libdw_memblock *memp;
-      memp = (struct libdw_memblock *) (atomic_load_explicit
-					(&dwarf->mem_tail,
-					 memory_order_relaxed));
-      while (memp != NULL)
-	{
-	  struct libdw_memblock *prevp = memp->prev;
-	  free (memp);
-	  memp = prevp;
-	}
-      pthread_key_delete (dwarf->mem_key);
+      for (size_t i = 0; i < dwarf->mem_stacks; i++)
+        {
+          struct libdw_memblock *memp = dwarf->mem_tails[i];
+          while (memp != NULL)
+	    {
+	      struct libdw_memblock *prevp = memp->prev;
+	      free (memp);
+	      memp = prevp;
+	    }
+        }
+      if (dwarf->mem_tails != NULL)
+        free (dwarf->mem_tails);
+      pthread_rwlock_destroy (&dwarf->mem_rwl);
 
       /* Free the pubnames helper structure.  */
       free (dwarf->pubnames_sets);
diff --git a/libdw/libdwP.h b/libdw/libdwP.h
index ad2599eb..3e1ef59b 100644
--- a/libdw/libdwP.h
+++ b/libdw/libdwP.h
@@ -149,17 +149,6 @@ enum
 
 #include "dwarf_sig8_hash.h"
 
-/* Structure for internal memory handling.  This is basically a simplified
-   reimplementation of obstacks.  Unfortunately the standard obstack
-   implementation is not usable in libraries.  */
-struct libdw_memblock
-{
-  size_t size;
-  size_t remaining;
-  struct libdw_memblock *prev;
-  char mem[0];
-};
-
 /* This is the structure representing the debugging state.  */
 struct Dwarf
 {
@@ -231,11 +220,22 @@ struct Dwarf
   /* Similar for addrx/constx, which will come from .debug_addr section.  */
   struct Dwarf_CU *fake_addr_cu;
 
-  /* Internal memory handling.  Each thread allocates separately and only
-     allocates from its own blocks, while all the blocks are pushed atomically
-     onto a unified stack for easy deallocation.  */
-  pthread_key_t mem_key;
-  atomic_uintptr_t mem_tail;
+  /* Supporting lock for internal memory handling.  Ensures threads that have
+     an entry in the mem_tails array are not disturbed by new threads doing
+     allocations for this Dwarf.  */
+  pthread_rwlock_t mem_rwl;
+
+  /* Internal memory handling.  This is basically a simplified thread-local
+     reimplementation of obstacks.  Unfortunately the standard obstack
+     implementation is not usable in libraries.  */
+  size_t mem_stacks;
+  struct libdw_memblock
+  {
+    size_t size;
+    size_t remaining;
+    struct libdw_memblock *prev;
+    char mem[0];
+  } **mem_tails;
 
   /* Default size of allocated memory blocks.  */
   size_t mem_default_size;
@@ -578,34 +578,31 @@ libdw_valid_user_form (int form)
 extern void __libdw_seterrno (int value) internal_function;
 
 
-/* Memory handling, the easy parts.  This macro does not do nor need to do any
-   locking for proper concurrent operation.  */
+/* Memory handling, the easy parts.  */
 #define libdw_alloc(dbg, type, tsize, cnt) \
-  ({ struct libdw_memblock *_tail = pthread_getspecific (dbg->mem_key);       \
-     size_t _req = (tsize) * (cnt);					      \
-     type *_result;							      \
-     if (unlikely (_tail == NULL))					      \
-       _result = (type *) __libdw_allocate (dbg, _req, __alignof (type));     \
+  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		      \
+     size_t _required = (tsize) * (cnt);				      \
+     type *_result = (type *) (_tail->mem + (_tail->size - _tail->remaining));\
+     size_t _padding = ((__alignof (type)				      \
+			 - ((uintptr_t) _result & (__alignof (type) - 1)))    \
+			& (__alignof (type) - 1));			      \
+     if (unlikely (_tail->remaining < _required + _padding))		      \
+       _result = (type *) __libdw_allocate (dbg, _required, __alignof (type));\
      else								      \
        {								      \
-	 _result = (type *) (_tail->mem + (_tail->size - _tail->remaining));  \
-	 size_t _padding = ((__alignof (type)				      \
-			    - ((uintptr_t) _result & (__alignof (type) - 1))) \
-			       & (__alignof (type) - 1));		      \
-	 if (unlikely (_tail->remaining < _req + _padding))		      \
-	   _result = (type *) __libdw_allocate (dbg, _req, __alignof (type)); \
-	 else								      \
-	   {								      \
-	     _req += _padding;						      \
-	     _result = (type *) ((char *) _result + _padding);		      \
-	     _tail->remaining -= _req;					      \
-	   }								      \
+	 _required += _padding;						      \
+	 _result = (type *) ((char *) _result + _padding);		      \
+	 _tail->remaining -= _required;					      \
        }								      \
      _result; })
 
 #define libdw_typed_alloc(dbg, type) \
   libdw_alloc (dbg, type, sizeof (type), 1)
 
+/* Callback to choose a thread-local memory allocation stack.  */
+extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
+     __nonnull_attribute__ (1);
+
 /* Callback to allocate more.  */
 extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
      __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
index f2e74d18..86ca7032 100644
--- a/libdw/libdw_alloc.c
+++ b/libdw/libdw_alloc.c
@@ -35,7 +35,68 @@
 #include <stdlib.h>
 #include "libdwP.h"
 #include "system.h"
+#include "atomics.h"
+#if USE_VG_ANNOTATIONS == 1
+#include <helgrind.h>
+#include <drd.h>
+#else
+#define ANNOTATE_HAPPENS_BEFORE(X)
+#define ANNOTATE_HAPPENS_AFTER(X)
+#endif
 
+#define THREAD_ID_UNSET ((size_t) -1)
+static thread_local size_t thread_id = THREAD_ID_UNSET;
+static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
+
+struct libdw_memblock *
+__libdw_alloc_tail (Dwarf *dbg)
+{
+  if (thread_id == THREAD_ID_UNSET)
+    thread_id = atomic_fetch_add (&next_id, 1);
+
+  pthread_rwlock_rdlock (&dbg->mem_rwl);
+  if (thread_id >= dbg->mem_stacks)
+    {
+      pthread_rwlock_unlock (&dbg->mem_rwl);
+      pthread_rwlock_wrlock (&dbg->mem_rwl);
+
+      /* Another thread may have already reallocated. In theory using an
+         atomic would be faster, but given that this only happens once per
+         thread per Dwarf, some minor slowdown should be fine.  */
+      if (thread_id >= dbg->mem_stacks)
+        {
+          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
+                                    * sizeof (struct libdw_memblock *));
+          if (dbg->mem_tails == NULL)
+            {
+              pthread_rwlock_unlock (&dbg->mem_rwl);
+              dbg->oom_handler();
+            }
+          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
+            dbg->mem_tails[i] = NULL;
+          dbg->mem_stacks = thread_id + 1;
+          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
+        }
+
+      pthread_rwlock_unlock (&dbg->mem_rwl);
+      pthread_rwlock_rdlock (&dbg->mem_rwl);
+    }
+
+  /* At this point, we have an entry in the tail array.  */
+  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
+  struct libdw_memblock *result = dbg->mem_tails[thread_id];
+  if (result == NULL)
+    {
+      result = malloc (dbg->mem_default_size);
+      result->size = dbg->mem_default_size
+                     - offsetof (struct libdw_memblock, mem);
+      result->remaining = result->size;
+      result->prev = NULL;
+      dbg->mem_tails[thread_id] = result;
+    }
+  pthread_rwlock_unlock (&dbg->mem_rwl);
+  return result;
+}
 
 void *
 __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
@@ -52,10 +113,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
   newp->size = size - offsetof (struct libdw_memblock, mem);
   newp->remaining = (uintptr_t) newp + size - (result + minsize);
 
-  newp->prev = (struct libdw_memblock*)atomic_exchange_explicit(
-      &dbg->mem_tail, (uintptr_t)newp, memory_order_relaxed);
-  if (pthread_setspecific (dbg->mem_key, newp) != 0)
-    dbg->oom_handler ();
+  pthread_rwlock_rdlock (&dbg->mem_rwl);
+  newp->prev = dbg->mem_tails[thread_id];
+  dbg->mem_tails[thread_id] = newp;
+  pthread_rwlock_unlock (&dbg->mem_rwl);
 
   return (void *) result;
 }
-- 
2.20.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] Add configure options for Valgrind annotations.
  2019-10-29 20:22   ` Jonathon Anderson
@ 2019-10-29 21:15     ` Mark Wielaard
  2019-10-29 21:15       ` [PATCH 2/2] libdw: Rewrite the memory handler to be more robust Mark Wielaard
  2019-11-07 17:19       ` [PATCH 1/2] Add configure options for Valgrind annotations Mark Wielaard
  0 siblings, 2 replies; 12+ messages in thread
From: Mark Wielaard @ 2019-10-29 21:15 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson

From: Jonathon Anderson <jma14@rice.edu>

Signed-off-by: Jonathon Anderson <jma14@rice.edu>
---
 ChangeLog    |  5 +++++
 configure.ac | 30 ++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/ChangeLog b/ChangeLog
index 911cf354..433a5f3c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2019-08-25  Jonathon Anderson <jma14@rice.edu>
+
+	* configure.ac: Add new --enable-valgrind-annotations
+	* configure.ac: Add new --with-valgrind (headers only)
+
 2019-07-05  Omar Sandoval  <osandov@fb.com>
 
 	* configure.ac: Get rid of --enable-libebl-subdir.
diff --git a/configure.ac b/configure.ac
index 9be34d12..7fc3acb6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -335,6 +335,35 @@ if test "$use_valgrind" = yes; then
 fi
 AM_CONDITIONAL(USE_VALGRIND, test "$use_valgrind" = yes)
 
+AC_ARG_WITH([valgrind],
+AS_HELP_STRING([--with-valgrind],[include directory for Valgrind headers]),
+[with_valgrind_headers=$withval], [with_valgrind_headers=no])
+if test "x$with_valgrind_headers" != xno; then
+    save_CFLAGS="$CFLAGS"
+    CFLAGS="$CFLAGS -I$with_valgrind_headers"
+    AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
+      #include <valgrind/valgrind.h>
+      int main() { return 0; }
+    ]])], [ HAVE_VALGRIND_HEADERS="yes"
+            CFLAGS="$save_CFLAGS -I$with_valgrind_headers" ],
+          [ AC_MSG_ERROR([invalid valgrind include directory: $with_valgrind_headers]) ])
+fi
+
+AC_ARG_ENABLE([valgrind-annotations],
+AS_HELP_STRING([--enable-valgrind-annotations],[insert extra annotations for better valgrind support]),
+[use_vg_annotations=$enableval], [use_vg_annotations=no])
+if test "$use_vg_annotations" = yes; then
+    if test "x$HAVE_VALGRIND_HEADERS" != "xyes"; then
+      AC_MSG_CHECKING([whether Valgrind headers are available])
+      AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
+        #include <valgrind/valgrind.h>
+        int main() { return 0; }
+      ]])], [ AC_MSG_RESULT([yes]) ],
+            [ AC_MSG_ERROR([valgrind annotations requested but no headers are available]) ])
+    fi
+fi
+AM_CONDITIONAL(USE_VG_ANNOTATIONS, test "$use_vg_annotations" = yes)
+
 AC_ARG_ENABLE([install-elfh],
 AS_HELP_STRING([--enable-install-elfh],[install elf.h in include dir]),
                [install_elfh=$enableval], [install_elfh=no])
@@ -669,6 +698,7 @@ AC_MSG_NOTICE([
   OTHER FEATURES
     Deterministic archives by default  : ${default_ar_deterministic}
     Native language support            : ${USE_NLS}
+    Extra Valgrind annotations         : ${use_vg_annotations}
 
   EXTRA TEST FEATURES (used with make check)
     have bunzip2 installed (required)  : ${HAVE_BUNZIP2}
-- 
2.20.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] Add configure options for Valgrind annotations.
  2019-10-29 21:15     ` [PATCH 1/2] Add configure options for Valgrind annotations Mark Wielaard
  2019-10-29 21:15       ` [PATCH 2/2] libdw: Rewrite the memory handler to be more robust Mark Wielaard
@ 2019-11-07 17:19       ` Mark Wielaard
  1 sibling, 0 replies; 12+ messages in thread
From: Mark Wielaard @ 2019-11-07 17:19 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson

Hi Jonathon,

On Tue, 2019-10-29 at 22:14 +0100, Mark Wielaard wrote:
> +2019-08-25  Jonathon Anderson <jma14@rice.edu>
> +
> +	* configure.ac: Add new --enable-valgrind-annotations
> +	* configure.ac: Add new --with-valgrind (headers only)

I think this looks fine. In theory we could also try to use pkg-config
checks, since valgrind does ship a valgrind.pc file. But various
distros don't actually ship that :{ So it is probably not worth it.

I must admit that I don't actually understand how the annotations
really work, see comments on the other patch.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
  2019-10-29 21:15       ` [PATCH 2/2] libdw: Rewrite the memory handler to be more robust Mark Wielaard
@ 2019-11-07 17:20         ` Mark Wielaard
  2019-11-07 18:13           ` Jonathon Anderson
  2019-11-07 18:40           ` Jonathon Anderson
  0 siblings, 2 replies; 12+ messages in thread
From: Mark Wielaard @ 2019-11-07 17:20 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Jonathon Anderson

Hi Jonathan,

On Tue, 2019-10-29 at 22:14 +0100, Mark Wielaard wrote:
> Pthread's thread-local variables are highly limited, which makes
> it difficult to use many Dwarfs. This replaces that with a
> less efficient (or elegant) but more robust method.

Thanks, it looks good (similar to a variant I already reviewed
earlier). Some small comments below.

I haven't benchmarked this version. Did you do any?

> diff --git a/lib/atomics.h b/lib/atomics.h
> index ffd12f87..e3773759 100644
> --- a/lib/atomics.h
> +++ b/lib/atomics.h
> @@ -31,7 +31,9 @@
>  #if HAVE_STDATOMIC_H
>  /* If possible, use the compiler's preferred atomics.  */
>  # include <stdatomic.h>
> +# include <threads.h>
>  #else
>  /* Otherwise, try to use the builtins provided by this compiler.  */
>  # include "stdatomic-fbsd.h"
> +# define thread_local __thread
>  #endif /* HAVE_STDATOMIC_H */

Do we really need this?
We already use __thread unconditionally in the rest of the code.
The usage of threads.h seems to imply we actually want C11
_Thread_local. Is that what you really want, or can we just use
__thread in libdw_alloc.c for thread_id?

> diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
> index 8d137414..eadff13b 100644
> --- a/libdw/dwarf_begin_elf.c
> +++ b/libdw/dwarf_begin_elf.c
> @@ -418,13 +418,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn *scngrp)
>       actual allocation.  */
>    result->mem_default_size = mem_default_size;
>    result->oom_handler = __libdw_oom;
> -  if (pthread_key_create (&result->mem_key, NULL) != 0)
> +  if (pthread_rwlock_init(&result->mem_rwl, NULL) != 0)
>      {
>        free (result);
> -      __libdw_seterrno (DWARF_E_NOMEM); /* no memory or max pthread keys.  */
> +      __libdw_seterrno (DWARF_E_NOMEM); /* no memory.  */
>        return NULL;
>      }
> -  atomic_init (&result->mem_tail, (uintptr_t)NULL);
> +  result->mem_stacks = 0;
> +  result->mem_tails = NULL;
>  
>    if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
>      {

OK.

> diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
> index a2e94436..3fd2836d 100644
> --- a/libdw/dwarf_end.c
> +++ b/libdw/dwarf_end.c
> @@ -95,17 +95,19 @@ dwarf_end (Dwarf *dwarf)
>        tdestroy (dwarf->split_tree, noop_free);
>  
>        /* Free the internally allocated memory.  */
> -      struct libdw_memblock *memp;
> -      memp = (struct libdw_memblock *) (atomic_load_explicit
> -					(&dwarf->mem_tail,
> -					 memory_order_relaxed));
> -      while (memp != NULL)
> -	{
> -	  struct libdw_memblock *prevp = memp->prev;
> -	  free (memp);
> -	  memp = prevp;
> -	}
> -      pthread_key_delete (dwarf->mem_key);
> +      for (size_t i = 0; i < dwarf->mem_stacks; i++)
> +        {
> +          struct libdw_memblock *memp = dwarf->mem_tails[i];
> +          while (memp != NULL)
> +	    {
> +	      struct libdw_memblock *prevp = memp->prev;
> +	      free (memp);
> +	      memp = prevp;
> +	    }
> +        }
> +      if (dwarf->mem_tails != NULL)
> +        free (dwarf->mem_tails);
> +      pthread_rwlock_destroy (&dwarf->mem_rwl);
>  
>        /* Free the pubnames helper structure.  */
>        free (dwarf->pubnames_sets);

OK. Might it be an idea to have some call to reset next_id (see below)?

> diff --git a/libdw/libdwP.h b/libdw/libdwP.h
> index ad2599eb..3e1ef59b 100644
> --- a/libdw/libdwP.h
> +++ b/libdw/libdwP.h
> @@ -149,17 +149,6 @@ enum
>  
>  #include "dwarf_sig8_hash.h"
>  
> -/* Structure for internal memory handling.  This is basically a simplified
> -   reimplementation of obstacks.  Unfortunately the standard obstack
> -   implementation is not usable in libraries.  */
> -struct libdw_memblock
> -{
> -  size_t size;
> -  size_t remaining;
> -  struct libdw_memblock *prev;
> -  char mem[0];
> -};
> -
>  /* This is the structure representing the debugging state.  */
>  struct Dwarf
>  {
> @@ -231,11 +220,22 @@ struct Dwarf
>    /* Similar for addrx/constx, which will come from .debug_addr section.  */
>    struct Dwarf_CU *fake_addr_cu;
>  
> -  /* Internal memory handling.  Each thread allocates separately and only
> -     allocates from its own blocks, while all the blocks are pushed atomically
> -     onto a unified stack for easy deallocation.  */
> -  pthread_key_t mem_key;
> -  atomic_uintptr_t mem_tail;
> +  /* Supporting lock for internal memory handling.  Ensures threads that have
> +     an entry in the mem_tails array are not disturbed by new threads doing
> +     allocations for this Dwarf.  */
> +  pthread_rwlock_t mem_rwl;
> +
> +  /* Internal memory handling.  This is basically a simplified thread-local
> +     reimplementation of obstacks.  Unfortunately the standard obstack
> +     implementation is not usable in libraries.  */
> +  size_t mem_stacks;
> +  struct libdw_memblock
> +  {
> +    size_t size;
> +    size_t remaining;
> +    struct libdw_memblock *prev;
> +    char mem[0];
> +  } **mem_tails;
>  
>    /* Default size of allocated memory blocks.  */
>    size_t mem_default_size;

OK.

> @@ -578,34 +578,31 @@ libdw_valid_user_form (int form)
>  extern void __libdw_seterrno (int value) internal_function;
>  
>  
> -/* Memory handling, the easy parts.  This macro does not do nor need to do any
> -   locking for proper concurrent operation.  */
> +/* Memory handling, the easy parts.  */
>  #define libdw_alloc(dbg, type, tsize, cnt) \
> -  ({ struct libdw_memblock *_tail = pthread_getspecific (dbg->mem_key);       \
> -     size_t _req = (tsize) * (cnt);					      \
> -     type *_result;							      \
> -     if (unlikely (_tail == NULL))					      \
> -       _result = (type *) __libdw_allocate (dbg, _req, __alignof (type));     \
> +  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		      \
> +     size_t _required = (tsize) * (cnt);				      \
> +     type *_result = (type *) (_tail->mem + (_tail->size - _tail->remaining));\
> +     size_t _padding = ((__alignof (type)				      \
> +			 - ((uintptr_t) _result & (__alignof (type) - 1)))    \
> +			& (__alignof (type) - 1));			      \
> +     if (unlikely (_tail->remaining < _required + _padding))		      \
> +       _result = (type *) __libdw_allocate (dbg, _required, __alignof (type));\
>       else								      \
>         {								      \
> -	 _result = (type *) (_tail->mem + (_tail->size - _tail->remaining));  \
> -	 size_t _padding = ((__alignof (type)				      \
> -			    - ((uintptr_t) _result & (__alignof (type) - 1))) \
> -			       & (__alignof (type) - 1));		      \
> -	 if (unlikely (_tail->remaining < _req + _padding))		      \
> -	   _result = (type *) __libdw_allocate (dbg, _req, __alignof (type)); \
> -	 else								      \
> -	   {								      \
> -	     _req += _padding;						      \
> -	     _result = (type *) ((char *) _result + _padding);		      \
> -	     _tail->remaining -= _req;					      \
> -	   }								      \
> +	 _required += _padding;						      \
> +	 _result = (type *) ((char *) _result + _padding);		      \
> +	 _tail->remaining -= _required;					      \
>         }								      \
>       _result; })

OK. Maybe add a comment that __libdw_alloc_tail makes sure that we get
a thread local libdw_memblock to work with.

>  #define libdw_typed_alloc(dbg, type) \
>    libdw_alloc (dbg, type, sizeof (type), 1)
>  
> +/* Callback to choose a thread-local memory allocation stack.  */
> +extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
> +     __nonnull_attribute__ (1);
> +
>  /* Callback to allocate more.  */
>  extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
>       __attribute__ ((__malloc__)) __nonnull_attribute__ (1);

O. There is the comment already :)

> diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
> index f2e74d18..86ca7032 100644
> --- a/libdw/libdw_alloc.c
> +++ b/libdw/libdw_alloc.c
> @@ -35,7 +35,68 @@
>  #include <stdlib.h>
>  #include "libdwP.h"
>  #include "system.h"
> +#include "atomics.h"
> +#if USE_VG_ANNOTATIONS == 1
> +#include <helgrind.h>
> +#include <drd.h>

I think if you include helgrind.h you won't get the drd.h
ANNOTATE_HAPPENS_BEFORE/AFTER. So do you also need to include drd.h?

> +#else
> +#define ANNOTATE_HAPPENS_BEFORE(X)
> +#define ANNOTATE_HAPPENS_AFTER(X)
> +#endif

Could you explain the usage of the happens_before/after annotations in
this code. I must admit that I don't fully understand why/how it works
in this case. Specifically since realloc might change the address that
mem_tails points to.

> +#define THREAD_ID_UNSET ((size_t) -1)
> +static thread_local size_t thread_id = THREAD_ID_UNSET;
> +static atomic_size_t next_id = ATOMIC_VAR_INIT(0);

OK, but maybe use static __thread size_t thread_id as explained above?

> +struct libdw_memblock *
> +__libdw_alloc_tail (Dwarf *dbg)
> +{
> +  if (thread_id == THREAD_ID_UNSET)
> +    thread_id = atomic_fetch_add (&next_id, 1);
> +
> +  pthread_rwlock_rdlock (&dbg->mem_rwl);
> +  if (thread_id >= dbg->mem_stacks)
> +    {
> +      pthread_rwlock_unlock (&dbg->mem_rwl);
> +      pthread_rwlock_wrlock (&dbg->mem_rwl);
> +
> +      /* Another thread may have already reallocated. In theory using an
> +         atomic would be faster, but given that this only happens once per
> +         thread per Dwarf, some minor slowdown should be fine.  */
> +      if (thread_id >= dbg->mem_stacks)
> +        {
> +          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
> +                                    * sizeof (struct libdw_memblock *));
> +          if (dbg->mem_tails == NULL)
> +            {
> +              pthread_rwlock_unlock (&dbg->mem_rwl);
> +              dbg->oom_handler();
> +            }
> +          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
> +            dbg->mem_tails[i] = NULL;
> +          dbg->mem_stacks = thread_id + 1;
> +          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
> +        }
> +
> +      pthread_rwlock_unlock (&dbg->mem_rwl);
> +      pthread_rwlock_rdlock (&dbg->mem_rwl);
> +    }

OK. So next_id only increases, so it might leak a bit without thread
pools.

Might it make sense to have some __libdw_destroy_tail () function that
could be called from dwarf_end that checks if this was the last Dwarf
in use so we could reset next_id? It would only work in some cases, if
there are multiple Dwarfs in use it probably is useless. I guess it is
too much trickery for an odd corner case?

O, and I now think you would then also need something for dwarf_begin
to reset any set thread_ids... bleah. So probably way too complicated.
So lets not, unless you think this is actually simple.

> +  /* At this point, we have an entry in the tail array.  */
> +  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
> +  struct libdw_memblock *result = dbg->mem_tails[thread_id];
> +  if (result == NULL)
> +    {
> +      result = malloc (dbg->mem_default_size);
> +      result->size = dbg->mem_default_size
> +                     - offsetof (struct libdw_memblock, mem);
> +      result->remaining = result->size;
> +      result->prev = NULL;
> +      dbg->mem_tails[thread_id] = result;
> +    }
> +  pthread_rwlock_unlock (&dbg->mem_rwl);
> +  return result;
> +}

OK.

>  void *
>  __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
> @@ -52,10 +113,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
>    newp->size = size - offsetof (struct libdw_memblock, mem);
>    newp->remaining = (uintptr_t) newp + size - (result + minsize);
>  
> -  newp->prev = (struct libdw_memblock*)atomic_exchange_explicit(
> -      &dbg->mem_tail, (uintptr_t)newp, memory_order_relaxed);
> -  if (pthread_setspecific (dbg->mem_key, newp) != 0)
> -    dbg->oom_handler ();
> +  pthread_rwlock_rdlock (&dbg->mem_rwl);
> +  newp->prev = dbg->mem_tails[thread_id];
> +  dbg->mem_tails[thread_id] = newp;
> +  pthread_rwlock_unlock (&dbg->mem_rwl);
>  
>    return (void *) result;
>  }

OK. Since this is only called after __libdw_alloc_tail you know that
thread_id will be set.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
  2019-11-07 17:20         ` Mark Wielaard
@ 2019-11-07 18:13           ` Jonathon Anderson
  2019-11-08 16:22             ` Mark Wielaard
  2019-11-07 18:40           ` Jonathon Anderson
  1 sibling, 1 reply; 12+ messages in thread
From: Jonathon Anderson @ 2019-11-07 18:13 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel



On Thu, Nov 7, 2019 at 18:20, Mark Wielaard <mark@klomp.org> wrote:
> Hi Jonathan,
> 
> On Tue, 2019-10-29 at 22:14 +0100, Mark Wielaard wrote:
>>  Pthread's thread-local variables are highly limited, which makes
>>  it difficult to use many Dwarfs. This replaces that with a
>>  less efficient (or elegant) but more robust method.
> 
> Thanks, it looks good (similar to a variant I already reviewed
> earlier). Some small comments below.
> 
> I haven't benchmarked this version. Did you do any?
> 
>>  diff --git a/lib/atomics.h b/lib/atomics.h
>>  index ffd12f87..e3773759 100644
>>  --- a/lib/atomics.h
>>  +++ b/lib/atomics.h
>>  @@ -31,7 +31,9 @@
>>   #if HAVE_STDATOMIC_H
>>   /* If possible, use the compiler's preferred atomics.  */
>>   # include <stdatomic.h>
>>  +# include <threads.h>
>>   #else
>>   /* Otherwise, try to use the builtins provided by this compiler.  
>> */
>>   # include "stdatomic-fbsd.h"
>>  +# define thread_local __thread
>>   #endif /* HAVE_STDATOMIC_H */
> 
> Do we really need this?
> We already use __thread unconditionally in the rest of the code.
> The usage of threads.h seems to imply we actually want C11
> _Thread_local. Is that what you really want, or can we just use
> __thread in libdw_alloc.c for thread_id?

We don't really need it, I just got in the habit of writing 
thread_local (and, proper C11 compat). __thread is perfectly fine for 
thread_id.

> 
>>  diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
>>  index 8d137414..eadff13b 100644
>>  --- a/libdw/dwarf_begin_elf.c
>>  +++ b/libdw/dwarf_begin_elf.c
>>  @@ -418,13 +418,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, 
>> Elf_Scn *scngrp)
>>        actual allocation.  */
>>     result->mem_default_size = mem_default_size;
>>     result->oom_handler = __libdw_oom;
>>  -  if (pthread_key_create (&result->mem_key, NULL) != 0)
>>  +  if (pthread_rwlock_init(&result->mem_rwl, NULL) != 0)
>>       {
>>         free (result);
>>  -      __libdw_seterrno (DWARF_E_NOMEM); /* no memory or max 
>> pthread keys.  */
>>  +      __libdw_seterrno (DWARF_E_NOMEM); /* no memory.  */
>>         return NULL;
>>       }
>>  -  atomic_init (&result->mem_tail, (uintptr_t)NULL);
>>  +  result->mem_stacks = 0;
>>  +  result->mem_tails = NULL;
>> 
>>     if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
>>       {
> 
> OK.
> 
>>  diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
>>  index a2e94436..3fd2836d 100644
>>  --- a/libdw/dwarf_end.c
>>  +++ b/libdw/dwarf_end.c
>>  @@ -95,17 +95,19 @@ dwarf_end (Dwarf *dwarf)
>>         tdestroy (dwarf->split_tree, noop_free);
>> 
>>         /* Free the internally allocated memory.  */
>>  -      struct libdw_memblock *memp;
>>  -      memp = (struct libdw_memblock *) (atomic_load_explicit
>>  -					(&dwarf->mem_tail,
>>  -					 memory_order_relaxed));
>>  -      while (memp != NULL)
>>  -	{
>>  -	  struct libdw_memblock *prevp = memp->prev;
>>  -	  free (memp);
>>  -	  memp = prevp;
>>  -	}
>>  -      pthread_key_delete (dwarf->mem_key);
>>  +      for (size_t i = 0; i < dwarf->mem_stacks; i++)
>>  +        {
>>  +          struct libdw_memblock *memp = dwarf->mem_tails[i];
>>  +          while (memp != NULL)
>>  +	    {
>>  +	      struct libdw_memblock *prevp = memp->prev;
>>  +	      free (memp);
>>  +	      memp = prevp;
>>  +	    }
>>  +        }
>>  +      if (dwarf->mem_tails != NULL)
>>  +        free (dwarf->mem_tails);
>>  +      pthread_rwlock_destroy (&dwarf->mem_rwl);
>> 
>>         /* Free the pubnames helper structure.  */
>>         free (dwarf->pubnames_sets);
> 
> OK. Might it be an idea to have some call to reset next_id (see 
> below)?

Generally not a good idea (see below).

> 
>>  diff --git a/libdw/libdwP.h b/libdw/libdwP.h
>>  index ad2599eb..3e1ef59b 100644
>>  --- a/libdw/libdwP.h
>>  +++ b/libdw/libdwP.h
>>  @@ -149,17 +149,6 @@ enum
>> 
>>   #include "dwarf_sig8_hash.h"
>> 
>>  -/* Structure for internal memory handling.  This is basically a 
>> simplified
>>  -   reimplementation of obstacks.  Unfortunately the standard 
>> obstack
>>  -   implementation is not usable in libraries.  */
>>  -struct libdw_memblock
>>  -{
>>  -  size_t size;
>>  -  size_t remaining;
>>  -  struct libdw_memblock *prev;
>>  -  char mem[0];
>>  -};
>>  -
>>   /* This is the structure representing the debugging state.  */
>>   struct Dwarf
>>   {
>>  @@ -231,11 +220,22 @@ struct Dwarf
>>     /* Similar for addrx/constx, which will come from .debug_addr 
>> section.  */
>>     struct Dwarf_CU *fake_addr_cu;
>> 
>>  -  /* Internal memory handling.  Each thread allocates separately 
>> and only
>>  -     allocates from its own blocks, while all the blocks are 
>> pushed atomically
>>  -     onto a unified stack for easy deallocation.  */
>>  -  pthread_key_t mem_key;
>>  -  atomic_uintptr_t mem_tail;
>>  +  /* Supporting lock for internal memory handling.  Ensures 
>> threads that have
>>  +     an entry in the mem_tails array are not disturbed by new 
>> threads doing
>>  +     allocations for this Dwarf.  */
>>  +  pthread_rwlock_t mem_rwl;
>>  +
>>  +  /* Internal memory handling.  This is basically a simplified 
>> thread-local
>>  +     reimplementation of obstacks.  Unfortunately the standard 
>> obstack
>>  +     implementation is not usable in libraries.  */
>>  +  size_t mem_stacks;
>>  +  struct libdw_memblock
>>  +  {
>>  +    size_t size;
>>  +    size_t remaining;
>>  +    struct libdw_memblock *prev;
>>  +    char mem[0];
>>  +  } **mem_tails;
>> 
>>     /* Default size of allocated memory blocks.  */
>>     size_t mem_default_size;
> 
> OK.
> 
>>  @@ -578,34 +578,31 @@ libdw_valid_user_form (int form)
>>   extern void __libdw_seterrno (int value) internal_function;
>> 
>> 
>>  -/* Memory handling, the easy parts.  This macro does not do nor 
>> need to do any
>>  -   locking for proper concurrent operation.  */
>>  +/* Memory handling, the easy parts.  */
>>   #define libdw_alloc(dbg, type, tsize, cnt) \
>>  -  ({ struct libdw_memblock *_tail = pthread_getspecific 
>> (dbg->mem_key);       \
>>  -     size_t _req = (tsize) * (cnt);					      \
>>  -     type *_result;							      \
>>  -     if (unlikely (_tail == NULL))					      \
>>  -       _result = (type *) __libdw_allocate (dbg, _req, __alignof 
>> (type));     \
>>  +  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		     
>>  \
>>  +     size_t _required = (tsize) * (cnt);				      \
>>  +     type *_result = (type *) (_tail->mem + (_tail->size - 
>> _tail->remaining));\
>>  +     size_t _padding = ((__alignof (type)				      \
>>  +			 - ((uintptr_t) _result & (__alignof (type) - 1)))    \
>>  +			& (__alignof (type) - 1));			      \
>>  +     if (unlikely (_tail->remaining < _required + _padding))		     
>>  \
>>  +       _result = (type *) __libdw_allocate (dbg, _required, 
>> __alignof (type));\
>>        else								      \
>>          {								      \
>>  -	 _result = (type *) (_tail->mem + (_tail->size - 
>> _tail->remaining));  \
>>  -	 size_t _padding = ((__alignof (type)				      \
>>  -			    - ((uintptr_t) _result & (__alignof (type) - 1))) \
>>  -			       & (__alignof (type) - 1));		      \
>>  -	 if (unlikely (_tail->remaining < _req + _padding))		      \
>>  -	   _result = (type *) __libdw_allocate (dbg, _req, __alignof 
>> (type)); \
>>  -	 else								      \
>>  -	   {								      \
>>  -	     _req += _padding;						      \
>>  -	     _result = (type *) ((char *) _result + _padding);		      \
>>  -	     _tail->remaining -= _req;					      \
>>  -	   }								      \
>>  +	 _required += _padding;						      \
>>  +	 _result = (type *) ((char *) _result + _padding);		      \
>>  +	 _tail->remaining -= _required;					      \
>>          }								      \
>>        _result; })
> 
> OK. Maybe add a comment that __libdw_alloc_tail makes sure that we get
> a thread local libdw_memblock to work with.
> 
>>   #define libdw_typed_alloc(dbg, type) \
>>     libdw_alloc (dbg, type, sizeof (type), 1)
>> 
>>  +/* Callback to choose a thread-local memory allocation stack.  */
>>  +extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
>>  +     __nonnull_attribute__ (1);
>>  +
>>   /* Callback to allocate more.  */
>>   extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t 
>> align)
>>        __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
> 
> O. There is the comment already :)
> 
>>  diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
>>  index f2e74d18..86ca7032 100644
>>  --- a/libdw/libdw_alloc.c
>>  +++ b/libdw/libdw_alloc.c
>>  @@ -35,7 +35,68 @@
>>   #include <stdlib.h>
>>   #include "libdwP.h"
>>   #include "system.h"
>>  +#include "atomics.h"
>>  +#if USE_VG_ANNOTATIONS == 1
>>  +#include <helgrind.h>
>>  +#include <drd.h>
> 
> I think if you include helgrind.h you won't get the drd.h
> ANNOTATE_HAPPENS_BEFORE/AFTER. So do you also need to include drd.h?

Not really, just another habit. Since this is file only needs HAPPENS_* 
helgrind.h is sufficient.

> 
>>  +#else
>>  +#define ANNOTATE_HAPPENS_BEFORE(X)
>>  +#define ANNOTATE_HAPPENS_AFTER(X)
>>  +#endif
> 
> Could you explain the usage of the happens_before/after annotations in
> this code. I must admit that I don't fully understand why/how it works
> in this case. Specifically since realloc might change the address that
> mem_tails points to.

Reader-writer locks ensure no "readers" are present whenever a "writer" 
is around. In this case we use the "write" side for resizing mem_tails 
and the "read" side when mem_tails needs to stay stable. Which is why 
most of the time we have a read lock and then promote to a write lock 
when we need to reallocate.

The annotations are to clean up a minor deficiency in Helgrind: for 
whatever reason if you do writes under a read lock it reports races 
with the writes from under the write lock (in this case, 
__libdw_allocate and the realloc). I haven't dug deep enough to know 
exactly why it happens, just that it does and adding this H-B arc seems 
to fix the issue.

> 
>>  +#define THREAD_ID_UNSET ((size_t) -1)
>>  +static thread_local size_t thread_id = THREAD_ID_UNSET;
>>  +static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
> 
> OK, but maybe use static __thread size_t thread_id as explained above?

Fine by me.

> 
>>  +struct libdw_memblock *
>>  +__libdw_alloc_tail (Dwarf *dbg)
>>  +{
>>  +  if (thread_id == THREAD_ID_UNSET)
>>  +    thread_id = atomic_fetch_add (&next_id, 1);
>>  +
>>  +  pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +  if (thread_id >= dbg->mem_stacks)
>>  +    {
>>  +      pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +      pthread_rwlock_wrlock (&dbg->mem_rwl);
>>  +
>>  +      /* Another thread may have already reallocated. In theory 
>> using an
>>  +         atomic would be faster, but given that this only happens 
>> once per
>>  +         thread per Dwarf, some minor slowdown should be fine.  */
>>  +      if (thread_id >= dbg->mem_stacks)
>>  +        {
>>  +          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
>>  +                                    * sizeof (struct 
>> libdw_memblock *));
>>  +          if (dbg->mem_tails == NULL)
>>  +            {
>>  +              pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +              dbg->oom_handler();
>>  +            }
>>  +          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
>>  +            dbg->mem_tails[i] = NULL;
>>  +          dbg->mem_stacks = thread_id + 1;
>>  +          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
>>  +        }
>>  +
>>  +      pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +      pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +    }
> 
> OK. So next_id only increases, so it might leak a bit without thread
> pools.
> 
> Might it make sense to have some __libdw_destroy_tail () function that
> could be called from dwarf_end that checks if this was the last Dwarf
> in use so we could reset next_id? It would only work in some cases, if
> there are multiple Dwarfs in use it probably is useless. I guess it is
> too much trickery for an odd corner case?
> 
> O, and I now think you would then also need something for dwarf_begin
> to reset any set thread_ids... bleah. So probably way too complicated.
> So lets not, unless you think this is actually simple.

Which is why I didn't want to do that.

The other option was to have a sort of free list for ids, but in that 
case the cleanup isn't great (sometime after all threads have 
completed... if you consider detached threads things get hairy). Plus 
it requires a fully concurrent stack or queue, which is a complicated 
data structure itself.

> 
>>  +  /* At this point, we have an entry in the tail array.  */
>>  +  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
>>  +  struct libdw_memblock *result = dbg->mem_tails[thread_id];
>>  +  if (result == NULL)
>>  +    {
>>  +      result = malloc (dbg->mem_default_size);
>>  +      result->size = dbg->mem_default_size
>>  +                     - offsetof (struct libdw_memblock, mem);
>>  +      result->remaining = result->size;
>>  +      result->prev = NULL;
>>  +      dbg->mem_tails[thread_id] = result;
>>  +    }
>>  +  pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +  return result;
>>  +}
> 
> OK.
> 
>>   void *
>>   __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
>>  @@ -52,10 +113,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, 
>> size_t align)
>>     newp->size = size - offsetof (struct libdw_memblock, mem);
>>     newp->remaining = (uintptr_t) newp + size - (result + minsize);
>> 
>>  -  newp->prev = (struct libdw_memblock*)atomic_exchange_explicit(
>>  -      &dbg->mem_tail, (uintptr_t)newp, memory_order_relaxed);
>>  -  if (pthread_setspecific (dbg->mem_key, newp) != 0)
>>  -    dbg->oom_handler ();
>>  +  pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +  newp->prev = dbg->mem_tails[thread_id];
>>  +  dbg->mem_tails[thread_id] = newp;
>>  +  pthread_rwlock_unlock (&dbg->mem_rwl);
>> 
>>     return (void *) result;
>>   }
> 
> OK. Since this is only called after __libdw_alloc_tail you know that
> thread_id will be set.

Exactly.

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
  2019-11-07 17:20         ` Mark Wielaard
  2019-11-07 18:13           ` Jonathon Anderson
@ 2019-11-07 18:40           ` Jonathon Anderson
  2019-11-08 16:22             ` Mark Wielaard
  1 sibling, 1 reply; 12+ messages in thread
From: Jonathon Anderson @ 2019-11-07 18:40 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel



On Thu, Nov 7, 2019 at 18:20, Mark Wielaard <mark@klomp.org> wrote:
> Hi Jonathan,
> 
> On Tue, 2019-10-29 at 22:14 +0100, Mark Wielaard wrote:
>>  Pthread's thread-local variables are highly limited, which makes
>>  it difficult to use many Dwarfs. This replaces that with a
>>  less efficient (or elegant) but more robust method.
> 
> Thanks, it looks good (similar to a variant I already reviewed
> earlier). Some small comments below.
> 
> I haven't benchmarked this version. Did you do any?

I haven't benchmarked this version, but I did benchmark the equivalent 
earlier version (this version is almost quite literally a rebase of the 
other). I don't have the exact results on hand, what I remember is that 
the pthread_key method was faster (and handled the many-thread case 
better), by maybe a factor of 1.5x-2x in parallel. In serial the 
overhead was minimal (just an extra pointer indirection on allocations).

The fastest (I think) would be the "cloned" Dwarf idea, but that 
requires an overhaul.

> 
>>  diff --git a/lib/atomics.h b/lib/atomics.h
>>  index ffd12f87..e3773759 100644
>>  --- a/lib/atomics.h
>>  +++ b/lib/atomics.h
>>  @@ -31,7 +31,9 @@
>>   #if HAVE_STDATOMIC_H
>>   /* If possible, use the compiler's preferred atomics.  */
>>   # include <stdatomic.h>
>>  +# include <threads.h>
>>   #else
>>   /* Otherwise, try to use the builtins provided by this compiler.  
>> */
>>   # include "stdatomic-fbsd.h"
>>  +# define thread_local __thread
>>   #endif /* HAVE_STDATOMIC_H */
> 
> Do we really need this?
> We already use __thread unconditionally in the rest of the code.
> The usage of threads.h seems to imply we actually want C11
> _Thread_local. Is that what you really want, or can we just use
> __thread in libdw_alloc.c for thread_id?
> 
>>  diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
>>  index 8d137414..eadff13b 100644
>>  --- a/libdw/dwarf_begin_elf.c
>>  +++ b/libdw/dwarf_begin_elf.c
>>  @@ -418,13 +418,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, 
>> Elf_Scn *scngrp)
>>        actual allocation.  */
>>     result->mem_default_size = mem_default_size;
>>     result->oom_handler = __libdw_oom;
>>  -  if (pthread_key_create (&result->mem_key, NULL) != 0)
>>  +  if (pthread_rwlock_init(&result->mem_rwl, NULL) != 0)
>>       {
>>         free (result);
>>  -      __libdw_seterrno (DWARF_E_NOMEM); /* no memory or max 
>> pthread keys.  */
>>  +      __libdw_seterrno (DWARF_E_NOMEM); /* no memory.  */
>>         return NULL;
>>       }
>>  -  atomic_init (&result->mem_tail, (uintptr_t)NULL);
>>  +  result->mem_stacks = 0;
>>  +  result->mem_tails = NULL;
>> 
>>     if (cmd == DWARF_C_READ || cmd == DWARF_C_RDWR)
>>       {
> 
> OK.
> 
>>  diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
>>  index a2e94436..3fd2836d 100644
>>  --- a/libdw/dwarf_end.c
>>  +++ b/libdw/dwarf_end.c
>>  @@ -95,17 +95,19 @@ dwarf_end (Dwarf *dwarf)
>>         tdestroy (dwarf->split_tree, noop_free);
>> 
>>         /* Free the internally allocated memory.  */
>>  -      struct libdw_memblock *memp;
>>  -      memp = (struct libdw_memblock *) (atomic_load_explicit
>>  -					(&dwarf->mem_tail,
>>  -					 memory_order_relaxed));
>>  -      while (memp != NULL)
>>  -	{
>>  -	  struct libdw_memblock *prevp = memp->prev;
>>  -	  free (memp);
>>  -	  memp = prevp;
>>  -	}
>>  -      pthread_key_delete (dwarf->mem_key);
>>  +      for (size_t i = 0; i < dwarf->mem_stacks; i++)
>>  +        {
>>  +          struct libdw_memblock *memp = dwarf->mem_tails[i];
>>  +          while (memp != NULL)
>>  +	    {
>>  +	      struct libdw_memblock *prevp = memp->prev;
>>  +	      free (memp);
>>  +	      memp = prevp;
>>  +	    }
>>  +        }
>>  +      if (dwarf->mem_tails != NULL)
>>  +        free (dwarf->mem_tails);
>>  +      pthread_rwlock_destroy (&dwarf->mem_rwl);
>> 
>>         /* Free the pubnames helper structure.  */
>>         free (dwarf->pubnames_sets);
> 
> OK. Might it be an idea to have some call to reset next_id (see 
> below)?
> 
>>  diff --git a/libdw/libdwP.h b/libdw/libdwP.h
>>  index ad2599eb..3e1ef59b 100644
>>  --- a/libdw/libdwP.h
>>  +++ b/libdw/libdwP.h
>>  @@ -149,17 +149,6 @@ enum
>> 
>>   #include "dwarf_sig8_hash.h"
>> 
>>  -/* Structure for internal memory handling.  This is basically a 
>> simplified
>>  -   reimplementation of obstacks.  Unfortunately the standard 
>> obstack
>>  -   implementation is not usable in libraries.  */
>>  -struct libdw_memblock
>>  -{
>>  -  size_t size;
>>  -  size_t remaining;
>>  -  struct libdw_memblock *prev;
>>  -  char mem[0];
>>  -};
>>  -
>>   /* This is the structure representing the debugging state.  */
>>   struct Dwarf
>>   {
>>  @@ -231,11 +220,22 @@ struct Dwarf
>>     /* Similar for addrx/constx, which will come from .debug_addr 
>> section.  */
>>     struct Dwarf_CU *fake_addr_cu;
>> 
>>  -  /* Internal memory handling.  Each thread allocates separately 
>> and only
>>  -     allocates from its own blocks, while all the blocks are 
>> pushed atomically
>>  -     onto a unified stack for easy deallocation.  */
>>  -  pthread_key_t mem_key;
>>  -  atomic_uintptr_t mem_tail;
>>  +  /* Supporting lock for internal memory handling.  Ensures 
>> threads that have
>>  +     an entry in the mem_tails array are not disturbed by new 
>> threads doing
>>  +     allocations for this Dwarf.  */
>>  +  pthread_rwlock_t mem_rwl;
>>  +
>>  +  /* Internal memory handling.  This is basically a simplified 
>> thread-local
>>  +     reimplementation of obstacks.  Unfortunately the standard 
>> obstack
>>  +     implementation is not usable in libraries.  */
>>  +  size_t mem_stacks;
>>  +  struct libdw_memblock
>>  +  {
>>  +    size_t size;
>>  +    size_t remaining;
>>  +    struct libdw_memblock *prev;
>>  +    char mem[0];
>>  +  } **mem_tails;
>> 
>>     /* Default size of allocated memory blocks.  */
>>     size_t mem_default_size;
> 
> OK.
> 
>>  @@ -578,34 +578,31 @@ libdw_valid_user_form (int form)
>>   extern void __libdw_seterrno (int value) internal_function;
>> 
>> 
>>  -/* Memory handling, the easy parts.  This macro does not do nor 
>> need to do any
>>  -   locking for proper concurrent operation.  */
>>  +/* Memory handling, the easy parts.  */
>>   #define libdw_alloc(dbg, type, tsize, cnt) \
>>  -  ({ struct libdw_memblock *_tail = pthread_getspecific 
>> (dbg->mem_key);       \
>>  -     size_t _req = (tsize) * (cnt);					      \
>>  -     type *_result;							      \
>>  -     if (unlikely (_tail == NULL))					      \
>>  -       _result = (type *) __libdw_allocate (dbg, _req, __alignof 
>> (type));     \
>>  +  ({ struct libdw_memblock *_tail = __libdw_alloc_tail(dbg);		     
>>  \
>>  +     size_t _required = (tsize) * (cnt);				      \
>>  +     type *_result = (type *) (_tail->mem + (_tail->size - 
>> _tail->remaining));\
>>  +     size_t _padding = ((__alignof (type)				      \
>>  +			 - ((uintptr_t) _result & (__alignof (type) - 1)))    \
>>  +			& (__alignof (type) - 1));			      \
>>  +     if (unlikely (_tail->remaining < _required + _padding))		     
>>  \
>>  +       _result = (type *) __libdw_allocate (dbg, _required, 
>> __alignof (type));\
>>        else								      \
>>          {								      \
>>  -	 _result = (type *) (_tail->mem + (_tail->size - 
>> _tail->remaining));  \
>>  -	 size_t _padding = ((__alignof (type)				      \
>>  -			    - ((uintptr_t) _result & (__alignof (type) - 1))) \
>>  -			       & (__alignof (type) - 1));		      \
>>  -	 if (unlikely (_tail->remaining < _req + _padding))		      \
>>  -	   _result = (type *) __libdw_allocate (dbg, _req, __alignof 
>> (type)); \
>>  -	 else								      \
>>  -	   {								      \
>>  -	     _req += _padding;						      \
>>  -	     _result = (type *) ((char *) _result + _padding);		      \
>>  -	     _tail->remaining -= _req;					      \
>>  -	   }								      \
>>  +	 _required += _padding;						      \
>>  +	 _result = (type *) ((char *) _result + _padding);		      \
>>  +	 _tail->remaining -= _required;					      \
>>          }								      \
>>        _result; })
> 
> OK. Maybe add a comment that __libdw_alloc_tail makes sure that we get
> a thread local libdw_memblock to work with.
> 
>>   #define libdw_typed_alloc(dbg, type) \
>>     libdw_alloc (dbg, type, sizeof (type), 1)
>> 
>>  +/* Callback to choose a thread-local memory allocation stack.  */
>>  +extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
>>  +     __nonnull_attribute__ (1);
>>  +
>>   /* Callback to allocate more.  */
>>   extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t 
>> align)
>>        __attribute__ ((__malloc__)) __nonnull_attribute__ (1);
> 
> O. There is the comment already :)
> 
>>  diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
>>  index f2e74d18..86ca7032 100644
>>  --- a/libdw/libdw_alloc.c
>>  +++ b/libdw/libdw_alloc.c
>>  @@ -35,7 +35,68 @@
>>   #include <stdlib.h>
>>   #include "libdwP.h"
>>   #include "system.h"
>>  +#include "atomics.h"
>>  +#if USE_VG_ANNOTATIONS == 1
>>  +#include <helgrind.h>
>>  +#include <drd.h>
> 
> I think if you include helgrind.h you won't get the drd.h
> ANNOTATE_HAPPENS_BEFORE/AFTER. So do you also need to include drd.h?
> 
>>  +#else
>>  +#define ANNOTATE_HAPPENS_BEFORE(X)
>>  +#define ANNOTATE_HAPPENS_AFTER(X)
>>  +#endif
> 
> Could you explain the usage of the happens_before/after annotations in
> this code. I must admit that I don't fully understand why/how it works
> in this case. Specifically since realloc might change the address that
> mem_tails points to.
> 
>>  +#define THREAD_ID_UNSET ((size_t) -1)
>>  +static thread_local size_t thread_id = THREAD_ID_UNSET;
>>  +static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
> 
> OK, but maybe use static __thread size_t thread_id as explained above?
> 
>>  +struct libdw_memblock *
>>  +__libdw_alloc_tail (Dwarf *dbg)
>>  +{
>>  +  if (thread_id == THREAD_ID_UNSET)
>>  +    thread_id = atomic_fetch_add (&next_id, 1);
>>  +
>>  +  pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +  if (thread_id >= dbg->mem_stacks)
>>  +    {
>>  +      pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +      pthread_rwlock_wrlock (&dbg->mem_rwl);
>>  +
>>  +      /* Another thread may have already reallocated. In theory 
>> using an
>>  +         atomic would be faster, but given that this only happens 
>> once per
>>  +         thread per Dwarf, some minor slowdown should be fine.  */
>>  +      if (thread_id >= dbg->mem_stacks)
>>  +        {
>>  +          dbg->mem_tails = realloc (dbg->mem_tails, (thread_id+1)
>>  +                                    * sizeof (struct 
>> libdw_memblock *));
>>  +          if (dbg->mem_tails == NULL)
>>  +            {
>>  +              pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +              dbg->oom_handler();
>>  +            }
>>  +          for (size_t i = dbg->mem_stacks; i <= thread_id; i++)
>>  +            dbg->mem_tails[i] = NULL;
>>  +          dbg->mem_stacks = thread_id + 1;
>>  +          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
>>  +        }
>>  +
>>  +      pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +      pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +    }
> 
> OK. So next_id only increases, so it might leak a bit without thread
> pools.
> 
> Might it make sense to have some __libdw_destroy_tail () function that
> could be called from dwarf_end that checks if this was the last Dwarf
> in use so we could reset next_id? It would only work in some cases, if
> there are multiple Dwarfs in use it probably is useless. I guess it is
> too much trickery for an odd corner case?
> 
> O, and I now think you would then also need something for dwarf_begin
> to reset any set thread_ids... bleah. So probably way too complicated.
> So lets not, unless you think this is actually simple.
> 
>>  +  /* At this point, we have an entry in the tail array.  */
>>  +  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
>>  +  struct libdw_memblock *result = dbg->mem_tails[thread_id];
>>  +  if (result == NULL)
>>  +    {
>>  +      result = malloc (dbg->mem_default_size);
>>  +      result->size = dbg->mem_default_size
>>  +                     - offsetof (struct libdw_memblock, mem);
>>  +      result->remaining = result->size;
>>  +      result->prev = NULL;
>>  +      dbg->mem_tails[thread_id] = result;
>>  +    }
>>  +  pthread_rwlock_unlock (&dbg->mem_rwl);
>>  +  return result;
>>  +}
> 
> OK.
> 
>>   void *
>>   __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
>>  @@ -52,10 +113,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, 
>> size_t align)
>>     newp->size = size - offsetof (struct libdw_memblock, mem);
>>     newp->remaining = (uintptr_t) newp + size - (result + minsize);
>> 
>>  -  newp->prev = (struct libdw_memblock*)atomic_exchange_explicit(
>>  -      &dbg->mem_tail, (uintptr_t)newp, memory_order_relaxed);
>>  -  if (pthread_setspecific (dbg->mem_key, newp) != 0)
>>  -    dbg->oom_handler ();
>>  +  pthread_rwlock_rdlock (&dbg->mem_rwl);
>>  +  newp->prev = dbg->mem_tails[thread_id];
>>  +  dbg->mem_tails[thread_id] = newp;
>>  +  pthread_rwlock_unlock (&dbg->mem_rwl);
>> 
>>     return (void *) result;
>>   }
> 
> OK. Since this is only called after __libdw_alloc_tail you know that
> thread_id will be set.
> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
  2019-11-07 18:13           ` Jonathon Anderson
@ 2019-11-08 16:22             ` Mark Wielaard
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Wielaard @ 2019-11-08 16:22 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel

Hi,

On Thu, 2019-11-07 at 12:13 -0600, Jonathon Anderson wrote:
> On Thu, Nov 7, 2019 at 18:20, Mark Wielaard <mark@klomp.org> wrote:
> > Do we really need this?
> > We already use __thread unconditionally in the rest of the code.
> > The usage of threads.h seems to imply we actually want C11
> > _Thread_local. Is that what you really want, or can we just use
> > __thread in libdw_alloc.c for thread_id?
> 
> We don't really need it, I just got in the habit of writing 
> thread_local (and, proper C11 compat). __thread is perfectly fine
> for thread_id.

Great, removed.

> > I think if you include helgrind.h you won't get the drd.h
> > ANNOTATE_HAPPENS_BEFORE/AFTER. So do you also need to include
> > drd.h?
> 
> Not really, just another habit. Since this is file only needs
> HAPPENS_* helgrind.h is sufficient.

Thanks. drd.h include removed.

> > 
> > >  +#else
> > >  +#define ANNOTATE_HAPPENS_BEFORE(X)
> > >  +#define ANNOTATE_HAPPENS_AFTER(X)
> > >  +#endif
> > 
> > Could you explain the usage of the happens_before/after annotations in
> > this code. I must admit that I don't fully understand why/how it works
> > in this case. Specifically since realloc might change the address that
> > mem_tails points to.
> 
> Reader-writer locks ensure no "readers" are present whenever a "writer" 
> is around. In this case we use the "write" side for resizing mem_tails 
> and the "read" side when mem_tails needs to stay stable. Which is why 
> most of the time we have a read lock and then promote to a write lock 
> when we need to reallocate.
> 
> The annotations are to clean up a minor deficiency in Helgrind: for 
> whatever reason if you do writes under a read lock it reports races 
> with the writes from under the write lock (in this case, 
> __libdw_allocate and the realloc). I haven't dug deep enough to know 
> exactly why it happens, just that it does and adding this H-B arc seems 
> to fix the issue.

OK, lets keep them in for now. They are disabled by default anyway. For
now people who want a "helgrindable" libdw will need to rebuild libdw
with them enabled.

> > >  +#define THREAD_ID_UNSET ((size_t) -1)
> > >  +static thread_local size_t thread_id = THREAD_ID_UNSET;
> > >  +static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
> > 
> > OK, but maybe use static __thread size_t thread_id as explained
> > above?
> 
> Fine by me.

Done.

> > O, and I now think you would then also need something for dwarf_begin
> > to reset any set thread_ids... bleah. So probably way too complicated.
> > So lets not, unless you think this is actually simple.
> 
> Which is why I didn't want to do that.
> 
> The other option was to have a sort of free list for ids, but in that 
> case the cleanup isn't great (sometime after all threads have 
> completed... if you consider detached threads things get hairy). Plus 
> it requires a fully concurrent stack or queue, which is a complicated 
> data structure itself.

Yeah, agreed, lets keep it with a simple monotonically increasing
next_id. Things need to get really big before this ever gets a problem.
And I don't think programs will keep spawning new threads and using
Dwarfs on each of them anyway. I expect longer running processes that
do need to handle Dwarfs in a concurrent fashion to use thread pools.

Pushed with the small changes noted above.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
  2019-11-07 18:40           ` Jonathon Anderson
@ 2019-11-08 16:22             ` Mark Wielaard
  2019-11-08 17:41               ` Jonathon Anderson
  0 siblings, 1 reply; 12+ messages in thread
From: Mark Wielaard @ 2019-11-08 16:22 UTC (permalink / raw)
  To: Jonathon Anderson; +Cc: elfutils-devel

On Thu, 2019-11-07 at 12:40 -0600, Jonathon Anderson wrote:
> I haven't benchmarked this version, but I did benchmark the equivalent 
> earlier version (this version is almost quite literally a rebase of the 
> other). I don't have the exact results on hand, what I remember is that 
> the pthread_key method was faster (and handled the many-thread case 
> better), by maybe a factor of 1.5x-2x in parallel. In serial the 
> overhead was minimal (just an extra pointer indirection on allocations).

I just tested the single-threaded case a bit and is not measurable
slower than the previous version, and compared to 0.177 things are
maybe ~1% slower (so probably in the noise).

A factor 1.5x-2.0x slower in parallel does seem significant. Is that in
the case of many-threads that are colliding a lot or in general?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
  2019-11-08 16:22             ` Mark Wielaard
@ 2019-11-08 17:41               ` Jonathon Anderson
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathon Anderson @ 2019-11-08 17:41 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel

On Fri, Nov 8, 2019 at 17:22, Mark Wielaard <mark@klomp.org> wrote:
> On Thu, 2019-11-07 at 12:40 -0600, Jonathon Anderson wrote:
>>  I haven't benchmarked this version, but I did benchmark the 
>> equivalent
>>  earlier version (this version is almost quite literally a rebase of 
>> the
>>  other). I don't have the exact results on hand, what I remember is 
>> that
>>  the pthread_key method was faster (and handled the many-thread case
>>  better), by maybe a factor of 1.5x-2x in parallel. In serial the
>>  overhead was minimal (just an extra pointer indirection on 
>> allocations).
> 
> I just tested the single-threaded case a bit and is not measurable
> slower than the previous version, and compared to 0.177 things are
> maybe ~1% slower (so probably in the noise).
> 
> A factor 1.5x-2.0x slower in parallel does seem significant. Is that 
> in
> the case of many-threads that are colliding a lot or in general?

I believe it was 64 threads colliding a lot (on the reader side of 
mem_rwl). That said, this is all based on my memory from before the 
semester started. (They may also be numbers munged out of a larger 
benchmark, so don't trust them too much).

As it happens, on our end any slowdown is entirely hidden by all the 
other work we do while reading DIEs, so its not a critical concern. Our 
code opens a Dwarf and then uses a #pragma parallel for across the CUs 
(using a serial recursion to read the DIEs), if you want to benchmark 
it that should suffice on a large enough example.

> 
> Thanks,
> 
> Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-11-08 17:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-29 18:55 [PATCH 0/2] libdw: Rewrite the memory handler to be more robust Jonathon Anderson
2019-10-29 20:17 ` Mark Wielaard
2019-10-29 20:22   ` Jonathon Anderson
2019-10-29 21:15     ` [PATCH 1/2] Add configure options for Valgrind annotations Mark Wielaard
2019-10-29 21:15       ` [PATCH 2/2] libdw: Rewrite the memory handler to be more robust Mark Wielaard
2019-11-07 17:20         ` Mark Wielaard
2019-11-07 18:13           ` Jonathon Anderson
2019-11-08 16:22             ` Mark Wielaard
2019-11-07 18:40           ` Jonathon Anderson
2019-11-08 16:22             ` Mark Wielaard
2019-11-08 17:41               ` Jonathon Anderson
2019-11-07 17:19       ` [PATCH 1/2] Add configure options for Valgrind annotations Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).