From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <elfutils-devel-return-7880-listarch-elfutils-devel=sourceware.org@sourceware.org>
Received: (qmail 84808 invoked by alias); 7 Nov 2019 17:20:09 -0000
Mailing-List: contact elfutils-devel-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <elfutils-devel.sourceware.org>
List-Post: <mailto:elfutils-devel@sourceware.org>
List-Help: <mailto:elfutils-devel-help@sourceware.org>
List-Subscribe: <mailto:elfutils-devel-subscribe@sourceware.org>
Sender: elfutils-devel-owner@sourceware.org
Received: (qmail 84794 invoked by uid 89); 7 Nov 2019 17:20:09 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Checked: by ClamAV 0.100.3 on sourceware.org
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-19.0 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.1 spammy=leak, keys
X-Spam-Status: No, score=-19.0 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on sourceware.org
X-Spam-Level:
X-HELO: gnu.wildebeest.org
Received: from wildebeest.demon.nl (HELO gnu.wildebeest.org) (212.238.236.112) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 07 Nov 2019 17:20:07 +0000
Received: from tarox.wildebeest.org (tarox.wildebeest.org [172.31.17.39])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by gnu.wildebeest.org (Postfix) with ESMTPSA id 85D73300073F;	Thu,  7 Nov 2019 18:20:04 +0100 (CET)
Received: by tarox.wildebeest.org (Postfix, from userid 1000)	id 755464970340; Thu,  7 Nov 2019 18:20:04 +0100 (CET)
Message-ID: <93a4d8983b6ec43c09c6c3b3f6ed8d358321bb9d.camel@klomp.org>
Subject: Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.
From: Mark Wielaard <mark@klomp.org>
To: elfutils-devel@sourceware.org
Cc: Jonathon Anderson <jma14@rice.edu>
Date: Thu, 07 Nov 2019 17:20:00 -0000
In-Reply-To: <20191029211437.3268-2-mark@klomp.org>
References: <1572380520.19948.0@rice.edu>	 <20191029211437.3268-1-mark@klomp.org>	 <20191029211437.3268-2-mark@klomp.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Evolution 3.28.5 (3.28.5-5.el7)
Mime-Version: 1.0
X-Spam-Flag: NO
X-IsSubscribed: yes
X-SW-Source: 2019-q4/txt/msg00110.txt.bz2

Hi Jonathan,

On Tue, 2019-10-29 at 22:14 +0100, Mark Wielaard wrote:
> Pthread's thread-local variables are highly limited, which makes
> it difficult to use many Dwarfs. This replaces that with a
> less efficient (or elegant) but more robust method.

Thanks, it looks good (similar to a variant I already reviewed
earlier). Some small comments below.

I haven't benchmarked this version. Did you do any?

> diff --git a/lib/atomics.h b/lib/atomics.h
> index ffd12f87..e3773759 100644
> --- a/lib/atomics.h
> +++ b/lib/atomics.h
> @@ -31,7 +31,9 @@
>  #if HAVE_STDATOMIC_H
>  /* If possible, use the compiler's preferred atomics.  */
>  # include <stdatomic.h>
> +# include <threads.h>
>  #else
>  /* Otherwise, try to use the builtins provided by this compiler.  */
>  # include "stdatomic-fbsd.h"
> +# define thread_local __thread
>  #endif /* HAVE_STDATOMIC_H */

Do we really need this?
We already use __thread unconditionally in the rest of the code.
The usage of threads.h seems to imply we actually want C11
_Thread_local. Is that what you really want, or can we just use
__thread in libdw_alloc.c for thread_id?

> diff --git a/libdw/dwarf_begin_elf.c b/libdw/dwarf_begin_elf.c
> index 8d137414..eadff13b 100644
> --- a/libdw/dwarf_begin_elf.c
> +++ b/libdw/dwarf_begin_elf.c
> @@ -418,13 +418,14 @@ dwarf_begin_elf (Elf *elf, Dwarf_Cmd cmd, Elf_Scn *=
scngrp)
>       actual allocation.  */
>    result->mem_default_size =3D mem_default_size;
>    result->oom_handler =3D __libdw_oom;
> -  if (pthread_key_create (&result->mem_key, NULL) !=3D 0)
> +  if (pthread_rwlock_init(&result->mem_rwl, NULL) !=3D 0)
>      {
>        free (result);
> -      __libdw_seterrno (DWARF_E_NOMEM); /* no memory or max pthread keys=
.  */
> +      __libdw_seterrno (DWARF_E_NOMEM); /* no memory.  */
>        return NULL;
>      }
> -  atomic_init (&result->mem_tail, (uintptr_t)NULL);
> +  result->mem_stacks =3D 0;
> +  result->mem_tails =3D NULL;
>=20=20
>    if (cmd =3D=3D DWARF_C_READ || cmd =3D=3D DWARF_C_RDWR)
>      {

OK.

> diff --git a/libdw/dwarf_end.c b/libdw/dwarf_end.c
> index a2e94436..3fd2836d 100644
> --- a/libdw/dwarf_end.c
> +++ b/libdw/dwarf_end.c
> @@ -95,17 +95,19 @@ dwarf_end (Dwarf *dwarf)
>        tdestroy (dwarf->split_tree, noop_free);
>=20=20
>        /* Free the internally allocated memory.  */
> -      struct libdw_memblock *memp;
> -      memp =3D (struct libdw_memblock *) (atomic_load_explicit
> -					(&dwarf->mem_tail,
> -					 memory_order_relaxed));
> -      while (memp !=3D NULL)
> -	{
> -	  struct libdw_memblock *prevp =3D memp->prev;
> -	  free (memp);
> -	  memp =3D prevp;
> -	}
> -      pthread_key_delete (dwarf->mem_key);
> +      for (size_t i =3D 0; i < dwarf->mem_stacks; i++)
> +        {
> +          struct libdw_memblock *memp =3D dwarf->mem_tails[i];
> +          while (memp !=3D NULL)
> +	    {
> +	      struct libdw_memblock *prevp =3D memp->prev;
> +	      free (memp);
> +	      memp =3D prevp;
> +	    }
> +        }
> +      if (dwarf->mem_tails !=3D NULL)
> +        free (dwarf->mem_tails);
> +      pthread_rwlock_destroy (&dwarf->mem_rwl);
>=20=20
>        /* Free the pubnames helper structure.  */
>        free (dwarf->pubnames_sets);

OK. Might it be an idea to have some call to reset next_id (see below)?

> diff --git a/libdw/libdwP.h b/libdw/libdwP.h
> index ad2599eb..3e1ef59b 100644
> --- a/libdw/libdwP.h
> +++ b/libdw/libdwP.h
> @@ -149,17 +149,6 @@ enum
>=20=20
>  #include "dwarf_sig8_hash.h"
>=20=20
> -/* Structure for internal memory handling.  This is basically a simplifi=
ed
> -   reimplementation of obstacks.  Unfortunately the standard obstack
> -   implementation is not usable in libraries.  */
> -struct libdw_memblock
> -{
> -  size_t size;
> -  size_t remaining;
> -  struct libdw_memblock *prev;
> -  char mem[0];
> -};
> -
>  /* This is the structure representing the debugging state.  */
>  struct Dwarf
>  {
> @@ -231,11 +220,22 @@ struct Dwarf
>    /* Similar for addrx/constx, which will come from .debug_addr section.=
  */
>    struct Dwarf_CU *fake_addr_cu;
>=20=20
> -  /* Internal memory handling.  Each thread allocates separately and only
> -     allocates from its own blocks, while all the blocks are pushed atom=
ically
> -     onto a unified stack for easy deallocation.  */
> -  pthread_key_t mem_key;
> -  atomic_uintptr_t mem_tail;
> +  /* Supporting lock for internal memory handling.  Ensures threads that=
 have
> +     an entry in the mem_tails array are not disturbed by new threads do=
ing
> +     allocations for this Dwarf.  */
> +  pthread_rwlock_t mem_rwl;
> +
> +  /* Internal memory handling.  This is basically a simplified thread-lo=
cal
> +     reimplementation of obstacks.  Unfortunately the standard obstack
> +     implementation is not usable in libraries.  */
> +  size_t mem_stacks;
> +  struct libdw_memblock
> +  {
> +    size_t size;
> +    size_t remaining;
> +    struct libdw_memblock *prev;
> +    char mem[0];
> +  } **mem_tails;
>=20=20
>    /* Default size of allocated memory blocks.  */
>    size_t mem_default_size;

OK.

> @@ -578,34 +578,31 @@ libdw_valid_user_form (int form)
>  extern void __libdw_seterrno (int value) internal_function;
>=20=20
>=20=20
> -/* Memory handling, the easy parts.  This macro does not do nor need to =
do any
> -   locking for proper concurrent operation.  */
> +/* Memory handling, the easy parts.  */
>  #define libdw_alloc(dbg, type, tsize, cnt) \
> -  ({ struct libdw_memblock *_tail =3D pthread_getspecific (dbg->mem_key)=
;       \
> -     size_t _req =3D (tsize) * (cnt);					      \
> -     type *_result;							      \
> -     if (unlikely (_tail =3D=3D NULL))					      \
> -       _result =3D (type *) __libdw_allocate (dbg, _req, __alignof (type=
));     \
> +  ({ struct libdw_memblock *_tail =3D __libdw_alloc_tail(dbg);		      \
> +     size_t _required =3D (tsize) * (cnt);				      \
> +     type *_result =3D (type *) (_tail->mem + (_tail->size - _tail->rema=
ining));\
> +     size_t _padding =3D ((__alignof (type)				      \
> +			 - ((uintptr_t) _result & (__alignof (type) - 1)))    \
> +			& (__alignof (type) - 1));			      \
> +     if (unlikely (_tail->remaining < _required + _padding))		      \
> +       _result =3D (type *) __libdw_allocate (dbg, _required, __alignof =
(type));\
>       else								      \
>         {								      \
> -	 _result =3D (type *) (_tail->mem + (_tail->size - _tail->remaining)); =
 \
> -	 size_t _padding =3D ((__alignof (type)				      \
> -			    - ((uintptr_t) _result & (__alignof (type) - 1))) \
> -			       & (__alignof (type) - 1));		      \
> -	 if (unlikely (_tail->remaining < _req + _padding))		      \
> -	   _result =3D (type *) __libdw_allocate (dbg, _req, __alignof (type));=
 \
> -	 else								      \
> -	   {								      \
> -	     _req +=3D _padding;						      \
> -	     _result =3D (type *) ((char *) _result + _padding);		      \
> -	     _tail->remaining -=3D _req;					      \
> -	   }								      \
> +	 _required +=3D _padding;						      \
> +	 _result =3D (type *) ((char *) _result + _padding);		      \
> +	 _tail->remaining -=3D _required;					      \
>         }								      \
>       _result; })

OK. Maybe add a comment that __libdw_alloc_tail makes sure that we get
a thread local libdw_memblock to work with.

>  #define libdw_typed_alloc(dbg, type) \
>    libdw_alloc (dbg, type, sizeof (type), 1)
>=20=20
> +/* Callback to choose a thread-local memory allocation stack.  */
> +extern struct libdw_memblock *__libdw_alloc_tail (Dwarf* dbg)
> +     __nonnull_attribute__ (1);
> +
>  /* Callback to allocate more.  */
>  extern void *__libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
>       __attribute__ ((__malloc__)) __nonnull_attribute__ (1);

O. There is the comment already :)

> diff --git a/libdw/libdw_alloc.c b/libdw/libdw_alloc.c
> index f2e74d18..86ca7032 100644
> --- a/libdw/libdw_alloc.c
> +++ b/libdw/libdw_alloc.c
> @@ -35,7 +35,68 @@
>  #include <stdlib.h>
>  #include "libdwP.h"
>  #include "system.h"
> +#include "atomics.h"
> +#if USE_VG_ANNOTATIONS =3D=3D 1
> +#include <helgrind.h>
> +#include <drd.h>

I think if you include helgrind.h you won't get the drd.h
ANNOTATE_HAPPENS_BEFORE/AFTER. So do you also need to include drd.h?

> +#else
> +#define ANNOTATE_HAPPENS_BEFORE(X)
> +#define ANNOTATE_HAPPENS_AFTER(X)
> +#endif

Could you explain the usage of the happens_before/after annotations in
this code. I must admit that I don't fully understand why/how it works
in this case. Specifically since realloc might change the address that
mem_tails points to.

> +#define THREAD_ID_UNSET ((size_t) -1)
> +static thread_local size_t thread_id =3D THREAD_ID_UNSET;
> +static atomic_size_t next_id =3D ATOMIC_VAR_INIT(0);

OK, but maybe use static __thread size_t thread_id as explained above?

> +struct libdw_memblock *
> +__libdw_alloc_tail (Dwarf *dbg)
> +{
> +  if (thread_id =3D=3D THREAD_ID_UNSET)
> +    thread_id =3D atomic_fetch_add (&next_id, 1);
> +
> +  pthread_rwlock_rdlock (&dbg->mem_rwl);
> +  if (thread_id >=3D dbg->mem_stacks)
> +    {
> +      pthread_rwlock_unlock (&dbg->mem_rwl);
> +      pthread_rwlock_wrlock (&dbg->mem_rwl);
> +
> +      /* Another thread may have already reallocated. In theory using an
> +         atomic would be faster, but given that this only happens once p=
er
> +         thread per Dwarf, some minor slowdown should be fine.  */
> +      if (thread_id >=3D dbg->mem_stacks)
> +        {
> +          dbg->mem_tails =3D realloc (dbg->mem_tails, (thread_id+1)
> +                                    * sizeof (struct libdw_memblock *));
> +          if (dbg->mem_tails =3D=3D NULL)
> +            {
> +              pthread_rwlock_unlock (&dbg->mem_rwl);
> +              dbg->oom_handler();
> +            }
> +          for (size_t i =3D dbg->mem_stacks; i <=3D thread_id; i++)
> +            dbg->mem_tails[i] =3D NULL;
> +          dbg->mem_stacks =3D thread_id + 1;
> +          ANNOTATE_HAPPENS_BEFORE (&dbg->mem_tails);
> +        }
> +
> +      pthread_rwlock_unlock (&dbg->mem_rwl);
> +      pthread_rwlock_rdlock (&dbg->mem_rwl);
> +    }

OK. So next_id only increases, so it might leak a bit without thread
pools.

Might it make sense to have some __libdw_destroy_tail () function that
could be called from dwarf_end that checks if this was the last Dwarf
in use so we could reset next_id? It would only work in some cases, if
there are multiple Dwarfs in use it probably is useless. I guess it is
too much trickery for an odd corner case?

O, and I now think you would then also need something for dwarf_begin
to reset any set thread_ids... bleah. So probably way too complicated.
So lets not, unless you think this is actually simple.

> +  /* At this point, we have an entry in the tail array.  */
> +  ANNOTATE_HAPPENS_AFTER (&dbg->mem_tails);
> +  struct libdw_memblock *result =3D dbg->mem_tails[thread_id];
> +  if (result =3D=3D NULL)
> +    {
> +      result =3D malloc (dbg->mem_default_size);
> +      result->size =3D dbg->mem_default_size
> +                     - offsetof (struct libdw_memblock, mem);
> +      result->remaining =3D result->size;
> +      result->prev =3D NULL;
> +      dbg->mem_tails[thread_id] =3D result;
> +    }
> +  pthread_rwlock_unlock (&dbg->mem_rwl);
> +  return result;
> +}

OK.

>  void *
>  __libdw_allocate (Dwarf *dbg, size_t minsize, size_t align)
> @@ -52,10 +113,10 @@ __libdw_allocate (Dwarf *dbg, size_t minsize, size_t=
 align)
>    newp->size =3D size - offsetof (struct libdw_memblock, mem);
>    newp->remaining =3D (uintptr_t) newp + size - (result + minsize);
>=20=20
> -  newp->prev =3D (struct libdw_memblock*)atomic_exchange_explicit(
> -      &dbg->mem_tail, (uintptr_t)newp, memory_order_relaxed);
> -  if (pthread_setspecific (dbg->mem_key, newp) !=3D 0)
> -    dbg->oom_handler ();
> +  pthread_rwlock_rdlock (&dbg->mem_rwl);
> +  newp->prev =3D dbg->mem_tails[thread_id];
> +  dbg->mem_tails[thread_id] =3D newp;
> +  pthread_rwlock_unlock (&dbg->mem_rwl);
>=20=20
>    return (void *) result;
>  }

OK. Since this is only called after __libdw_alloc_tail you know that
thread_id will be set.

Thanks,

Mark