public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* GCC DWARF unwinder _Unwind_Find_FDE acceleration
@ 2021-10-29 19:13 Florian Weimer
  0 siblings, 0 replies; only message in thread
From: Florian Weimer @ 2021-10-29 19:13 UTC (permalink / raw)
  To: gcc; +Cc: Jason Merrill, Jakub Jelinek, libc-alpha

I've got a mostly-complete implementation of a faster _Unwind_Find_FDE
implementation which lock-free and async-signal-safe.  The core idea is
to turn into effectively this.

  const fde *
  _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
  {
    struct unw_eh_callback_data data;
    const fde *ret;
  
    ret = _Unwind_Find_registered_FDE (pc, bases);
    if (ret != NULL)
      return ret;

     void *eh_frame = _dl_find_eh_frame (pc);
     if (eh_frame == NULL)
       return NULL;
  
     struct find_fde_tail_result result
       = find_fde_tail ((_Unwind_Ptr) pc, eh_frame, NULL);
     if (result.entry != NULL)
       {
         bases->tbase = NULL;
         bases->dbase = NULL;
         bases->func = result.func;
       }
    return result.entry;
  }

_dl_find_eh_frame is implemented by glibc and is very similar to
__gnu_Unwind_Find_exidx, except that it returns the PT_GNU_EH_FRAME
address.  The real implementation of _Unwind_Find_FDE (not the
simplified one above) can use a weak symbol for _dl_find_eh_frame and
has still the old dl_iterate_phdr-based implementation as a fallback
(for non-glibc older glibc).  There's also some hackery for passing
around the dbase value for DW_EH_PE_datarel.

I split the common tail of the old and new implementations into
find_fde_tail, included below for reference.  Previously, the tail was
covered by the loader lock implied by dl_iterate_phdr.  With the new
implementation, that's no longer true.  I do not think this matters
because in the current implement in trunk, we continue DWARF parsing
after _Unwind_Find_FDE returns, relative to the returned FDE pointer.  A
concurrent dlclose will proceed to unmap the data while it is being
read.  So the race condition already exists on trunk, and we can have
crashes due to concurrent dlclose of code for a stack frame that is
processed by unwinding.

The new implementation I have is as fast as the trunk implementation for
the best case for the trunk implementation: when there are no threads
and we achieve 100% cache hit rate.  Even with 1000 shared objects (of
which less than 8 participate in unwinding, to achieve 100% cache hits),
my new implementation seems to be 10% faster.  Once the cache isn't 100%
effective, performance is really bad on the trunk, and likewise for
multi-threading.  (There are no writes to shared data in my new
implementation and only two 64-bit acquire-MO loads (worst case), so
benchmarking multi-threaded workloads against the trunk implementation
isn't very interesting.  32-bit support will likely need 4 acquire-MO
loads.)  There is a slight performance regression with lots and lots of
dlopen calls because the new data structures for the EH frame location
are somewhat costly to upgrade (but dlopen still spends twice as much
time in strcmp for soname matching, so I'm not worried).

There's room for one more straightforward optimization: move
_Unwind_Find_FDE into glibc, and not just its PT_GNU_EH_FRAME lookup.
We would have to teach glibc a little bit about the PT_GNU_EH_FRAME
layout (the parts parsed in find_fde_tail below).  Once in glibc, we can
easily cache the presence and location of the table of FDEs used in the
binary search inside find_fde_tail.  This means we can start the search
right away.  But maybe this isn't necessary given that my new
implementation is already as fast as the old one even in the cases most
benefiting the old implementation.

A future implementation of _Unwind_Find_FDE inside glibc could also
cache the PC-to-FDE mapping inside glibc, skipping the binary search for
the FDE and taking advantage of efficient cache invalidation during
dlopen and dlclose.  For an object of 20,000 functions, the binary
search in find_fde_tail takes about 10% to 15% of the time during
unwinding, so it's not insignificant.  But it's also unclear if we can
actually a achieve a decent cache hit rate for any useful cache size.

So I would like to stick to my current approach, with a function like
_dl_find_eh_frame.

I will hopefully complete 32-bit support and tests next week, so that I
have patches to post.

Comments?  Thoughts?

Thanks,
Florian

/* Result type of find_fde_tail below.  */
struct find_fde_tail_result
{
  const fde *entry;
  void *func;
};

/* Find the FDE for the program counter PC, in a previously located
   PT_GNU_EH_FRAME data region.  */
static struct find_fde_tail_result
find_fde_tail (_Unwind_Ptr pc,
	       const struct unw_eh_frame_hdr *hdr,
	       _Unwind_Ptr dbase)
{
  const unsigned char *p = (const unsigned char *) (hdr + 1);
  _Unwind_Ptr eh_frame;
  struct object ob;

  if (hdr->version != 1)
    return (struct find_fde_tail_result) { NULL, };

  p = read_encoded_value_with_base (hdr->eh_frame_ptr_enc,
				    base_from_cb_data (hdr->eh_frame_ptr_enc,
						       dbase),
				    p, &eh_frame);

  /* We require here specific table encoding to speed things up.
     Also, DW_EH_PE_datarel here means using PT_GNU_EH_FRAME start
     as base, not the processor specific DW_EH_PE_datarel.  */
  if (hdr->fde_count_enc != DW_EH_PE_omit
      && hdr->table_enc == (DW_EH_PE_datarel | DW_EH_PE_sdata4))
    {
      _Unwind_Ptr fde_count;

      p = read_encoded_value_with_base (hdr->fde_count_enc,
					base_from_cb_data (hdr->fde_count_enc,
							   dbase),
					p, &fde_count);
      /* Shouldn't happen.  */
      if (fde_count == 0)
	return (struct find_fde_tail_result) { NULL, };
      if ((((_Unwind_Ptr) p) & 3) == 0)
	{
	  struct fde_table {
	    signed initial_loc __attribute__ ((mode (SI)));
	    signed fde __attribute__ ((mode (SI)));
	  };
	  const struct fde_table *table = (const struct fde_table *) p;
	  size_t lo, hi, mid;
	  _Unwind_Ptr data_base = (_Unwind_Ptr) hdr;
	  fde *f;
	  unsigned int f_enc, f_enc_size;
	  _Unwind_Ptr range;

	  mid = fde_count - 1;
	  if (pc < table[0].initial_loc + data_base)
	    return (struct find_fde_tail_result) { NULL, };
	  else if (pc < table[mid].initial_loc + data_base)
	    {
	      lo = 0;
	      hi = mid;

	      while (lo < hi)
		{
		  mid = (lo + hi) / 2;
		  if (pc < table[mid].initial_loc + data_base)
		    hi = mid;
		  else if (pc >= table[mid + 1].initial_loc + data_base)
		    lo = mid + 1;
		  else
		    break;
		}

	      gcc_assert (lo < hi);
	    }

	  f = (fde *) (table[mid].fde + data_base);
	  f_enc = get_fde_encoding (f);
	  f_enc_size = size_of_encoded_value (f_enc);
	  read_encoded_value_with_base (f_enc & 0x0f, 0,
					&f->pc_begin[f_enc_size], &range);
	  void *func = (void *) (table[mid].initial_loc + data_base);
	  if (pc < table[mid].initial_loc + data_base + range)
	    return (struct find_fde_tail_result) { f, func };
	  else
	    return (struct find_fde_tail_result) { NULL, func };
	}
    }

  /* We have no sorted search table, so need to go the slow way.
     As soon as GLIBC will provide API so to notify that a library has been
     removed, we could cache this (and thus use search_object).  */
  ob.pc_begin = NULL;
  ob.tbase = NULL;
  ob.dbase = (void *) dbase;
  ob.u.single = (fde *) eh_frame;
  ob.s.i = 0;
  ob.s.b.mixed_encoding = 1;  /* Need to assume worst case.  */
  const fde *entry = linear_search_fdes (&ob, (fde *) eh_frame, (void *) pc);
  if (entry != NULL)
    {
      _Unwind_Ptr func;
      unsigned int encoding = get_fde_encoding (entry);

      read_encoded_value_with_base (encoding,
				    base_from_cb_data (encoding, dbase),
				    entry->pc_begin, &func);
      return (struct find_fde_tail_result) { entry, (void *) func };
    }
  return (struct find_fde_tail_result) { NULL, };
}


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2021-10-29 19:13 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-29 19:13 GCC DWARF unwinder _Unwind_Find_FDE acceleration Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).