[PATCH v2] Add a trie to map quickly from address range to compilation unit.

public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v2] Add a trie to map quickly from address range to compilation unit.
@ 2022-04-04  7:32 Steinar H. Gunderson
  2022-04-08  8:05 ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: Steinar H. Gunderson @ 2022-04-04  7:32 UTC (permalink / raw)
  To: binutils; +Cc: sesse, Steinar H. Gunderson

When using perf to profile large binaries, _bfd_dwarf2_find_nearest_line()
becomes a hotspot, as perf wants to get line number information
(for inline-detection purposes) for each and every sample. In Chromium
in particular (the content_shell binary), this entails going through
475k address ranges, which takes a long time when done repeatedly.

Add a radix-256 trie over the address space to quickly map address to
compilation unit spaces; for content_shell, which is 1.6 GB when some
(but not full) debug information turned is on, we go from 6 ms to
0.006 ms (6 µs) for each lookup from address to compilation unit, a 1000x
speedup.

There is a modest RAM increase of 180 MB in this binary (the existing
linked list over ranges uses about 10 MB, and the entire perf job uses
between 2–3 GB for a medium-size profile); for smaller binaries with few
ranges, there should be hardly any extra RAM usage at all.
---
 bfd/dwarf2.c | 372 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 354 insertions(+), 18 deletions(-)

diff --git a/bfd/dwarf2.c b/bfd/dwarf2.c
index 404f35df62b..61ca68d4e97 100644
--- a/bfd/dwarf2.c
+++ b/bfd/dwarf2.c
@@ -82,6 +82,76 @@ struct adjusted_section
   bfd_vma adj_vma;
 };
 
+/* A trie to map quickly from address range to compilation unit.
+
+   This is a fairly standard radix-256 trie, used to quickly locate which
+   compilation unit any given address belongs to.  Given that each compilation
+   unit may register hundreds of very small and unaligned ranges (which may
+   potentially overlap, due to inlining and other concerns), and a large
+   program may end up containing hundreds of thousands of such ranges, we cannot
+   scan through them linearly without undue slowdown.
+
+   We use a hybrid trie to avoid memory explosion: There are two types of trie
+   nodes, leaves and interior nodes.  (Almost all nodes are leaves, so they
+   take up the bulk of the memory usage.) Leaves contain a simple array of
+   ranges (high/low address) and which compilation unit contains those ranges,
+   and when we get to a leaf, we scan through it linearly.  Interior nodes
+   contain pointers to 256 other nodes, keyed by the next byte of the address.
+   So for a 64-bit address like 0x1234567abcd, we would start at the root and go
+   down child[0x00]->child[0x00]->child[0x01]->child[0x23]->child[0x45] etc.,
+   until we hit a leaf.  (Nodes are, in general, leaves until they exceed the
+   default allocation of 16 elements, at which point they are converted to
+   interior node if possible.) This gives us near-constant lookup times;
+   the only thing that can be costly is if there are lots of overlapping ranges
+   within a single 256-byte segment of the binary, in which case we have to
+   scan through them all to find the best match.
+
+   For a binary with few ranges, we will in practice only have a single leaf node
+   at the root, containing a simple array.  Thus, the scheme is efficient for
+   both small and large binaries.
+ */
+
+/* Experiments have shown 16 to be a memory-efficient default leaf size.
+   The only case where a leaf will hold more memory than this, is at the
+   bottomost level (covering 256 bytes in the binary), where we'll expand
+   the leaf to be able to hold more ranges if needed.
+ */
+#define TRIE_LEAF_SIZE 16
+
+/* All trie_node pointers will really be trie_leaf or trie_interior,
+   but they have this common head.  */
+struct trie_node
+{
+  /* If zero, we are an interior node.
+     Otherwise, how many ranges we have room for in this leaf.  */
+  int num_room_in_leaf;
+};
+
+struct trie_leaf
+{
+  struct trie_node head;
+  int num_stored_in_leaf;
+  struct {
+    const struct comp_unit *unit;
+    bfd_vma low_pc, high_pc;
+  } ranges[TRIE_LEAF_SIZE];
+};
+
+struct trie_interior
+{
+  struct trie_node head;
+  struct trie_node *children[256];
+};
+
+static struct trie_node *alloc_trie_leaf (bfd *abfd)
+{
+  struct trie_leaf *leaf =
+    (struct trie_leaf *) bfd_zalloc (abfd, sizeof (struct trie_leaf));
+  if (leaf != NULL)
+    leaf->head.num_room_in_leaf = TRIE_LEAF_SIZE;
+  return (struct trie_node *) leaf;
+}
+
 struct dwarf2_debug_file
 {
   /* The actual bfd from which debug info was loaded.  Might be
@@ -139,6 +209,9 @@ struct dwarf2_debug_file
   /* A list of all previously read comp_units.  */
   struct comp_unit *all_comp_units;
 
+  /* A list of all previously read comp_units with no ranges (yet).  */
+  struct comp_unit *all_comp_units_without_ranges;
+
   /* Last comp unit in list above.  */
   struct comp_unit *last_comp_unit;
 
@@ -147,6 +220,9 @@ struct dwarf2_debug_file
 
   /* Hash table to map offsets to decoded abbrevs.  */
   htab_t abbrev_offsets;
+
+  /* Root of a trie to map addresses to compilation units.  */
+  struct trie_node *trie_root;
 };
 
 struct dwarf2_debug
@@ -220,6 +296,11 @@ struct comp_unit
   /* Chain the previously read compilation units.  */
   struct comp_unit *next_unit;
 
+  /* Chain the previously read compilation units that have no ranges yet.
+     We scan these separately when we have a trie over the ranges.
+     Unused if arange.high != 0. */
+  struct comp_unit *next_unit_without_ranges;
+
   /* Likewise, chain the compilation unit read after this one.
      The comp units are stored in reversed reading order.  */
   struct comp_unit *prev_unit;
@@ -296,6 +377,10 @@ struct comp_unit
 
   /* TRUE if symbols are cached in hash table for faster lookup by name.  */
   bool cached;
+
+  /* Used when iterating over trie leaves to know which units we have
+     already seen in this iteration.  */
+  bool mark;
 };
 
 /* This data structure holds the information of an abbrev.  */
@@ -1767,9 +1852,183 @@ concat_filename (struct line_info_table *table, unsigned int file)
   return strdup (filename);
 }
 
+/* Number of bits in a bfd_vma.  */
+#define VMA_BITS (8 * sizeof (bfd_vma))
+
+/* Check whether [low1, high1) can be combined with [low2, high2),
+   i.e., they touch or overlap.  */
+static bool ranges_overlap(bfd_vma low1, bfd_vma high1, bfd_vma low2, bfd_vma high2)
+{
+  if (low1 == low2 || high1 == high2)
+    return true;
+
+  /* Sort so that low1 is below low2. */
+  if (low1 > low2)
+    {
+      bfd_vma tmp;
+
+      tmp = low1;
+      low1 = low2;
+      low2 = tmp;
+
+      tmp = high1;
+      high1 = high2;
+      high2 = tmp;
+    }
+
+  /* We touch iff low2 == high1.
+     We overlap iff low2 is within [low1, high1). */
+  return (low2 <= high1);
+}
+
+/* Insert an address range in the trie mapping addresses to compilation units.
+   Will return the new trie node (usually the same as is being sent in, but
+   in case of a leaf-to-interior conversion, or expansion of a leaf, it may be
+   different), or NULL on failure.
+ */
+static struct trie_node *insert_arange_in_trie(bfd *abfd,
+					       struct trie_node *trie,
+					       bfd_vma trie_pc,
+					       int trie_pc_bits,
+					       const struct comp_unit *unit,
+					       bfd_vma low_pc,
+					       bfd_vma high_pc)
+{
+  bfd_vma clamped_low_pc, clamped_high_pc;
+  int ch, from_ch, to_ch;
+  bool is_full_leaf = false;
+
+  /* See if we can extend any of the existing ranges.  This merging
+     isn't perfect (if merging opens up the possibility of merging two existing
+     ranges, we won't find them), but it takes the majority of the cases.  */
+  if (trie->num_room_in_leaf > 0)
+    {
+      struct trie_leaf *leaf = (struct trie_leaf *) trie;
+      int i;
+
+      for (i = 0; i < leaf->num_stored_in_leaf; ++i)
+	{
+	  if (leaf->ranges[i].unit == unit &&
+	      ranges_overlap(low_pc, high_pc,
+			     leaf->ranges[i].low_pc, leaf->ranges[i].high_pc))
+	    {
+	      if (low_pc < leaf->ranges[i].low_pc)
+		leaf->ranges[i].low_pc = low_pc;
+	      if (high_pc > leaf->ranges[i].high_pc)
+		leaf->ranges[i].high_pc = high_pc;
+	      return trie;
+	    }
+	}
+
+      is_full_leaf = leaf->num_stored_in_leaf == trie->num_room_in_leaf;
+    }
+
+  /* If we're a leaf with no more room and we're _not_ at the bottom,
+     convert to an interior node.  */
+  if (is_full_leaf && trie_pc_bits < (int)VMA_BITS)
+    {
+      const struct trie_leaf *leaf = (struct trie_leaf *) trie;
+      int i;
+
+      trie = (struct trie_node *) bfd_zalloc (abfd, sizeof (struct trie_interior));
+      if (!trie)
+	return NULL;
+      is_full_leaf = false;
+
+      /* TODO: If we wanted to save a little more memory at the cost of
+	 complexity, we could have reused the old leaf node as one of the
+	 children of the new interior node, instead of throwing it away.  */
+      for (i = 0; i < leaf->num_stored_in_leaf; ++i)
+        {
+	  if (!insert_arange_in_trie (abfd, trie, trie_pc, trie_pc_bits,
+				      leaf->ranges[i].unit, leaf->ranges[i].low_pc,
+				      leaf->ranges[i].high_pc))
+	    return NULL;
+	}
+    }
+
+  /* If we're a leaf with no more room and we _are_ at the bottom,
+     we have no choice but to just make it larger. */
+  if (is_full_leaf)
+    {
+      const struct trie_leaf *leaf = (struct trie_leaf *) trie;
+      int new_room_in_leaf = trie->num_room_in_leaf * 2;
+      struct trie_leaf *new_leaf;
+
+      new_leaf = (struct trie_leaf *) bfd_zalloc (abfd,
+	sizeof (struct trie_leaf) + (new_room_in_leaf - TRIE_LEAF_SIZE) * sizeof (leaf->ranges[0]));
+      new_leaf->head.num_room_in_leaf = new_room_in_leaf;
+      new_leaf->num_stored_in_leaf = leaf->num_stored_in_leaf;
+
+      memcpy (new_leaf->ranges,
+	      leaf->ranges,
+	      leaf->num_stored_in_leaf * sizeof (leaf->ranges[0]));
+      trie = (struct trie_node *) new_leaf;
+      is_full_leaf = false;
+
+      /* Now the insert below will go through.  */
+    }
+
+  /* If we're a leaf (now with room), we can just insert at the end.  */
+  if (trie->num_room_in_leaf > 0)
+    {
+      struct trie_leaf *leaf = (struct trie_leaf *) trie;
+
+      int i = leaf->num_stored_in_leaf++;
+      leaf->ranges[i].unit = unit;
+      leaf->ranges[i].low_pc = low_pc;
+      leaf->ranges[i].high_pc = high_pc;
+      return trie;
+    }
+
+  /* Now we are definitely an interior node, so recurse into all the relevant buckets.  */
+
+  /* Clamp the range to the current trie bucket.  */
+  clamped_low_pc = low_pc;
+  clamped_high_pc = high_pc;
+  if (trie_pc_bits > 0)
+    {
+      bfd_vma bucket_high_pc = trie_pc + ((bfd_vma)-1 >> trie_pc_bits);  /* Inclusive.  */
+      if (clamped_low_pc < trie_pc)
+	clamped_low_pc = trie_pc;
+      if (clamped_high_pc > bucket_high_pc)
+	clamped_high_pc = bucket_high_pc;
+    }
+
+  /* Insert the ranges in all buckets that it spans.  */
+  from_ch = (clamped_low_pc >> (VMA_BITS - trie_pc_bits - 8)) & 0xff;
+  to_ch = ((clamped_high_pc - 1) >> (VMA_BITS - trie_pc_bits - 8)) & 0xff;
+  for (ch = from_ch; ch <= to_ch; ++ch)
+    {
+      struct trie_interior *interior = (struct trie_interior *) trie;
+      struct trie_node *child = interior->children[ch];
+
+      if (child == NULL)
+        {
+	  child = alloc_trie_leaf (abfd);
+	  if (!child)
+	    return NULL;
+	}
+      child = insert_arange_in_trie (abfd,
+				     child,
+				     trie_pc + ((bfd_vma)ch << (VMA_BITS - trie_pc_bits - 8)),
+				     trie_pc_bits + 8,
+				     unit,
+				     low_pc,
+				     high_pc);
+      if (!child)
+	return NULL;
+
+      interior->children[ch] = child;
+    }
+
+    return trie;
+}
+
+
 static bool
 arange_add (const struct comp_unit *unit, struct arange *first_arange,
-	    bfd_vma low_pc, bfd_vma high_pc)
+	    struct trie_node **trie_root, bfd_vma low_pc, bfd_vma high_pc)
 {
   struct arange *arange;
 
@@ -1777,6 +2036,19 @@ arange_add (const struct comp_unit *unit, struct arange *first_arange,
   if (low_pc == high_pc)
     return true;
 
+  if (trie_root != NULL)
+    {
+      *trie_root = insert_arange_in_trie (unit->file->bfd_ptr,
+					  *trie_root,
+					  0,
+					  0,
+					  unit,
+					  low_pc,
+					  high_pc);
+      if (*trie_root == NULL)
+	return false;
+    }
+
   /* If the first arange is empty, use it.  */
   if (first_arange->high == 0)
     {
@@ -2411,7 +2683,7 @@ decode_line_info (struct comp_unit *unit)
 		    low_pc = address;
 		  if (address > high_pc)
 		    high_pc = address;
-		  if (!arange_add (unit, &unit->arange, low_pc, high_pc))
+		  if (!arange_add (unit, &unit->arange, &unit->file->trie_root, low_pc, high_pc))
 		    goto line_fail;
 		  break;
 		case DW_LNE_set_address:
@@ -3134,7 +3406,7 @@ find_abstract_instance (struct comp_unit *unit,
 
 static bool
 read_ranges (struct comp_unit *unit, struct arange *arange,
-	     bfd_uint64_t offset)
+	     struct trie_node **trie_root, bfd_uint64_t offset)
 {
   bfd_byte *ranges_ptr;
   bfd_byte *ranges_end;
@@ -3169,7 +3441,7 @@ read_ranges (struct comp_unit *unit, struct arange *arange,
 	base_address = high_pc;
       else
 	{
-	  if (!arange_add (unit, arange,
+	  if (!arange_add (unit, arange, trie_root,
 			   base_address + low_pc, base_address + high_pc))
 	    return false;
 	}
@@ -3179,7 +3451,7 @@ read_ranges (struct comp_unit *unit, struct arange *arange,
 
 static bool
 read_rnglists (struct comp_unit *unit, struct arange *arange,
-	       bfd_uint64_t offset)
+	       struct trie_node **trie_root, bfd_uint64_t offset)
 {
   bfd_byte *rngs_ptr;
   bfd_byte *rngs_end;
@@ -3253,19 +3525,19 @@ read_rnglists (struct comp_unit *unit, struct arange *arange,
 	  return false;
 	}
 
-      if (!arange_add (unit, arange, low_pc, high_pc))
+      if (!arange_add (unit, arange, trie_root, low_pc, high_pc))
 	return false;
     }
 }
 
 static bool
 read_rangelist (struct comp_unit *unit, struct arange *arange,
-		bfd_uint64_t offset)
+		struct trie_node **trie_root, bfd_uint64_t offset)
 {
   if (unit->version <= 4)
-    return read_ranges (unit, arange, offset);
+    return read_ranges (unit, arange, trie_root, offset);
   else
-    return read_rnglists (unit, arange, offset);
+    return read_rnglists (unit, arange, trie_root, offset);
 }
 
 static struct funcinfo *
@@ -3617,7 +3889,7 @@ scan_unit_for_symbols (struct comp_unit *unit)
 
 		case DW_AT_ranges:
 		  if (is_int_form (&attr)
-		      && !read_rangelist (unit, &func->arange, attr.u.val))
+		      && !read_rangelist (unit, &func->arange, &unit->file->trie_root, attr.u.val))
 		    goto fail;
 		  break;
 
@@ -3733,7 +4005,7 @@ scan_unit_for_symbols (struct comp_unit *unit)
 
       if (func && high_pc != 0)
 	{
-	  if (!arange_add (unit, &func->arange, low_pc, high_pc))
+	  if (!arange_add (unit, &func->arange, &unit->file->trie_root, low_pc, high_pc))
 	    goto fail;
 	}
     }
@@ -3931,7 +4203,7 @@ parse_comp_unit (struct dwarf2_debug *stash,
 
 	case DW_AT_ranges:
 	  if (is_int_form (&attr)
-	      && !read_rangelist (unit, &unit->arange, attr.u.val))
+	      && !read_rangelist (unit, &unit->arange, &unit->file->trie_root, attr.u.val))
 	    return NULL;
 	  break;
 
@@ -3973,7 +4245,7 @@ parse_comp_unit (struct dwarf2_debug *stash,
     high_pc += low_pc;
   if (high_pc != 0)
     {
-      if (!arange_add (unit, &unit->arange, low_pc, high_pc))
+      if (!arange_add (unit, &unit->arange, &unit->file->trie_root, low_pc, high_pc))
 	return NULL;
     }
 
@@ -4772,6 +5044,14 @@ _bfd_dwarf2_slurp_debug_info (bfd *abfd, bfd *debug_bfd,
   if (!stash->alt.abbrev_offsets)
     return false;
 
+  stash->f.trie_root = alloc_trie_leaf (abfd);
+  if (!stash->f.trie_root)
+    return false;
+
+  stash->alt.trie_root = alloc_trie_leaf (abfd);
+  if (!stash->alt.trie_root)
+    return false;
+
   *pinfo = stash;
 
   if (debug_bfd == NULL)
@@ -4943,6 +5223,12 @@ stash_comp_unit (struct dwarf2_debug *stash, struct dwarf2_debug_file *file)
 	  each->next_unit = file->all_comp_units;
 	  file->all_comp_units = each;
 
+	  if (each->arange.high == 0)
+	    {
+	      each->next_unit_without_ranges = file->all_comp_units_without_ranges;
+	      file->all_comp_units_without_ranges = each->next_unit_without_ranges;
+	    }
+
 	  file->info_ptr += length;
 	  return each;
 	}
@@ -5185,17 +5471,67 @@ _bfd_dwarf2_find_nearest_line (bfd *abfd,
     }
   else
     {
-      for (each = stash->f.all_comp_units; each; each = each->next_unit)
+      struct trie_node *trie = stash->f.trie_root;
+      int bits = VMA_BITS - 8;
+      struct comp_unit **prev_each;
+
+      /* Traverse interior nodes until we get to a leaf.  */
+      while (trie && trie->num_room_in_leaf == 0)
 	{
-	  found = ((each->arange.high == 0
-		    || comp_unit_contains_address (each, addr))
-		   && comp_unit_find_nearest_line (each, addr,
+	  int ch = (addr >> bits) & 0xff;
+	  trie = ((struct trie_interior *) trie)->children[ch];
+	  bits -= 8;
+	}
+
+      if (trie)
+	{
+	  const struct trie_leaf *leaf = (struct trie_leaf *) trie;
+	  int i;
+
+	  for (i = 0; i < leaf->num_stored_in_leaf; ++i)
+	    {
+	      struct comp_unit *unit = (struct comp_unit *) leaf->ranges[i].unit;
+	      unit->mark = false;
+	    }
+
+	  for (i = 0; i < leaf->num_stored_in_leaf; ++i)
+	    {
+	      struct comp_unit *unit = (struct comp_unit *) leaf->ranges[i].unit;
+	      if (unit->mark ||
+	          addr < leaf->ranges[i].low_pc ||
+	          addr >= leaf->ranges[i].high_pc)
+	        continue;
+	      unit->mark = true;
+
+	      found = comp_unit_find_nearest_line (unit, addr,
 						   filename_ptr,
 						   &function,
 						   linenumber_ptr,
-						   discriminator_ptr));
+						   discriminator_ptr);
+	      if (found)
+		goto done;
+	   }
+	}
+
+      /* Also scan through all compilation units without any ranges,
+         taking them out of the list if they have acquired any since last time.  */
+      prev_each = &stash->f.all_comp_units_without_ranges;
+      for (each = *prev_each; each; each = each->next_unit_without_ranges)
+        {
+	  if (each->arange.high != 0)
+	    {
+	      *prev_each = each->next_unit_without_ranges;
+	      continue;
+	    }
+
+	  found = comp_unit_find_nearest_line (each, addr,
+					       filename_ptr,
+					       &function,
+					       linenumber_ptr,
+					       discriminator_ptr);
 	  if (found)
 	    goto done;
+	  prev_each = &each->next_unit_without_ranges;
 	}
     }
 
-- 
2.35.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-04  7:32 [PATCH v2] Add a trie to map quickly from address range to compilation unit Steinar H. Gunderson
@ 2022-04-08  8:05 ` Jan Beulich
  2022-04-08  8:38   ` Steinar H. Gunderson
  2022-04-08 10:38   ` Alan Modra
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Beulich @ 2022-04-08  8:05 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: sesse, binutils

On 04.04.2022 09:32, Steinar H. Gunderson via Binutils wrote:
> When using perf to profile large binaries, _bfd_dwarf2_find_nearest_line()
> becomes a hotspot, as perf wants to get line number information
> (for inline-detection purposes) for each and every sample. In Chromium
> in particular (the content_shell binary), this entails going through
> 475k address ranges, which takes a long time when done repeatedly.
> 
> Add a radix-256 trie over the address space to quickly map address to
> compilation unit spaces; for content_shell, which is 1.6 GB when some
> (but not full) debug information turned is on, we go from 6 ms to
> 0.006 ms (6 µs) for each lookup from address to compilation unit, a 1000x
> speedup.
> 
> There is a modest RAM increase of 180 MB in this binary (the existing
> linked list over ranges uses about 10 MB, and the entire perf job uses
> between 2–3 GB for a medium-size profile); for smaller binaries with few
> ranges, there should be hardly any extra RAM usage at all.

While the change looks okay to me in principle (a few comments further
down), I'm concerned of it being tailored to your use case: A long
running process surely will benefit from the speedup. To a short
running process (e.g. addr2line) this may be quite different, and the
increased memory demand and trie setup overhead may become dominant. I
guess there may want to be control over this by the application using
libbfd.

> +/* All trie_node pointers will really be trie_leaf or trie_interior,
> +   but they have this common head.  */
> +struct trie_node
> +{
> +  /* If zero, we are an interior node.
> +     Otherwise, how many ranges we have room for in this leaf.  */
> +  int num_room_in_leaf;

Here and in several more places further down, I think you would better
use "unsigned int". That's imo a general pattern to follow when values
can't go negative. But then again I'm still quite new as a general
maintainer, so I may not know of (unwritten?) rules saying otherwise.

> +};
> +
> +struct trie_leaf
> +{
> +  struct trie_node head;
> +  int num_stored_in_leaf;
> +  struct {
> +    const struct comp_unit *unit;
> +    bfd_vma low_pc, high_pc;
> +  } ranges[TRIE_LEAF_SIZE];
> +};
> +
> +struct trie_interior
> +{
> +  struct trie_node head;
> +  struct trie_node *children[256];
> +};
> +
> +static struct trie_node *alloc_trie_leaf (bfd *abfd)
> +{
> +  struct trie_leaf *leaf =
> +    (struct trie_leaf *) bfd_zalloc (abfd, sizeof (struct trie_leaf));

With C99 now being a requirement (and K&R long not having been supported)
I don't think you need a cast here (and elsewhere in similar cases).

> +  if (leaf != NULL)
> +    leaf->head.num_room_in_leaf = TRIE_LEAF_SIZE;
> +  return (struct trie_node *) leaf;

Furthermore, with casts being somewhat risky in general (and there not
being more fine-grained C++-like casts in C), I think it would be better
to use &leaf->head in such cases. For the opposite conversion Linux and
other projects have a container_of() construct - I couldn't find
anything similar in bfd or libiberty, but I wonder whether something
like that would be worthwhile introducing to increase type-safety.
There are several more casts throughout the patch, with one more flavor
to specifically call out:

> +      if (trie)
> +	{
> +	  const struct trie_leaf *leaf = (struct trie_leaf *) trie;
> +	  int i;
> +
> +	  for (i = 0; i < leaf->num_stored_in_leaf; ++i)
> +	    {
> +	      struct comp_unit *unit = (struct comp_unit *) leaf->ranges[i].unit;

Here you cast away constness. Imo such should only be done in very rare
cases. Avoiding such a cast is as simple as dropping the "const" from
the struct field declaration.

Finally one other concern: Recently a patch was contributed to make
libopcodes usable (for x86) from multi-threaded applications. I don't
know what the respective aims are with libbfd. If the library is meant
to be usable that way, then something would need to be done about races
in the trie accesses. I would hope e.g. Nick or Alan could help clarify
this.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08  8:05 ` Jan Beulich
@ 2022-04-08  8:38   ` Steinar H. Gunderson
  2022-04-08  8:50     ` Jan Beulich
  2022-04-08 10:38   ` Alan Modra
  1 sibling, 1 reply; 10+ messages in thread
From: Steinar H. Gunderson @ 2022-04-08  8:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: sesse, binutils

On Fri, Apr 08, 2022 at 10:05:55AM +0200, Jan Beulich wrote:
> While the change looks okay to me in principle (a few comments further
> down), I'm concerned of it being tailored to your use case: A long
> running process surely will benefit from the speedup. To a short
> running process (e.g. addr2line) this may be quite different, and the
> increased memory demand and trie setup overhead may become dominant. I
> guess there may want to be control over this by the application using
> libbfd.

Well, if you only run it once, the memory usage won't really be that
high either, as it stops as soon as it finds an appropriate compilation
unit (it inserts incrementally into the tree). So it scales pretty well
downwards as well, just like the existing code does. But that aside,
is it a goal to have a non-long-running addr2line use as little peak
memory as possible?

An alternative would be something like only building the trie after the
first 100 lookups or similar. It should be fairly non-intrusive, but it
would require us to keep more of the old paths around.

> Here and in several more places further down, I think you would better
> use "unsigned int". That's imo a general pattern to follow when values
> can't go negative. But then again I'm still quite new as a general
> maintainer, so I may not know of (unwritten?) rules saying otherwise.

I'm fine with anything as long as there is some rule. :-) The rest of
the file appeared to me to use mostly int, so I was trying to follow that.

>> +static struct trie_node *alloc_trie_leaf (bfd *abfd)
>> +{
>> +  struct trie_leaf *leaf =
>> +    (struct trie_leaf *) bfd_zalloc (abfd, sizeof (struct trie_leaf));
> With C99 now being a requirement (and K&R long not having been supported)
> I don't think you need a cast here (and elsewhere in similar cases).

Again, this is just following the rest of the style (and implicit casts
from void* have not changed in C99, from what I know?). Do you want me
to diverge? Change the existing calls?

> Furthermore, with casts being somewhat risky in general (and there not
> being more fine-grained C++-like casts in C), I think it would be better
> to use &leaf->head in such cases.

Sure, I can change that.

>> +      if (trie)
>> +	{
>> +	  const struct trie_leaf *leaf = (struct trie_leaf *) trie;
>> +	  int i;
>> +
>> +	  for (i = 0; i < leaf->num_stored_in_leaf; ++i)
>> +	    {
>> +	      struct comp_unit *unit = (struct comp_unit *) leaf->ranges[i].unit;
> Here you cast away constness. Imo such should only be done in very rare
> cases. Avoiding such a cast is as simple as dropping the "const" from
> the struct field declaration.

The problem is that when we assign it to that struct field, we have it
only as a const. So we need to either const-cast here, at that point,
or all the way up to arange_add().

> Finally one other concern: Recently a patch was contributed to make
> libopcodes usable (for x86) from multi-threaded applications. I don't
> know what the respective aims are with libbfd. If the library is meant
> to be usable that way, then something would need to be done about races
> in the trie accesses. I would hope e.g. Nick or Alan could help clarify
> this.

The code already isn't thread-safe, though. It liberally reads in new
debug information from compilation units and inserts into various lists
as it goes. My patch does not really change anything here.

/* Steinar */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08  8:38   ` Steinar H. Gunderson
@ 2022-04-08  8:50     ` Jan Beulich
  2022-04-08 11:03       ` Steinar H. Gunderson
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2022-04-08  8:50 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: sesse, binutils

On 08.04.2022 10:38, Steinar H. Gunderson wrote:
> On Fri, Apr 08, 2022 at 10:05:55AM +0200, Jan Beulich wrote:
>> While the change looks okay to me in principle (a few comments further
>> down), I'm concerned of it being tailored to your use case: A long
>> running process surely will benefit from the speedup. To a short
>> running process (e.g. addr2line) this may be quite different, and the
>> increased memory demand and trie setup overhead may become dominant. I
>> guess there may want to be control over this by the application using
>> libbfd.
> 
> Well, if you only run it once, the memory usage won't really be that
> high either, as it stops as soon as it finds an appropriate compilation
> unit (it inserts incrementally into the tree). So it scales pretty well
> downwards as well, just like the existing code does. But that aside,
> is it a goal to have a non-long-running addr2line use as little peak
> memory as possible?

If we were talking of just a few kb, I think it wouldn't matter. But
for large binaries I think the peak could be quite a bit higher with
your change in place.

> An alternative would be something like only building the trie after the
> first 100 lookups or similar. It should be fairly non-intrusive, but it
> would require us to keep more of the old paths around.

Some heuristic like this may also do, yes. Nevertheless I think it
would be better for the application to provide an indication. After
all, if addr2line was called with dozens of addresses, it might
benefit as well (so such a heuristic would better live there than
in the library).

>> Here and in several more places further down, I think you would better
>> use "unsigned int". That's imo a general pattern to follow when values
>> can't go negative. But then again I'm still quite new as a general
>> maintainer, so I may not know of (unwritten?) rules saying otherwise.
> 
> I'm fine with anything as long as there is some rule. :-) The rest of
> the file appeared to me to use mostly int, so I was trying to follow that.

Well, yes - I can only defer to other maintainers in this regard.
My general view (shared in other projects) is that it's better to
stop following a bad habit in new code, even if pre-existing code
isn't immediately converted. Usually such is done over time, as
code is being touched anyway for other reasons.

>>> +static struct trie_node *alloc_trie_leaf (bfd *abfd)
>>> +{
>>> +  struct trie_leaf *leaf =
>>> +    (struct trie_leaf *) bfd_zalloc (abfd, sizeof (struct trie_leaf));
>> With C99 now being a requirement (and K&R long not having been supported)
>> I don't think you need a cast here (and elsewhere in similar cases).
> 
> Again, this is just following the rest of the style (and implicit casts
> from void* have not changed in C99, from what I know?).

They haven't changed in C99, but apparently there were very old
compilers (hence my reference to K&R) which issued diagnostics for
such conversions.

> Do you want me
> to diverge? Change the existing calls?

As per above - if you want to change existing uses, that would
certainly be nice, but I don't view such as a prereq to breaking
with a specific habit in new code.

>>> +      if (trie)
>>> +	{
>>> +	  const struct trie_leaf *leaf = (struct trie_leaf *) trie;
>>> +	  int i;
>>> +
>>> +	  for (i = 0; i < leaf->num_stored_in_leaf; ++i)
>>> +	    {
>>> +	      struct comp_unit *unit = (struct comp_unit *) leaf->ranges[i].unit;
>> Here you cast away constness. Imo such should only be done in very rare
>> cases. Avoiding such a cast is as simple as dropping the "const" from
>> the struct field declaration.
> 
> The problem is that when we assign it to that struct field, we have it
> only as a const. So we need to either const-cast here, at that point,
> or all the way up to arange_add().

My personal view is that properly dropping const wherever necessary is
preferable over adding such casts.

Jan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08  8:05 ` Jan Beulich
  2022-04-08  8:38   ` Steinar H. Gunderson
@ 2022-04-08 10:38   ` Alan Modra
  2022-04-08 10:43     ` Jan Beulich
  1 sibling, 1 reply; 10+ messages in thread
From: Alan Modra @ 2022-04-08 10:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Steinar H. Gunderson, binutils, sesse

On Fri, Apr 08, 2022 at 10:05:55AM +0200, Jan Beulich via Binutils wrote:
> On 04.04.2022 09:32, Steinar H. Gunderson via Binutils wrote:
> > +  if (leaf != NULL)
> > +    leaf->head.num_room_in_leaf = TRIE_LEAF_SIZE;
> > +  return (struct trie_node *) leaf;
> 
> Furthermore, with casts being somewhat risky in general (and there not
> being more fine-grained C++-like casts in C), I think it would be better
> to use &leaf->head in such cases.

Except that if leaf can be NULL then &leaf->head is undefined
according to the C standard.  ubsan will complain.  I dislike casts
too, but this is one annoying case where C requires one.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08 10:38   ` Alan Modra
@ 2022-04-08 10:43     ` Jan Beulich
  2022-04-08 12:10       ` Alan Modra
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2022-04-08 10:43 UTC (permalink / raw)
  To: Alan Modra; +Cc: Steinar H. Gunderson, binutils, sesse

On 08.04.2022 12:38, Alan Modra wrote:
> On Fri, Apr 08, 2022 at 10:05:55AM +0200, Jan Beulich via Binutils wrote:
>> On 04.04.2022 09:32, Steinar H. Gunderson via Binutils wrote:
>>> +  if (leaf != NULL)
>>> +    leaf->head.num_room_in_leaf = TRIE_LEAF_SIZE;
>>> +  return (struct trie_node *) leaf;
>>
>> Furthermore, with casts being somewhat risky in general (and there not
>> being more fine-grained C++-like casts in C), I think it would be better
>> to use &leaf->head in such cases.
> 
> Except that if leaf can be NULL then &leaf->head is undefined
> according to the C standard.  ubsan will complain.  I dislike casts
> too, but this is one annoying case where C requires one.

  if (leaf == NULL)
    return NULL;
  leaf->head.num_room_in_leaf = TRIE_LEAF_SIZE;
  return &leaf->head;

?

Jan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08  8:50     ` Jan Beulich
@ 2022-04-08 11:03       ` Steinar H. Gunderson
  2022-04-08 12:18         ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: Steinar H. Gunderson @ 2022-04-08 11:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: sesse, binutils

On Fri, Apr 08, 2022 at 10:50:37AM +0200, Jan Beulich wrote:
>> Well, if you only run it once, the memory usage won't really be that
>> high either, as it stops as soon as it finds an appropriate compilation
>> unit (it inserts incrementally into the tree). So it scales pretty well
>> downwards as well, just like the existing code does. But that aside,
>> is it a goal to have a non-long-running addr2line use as little peak
>> memory as possible?
> If we were talking of just a few kb, I think it wouldn't matter. But
> for large binaries I think the peak could be quite a bit higher with
> your change in place.

Let's try to quantify it a bit. Here's content_shell from chromium,
running addr2line over a single address and then immediately exiting.
Before this change:

  calls to allocation functions: 172864 (406738/s)
  temporary memory allocations: 62124 (146174/s)
  peak heap memory consumption: 1.46G
  peak RSS (including heaptrack overhead): 1.47G
  total memory leaked: 5.60K

After:

  calls to allocation functions: 193686 (412097/s)
  temporary memory allocations: 62124 (132178/s)
  peak heap memory consumption: 1.54G
  peak RSS (including heaptrack overhead): 1.55G
  total memory leaked: 5.60K

So it's up about 5% in terms of RSS (this doesn't count the OS cache
needed to read in the binary, if any). Let's look at the case where I
give in an address that doesn't correspond to a source line at all, ie.,
it has to consume the entire binary. Before:

  calls to allocation functions: 38786514 (1640368/s)
  temporary memory allocations: 6312506 (266970/s)
  peak heap memory consumption: 7.68G
  peak RSS (including heaptrack overhead): 8.18G
  total memory leaked: 8.53M

After:

  calls to allocation functions: 38835333 (1516886/s)
  temporary memory allocations: 6311800 (246535/s)
  peak heap memory consumption: 7.88G
  peak RSS (including heaptrack overhead): 8.37G
  total memory leaked: 8.53M

So that's a bit over 2% in terms of RSS. And this is memory that goes
away immediately after addr2line exits, ie. number of byte-seconds is
going to be low. (And you could argue many other things are more
wasteful; e.g. we spend 1.5 GB of that RSS holding tons and tons of
copies of the exact same filenames, from concat_filename().)

For a smaller binary like /bin/ls, we go from 5.43 MB to 5.46 MB
(for reading all of it).

>> An alternative would be something like only building the trie after the
>> first 100 lookups or similar. It should be fairly non-intrusive, but it
>> would require us to keep more of the old paths around.
> Some heuristic like this may also do, yes. Nevertheless I think it
> would be better for the application to provide an indication. After
> all, if addr2line was called with dozens of addresses, it might
> benefit as well (so such a heuristic would better live there than
> in the library).

I'm not too sold on moving the heuristics into the caller, for the
simple reason that there are a lot of potential callers. This means we'd
first need to define a new API, get a new binutils version out, go hunt
all relevant callers that could be ever interested in looking up debug
lines, and convince each of their maintainers to add a check for the new
binutils version to call that API (plus heuristics). All to get behavior
of “reasonable performance for large binaries”, which I believe is
something most of them already expected. :-)

If you put the heuristic in the library, addr2line would automatically
benefit without changing it. There's already precedent for this kind of
heuristic in libbfd as far as I can see, pretty much in the same area of
code:

   /* Number of times find_line is called.  This is used in
     the heuristic for enabling the info hash tables.  */
  int info_hash_count;

  #define STASH_INFO_HASH_TRIGGER    100

> They haven't changed in C99, but apparently there were very old
> compilers (hence my reference to K&R) which issued diagnostics for
> such conversions.

GCC has an optional warning for it, FWIW.

/* Steinar */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08 10:43     ` Jan Beulich
@ 2022-04-08 12:10       ` Alan Modra
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Modra @ 2022-04-08 12:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Steinar H. Gunderson, binutils, sesse

On Fri, Apr 08, 2022 at 12:43:14PM +0200, Jan Beulich wrote:
> On 08.04.2022 12:38, Alan Modra wrote:
> > On Fri, Apr 08, 2022 at 10:05:55AM +0200, Jan Beulich via Binutils wrote:
> >> On 04.04.2022 09:32, Steinar H. Gunderson via Binutils wrote:
> >>> +  if (leaf != NULL)
> >>> +    leaf->head.num_room_in_leaf = TRIE_LEAF_SIZE;
> >>> +  return (struct trie_node *) leaf;
> >>
> >> Furthermore, with casts being somewhat risky in general (and there not
> >> being more fine-grained C++-like casts in C), I think it would be better
> >> to use &leaf->head in such cases.
> > 
> > Except that if leaf can be NULL then &leaf->head is undefined
> > according to the C standard.  ubsan will complain.  I dislike casts
> > too, but this is one annoying case where C requires one.
> 
>   if (leaf == NULL)
>     return NULL;
>   leaf->head.num_room_in_leaf = TRIE_LEAF_SIZE;
>   return &leaf->head;

Yes, that's fine.

  return leaf ? &leaf->head : NULL;

works too but I find it a little awkward.

BTW, a cast to the proper type is no better really than
  return (void *) leaf;
Minimal typing and makes C++ heads explode.  :-)

I'm not trying to lay down style rules here.  Anything that avoids
ubsan errors is OK as far as I'm concerned.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08 11:03       ` Steinar H. Gunderson
@ 2022-04-08 12:18         ` Jan Beulich
  2022-04-08 13:57           ` Steinar H. Gunderson
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2022-04-08 12:18 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: sesse, binutils

On 08.04.2022 13:03, Steinar H. Gunderson wrote:
> On Fri, Apr 08, 2022 at 10:50:37AM +0200, Jan Beulich wrote:
>>> Well, if you only run it once, the memory usage won't really be that
>>> high either, as it stops as soon as it finds an appropriate compilation
>>> unit (it inserts incrementally into the tree). So it scales pretty well
>>> downwards as well, just like the existing code does. But that aside,
>>> is it a goal to have a non-long-running addr2line use as little peak
>>> memory as possible?
>> If we were talking of just a few kb, I think it wouldn't matter. But
>> for large binaries I think the peak could be quite a bit higher with
>> your change in place.
> 
> Let's try to quantify it a bit. Here's content_shell from chromium,
> running addr2line over a single address and then immediately exiting.
> Before this change:
> 
>   calls to allocation functions: 172864 (406738/s)
>   temporary memory allocations: 62124 (146174/s)
>   peak heap memory consumption: 1.46G
>   peak RSS (including heaptrack overhead): 1.47G
>   total memory leaked: 5.60K
> 
> After:
> 
>   calls to allocation functions: 193686 (412097/s)
>   temporary memory allocations: 62124 (132178/s)
>   peak heap memory consumption: 1.54G
>   peak RSS (including heaptrack overhead): 1.55G
>   total memory leaked: 5.60K
> 
> So it's up about 5% in terms of RSS (this doesn't count the OS cache
> needed to read in the binary, if any). Let's look at the case where I
> give in an address that doesn't correspond to a source line at all, ie.,
> it has to consume the entire binary. Before:
> 
>   calls to allocation functions: 38786514 (1640368/s)
>   temporary memory allocations: 6312506 (266970/s)
>   peak heap memory consumption: 7.68G
>   peak RSS (including heaptrack overhead): 8.18G
>   total memory leaked: 8.53M
> 
> After:
> 
>   calls to allocation functions: 38835333 (1516886/s)
>   temporary memory allocations: 6311800 (246535/s)
>   peak heap memory consumption: 7.88G
>   peak RSS (including heaptrack overhead): 8.37G
>   total memory leaked: 8.53M
> 
> So that's a bit over 2% in terms of RSS. And this is memory that goes
> away immediately after addr2line exits, ie. number of byte-seconds is
> going to be low. (And you could argue many other things are more
> wasteful; e.g. we spend 1.5 GB of that RSS holding tons and tons of
> copies of the exact same filenames, from concat_filename().)
> 
> For a smaller binary like /bin/ls, we go from 5.43 MB to 5.46 MB
> (for reading all of it).

Hmm, okay, that's not all that much of an increase. I withdraw my
respective request.

Jan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Add a trie to map quickly from address range to compilation unit.
  2022-04-08 12:18         ` Jan Beulich
@ 2022-04-08 13:57           ` Steinar H. Gunderson
  0 siblings, 0 replies; 10+ messages in thread
From: Steinar H. Gunderson @ 2022-04-08 13:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: sesse, binutils

On Fri, Apr 08, 2022 at 02:18:54PM +0200, Jan Beulich wrote:
> Hmm, okay, that's not all that much of an increase. I withdraw my
> respective request.

Great; I'll send a v4 with the other things you wanted fixed
(unsigned, const, cast changes).

/* Steinar */

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-04-08 13:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-04  7:32 [PATCH v2] Add a trie to map quickly from address range to compilation unit Steinar H. Gunderson
2022-04-08  8:05 ` Jan Beulich
2022-04-08  8:38   ` Steinar H. Gunderson
2022-04-08  8:50     ` Jan Beulich
2022-04-08 11:03       ` Steinar H. Gunderson
2022-04-08 12:18         ` Jan Beulich
2022-04-08 13:57           ` Steinar H. Gunderson
2022-04-08 10:38   ` Alan Modra
2022-04-08 10:43     ` Jan Beulich
2022-04-08 12:10       ` Alan Modra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).