public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v2 2/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader
@ 2020-05-19 13:25 Chung-Lin Tang
  2020-06-29 21:29 ` Carlos O'Donell
  0 siblings, 1 reply; 3+ messages in thread
From: Chung-Lin Tang @ 2020-05-19 13:25 UTC (permalink / raw)
  To: GNU C Library, Carlos O'Donell; +Cc: Catherine Moore

[-- Attachment #1: Type: text/plain, Size: 3033 bytes --]

Hi Carlos,
This part is the actual code changes in elf/, changes since the last version[1] are:

1. As you suggested, the use of a glibc tunable (glibc.rtld.dynamic_sort)
to allow specifying which sorting algorithm to use. "1" for the current algorithm,
"2" for the new DFS based one. The default is 1 to be conservative.

I tried a few styles on how this should be best written, and settled for a
static function pointer set on the first run of _dl_sort_maps(). I'm open to
better suggestions.

If glibc is configured with --enable-tunables=no, the behavior is to only use
the current old algorithm.

2. Addition of l_map_used/l_map_done bitfields in struct link_map, to be used
in _dl_close_worker(), instead of overloading on l_reserved. This has the benefit
of not needing to clean up at the end, saving a few CPU cycles. (BTW, 'l_used' was
already taken, thus named l_map_*)

3. Now _dl_sort_maps() has the used[] parameter removed, but due to many of the
current sorting call-sites skipping the first map, while the new algorithm does
not want this kind of +1/-1, a new 'unsigned int skip' argument is added. This is
due to both algorithms existing together as an option, and it is a little unclear
whether removing "skip-the-first" is detrimental to the current algorithm in practice
(and I didn't want to regress it when it is still default), so doing this to be
conservative.

There was a small change that removed a redundant call of _dl_sort_maps in dl-open.c,
which was already accomplished during this time by the fix for #16272, thus of course
dropped in this patch.

As for the issue of "interleaved scopes" in my algorithm, actually, when I tried
comparing the LD_DEBUG output of both algorithms, the scope list appears to be the
same, i.e. both have libc.so.6 and ld.so interleaved in the sequence, and the sequences
appears to be the same. This is quite easy to test now, using GLIBC_TUNABLES=glibc.rtld.dynamic_sort=[12]
to switch in between.

I hope this is still in time to be in 2.32, as the non-default option.

Thanks,
Chung-Lin

[1] https://sourceware.org/pipermail/libc-alpha/2019-July/105200.html

2020-05-19  Chung-Lin Tang  <cltang@codesourcery.com>

	[BZ #17645]
	[BZ #15311]
	[BZ #15310]
	* elf/dl-close.c (_dl_close_worker): Remove used[], done[] char arrays,
	use l_map_used/l_map_done bitfields instead. Adjust call to _dl_sort_maps.
	* elf/dl-deps.c (_dl_map_object_deps): Adjust call to _dl_sort_maps.
	* elf/dl-fini.c (_dl_fini): Likewise.
	* elf/dl-sort_maps.c
	(_dl_sort_maps_original): Adapted from original _dl_sort_maps.
	(dfs_traversal): New static function.
	(_dl_sort_maps_dfs): New Reverse-Postorder (RPO) based implementation.
	(_dl_sort_maps): Select and call either _dl_sort_maps_original or
	_dl_sort_maps_dfs based on glibc.rtld.dynamic_sort.
	* elf/dl-tunables.list (rtld): New namespace with tunable 'dynamic_sort'.
	* include/link.h (struct link_map): Add l_visited:1, l_map_used:1, and
	l_map_done:1 bitfields.
	* sysdeps/generic/ldsodefs.h (_dl_sort_maps): Adjust declaration.

[-- Attachment #2: dlsort-02-changes-v2.patch --]
[-- Type: text/plain, Size: 13840 bytes --]

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 73b2817bbf..3d52126200 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -164,8 +164,6 @@ _dl_close_worker (struct link_map *map, bool force)
 
   bool any_tls = false;
   const unsigned int nloaded = ns->_ns_nloaded;
-  char used[nloaded];
-  char done[nloaded];
   struct link_map *maps[nloaded];
 
   /* Run over the list and assign indexes to the link maps and enter
@@ -173,24 +171,21 @@ _dl_close_worker (struct link_map *map, bool force)
   int idx = 0;
   for (struct link_map *l = ns->_ns_loaded; l != NULL; l = l->l_next)
     {
+      l->l_map_used = 0;
+      l->l_map_done = 0;
       l->l_idx = idx;
       maps[idx] = l;
       ++idx;
-
     }
   assert (idx == nloaded);
 
-  /* Prepare the bitmaps.  */
-  memset (used, '\0', sizeof (used));
-  memset (done, '\0', sizeof (done));
-
   /* Keep track of the lowest index link map we have covered already.  */
   int done_index = -1;
   while (++done_index < nloaded)
     {
       struct link_map *l = maps[done_index];
 
-      if (done[done_index])
+      if (l->l_map_done)
 	/* Already handled.  */
 	continue;
 
@@ -201,12 +196,12 @@ _dl_close_worker (struct link_map *map, bool force)
 	  /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
 	     acquire is sufficient and correct.  */
 	  && atomic_load_acquire (&l->l_tls_dtor_count) == 0
-	  && !used[done_index])
+	  && !l->l_map_used)
 	continue;
 
       /* We need this object and we handle it now.  */
-      done[done_index] = 1;
-      used[done_index] = 1;
+      l->l_map_used = 1;
+      l->l_map_done = 1;
       /* Signal the object is still needed.  */
       l->l_idx = IDX_STILL_USED;
 
@@ -222,9 +217,9 @@ _dl_close_worker (struct link_map *map, bool force)
 		{
 		  assert ((*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded);
 
-		  if (!used[(*lp)->l_idx])
+		  if (!(*lp)->l_map_used)
 		    {
-		      used[(*lp)->l_idx] = 1;
+		      (*lp)->l_map_used = 1;
 		      /* If we marked a new object as used, and we've
 			 already processed it, then we need to go back
 			 and process again from that point forward to
@@ -247,9 +242,9 @@ _dl_close_worker (struct link_map *map, bool force)
 	      {
 		assert (jmap->l_idx >= 0 && jmap->l_idx < nloaded);
 
-		if (!used[jmap->l_idx])
+		if (!jmap->l_map_used)
 		  {
-		    used[jmap->l_idx] = 1;
+		    jmap->l_map_used = 1;
 		    if (jmap->l_idx - 1 < done_index)
 		      done_index = jmap->l_idx - 1;
 		  }
@@ -259,8 +254,7 @@ _dl_close_worker (struct link_map *map, bool force)
 
   /* Sort the entries.  We can skip looking for the binary itself which is
      at the front of the search list for the main namespace.  */
-  _dl_sort_maps (maps + (nsid == LM_ID_BASE), nloaded - (nsid == LM_ID_BASE),
-		 used + (nsid == LM_ID_BASE), true);
+  _dl_sort_maps (maps, nloaded, (nsid == LM_ID_BASE), true);
 
   /* Call all termination functions at once.  */
 #ifdef SHARED
@@ -277,7 +271,7 @@ _dl_close_worker (struct link_map *map, bool force)
       /* All elements must be in the same namespace.  */
       assert (imap->l_ns == nsid);
 
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded && !imap->l_nodelete_active);
 
@@ -330,7 +324,7 @@ _dl_close_worker (struct link_map *map, bool force)
 	  if (i < first_loaded)
 	    first_loaded = i;
 	}
-      /* Else used[i].  */
+      /* Else imap->l_map_used.  */
       else if (imap->l_type == lt_loaded)
 	{
 	  struct r_scope_elem *new_list = NULL;
@@ -554,7 +548,7 @@ _dl_close_worker (struct link_map *map, bool force)
   for (unsigned int i = first_loaded; i < nloaded; ++i)
     {
       struct link_map *imap = maps[i];
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded);
 
diff --git a/elf/dl-deps.c b/elf/dl-deps.c
index b5a43232a7..0851971dca 100644
--- a/elf/dl-deps.c
+++ b/elf/dl-deps.c
@@ -611,7 +611,7 @@ Filters not supported with LD_TRACE_PRELINKING"));
     memcpy (l_initfini, map->l_searchlist.r_list,
 	    nlist * sizeof (struct link_map *));
 
-  _dl_sort_maps (&l_initfini[1], nlist - 1, NULL, false);
+  _dl_sort_maps (l_initfini, nlist, 1, false);
 
   /* Terminate the list of dependencies.  */
   l_initfini[nlist] = NULL;
diff --git a/elf/dl-fini.c b/elf/dl-fini.c
index 231db3d66d..5b1bf6371c 100644
--- a/elf/dl-fini.c
+++ b/elf/dl-fini.c
@@ -92,8 +92,7 @@ _dl_fini (void)
 	  /* Now we have to do the sorting.  We can skip looking for the
 	     binary itself which is at the front of the search list for
 	     the main namespace.  */
-	  _dl_sort_maps (maps + (ns == LM_ID_BASE), nmaps - (ns == LM_ID_BASE),
-			 NULL, true);
+	  _dl_sort_maps (maps, nmaps, (ns == LM_ID_BASE), true);
 
 	  /* We do not rely on the linked list of loaded object anymore
 	     from this point on.  We have our own list here (maps).  The
diff --git a/elf/dl-sort-maps.c b/elf/dl-sort-maps.c
index 86f1e23c85..7d143cc487 100644
--- a/elf/dl-sort-maps.c
+++ b/elf/dl-sort-maps.c
@@ -16,16 +16,24 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <assert.h>
 #include <ldsodefs.h>
+#include <elf/dl-tunables.h>
 
+/* Note: this is the older, "original" sorting algorithm, being used as
+   default up to 2.32.
 
-/* Sort array MAPS according to dependencies of the contained objects.
-   Array USED, if non-NULL, is permutated along MAPS.  If FOR_FINI this is
-   called for finishing an object.  */
-void
-_dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
-	       bool for_fini)
+   Sort array MAPS according to dependencies of the contained objects.
+   If FOR_FINI is true, this is called for finishing an object.  */
+static void
+_dl_sort_maps_original (struct link_map **maps, unsigned int nmaps,
+			unsigned int skip, bool for_fini)
 {
+  /* Allows caller to do the common optimization of skipping the first map,
+     usually the main binary.  */
+  maps += skip;
+  nmaps -= skip;
+
   /* A list of one element need not be sorted.  */
   if (nmaps <= 1)
     return;
@@ -66,14 +74,6 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
 			   (k - i) * sizeof (maps[0]));
 		  maps[k] = thisp;
 
-		  if (used != NULL)
-		    {
-		      char here_used = used[i];
-		      memmove (&used[i], &used[i + 1],
-			       (k - i) * sizeof (used[0]));
-		      used[k] = here_used;
-		    }
-
 		  if (seen[i + 1] > nmaps - i)
 		    {
 		      ++i;
@@ -120,3 +120,160 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
     next:;
     }
 }
+
+#if !HAVE_TUNABLES
+/* In this case, just default to the original algorithm.  */
+strong_alias (_dl_sort_maps_original, _dl_sort_maps);
+#else
+
+/* We use a recursive function due to its better clarity and ease of
+   implementation, as well as faster execution speed. We already use
+   alloca() for list allocation during the breadth-first search of
+   dependencies in _dl_map_object_deps(), and this should be on the
+   same order of worst-case stack usage.  */
+
+static void
+dfs_traversal (struct link_map ***rpo, struct link_map *map,
+	       bool *do_reldeps)
+{
+  if (map->l_visited)
+    return;
+
+  map->l_visited = 1;
+
+  if (map->l_initfini)
+    {
+      for (int i = 0; map->l_initfini[i] != NULL; i++)
+	{
+	  struct link_map *dep = map->l_initfini[i];
+	  if (dep->l_visited == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  if (__glibc_unlikely (do_reldeps != NULL && map->l_reldeps != NULL))
+    {
+      /* Indicate that we encountered relocation dependencies during
+	 traversal.  */
+      *do_reldeps = true;
+
+      for (int m = map->l_reldeps->act - 1; m >= 0; m--)
+	{
+	  struct link_map *dep = map->l_reldeps->list[m];
+	  if (dep->l_visited == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  *rpo -= 1;
+  **rpo = map;
+}
+
+/* Topologically sort array MAPS according to dependencies of the contained
+   objects.  */
+
+static void
+_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
+		   unsigned int skip __attribute__ ((unused)), bool for_fini)
+{
+  for (int i = 0; i < nmaps; i++)
+    maps[i]->l_visited = 0;
+
+  /* We apply DFS traversal for each of maps[i] until the whole total order
+     is found and we're at the start of the Reverse-Postorder (RPO) sequence,
+     which is a topological sort.
+
+     We go from maps[nmaps - 1] backwards towards maps[0] at this level.
+     Due to the breadth-first search (BFS) ordering we receive, going
+     backwards usually gives a more shallow depth-first recursion depth,
+     adding more stack usage safety. Also, combined with the natural
+     processing order of l_initfini[] at each node during DFS, this maintains
+     an ordering closer to the original link ordering in the sorting results
+     under most simpler cases.
+
+     Another reason we order the top level backwards, it that maps[0] is
+     usually exactly the main object of which we're in the midst of
+     _dl_map_object_deps() processing, and maps[0]->l_initfini[] is still
+     blank. If we start the traversal from maps[0], since having no
+     dependencies yet filled in, maps[0] will always be immediately
+     incorrectly placed at the last place in the order (first in reverse).
+     Adjusting the order so that maps[0] is last traversed naturally avoids
+     this problem.
+
+     Further, the old "optimization" of skipping the main object at maps[0]
+     from the call-site (i.e. _dl_sort_maps(maps+1,nmaps-1)) is in general
+     no longer valid, since traversing along object dependency-links
+     may "find" the main object even when it is not included in the initial
+     order (e.g. a dlopen()'ed shared object can have circular dependencies
+     linked back to itself). In such a case, traversing N-1 objects will
+     create a N-object result, and raise problems.
+
+     To summarize, just passing in the full list, and iterating from back
+     to front makes things much more straightforward.  */
+
+  struct link_map *rpo[nmaps];
+  struct link_map **rpo_head = &rpo[nmaps];
+
+  bool do_reldeps = false;
+  bool *do_reldeps_ref = (for_fini ? &do_reldeps : NULL);
+
+  for (int i = nmaps - 1; i >= 0; i--)
+    {
+      dfs_traversal (&rpo_head, maps[i], do_reldeps_ref);
+
+      /* We can break early if all objects are already placed.  */
+      if (rpo_head == rpo)
+	goto end;
+    }
+  assert (rpo_head == rpo);
+
+ end:
+  /* This is avoided if !FOR_FINI or if we didn't find any reldeps in
+     the first DFS traversal.  */
+  if (do_reldeps)
+    {
+      for (int i = 0; i < nmaps; i++)
+	rpo[i]->l_visited = 0;
+
+      struct link_map **maps_head = &maps[nmaps];
+      for (int i = nmaps - 1; i >= 0; i--)
+	{
+	  dfs_traversal (&maps_head, rpo[i], NULL);
+
+	  /* We can break early if all objects are already placed.
+	     The below memcpy is not needed in the do_reldeps case here,
+	     since we wrote back to maps[] during DFS traversal.  */
+	  if (maps_head == maps)
+	    return;
+	}
+      assert (maps_head == maps);
+      return;
+    }
+
+  memcpy (maps, rpo, sizeof (struct link_map *) * nmaps);
+}
+
+void
+_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
+	       unsigned int skip, bool for_fini)
+{
+  /* Function pointer to sorting algorithm currently in use.  */
+  static void (*sort_maps) (struct link_map **, unsigned int,
+			    unsigned int, bool) = NULL;
+  if (__glibc_unlikely (sort_maps == NULL))
+    {
+      /* Select sorting algorithm based on tunable value.  */
+      int32_t algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
+				       int32_t, NULL);
+      if (__glibc_likely (algorithm == 1))
+	sort_maps = _dl_sort_maps_original;
+      else if (algorithm == 2)
+	sort_maps = _dl_sort_maps_dfs;
+      else
+	__builtin_unreachable ();
+    }
+
+  sort_maps (maps, nmaps, skip, for_fini);
+}
+
+#endif /* HAVE_TUNABLES.  */
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index 0d398dd251..aeeafe704a 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -126,4 +126,13 @@ glibc {
       default: 3
     }
   }
+
+  rtld {
+    dynamic_sort {
+      type: INT_32
+      minval: 1
+      maxval: 2
+      default: 1
+    }
+  }
 }
diff --git a/include/link.h b/include/link.h
index aea268439c..48c11c5d68 100644
--- a/include/link.h
+++ b/include/link.h
@@ -177,6 +177,10 @@ struct link_map
     unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
     unsigned int l_global:1;	/* Nonzero if object in _dl_global_scope.  */
     unsigned int l_reserved:2;	/* Reserved for internal use.  */
+    unsigned int l_visited:1;   /* Used internally for map dependency
+				   graph traversal.  */
+    unsigned int l_map_used:1;  /* These two bits are used during traversal */
+    unsigned int l_map_done:1;  /* of maps in _dl_close_worker(). */
     unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
 					to by `l_phdr' is allocated.  */
     unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index 5ff4a2831b..a0f60df81c 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -1017,8 +1017,8 @@ extern void _dl_init (struct link_map *main_map, int argc, char **argv,
 extern void _dl_fini (void) attribute_hidden;
 
 /* Sort array MAPS according to dependencies of the contained objects.  */
-extern void _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
-			   char *used, bool for_fini) attribute_hidden;
+extern void _dl_sort_maps (struct link_map **, unsigned int, unsigned int,
+			   bool) attribute_hidden;
 
 /* The dynamic linker calls this function before and having changing
    any shared object mappings.  The `r_state' member of `struct r_debug'

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 2/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader
  2020-05-19 13:25 [PATCH v2 2/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader Chung-Lin Tang
@ 2020-06-29 21:29 ` Carlos O'Donell
  2020-07-22 18:10   ` [PATCH v3 " Chung-Lin Tang
  0 siblings, 1 reply; 3+ messages in thread
From: Carlos O'Donell @ 2020-06-29 21:29 UTC (permalink / raw)
  To: cltang, GNU C Library; +Cc: Catherine Moore

On 5/19/20 9:25 AM, Chung-Lin Tang wrote:
> Hi Carlos,
> This part is the actual code changes in elf/, changes since the last version[1] are:

I have 1 remaining question. Please see "QUESTION:" below.

I have 1 remaining nit, which is just to use named arguments in ldsodefs.h.

I have 1 code change regarding PTR_MANGLE/PTR_DEMANGLE to obfuscate the stored
function pointer.

Please post v3. One way to answer my question is to post a v3 with a large enough
comment in that section that explains *why* we resort without taking l_reldeps into
account.

Looking very good. Thanks for all the work here.
 
> 1. As you suggested, the use of a glibc tunable (glibc.rtld.dynamic_sort)
> to allow specifying which sorting algorithm to use. "1" for the current algorithm,
> "2" for the new DFS based one. The default is 1 to be conservative.

Thank you. I think that overall I'd want to take the following strategy:

(1) Test glibc.rtld.dynamic_sort=2 in Fedora Rawhide for 6 months.
(2) Switch glibc.rtld.dynamic_sort default to 2 in glibc 2.33, and declare
    the version 1 deprecated.
(3) In glibc 2.35 or glibc 2.36 remove version 1. That gives us a 1.5 to 2 year
    window where we have deprecated and then removed the old algorithm. Users
    will have deployed environments at this point with the default set correctly
    and we'll have seen the most egregious problems sorted out.

Not everyone may agree with this strategy. However, I'm hesitant to take a more
strict approach to this problem. We need a good transitional period with both 
options.
 
> I tried a few styles on how this should be best written, and settled for a
> static function pointer set on the first run of _dl_sort_maps(). I'm open to
> better suggestions.

I like that option. We might want to obfuscate that pointer to avoid hijacking.
 
> If glibc is configured with --enable-tunables=no, the behavior is to only use
> the current old algorithm.

That's OK. We will have to switch that when we deprecate the old algorithm.
 
> 2. Addition of l_map_used/l_map_done bitfields in struct link_map, to be used
> in _dl_close_worker(), instead of overloading on l_reserved. This has the benefit
> of not needing to clean up at the end, saving a few CPU cycles. (BTW, 'l_used' was
> already taken, thus named l_map_*)

Agreed. Good plan there.
 
> 3. Now _dl_sort_maps() has the used[] parameter removed, but due to many of the
> current sorting call-sites skipping the first map, while the new algorithm does
> not want this kind of +1/-1, a new 'unsigned int skip' argument is added. This is
> due to both algorithms existing together as an option, and it is a little unclear
> whether removing "skip-the-first" is detrimental to the current algorithm in practice
> (and I didn't want to regress it when it is still default), so doing this to be
> conservative.

Agreed.

> There was a small change that removed a redundant call of _dl_sort_maps in dl-open.c,
> which was already accomplished during this time by the fix for #16272, thus of course
> dropped in this patch.

Thanks.

> As for the issue of "interleaved scopes" in my algorithm, actually, when I tried
> comparing the LD_DEBUG output of both algorithms, the scope list appears to be the
> same, i.e. both have libc.so.6 and ld.so interleaved in the sequence, and the sequences
> appears to be the same. This is quite easy to test now, using GLIBC_TUNABLES=glibc.rtld.dynamic_sort=[12]
> to switch in between.

Perfect. I could have made a mistake during testing and I'll review.

> I hope this is still in time to be in 2.32, as the non-default option.
> 
> Thanks,
> Chung-Lin
> 
> [1] https://sourceware.org/pipermail/libc-alpha/2019-July/105200.html
> 
> 2020-05-19  Chung-Lin Tang  <cltang@codesourcery.com>
> 
>     [BZ #17645]
>     [BZ #15311]
>     [BZ #15310]
>     * elf/dl-close.c (_dl_close_worker): Remove used[], done[] char arrays,
>     use l_map_used/l_map_done bitfields instead. Adjust call to _dl_sort_maps.
>     * elf/dl-deps.c (_dl_map_object_deps): Adjust call to _dl_sort_maps.
>     * elf/dl-fini.c (_dl_fini): Likewise.
>     * elf/dl-sort_maps.c
>     (_dl_sort_maps_original): Adapted from original _dl_sort_maps.
>     (dfs_traversal): New static function.
>     (_dl_sort_maps_dfs): New Reverse-Postorder (RPO) based implementation.
>     (_dl_sort_maps): Select and call either _dl_sort_maps_original or
>     _dl_sort_maps_dfs based on glibc.rtld.dynamic_sort.
>     * elf/dl-tunables.list (rtld): New namespace with tunable 'dynamic_sort'.
>     * include/link.h (struct link_map): Add l_visited:1, l_map_used:1, and
>     l_map_done:1 bitfields.
>     * sysdeps/generic/ldsodefs.h (_dl_sort_maps): Adjust declaration.


> diff --git a/elf/dl-close.c b/elf/dl-close.c
> index 73b2817bbf..3d52126200 100644
> --- a/elf/dl-close.c
> +++ b/elf/dl-close.c
> @@ -164,8 +164,6 @@ _dl_close_worker (struct link_map *map, bool force)
>  
>    bool any_tls = false;
>    const unsigned int nloaded = ns->_ns_nloaded;
> -  char used[nloaded];
> -  char done[nloaded];

OK. Moving this into the struct.

>    struct link_map *maps[nloaded];
>  
>    /* Run over the list and assign indexes to the link maps and enter
> @@ -173,24 +171,21 @@ _dl_close_worker (struct link_map *map, bool force)
>    int idx = 0;
>    for (struct link_map *l = ns->_ns_loaded; l != NULL; l = l->l_next)
>      {
> +      l->l_map_used = 0;
> +      l->l_map_done = 0;

OK. Initialize.

>        l->l_idx = idx;
>        maps[idx] = l;
>        ++idx;
> -
>      }
>    assert (idx == nloaded);
>  
> -  /* Prepare the bitmaps.  */
> -  memset (used, '\0', sizeof (used));
> -  memset (done, '\0', sizeof (done));
> -

OK. Not needed.

>    /* Keep track of the lowest index link map we have covered already.  */
>    int done_index = -1;
>    while (++done_index < nloaded)
>      {
>        struct link_map *l = maps[done_index];
>  
> -      if (done[done_index])
> +      if (l->l_map_done)

OK.

>  	/* Already handled.  */
>  	continue;
>  
> @@ -201,12 +196,12 @@ _dl_close_worker (struct link_map *map, bool force)
>  	  /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
>  	     acquire is sufficient and correct.  */
>  	  && atomic_load_acquire (&l->l_tls_dtor_count) == 0
> -	  && !used[done_index])
> +	  && !l->l_map_used)

OK.

>  	continue;
>  
>        /* We need this object and we handle it now.  */
> -      done[done_index] = 1;
> -      used[done_index] = 1;
> +      l->l_map_used = 1;
> +      l->l_map_done = 1;

OK.

>        /* Signal the object is still needed.  */
>        l->l_idx = IDX_STILL_USED;
>  
> @@ -222,9 +217,9 @@ _dl_close_worker (struct link_map *map, bool force)
>  		{
>  		  assert ((*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded);
>  
> -		  if (!used[(*lp)->l_idx])
> +		  if (!(*lp)->l_map_used)

OK.

>  		    {
> -		      used[(*lp)->l_idx] = 1;
> +		      (*lp)->l_map_used = 1;

OK.

>  		      /* If we marked a new object as used, and we've
>  			 already processed it, then we need to go back
>  			 and process again from that point forward to
> @@ -247,9 +242,9 @@ _dl_close_worker (struct link_map *map, bool force)
>  	      {
>  		assert (jmap->l_idx >= 0 && jmap->l_idx < nloaded);
>  
> -		if (!used[jmap->l_idx])
> +		if (!jmap->l_map_used)

OK. We have the map and just use it directly instead of indexing into used[jmap->l_idx].

>  		  {
> -		    used[jmap->l_idx] = 1;
> +		    jmap->l_map_used = 1;
>  		    if (jmap->l_idx - 1 < done_index)
>  		      done_index = jmap->l_idx - 1;
>  		  }
> @@ -259,8 +254,7 @@ _dl_close_worker (struct link_map *map, bool force)
>  
>    /* Sort the entries.  We can skip looking for the binary itself which is
>       at the front of the search list for the main namespace.  */
> -  _dl_sort_maps (maps + (nsid == LM_ID_BASE), nloaded - (nsid == LM_ID_BASE),
> -		 used + (nsid == LM_ID_BASE), true);
> +  _dl_sort_maps (maps, nloaded, (nsid == LM_ID_BASE), true);

OK. Real sorting done.

>  
>    /* Call all termination functions at once.  */
>  #ifdef SHARED
> @@ -277,7 +271,7 @@ _dl_close_worker (struct link_map *map, bool force)
>        /* All elements must be in the same namespace.  */
>        assert (imap->l_ns == nsid);
>  
> -      if (!used[i])
> +      if (!imap->l_map_used)

OK.

>  	{
>  	  assert (imap->l_type == lt_loaded && !imap->l_nodelete_active);
>  
> @@ -330,7 +324,7 @@ _dl_close_worker (struct link_map *map, bool force)
>  	  if (i < first_loaded)
>  	    first_loaded = i;
>  	}
> -      /* Else used[i].  */
> +      /* Else imap->l_map_used.  */

OK.

>        else if (imap->l_type == lt_loaded)
>  	{
>  	  struct r_scope_elem *new_list = NULL;
> @@ -554,7 +548,7 @@ _dl_close_worker (struct link_map *map, bool force)
>    for (unsigned int i = first_loaded; i < nloaded; ++i)
>      {
>        struct link_map *imap = maps[i];
> -      if (!used[i])
> +      if (!imap->l_map_used)

OK.

>  	{
>  	  assert (imap->l_type == lt_loaded);
>  
> diff --git a/elf/dl-deps.c b/elf/dl-deps.c
> index b5a43232a7..0851971dca 100644
> --- a/elf/dl-deps.c
> +++ b/elf/dl-deps.c
> @@ -611,7 +611,7 @@ Filters not supported with LD_TRACE_PRELINKING"));
>      memcpy (l_initfini, map->l_searchlist.r_list,
>  	    nlist * sizeof (struct link_map *));
>  
> -  _dl_sort_maps (&l_initfini[1], nlist - 1, NULL, false);
> +  _dl_sort_maps (l_initfini, nlist, 1, false);

OK. Real sorting. Always base ns. This sorts the init.

>  
>    /* Terminate the list of dependencies.  */
>    l_initfini[nlist] = NULL;
> diff --git a/elf/dl-fini.c b/elf/dl-fini.c
> index 231db3d66d..5b1bf6371c 100644
> --- a/elf/dl-fini.c
> +++ b/elf/dl-fini.c
> @@ -92,8 +92,7 @@ _dl_fini (void)
>  	  /* Now we have to do the sorting.  We can skip looking for the
>  	     binary itself which is at the front of the search list for
>  	     the main namespace.  */
> -	  _dl_sort_maps (maps + (ns == LM_ID_BASE), nmaps - (ns == LM_ID_BASE),
> -			 NULL, true);
> +	  _dl_sort_maps (maps, nmaps, (ns == LM_ID_BASE), true);

OK. Sorting the fini.

>  
>  	  /* We do not rely on the linked list of loaded object anymore
>  	     from this point on.  We have our own list here (maps).  The
> diff --git a/elf/dl-sort-maps.c b/elf/dl-sort-maps.c
> index 86f1e23c85..7d143cc487 100644
> --- a/elf/dl-sort-maps.c
> +++ b/elf/dl-sort-maps.c
> @@ -16,16 +16,24 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> +#include <assert.h>
>  #include <ldsodefs.h>
> +#include <elf/dl-tunables.h>
>  
> +/* Note: this is the older, "original" sorting algorithm, being used as
> +   default up to 2.32.
>  
> -/* Sort array MAPS according to dependencies of the contained objects.
> -   Array USED, if non-NULL, is permutated along MAPS.  If FOR_FINI this is
> -   called for finishing an object.  */
> -void
> -_dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
> -	       bool for_fini)
> +   Sort array MAPS according to dependencies of the contained objects.
> +   If FOR_FINI is true, this is called for finishing an object.  */
> +static void
> +_dl_sort_maps_original (struct link_map **maps, unsigned int nmaps,
> +			unsigned int skip, bool for_fini)
>  {
> +  /* Allows caller to do the common optimization of skipping the first map,
> +     usually the main binary.  */
> +  maps += skip;
> +  nmaps -= skip;

OK.

> +
>    /* A list of one element need not be sorted.  */
>    if (nmaps <= 1)
>      return;
> @@ -66,14 +74,6 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
>  			   (k - i) * sizeof (maps[0]));
>  		  maps[k] = thisp;
>  
> -		  if (used != NULL)
> -		    {
> -		      char here_used = used[i];
> -		      memmove (&used[i], &used[i + 1],
> -			       (k - i) * sizeof (used[0]));
> -		      used[k] = here_used;
> -		    }

OK. We don't have 'used' anymore. We use the link map value.

> -
>  		  if (seen[i + 1] > nmaps - i)
>  		    {
>  		      ++i;
> @@ -120,3 +120,160 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
>      next:;
>      }
>  }
> +
> +#if !HAVE_TUNABLES
> +/* In this case, just default to the original algorithm.  */
> +strong_alias (_dl_sort_maps_original, _dl_sort_maps);
> +#else
> +
> +/* We use a recursive function due to its better clarity and ease of
> +   implementation, as well as faster execution speed. We already use
> +   alloca() for list allocation during the breadth-first search of
> +   dependencies in _dl_map_object_deps(), and this should be on the
> +   same order of worst-case stack usage.  */
> +
> +static void
> +dfs_traversal (struct link_map ***rpo, struct link_map *map,
> +	       bool *do_reldeps)
> +{
> +  if (map->l_visited)

OK. This is the terminated case, we have seen this node, go back up the edge.

> +    return;
> +
> +  map->l_visited = 1;

OK. Set visited.

> +
> +  if (map->l_initfini)
> +    {
> +      for (int i = 0; map->l_initfini[i] != NULL; i++)
> +	{
> +	  struct link_map *dep = map->l_initfini[i];
> +	  if (dep->l_visited == 0)
> +	    dfs_traversal (rpo, dep, do_reldeps);
> +	}

OK. If we have initfini then walk that list and travel down those nodes.

> +    }
> +
> +  if (__glibc_unlikely (do_reldeps != NULL && map->l_reldeps != NULL))
> +    {
> +      /* Indicate that we encountered relocation dependencies during
> +	 traversal.  */
> +      *do_reldeps = true;
> +
> +      for (int m = map->l_reldeps->act - 1; m >= 0; m--)
> +	{
> +	  struct link_map *dep = map->l_reldeps->list[m];
> +	  if (dep->l_visited == 0)
> +	    dfs_traversal (rpo, dep, do_reldeps);
> +	}

OK, we saw relocation dependencies. If we haven't visisted them then consider
them as additional edges in the graph to visit *after* we visit the normal
initfini edges. We travel down those edges possibly visiting those nodes in
order. What order? The order of relocations as _dl_lookup_symbol_x is being
called to then call add_dependency() which can be lazy bound based on dlopen
and a number of other factors.

> +    }
> +
> +  *rpo -= 1;

OK. Walk back up the list towards &rpo[0].

> +  **rpo = map;

OK. Record this node. Return back up the traversal.

> +}
> +
> +/* Topologically sort array MAPS according to dependencies of the contained
> +   objects.  */
> +
> +static void
> +_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
> +		   unsigned int skip __attribute__ ((unused)), bool for_fini)
> +{
> +  for (int i = 0; i < nmaps; i++)
> +    maps[i]->l_visited = 0;

OK. Clear the l_visited because we're starting from fresh.

> +
> +  /* We apply DFS traversal for each of maps[i] until the whole total order
> +     is found and we're at the start of the Reverse-Postorder (RPO) sequence,
> +     which is a topological sort.

OK. Agreed.

> +
> +     We go from maps[nmaps - 1] backwards towards maps[0] at this level.
> +     Due to the breadth-first search (BFS) ordering we receive, going
> +     backwards usually gives a more shallow depth-first recursion depth,
> +     adding more stack usage safety. Also, combined with the natural
> +     processing order of l_initfini[] at each node during DFS, this maintains
> +     an ordering closer to the original link ordering in the sorting results
> +     under most simpler cases.

OK. Good, keeping the link ordering is important (and possibly the next step
in the future given the C++ requirements for this).

> +
> +     Another reason we order the top level backwards, it that maps[0] is
> +     usually exactly the main object of which we're in the midst of
> +     _dl_map_object_deps() processing, and maps[0]->l_initfini[] is still
> +     blank. If we start the traversal from maps[0], since having no
> +     dependencies yet filled in, maps[0] will always be immediately
> +     incorrectly placed at the last place in the order (first in reverse).
> +     Adjusting the order so that maps[0] is last traversed naturally avoids
> +     this problem.

OK.

> +
> +     Further, the old "optimization" of skipping the main object at maps[0]
> +     from the call-site (i.e. _dl_sort_maps(maps+1,nmaps-1)) is in general
> +     no longer valid, since traversing along object dependency-links
> +     may "find" the main object even when it is not included in the initial
> +     order (e.g. a dlopen()'ed shared object can have circular dependencies
> +     linked back to itself). In such a case, traversing N-1 objects will
> +     create a N-object result, and raise problems.

OK. Agreed, I always found such optimization questionable.

> +
> +     To summarize, just passing in the full list, and iterating from back
> +     to front makes things much more straightforward.  */
> +
> +  struct link_map *rpo[nmaps];

OK. Array to hold rpo pointers.

> +  struct link_map **rpo_head = &rpo[nmaps];

Required: Please add a comment that we start 1 past the end of this list because
of the *pro -= 1; we use in dfs_traveral().

OK. This is the "head" which is the last map in the list. Initially it actually
points to unmapped memory but becuase we do *rpo -= 1; we always walk it back
one before writing anything. So the pointer has to start outside of the list.

> +
> +  bool do_reldeps = false;
> +  bool *do_reldeps_ref = (for_fini ? &do_reldeps : NULL);

OK. Keeping the algorithm close what we did before, which is to use reldeps for
fini sorting only.

> +
> +  for (int i = nmaps - 1; i >= 0; i--)
> +    {
> +      dfs_traversal (&rpo_head, maps[i], do_reldeps_ref);
> +
> +      /* We can break early if all objects are already placed.  */
> +      if (rpo_head == rpo)
> +	goto end;

OK. This indicates the dfs_traversal moved rpo_head all the way down
the list to rpo, and so we finished sorting, all accounted objects
were placed in a position among nmaps number of entries. We
called dfs_traversal() initially with an empty rpo[] and rpo_head at the
end of &rpo[nmaps] and the map we want being maps[nmaps-1]. Looks good to me.

> +    }
> +  assert (rpo_head == rpo);
> +
> + end:
> +  /* This is avoided if !FOR_FINI or if we didn't find any reldeps in
> +     the first DFS traversal.  */

OK. Even if we find everything sorted we might have to check reldeps.
You could adjust the logic in dfs_traversal to get rid of this here, but
it would complicate the logic there.

> +  if (do_reldeps)
> +    {
> +      for (int i = 0; i < nmaps; i++)
> +	rpo[i]->l_visited = 0;

OK. Clear rpo in order setting l_visited to 0, and we have to sort all over again.

> +
> +      struct link_map **maps_head = &maps[nmaps];
> +      for (int i = nmaps - 1; i >= 0; i--)
> +	{
> +	  dfs_traversal (&maps_head, rpo[i], NULL);

QUESTION:

We saw symbol dependencies.

The objects have to already be in maps[] or it's an unsatisfied symbol dependency.

We already took into account the l_reldeps in dfs_traversal() the first time.

Now we sort again *without* taking them into account.

However, we use the rpo[] maps list which includes the sorted order of the relocation
dependencies taken into account.

Why are we doing this? Is it to take DT_NEEDED as a preference over symbol
relocation dependencies?

> +
> +	  /* We can break early if all objects are already placed.
> +	     The below memcpy is not needed in the do_reldeps case here,
> +	     since we wrote back to maps[] during DFS traversal.  */
> +	  if (maps_head == maps)
> +	    return;
> +	}
> +      assert (maps_head == maps);
> +      return;

OK. Finished sorting again.

> +    }
> +
> +  memcpy (maps, rpo, sizeof (struct link_map *) * nmaps);

OK. Copy result.

> +}
> +
> +void
> +_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
> +	       unsigned int skip, bool for_fini)
> +{
> +  /* Function pointer to sorting algorithm currently in use.  */
> +  static void (*sort_maps) (struct link_map **, unsigned int,
> +			    unsigned int, bool) = NULL;
> +  if (__glibc_unlikely (sort_maps == NULL))
> +    {
> +      /* Select sorting algorithm based on tunable value.  */
> +      int32_t algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
> +				       int32_t, NULL);
> +      if (__glibc_likely (algorithm == 1))
> +	sort_maps = _dl_sort_maps_original;
> +      else if (algorithm == 2)
> +	sort_maps = _dl_sort_maps_dfs;

Both of these need function pointer encryption e.g. PTR_MANGLE.

> +      else
> +	__builtin_unreachable ();
> +    }
> +
> +  sort_maps (maps, nmaps, skip, for_fini);

This should decrypt the function pointer and call it e.g. PTR_DEMANGLE.

> +}
> +
> +#endif /* HAVE_TUNABLES.  */
> diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
> index 0d398dd251..aeeafe704a 100644
> --- a/elf/dl-tunables.list
> +++ b/elf/dl-tunables.list
> @@ -126,4 +126,13 @@ glibc {
>        default: 3
>      }
>    }
> +
> +  rtld {
> +    dynamic_sort {
> +      type: INT_32
> +      minval: 1
> +      maxval: 2
> +      default: 1

OK.

> +    }
> +  }
>  }
> diff --git a/include/link.h b/include/link.h
> index aea268439c..48c11c5d68 100644
> --- a/include/link.h
> +++ b/include/link.h
> @@ -177,6 +177,10 @@ struct link_map
>      unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
>      unsigned int l_global:1;	/* Nonzero if object in _dl_global_scope.  */
>      unsigned int l_reserved:2;	/* Reserved for internal use.  */
> +    unsigned int l_visited:1;   /* Used internally for map dependency
> +				   graph traversal.  */
> +    unsigned int l_map_used:1;  /* These two bits are used during traversal */
> +    unsigned int l_map_done:1;  /* of maps in _dl_close_worker(). */

OK.

>      unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
>  					to by `l_phdr' is allocated.  */
>      unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in
> diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
> index 5ff4a2831b..a0f60df81c 100644
> --- a/sysdeps/generic/ldsodefs.h
> +++ b/sysdeps/generic/ldsodefs.h
> @@ -1017,8 +1017,8 @@ extern void _dl_init (struct link_map *main_map, int argc, char **argv,
>  extern void _dl_fini (void) attribute_hidden;
>  
>  /* Sort array MAPS according to dependencies of the contained objects.  */
> -extern void _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
> -			   char *used, bool for_fini) attribute_hidden;
> +extern void _dl_sort_maps (struct link_map **, unsigned int, unsigned int,
> +			   bool) attribute_hidden;

Please name the arguments. They should be matching names to the implementation
and help the reader understand the arguments.

>  
>  /* The dynamic linker calls this function before and having changing
>     any shared object mappings.  The `r_state' member of `struct r_debug'


-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v3 2/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader
  2020-06-29 21:29 ` Carlos O'Donell
@ 2020-07-22 18:10   ` Chung-Lin Tang
  0 siblings, 0 replies; 3+ messages in thread
From: Chung-Lin Tang @ 2020-07-22 18:10 UTC (permalink / raw)
  To: Carlos O'Donell, cltang, GNU C Library; +Cc: Catherine Moore

[-- Attachment #1: Type: text/plain, Size: 9368 bytes --]

Hi Carlos, here's the v3 patch for the actual sorting changes.

On 2020/6/30 5:29 AM, Carlos O'Donell wrote:
> On 5/19/20 9:25 AM, Chung-Lin Tang wrote:
>> Hi Carlos,
>> This part is the actual code changes in elf/, changes since the last version[1] are:
> 
> I have 1 remaining question. Please see "QUESTION:" below.

Please see below.

> I have 1 remaining nit, which is just to use named arguments in ldsodefs.h.

Done.

> I have 1 code change regarding PTR_MANGLE/PTR_DEMANGLE to obfuscate the stored
> function pointer.

I have been performance testing the patch, and have concluded that, although
just in the hundreds of cycles as measured by hp-timing.h facilities, the speed of
using a simple if-case and direct calls is faster than calling through a function
pointer (current CPUs still doesn't do indirect branches efficiently enough), and
adding pointer mangle/demangle will just add on to the overhead. So in this version
I've changed to using just a static int32_t code instead of function pointer.

(this matters a bit because as I've tested, while the new DFS algorithm is more scalable,
for the common small-number-of-deps case the new _dl_sort_maps routine is actually
slightly slower than the old algorithm; trying to recover this gap as much as possible)

> Please post v3. One way to answer my question is to post a v3 with a large enough
> comment in that section that explains *why* we resort without taking l_reldeps into
> account.
> 
> Looking very good. Thanks for all the work here.
>   
>> 1. As you suggested, the use of a glibc tunable (glibc.rtld.dynamic_sort)
>> to allow specifying which sorting algorithm to use. "1" for the current algorithm,
>> "2" for the new DFS based one. The default is 1 to be conservative.
> 
> Thank you. I think that overall I'd want to take the following strategy:
> 
> (1) Test glibc.rtld.dynamic_sort=2 in Fedora Rawhide for 6 months.
> (2) Switch glibc.rtld.dynamic_sort default to 2 in glibc 2.33, and declare
>      the version 1 deprecated.
> (3) In glibc 2.35 or glibc 2.36 remove version 1. That gives us a 1.5 to 2 year
>      window where we have deprecated and then removed the old algorithm. Users
>      will have deployed environments at this point with the default set correctly
>      and we'll have seen the most egregious problems sorted out.
> 
> Not everyone may agree with this strategy. However, I'm hesitant to take a more
> strict approach to this problem. We need a good transitional period with both
> options.

I personally think this is fine, nothing to object.


>> +static void
>> +_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
>> +		   unsigned int skip __attribute__ ((unused)), bool for_fini)
>> +{
>> +  for (int i = 0; i < nmaps; i++)
>> +    maps[i]->l_visited = 0;

The two loops where we initialize the l_visited bits have been adjusted to
'for(int i = nmaps-1; i >= 0; i--)'.  The change compare against zero generally
saves one register pressure in the loop body.

> 
>> +     To summarize, just passing in the full list, and iterating from back
>> +     to front makes things much more straightforward.  */
>> +
>> +  struct link_map *rpo[nmaps];
> 
> OK. Array to hold rpo pointers.
> 
>> +  struct link_map **rpo_head = &rpo[nmaps];
> 
> Required: Please add a comment that we start 1 past the end of this list because
> of the *pro -= 1; we use in dfs_traveral().

Done.
>> + end:
>> +  /* This is avoided if !FOR_FINI or if we didn't find any reldeps in
>> +     the first DFS traversal.  */
> 
> OK. Even if we find everything sorted we might have to check reldeps.
> You could adjust the logic in dfs_traversal to get rid of this here, but
> it would complicate the logic there.
> 
>> +  if (do_reldeps)
>> +    {
>> +      for (int i = 0; i < nmaps; i++)
>> +	rpo[i]->l_visited = 0;
> 
> OK. Clear rpo in order setting l_visited to 0, and we have to sort all over again.
> 
>> +
>> +      struct link_map **maps_head = &maps[nmaps];
>> +      for (int i = nmaps - 1; i >= 0; i--)
>> +	{
>> +	  dfs_traversal (&maps_head, rpo[i], NULL);
> 
> QUESTION:
> 
> We saw symbol dependencies.
> 
> The objects have to already be in maps[] or it's an unsatisfied symbol dependency.
> 
> We already took into account the l_reldeps in dfs_traversal() the first time.
> 
> Now we sort again *without* taking them into account.
> 
> However, we use the rpo[] maps list which includes the sorted order of the relocation
> dependencies taken into account.
> 
> Why are we doing this? Is it to take DT_NEEDED as a preference over symbol
> relocation dependencies?

Yes, the discussion of #15311 was that in the destructor ordering, static DT_NEEDED
dependencies should not be violated in the face of relocation dependencies, and a
second pass of sorting using only static deps should be a viable solution to enforce
this.

To illustrate, given the tst-bz15311 test description in elf/dso-sort-tests-1.def
in the testing patch:

Test description: {+a;+e;+f;+g;+d;%d;-d;-g;-f;-e;-a};a->b->c->d;d=>[ba];c=>a;b=>e=>a;c=>f=>b;d=>g=>c

(1) Output of the current "old" algorithm:
{+a[d>c>b>a>];+e[e>];+f[f>];+g[g>];+d[];%d(b(e(a()))a()g(c(a()f(b(e(a()))))));-d[];-g[];-f[];-e[];-a[<a<c<d<g<f<b<e];}

(2) Output of the "new" DFS-based algorithm:
{+a[d>c>b>a>];+e[e>];+f[f>];+g[g>];+d[];%d(b(e(a()))a()g(c(a()f(b(e(a()))))));-d[];-g[];-f[];-e[];-a[<g<f<a<b<c<d<e];}

(3) Output of the "new" DFS-based algorithm, but with the second sorting pass *removed*:
{+a[d>c>b>a>];+e[e>];+f[f>];+g[g>];+d[];%d(b(e(a()))a()g(c(a()f(b(e(a()))))));-d[];-g[];-f[];-e[];-a[<g<c<f<d<a<b<e];}

The difference is only at the ending "-a[...]" part (e.g. dlclose(a) operation). The deemed reasonable behavior
is to respect the "a->b->c->d" static dependency restrictions, even in the face of circular dependency ambiguity.
Currently, only (2) does this.

But you did mention the test is actually underlinked, and the proper fix is to properly link it.
So I think there's also a question of: should #15311 be fixed at all? Is the current "beyond specification" handling
of this testcase desired, or should we just accept such a testcase as undefined behavior?
(if we don't need to handle this, the code could be leaner)

FYI, the tst-bz15311 testcase is the only test that is currently affected by this second pass sorting,
i.e. removing it does not cause any regressions in other elf/ tests.

>> +void
>> +_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
>> +	       unsigned int skip, bool for_fini)
>> +{
>> +  /* Function pointer to sorting algorithm currently in use.  */
>> +  static void (*sort_maps) (struct link_map **, unsigned int,
>> +			    unsigned int, bool) = NULL;
>> +  if (__glibc_unlikely (sort_maps == NULL))
>> +    {
>> +      /* Select sorting algorithm based on tunable value.  */
>> +      int32_t algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
>> +				       int32_t, NULL);
>> +      if (__glibc_likely (algorithm == 1))
>> +	sort_maps = _dl_sort_maps_original;
>> +      else if (algorithm == 2)
>> +	sort_maps = _dl_sort_maps_dfs;
> 
> Both of these need function pointer encryption e.g. PTR_MANGLE.

This part has been changed to (pasting changed fragment here to ease review):

   /* Index code for sorting algorithm currently in use.  */
   static int32_t algorithm = 0;
   if (__glibc_unlikely (algorithm == 0))
     algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
                              int32_t, NULL);

   /* It can be tempting to use a static function pointer to store and call
      the current selected sorting algorithm routine, but experimentation
      shows that current processors still do not handle indirect branches
      that efficiently, plus a static function pointer will involve
      PTR_MANGLE/DEMANGLE, further impairing performance of small, common
      input cases. A simple if-case with direct function calls appears to
      be the fastest.  */
   if (__glibc_likely (algorithm == 1))
     _dl_sort_maps_original (maps, nmaps, skip, for_fini);
   else if (algorithm == 2)
     _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
   else
     __builtin_unreachable ();


The reasons of change are mainly performance related as mentioned above,
I don't think it will affect understanding though.

>>       unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
>>   					to by `l_phdr' is allocated.  */
>>       unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in
>> diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
>> index 5ff4a2831b..a0f60df81c 100644
>> --- a/sysdeps/generic/ldsodefs.h
>> +++ b/sysdeps/generic/ldsodefs.h
>> @@ -1017,8 +1017,8 @@ extern void _dl_init (struct link_map *main_map, int argc, char **argv,
>>   extern void _dl_fini (void) attribute_hidden;
>>   
>>   /* Sort array MAPS according to dependencies of the contained objects.  */
>> -extern void _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
>> -			   char *used, bool for_fini) attribute_hidden;
>> +extern void _dl_sort_maps (struct link_map **, unsigned int, unsigned int,
>> +			   bool) attribute_hidden;
> 
> Please name the arguments. They should be matching names to the implementation
> and help the reader understand the arguments.

Done.

The sorting part of the v3 patch is attached, please see if this is ready to commit.

Thanks,
Chung-Lin

[-- Attachment #2: dlsort-02-changes-v3.patch --]
[-- Type: text/plain, Size: 14930 bytes --]

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 8e146ecee1..8bc7eeb04b 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -164,8 +164,6 @@ _dl_close_worker (struct link_map *map, bool force)
 
   bool any_tls = false;
   const unsigned int nloaded = ns->_ns_nloaded;
-  char used[nloaded];
-  char done[nloaded];
   struct link_map *maps[nloaded];
 
   /* Run over the list and assign indexes to the link maps and enter
@@ -173,24 +171,21 @@ _dl_close_worker (struct link_map *map, bool force)
   int idx = 0;
   for (struct link_map *l = ns->_ns_loaded; l != NULL; l = l->l_next)
     {
+      l->l_map_used = 0;
+      l->l_map_done = 0;
       l->l_idx = idx;
       maps[idx] = l;
       ++idx;
-
     }
   assert (idx == nloaded);
 
-  /* Prepare the bitmaps.  */
-  memset (used, '\0', sizeof (used));
-  memset (done, '\0', sizeof (done));
-
   /* Keep track of the lowest index link map we have covered already.  */
   int done_index = -1;
   while (++done_index < nloaded)
     {
       struct link_map *l = maps[done_index];
 
-      if (done[done_index])
+      if (l->l_map_done)
 	/* Already handled.  */
 	continue;
 
@@ -201,12 +196,12 @@ _dl_close_worker (struct link_map *map, bool force)
 	  /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
 	     acquire is sufficient and correct.  */
 	  && atomic_load_acquire (&l->l_tls_dtor_count) == 0
-	  && !used[done_index])
+	  && !l->l_map_used)
 	continue;
 
       /* We need this object and we handle it now.  */
-      done[done_index] = 1;
-      used[done_index] = 1;
+      l->l_map_used = 1;
+      l->l_map_done = 1;
       /* Signal the object is still needed.  */
       l->l_idx = IDX_STILL_USED;
 
@@ -222,9 +217,9 @@ _dl_close_worker (struct link_map *map, bool force)
 		{
 		  assert ((*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded);
 
-		  if (!used[(*lp)->l_idx])
+		  if (!(*lp)->l_map_used)
 		    {
-		      used[(*lp)->l_idx] = 1;
+		      (*lp)->l_map_used = 1;
 		      /* If we marked a new object as used, and we've
 			 already processed it, then we need to go back
 			 and process again from that point forward to
@@ -247,9 +242,9 @@ _dl_close_worker (struct link_map *map, bool force)
 	      {
 		assert (jmap->l_idx >= 0 && jmap->l_idx < nloaded);
 
-		if (!used[jmap->l_idx])
+		if (!jmap->l_map_used)
 		  {
-		    used[jmap->l_idx] = 1;
+		    jmap->l_map_used = 1;
 		    if (jmap->l_idx - 1 < done_index)
 		      done_index = jmap->l_idx - 1;
 		  }
@@ -259,8 +254,7 @@ _dl_close_worker (struct link_map *map, bool force)
 
   /* Sort the entries.  We can skip looking for the binary itself which is
      at the front of the search list for the main namespace.  */
-  _dl_sort_maps (maps + (nsid == LM_ID_BASE), nloaded - (nsid == LM_ID_BASE),
-		 used + (nsid == LM_ID_BASE), true);
+  _dl_sort_maps (maps, nloaded, (nsid == LM_ID_BASE), true);
 
   /* Call all termination functions at once.  */
 #ifdef SHARED
@@ -277,7 +271,7 @@ _dl_close_worker (struct link_map *map, bool force)
       /* All elements must be in the same namespace.  */
       assert (imap->l_ns == nsid);
 
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded && !imap->l_nodelete_active);
 
@@ -330,7 +324,7 @@ _dl_close_worker (struct link_map *map, bool force)
 	  if (i < first_loaded)
 	    first_loaded = i;
 	}
-      /* Else used[i].  */
+      /* Else imap->l_map_used.  */
       else if (imap->l_type == lt_loaded)
 	{
 	  struct r_scope_elem *new_list = NULL;
@@ -554,7 +548,7 @@ _dl_close_worker (struct link_map *map, bool force)
   for (unsigned int i = first_loaded; i < nloaded; ++i)
     {
       struct link_map *imap = maps[i];
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded);
 
diff --git a/elf/dl-deps.c b/elf/dl-deps.c
index b5a43232a7..0851971dca 100644
--- a/elf/dl-deps.c
+++ b/elf/dl-deps.c
@@ -611,7 +611,7 @@ Filters not supported with LD_TRACE_PRELINKING"));
     memcpy (l_initfini, map->l_searchlist.r_list,
 	    nlist * sizeof (struct link_map *));
 
-  _dl_sort_maps (&l_initfini[1], nlist - 1, NULL, false);
+  _dl_sort_maps (l_initfini, nlist, 1, false);
 
   /* Terminate the list of dependencies.  */
   l_initfini[nlist] = NULL;
diff --git a/elf/dl-fini.c b/elf/dl-fini.c
index 231db3d66d..5b1bf6371c 100644
--- a/elf/dl-fini.c
+++ b/elf/dl-fini.c
@@ -92,8 +92,7 @@ _dl_fini (void)
 	  /* Now we have to do the sorting.  We can skip looking for the
 	     binary itself which is at the front of the search list for
 	     the main namespace.  */
-	  _dl_sort_maps (maps + (ns == LM_ID_BASE), nmaps - (ns == LM_ID_BASE),
-			 NULL, true);
+	  _dl_sort_maps (maps, nmaps, (ns == LM_ID_BASE), true);
 
 	  /* We do not rely on the linked list of loaded object anymore
 	     from this point on.  We have our own list here (maps).  The
diff --git a/elf/dl-sort-maps.c b/elf/dl-sort-maps.c
index 86f1e23c85..9513155bdf 100644
--- a/elf/dl-sort-maps.c
+++ b/elf/dl-sort-maps.c
@@ -16,16 +16,24 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <assert.h>
 #include <ldsodefs.h>
+#include <elf/dl-tunables.h>
 
+/* Note: this is the older, "original" sorting algorithm, being used as
+   default up to 2.32.
 
-/* Sort array MAPS according to dependencies of the contained objects.
-   Array USED, if non-NULL, is permutated along MAPS.  If FOR_FINI this is
-   called for finishing an object.  */
-void
-_dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
-	       bool for_fini)
+   Sort array MAPS according to dependencies of the contained objects.
+   If FOR_FINI is true, this is called for finishing an object.  */
+static void
+_dl_sort_maps_original (struct link_map **maps, unsigned int nmaps,
+			unsigned int skip, bool for_fini)
 {
+  /* Allows caller to do the common optimization of skipping the first map,
+     usually the main binary.  */
+  maps += skip;
+  nmaps -= skip;
+
   /* A list of one element need not be sorted.  */
   if (nmaps <= 1)
     return;
@@ -66,14 +74,6 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
 			   (k - i) * sizeof (maps[0]));
 		  maps[k] = thisp;
 
-		  if (used != NULL)
-		    {
-		      char here_used = used[i];
-		      memmove (&used[i], &used[i + 1],
-			       (k - i) * sizeof (used[0]));
-		      used[k] = here_used;
-		    }
-
 		  if (seen[i + 1] > nmaps - i)
 		    {
 		      ++i;
@@ -120,3 +120,177 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
     next:;
     }
 }
+
+#if !HAVE_TUNABLES
+/* In this case, just default to the original algorithm.  */
+strong_alias (_dl_sort_maps_original, _dl_sort_maps);
+#else
+
+/* We use a recursive function due to its better clarity and ease of
+   implementation, as well as faster execution speed. We already use
+   alloca() for list allocation during the breadth-first search of
+   dependencies in _dl_map_object_deps(), and this should be on the
+   same order of worst-case stack usage.  */
+
+static void
+dfs_traversal (struct link_map ***rpo, struct link_map *map,
+	       bool *do_reldeps)
+{
+  if (map->l_visited)
+    return;
+
+  map->l_visited = 1;
+
+  if (map->l_initfini)
+    {
+      for (int i = 0; map->l_initfini[i] != NULL; i++)
+	{
+	  struct link_map *dep = map->l_initfini[i];
+	  if (dep->l_visited == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  if (__glibc_unlikely (do_reldeps != NULL && map->l_reldeps != NULL))
+    {
+      /* Indicate that we encountered relocation dependencies during
+	 traversal.  */
+      *do_reldeps = true;
+
+      for (int m = map->l_reldeps->act - 1; m >= 0; m--)
+	{
+	  struct link_map *dep = map->l_reldeps->list[m];
+	  if (dep->l_visited == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  *rpo -= 1;
+  **rpo = map;
+}
+
+/* Topologically sort array MAPS according to dependencies of the contained
+   objects.  */
+
+static void
+_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
+		   unsigned int skip __attribute__ ((unused)), bool for_fini)
+{
+  for (int i = nmaps - 1; i >= 0; i--)
+    maps[i]->l_visited = 0;
+
+  /* We apply DFS traversal for each of maps[i] until the whole total order
+     is found and we're at the start of the Reverse-Postorder (RPO) sequence,
+     which is a topological sort.
+
+     We go from maps[nmaps - 1] backwards towards maps[0] at this level.
+     Due to the breadth-first search (BFS) ordering we receive, going
+     backwards usually gives a more shallow depth-first recursion depth,
+     adding more stack usage safety. Also, combined with the natural
+     processing order of l_initfini[] at each node during DFS, this maintains
+     an ordering closer to the original link ordering in the sorting results
+     under most simpler cases.
+
+     Another reason we order the top level backwards, it that maps[0] is
+     usually exactly the main object of which we're in the midst of
+     _dl_map_object_deps() processing, and maps[0]->l_initfini[] is still
+     blank. If we start the traversal from maps[0], since having no
+     dependencies yet filled in, maps[0] will always be immediately
+     incorrectly placed at the last place in the order (first in reverse).
+     Adjusting the order so that maps[0] is last traversed naturally avoids
+     this problem.
+
+     Further, the old "optimization" of skipping the main object at maps[0]
+     from the call-site (i.e. _dl_sort_maps(maps+1,nmaps-1)) is in general
+     no longer valid, since traversing along object dependency-links
+     may "find" the main object even when it is not included in the initial
+     order (e.g. a dlopen()'ed shared object can have circular dependencies
+     linked back to itself). In such a case, traversing N-1 objects will
+     create a N-object result, and raise problems.
+
+     To summarize, just passing in the full list, and iterating from back
+     to front makes things much more straightforward.  */
+
+  /* Array to hold RPO sorting results, before we copy back to maps[].  */
+  struct link_map *rpo[nmaps];
+
+  /* The 'head' position during each DFS iteration. Note that we start at
+     one past the last element due to first-decrement-then-store (see the
+     bottom of above dfs_traversal() routine).  */
+  struct link_map **rpo_head = &rpo[nmaps];
+
+  bool do_reldeps = false;
+  bool *do_reldeps_ref = (for_fini ? &do_reldeps : NULL);
+
+  for (int i = nmaps - 1; i >= 0; i--)
+    {
+      dfs_traversal (&rpo_head, maps[i], do_reldeps_ref);
+
+      /* We can break early if all objects are already placed.  */
+      if (rpo_head == rpo)
+	goto end;
+    }
+  assert (rpo_head == rpo);
+
+ end:
+  /* Here we may do a second pass of sorting, using only l_initfini[]
+     static dependency links. This is avoided if !FOR_FINI or if we didn't
+     find any reldeps in the first DFS traversal.
+
+     The reason we do this is: while it is unspecified how circular
+     dependencies should be handled, the presumed reasonable behavior is to
+     have destructors to respect static dependency links as much as possible,
+     overriding reldeps if needed. And the first sorting pass, which takes
+     l_initfini/l_reldeps links equally, may not preserve this priority.
+
+     Hence we do a 2nd sorting pass, taking only DT_NEEDED links into account
+     (see how the do_reldeps argument to dfs_traversal() is NULL below).  */
+  if (do_reldeps)
+    {
+      for (int i = nmaps - 1; i >= 0; i--)
+	rpo[i]->l_visited = 0;
+
+      struct link_map **maps_head = &maps[nmaps];
+      for (int i = nmaps - 1; i >= 0; i--)
+	{
+	  dfs_traversal (&maps_head, rpo[i], NULL);
+
+	  /* We can break early if all objects are already placed.
+	     The below memcpy is not needed in the do_reldeps case here,
+	     since we wrote back to maps[] during DFS traversal.  */
+	  if (maps_head == maps)
+	    return;
+	}
+      assert (maps_head == maps);
+      return;
+    }
+
+  memcpy (maps, rpo, sizeof (struct link_map *) * nmaps);
+}
+
+void
+_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
+	       unsigned int skip, bool for_fini)
+{
+  /* Index code for sorting algorithm currently in use.  */
+  static int32_t algorithm = 0;
+  if (__glibc_unlikely (algorithm == 0))
+    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
+			     int32_t, NULL);
+
+  /* It can be tempting to use a static function pointer to store and call
+     the current selected sorting algorithm routine, but experimentation
+     shows that current processors still do not handle indirect branches
+     that efficiently, plus a static function pointer will involve
+     PTR_MANGLE/DEMANGLE, further impairing performance of small, common
+     input cases. A simple if-case with direct function calls appears to
+     be the fastest.  */
+  if (__glibc_likely (algorithm == 1))
+    _dl_sort_maps_original (maps, nmaps, skip, for_fini);
+  else if (algorithm == 2)
+    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
+  else
+    __builtin_unreachable ();
+}
+
+#endif /* HAVE_TUNABLES.  */
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index 35634ef24d..a38937d6a8 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -140,4 +140,13 @@ glibc {
       default: 512
     }
   }
+
+  rtld {
+    dynamic_sort {
+      type: INT_32
+      minval: 1
+      maxval: 2
+      default: 1
+    }
+  }
 }
diff --git a/include/link.h b/include/link.h
index aea268439c..48c11c5d68 100644
--- a/include/link.h
+++ b/include/link.h
@@ -177,6 +177,10 @@ struct link_map
     unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
     unsigned int l_global:1;	/* Nonzero if object in _dl_global_scope.  */
     unsigned int l_reserved:2;	/* Reserved for internal use.  */
+    unsigned int l_visited:1;   /* Used internally for map dependency
+				   graph traversal.  */
+    unsigned int l_map_used:1;  /* These two bits are used during traversal */
+    unsigned int l_map_done:1;  /* of maps in _dl_close_worker(). */
     unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
 					to by `l_phdr' is allocated.  */
     unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index ba114ab4b1..4ad0cbc692 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -1025,7 +1025,7 @@ extern void _dl_fini (void) attribute_hidden;
 
 /* Sort array MAPS according to dependencies of the contained objects.  */
 extern void _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
-			   char *used, bool for_fini) attribute_hidden;
+			   unsigned int skip, bool for_fini) attribute_hidden;
 
 /* The dynamic linker calls this function before and having changing
    any shared object mappings.  The `r_state' member of `struct r_debug'

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-07-22 18:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 13:25 [PATCH v2 2/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader Chung-Lin Tang
2020-06-29 21:29 ` Carlos O'Donell
2020-07-22 18:10   ` [PATCH v3 " Chung-Lin Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).