[RFC][gdbsupport] Improve thread scheduling in parallel_for

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* [RFC][gdbsupport] Improve thread scheduling in parallel_for_each
@ 2022-07-15  9:40 Tom de Vries
  2022-07-15 19:05 ` Tom Tromey
  0 siblings, 1 reply; 3+ messages in thread
From: Tom de Vries @ 2022-07-15  9:40 UTC (permalink / raw)
  To: gdb-patches

Hi,

In https://sourceware.org/pipermail/gdb-patches/2022-July/190757.html I added
some debug statements to the thread scheduling in parallel_for_each and
noticed that only 3 out of 4 worker threads were used:
...
Parallel for: n_elements: 7271
Parallel for: minimum elements per thread: 10
Parallel for: elts_per_thread: 1817
Parallel for: worker threads: 4
Parallel for: worker threads used: 3
Parallel for: elements 0-1816 on thread 0: 1817
Parallel for: elements 1817-3633 on thread 1: 1817
Parallel for: elements 3634-5450 on thread 2: 1817
Parallel for: elements 5451-7270 on main thread: 1820
...

This simple fix:
...
-  size_t count = n_threads == 0 ? 0 : n_threads - 1;
+  size_t count = n_threads == 0 ? 0 : n_threads;
...
would give thread 3 a share of the work, such that we have instead:
...
Parallel for: worker threads used: 4
Parallel for: elements 0-1816 on thread 0: 1817
Parallel for: elements 1817-3633 on thread 1: 1817
Parallel for: elements 3634-5450 on thread 2: 1817
Parallel for: elements 5451-7267 on thread 3: 1817
Parallel for: elements 7268-7270 on main thread: 3
...
but that would leave the main thread with very few iterations (always less
than the number of worker threads).

Fix this instead by also counting the main thread for distributing work to,
such that we have instead:
...
Parallel for: elts_per_thread: 1454
Parallel for: worker threads: 4
Parallel for: worker threads used: 4
Parallel for: elements 0-1453 on thread 0: 1454
Parallel for: elements 1454-2907 on thread 1: 1454
Parallel for: elements 2908-4361 on thread 2: 1454
Parallel for: elements 4362-5815 on thread 3: 1454
Parallel for: elements 5816-7270 on main thread: 1455
...

This introduces a performance regression on a particular test-case I happened
to use:
...
$ for n in $(seq 1 10); do \
    time gdb -q -batch libxul.so.debug 2>&1 | grep real:; \
  done
...
so revert to the original schedule by reducing the worker threads:
...
     n_threads = std::thread::hardware_concurrency ();
+    if (n_threads > 0)
+      /* Account for main thread.  */
+      n_threads--;
...

So, it looks like the only thing we've achieved is not have an idle worker
thread anymore.

Redoing the performance experiment still yields a slight performance loss.

Try to improve things a tiny bit by spreading the left-over elements over
all threads rather than just the main thread:
...
Parallel for: elements 0-1817 on thread 0: 1818
Parallel for: elements 1818-3635 on thread 1: 1818
Parallel for: elements 3636-5453 on thread 2: 1818
Parallel for: elements 5454-7270 on main thread: 1817
...

Still, the performance experiment yields a slight performance loss.

Tested on x86_64-linux.

Any comments?

Thanks,
- Tom

[gdbsupport] Improve thread scheduling in parallel_for_each

---
 gdb/maint.c               |  7 ++++++-
 gdbsupport/parallel-for.h | 27 ++++++++++++++++-----------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/gdb/maint.c b/gdb/maint.c
index 289560957f2..cc3be7b7d82 100644
--- a/gdb/maint.c
+++ b/gdb/maint.c
@@ -850,7 +850,12 @@ update_thread_pool_size ()
   int n_threads = n_worker_threads;

   if (n_threads < 0)
-    n_threads = std::thread::hardware_concurrency ();
+    {
+      n_threads = std::thread::hardware_concurrency ();
+      if (n_threads > 0)
+	/* Account for main thread.  */
+	n_threads--;
+    }

   gdb::thread_pool::g_thread_pool->set_thread_count (n_threads);
 #endif
diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
index a614fc35766..2bf8de25bc9 100644
--- a/gdbsupport/parallel-for.h
+++ b/gdbsupport/parallel-for.h
@@ -139,25 +139,30 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
   using result_type
     = typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type;

-  size_t n_threads = thread_pool::g_thread_pool->thread_count ();
+  size_t n_threads
+    = (thread_pool::g_thread_pool->thread_count ()
+       /* Also count the main thread as a thread to distribute work over.  */
+       + 1);
   size_t n_elements = last - first;
   size_t elts_per_thread = 0;
-  if (n_threads > 1)
-    {
-      /* Require that there should be at least N elements in a
-	 thread.  */
-      gdb_assert (n > 0);
-      if (n_elements / n_threads < n)
-	n_threads = std::max (n_elements / n, (size_t) 1);
-      elts_per_thread = n_elements / n_threads;
-    }

-  size_t count = n_threads == 0 ? 0 : n_threads - 1;
+  /* Require that there should be at least N elements in a
+     thread.  */
+  gdb_assert (n > 0);
+  if (n_elements / n_threads < n)
+    n_threads = std::max (n_elements / n, (size_t) 1);
+  elts_per_thread = n_elements / n_threads;
+  gdb_assert (elts_per_thread >= n || n_threads == 1);
+  size_t left_over = n_elements % n_threads;
+
+  size_t count = n_threads - 1;
   gdb::detail::par_for_accumulator<result_type> results (count);

   for (int i = 0; i < count; ++i)
     {
       RandomIt end = first + elts_per_thread;
+      if (i < left_over)
+	end++;
       results.post (i, [=] ()
         {
 	  return callback (first, end);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC][gdbsupport] Improve thread scheduling in parallel_for_each
  2022-07-15  9:40 [RFC][gdbsupport] Improve thread scheduling in parallel_for_each Tom de Vries
@ 2022-07-15 19:05 ` Tom Tromey
  2022-07-18  7:30   ` Tom de Vries
  0 siblings, 1 reply; 3+ messages in thread
From: Tom Tromey @ 2022-07-15 19:05 UTC (permalink / raw)
  To: Tom de Vries via Gdb-patches

>>>>> "Tom" == Tom de Vries via Gdb-patches <gdb-patches@sourceware.org> writes:

Tom> This introduces a performance regression on a particular test-case I happened
Tom> to use:
Tom> ...
Tom> $ for n in $(seq 1 10); do \
Tom>     time gdb -q -batch libxul.so.debug 2>&1 | grep real:; \
Tom>   done
Tom> ...
Tom> so revert to the original schedule by reducing the worker threads:
Tom> ...

This seems like making a change and then undoing it somewhere else?

Tom> Still, the performance experiment yields a slight performance loss.

Sounds bad.

Tom>    if (n_threads < 0)
Tom> -    n_threads = std::thread::hardware_concurrency ();
Tom> +    {
Tom> +      n_threads = std::thread::hardware_concurrency ();
Tom> +      if (n_threads > 0)
Tom> +	/* Account for main thread.  */
Tom> +	n_threads--;
Tom> +    }

I think it's better if the setting just directly controls how many
threads there are.  Then elsewhere we can decide what that means -- like
if it performs better with the defaults to not do any work in the main
thread, then parallel_for_each can be modified to just send tasks to the
workers and do nothing in the main thread except wait for results.

Tom>    size_t elts_per_thread = 0;
[...]
Tom> +  elts_per_thread = n_elements / n_threads;

The initial declaration can be removed and then this latter line can
declare the variable as well.

Tom>    for (int i = 0; i < count; ++i)
Tom>      {
Tom>        RandomIt end = first + elts_per_thread;
Tom> +      if (i < left_over)
Tom> +	end++;

It may be nice to mention the distribution of leftovers in a comment
somewhere.

Tom

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC][gdbsupport] Improve thread scheduling in parallel_for_each
  2022-07-15 19:05 ` Tom Tromey
@ 2022-07-18  7:30   ` Tom de Vries
  0 siblings, 0 replies; 3+ messages in thread
From: Tom de Vries @ 2022-07-18  7:30 UTC (permalink / raw)
  To: Tom Tromey, Tom de Vries via Gdb-patches

[-- Attachment #1: Type: text/plain, Size: 1844 bytes --]

On 7/15/22 21:05, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries via Gdb-patches <gdb-patches@sourceware.org> writes:
> 
> Tom> This introduces a performance regression on a particular test-case I happened
> Tom> to use:
> Tom> ...
> Tom> $ for n in $(seq 1 10); do \
> Tom>     time gdb -q -batch libxul.so.debug 2>&1 | grep real:; \
> Tom>   done
> Tom> ...
> Tom> so revert to the original schedule by reducing the worker threads:
> Tom> ...
> 
> This seems like making a change and then undoing it somewhere else?
> 
> Tom> Still, the performance experiment yields a slight performance loss.
> 
> Sounds bad.
> 
> Tom>    if (n_threads < 0)
> Tom> -    n_threads = std::thread::hardware_concurrency ();
> Tom> +    {
> Tom> +      n_threads = std::thread::hardware_concurrency ();
> Tom> +      if (n_threads > 0)
> Tom> +	/* Account for main thread.  */
> Tom> +	n_threads--;
> Tom> +    }
> 
> I think it's better if the setting just directly controls how many
> threads there are.  Then elsewhere we can decide what that means -- like
> if it performs better with the defaults to not do any work in the main
> thread, then parallel_for_each can be modified to just send tasks to the
> workers and do nothing in the main thread except wait for results.
> 
> Tom>    size_t elts_per_thread = 0;
> [...]
> Tom> +  elts_per_thread = n_elements / n_threads;
> 
> The initial declaration can be removed and then this latter line can
> declare the variable as well.
> 
> Tom>    for (int i = 0; i < count; ++i)
> Tom>      {
> Tom>        RandomIt end = first + elts_per_thread;
> Tom> +      if (i < left_over)
> Tom> +	end++;
> 
> It may be nice to mention the distribution of leftovers in a comment
> somewhere.

I've ended up committing this patch, which just does the leftovers 
distribution part, with some comments added.

Thanks,
- Tom

[-- Attachment #2: 0001-gdbsupport-Improve-thread-scheduling-in-parallel_for_each.patch --]
[-- Type: text/x-patch, Size: 2749 bytes --]

[gdbsupport] Improve thread scheduling in parallel_for_each

When running a task using parallel_for_each, we get the following
distribution:
...
Parallel for: n_elements: 7271
Parallel for: minimum elements per thread: 10
Parallel for: elts_per_thread: 1817
Parallel for: elements on worker thread 0       : 1817
Parallel for: elements on worker thread 1       : 1817
Parallel for: elements on worker thread 2       : 1817
Parallel for: elements on worker thread 3       : 0
Parallel for: elements on main thread           : 1820
...

Note that there are 4 active threads, and scheduling elts_per_thread on each
of those handles 4 * 1817 == 7268, leaving 3 "left over" elements.

These leftovers are currently handled in the main thread.

That doesn't seem to matter much for this example, but for say 10 threads and
99 elements, you'd have 9 threads handling 9 elements and 1 thread handling 18
elements.

Instead, distribute the left over elements over the worker threads, such that
we have:
...
Parallel for: elements on worker thread 0       : 1818
Parallel for: elements on worker thread 1       : 1818
Parallel for: elements on worker thread 2       : 1818
Parallel for: elements on worker thread 3       : 0
Parallel for: elements on main thread           : 1817
...

Tested on x86_64-linux.

---
 gdbsupport/parallel-for.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
index cfe8a6e4f09..bf40f125f0f 100644
--- a/gdbsupport/parallel-for.h
+++ b/gdbsupport/parallel-for.h
@@ -147,6 +147,8 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
   size_t n_threads = n_worker_threads;
   size_t n_elements = last - first;
   size_t elts_per_thread = 0;
+  size_t elts_left_over = 0;
+
   if (n_threads > 1)
     {
       /* Require that there should be at least N elements in a
@@ -155,6 +157,8 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
       if (n_elements / n_threads < n)
 	n_threads = std::max (n_elements / n, (size_t) 1);
       elts_per_thread = n_elements / n_threads;
+      elts_left_over = n_elements % n_threads;
+      /* n_elements == n_threads * elts_per_thread + elts_left_over. */
     }
 
   size_t count = n_threads == 0 ? 0 : n_threads - 1;
@@ -170,6 +174,10 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
   for (int i = 0; i < count; ++i)
     {
       RandomIt end = first + elts_per_thread;
+      if (i < elts_left_over)
+	/* Distribute the leftovers over the worker threads, to avoid having
+	   to handle all of them in a single thread.  */
+	end++;
       if (parallel_for_each_debug)
 	debug_printf (_("Parallel for: elements on worker thread %i\t: %zu\n"),
 		      i, (size_t)(end - first));

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-07-18  7:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-15  9:40 [RFC][gdbsupport] Improve thread scheduling in parallel_for_each Tom de Vries
2022-07-15 19:05 ` Tom Tromey
2022-07-18  7:30   ` Tom de Vries

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).