Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Julian Brown <julian@codesourcery.com>
To: Alexander Monakov <amonakov@ispras.ru>
Cc: Jakub Jelinek <jakub@redhat.com>, <gcc-patches@gcc.gnu.org>,
	"Thomas Schwinge" <thomas@codesourcery.com>,
	Tom de Vries <tdevries@suse.de>,
	Chung-Lin Tang <cltang@codesourcery.com>
Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch
Date: Fri, 13 Nov 2020 20:54:54 +0000	[thread overview]
Message-ID: <20201113205454.65f1539d@squid.athome> (raw)
In-Reply-To: <alpine.LNX.2.20.13.2011100000050.9902@monopod.intra.ispras.ru>

[-- Attachment #1: Type: text/plain, Size: 7743 bytes --]

Hi Alexander,

Thanks for the review! Comments below.

On Tue, 10 Nov 2020 00:32:36 +0300
Alexander Monakov <amonakov@ispras.ru> wrote:

> On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> 
> > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:  
> > > This patch adds caching for the stack block allocated for
> > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > optimisation -- we observed an average 11% or so performance
> > > improvement with this patch across a set of accelerated GPU
> > > benchmarks on one machine (results vary according to individual
> > > benchmark and with hardware used).  
> 
> In this patch you're folding two changes together: reuse of allocated
> stacks and removing one host-device synchronization.  Why is that?
> Can you report performance change separately for each change (and
> split out the patches)?

An accident of the development process of the patch, really -- the idea
for removing the post-kernel-launch synchronisation came from the
OpenACC side, and adapting it to OpenMP meant the stacks had to remain
allocated after the return of the GOMP_OFFLOAD_run function.

> > > A given kernel launch will reuse the stack block from the
> > > previous launch if it is large enough, else it is freed and
> > > reallocated. A slight caveat is that memory will not be freed
> > > until the device is closed, so e.g. if code is using highly
> > > variable launch geometries and large amounts of GPU RAM, you
> > > might run out of resources slightly quicker with this patch.
> > > 
> > > Another way this patch gains performance is by omitting the
> > > synchronisation at the end of an OpenMP offload kernel launch --
> > > it's safe for the GPU and CPU to continue executing in parallel
> > > at that point, because e.g. copies-back from the device will be
> > > synchronised properly with kernel completion anyway.  
> 
> I don't think this explanation is sufficient. My understanding is
> that OpenMP forbids the host to proceed asynchronously after the
> target construct unless it is a 'target nowait' construct. This may
> be observable if there's a printf in the target region for example
> (or if it accesses memory via host pointers).
> 
> So this really needs to be a separate patch with more explanation why
> this is okay (if it is okay).

As long as the offload kernel only touches GPU memory and does not have
any CPU-visible side effects (like the printf you mentioned -- I hadn't
really considered that, oops!), it's probably OK.

But anyway, the benefit obtained on OpenMP code (the same set of
benchmarks run before) of omitting the synchronisation at the end of
GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
stacks caching, and miss out the synchronisation removal for now. (It
might still be something worth considering later, perhaps, as long as
we can show some given kernel doesn't use printf or access memory via
host pointers -- I guess the former might be easier than the latter. I
have observed the equivalent OpenACC patch provide a significant boost
on some benchmarks, so there's probably something that could be gained
on the OpenMP side too.)

The benefit with the attached patch -- just stacks caching, no
synchronisation removal -- is about 12% on the same set of benchmarks
as before. Results are a little noisy on the machine I'm benchmarking
on, so this isn't necessarily proof that the synchronisation removal is
harmful for performance!

> > > In turn, the last part necessitates a change to the way "(perhaps
> > > abort was called)" errors are detected and reported.  
> 
> As already mentioned using callbacks is problematic. Plus, I'm sure
> the way you lock out other threads is a performance loss when
> multiple threads have target regions: even though they will not run
> concurrently on the GPU, you still want to allow host threads to
> submit GPU jobs while the GPU is occupied.
> 
> I would suggest to have a small pool (up to 3 entries perhaps) of
> stacks. Then you can arrange reuse without totally serializing host
> threads on target regions.

I'm really wary of the additional complexity of adding a stack pool,
and the memory allocation/freeing code paths in CUDA appear to be so
slow that we get a benefit with this patch even when the GPU stream has
to wait for the CPU to unlock the stacks block. Also, for large GPU
launches, the size of the soft-stacks block isn't really trivial (I've
seen something like 50MB on the hardware I'm using, with default
options), and multiplying that by 3 could start to eat into the GPU
heap memory for "useful data" quite significantly.

Consider the attached (probably not amazingly-written) microbenchmark.
It spawns 8 threads which each launch lots of OpenMP kernels
performing some trivial work, then joins the threads and checks the
results. As a baseline, with the "FEWER_KERNELS" parameters set (256
kernel launches over 8 threads), this gives us over 5 runs:

real    3m55.375s
user    7m14.192s
sys     0m30.148s

real    3m54.487s
user    7m6.775s
sys     0m34.678s

real    3m54.633s
user    7m20.381s
sys     0m30.620s

real    3m54.992s
user    7m12.464s
sys     0m29.610s

real    3m55.471s
user    7m14.342s
sys     0m29.815s

With a version of the attached patch, we instead get:

real    3m53.404s
user    3m39.869s
sys     0m16.149s

real    3m54.713s
user    3m41.018s
sys     0m16.129s

real    3m55.242s
user    3m55.148s
sys     0m17.130s

real    3m55.374s
user    3m40.411s
sys     0m15.818s

real    3m55.189s
user    3m40.144s
sys     0m15.846s

That is: real time is about the same, but user/sys time are reduced.

Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the
baseline is:

real    12m29.975s
user    24m2.244s
sys     8m8.153s

real    12m15.391s
user    23m51.018s
sys     8m0.809s

real    12m5.424s
user    23m38.585s
sys     7m47.714s

real    12m10.456s
user    23m51.691s
sys     7m54.324s

real    12m37.735s
user    24m19.671s
sys     8m15.752s

And with the patch, we get:

real    4m42.600s
user    16m14.593s
sys     0m40.444s

real    4m43.579s
user    15m33.805s
sys     0m38.537s

real    4m42.211s
user    16m32.926s
sys     0m40.271s

real    4m44.256s
user    15m49.290s
sys     0m39.116s

real    4m42.013s
user    15m39.447s
sys     0m38.517s

Real, user and sys time are all dramatically less. So I'd suggest that
the attached patch is an improvement over the status quo, even if we
could experiment with the stacks pool idea as a further improvement
later on.

The attached patch also implements a size limit for retention of the
soft-stack block -- freeing it before allocating more memory, rather
than at the start of a kernel launch, so bigger blocks can still be
shared between kernel launches if there's no memory allocation between
them. It also tries freeing smaller cached soft-stack blocks and
retrying memory allocation in out-of-memory situations.

Re-tested with offloading to NVPTX. OK for trunk?

Thanks,

Julian

ChangeLog

2020-11-13  Julian Brown  <julian@codesourcery.com>

libgomp/
    * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
    (struct ptx_device): Add omp_stacks struct.
    (nvptx_open_device): Initialise cached-stacks housekeeping info.
    (nvptx_close_device): Free cached stacks block and mutex.
    (nvptx_stacks_free): New function.
    (nvptx_alloc): Add SUPPRESS_ERRORS parameter.
    (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks block.
    (nvptx_stacks_alloc): Rename to...
    (nvptx_stacks_acquire): This.  Cache stacks block between runs if same
    size or smaller is required.
    (nvptx_stacks_free): Remove.
    (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks block
    during kernel execution.

[-- Attachment #2: mt.c --]
[-- Type: text/x-c++src, Size: 1565 bytes --]

#include <pthread.h>
#include <assert.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

#ifdef FEWER_KERNELS
#define THREADS 8
#define BLOCKSIZE (8 * 1024 * 1024)
#define REPS 32
#else
#define THREADS 8
#define BLOCKSIZE (8 * 256)
#define REPS (32 * 1024 * 4)
#endif

/* A worker thread.  Launch a pile of offload kernels and do some work.  */

void *
target_incr (void *arg)
{
  int *myarr = (int *) arg;

  for (int r = 0; r < REPS; r++)
    {
      #pragma omp target map(tofrom: myarr[0:BLOCKSIZE])
      {
	#pragma omp for
	for (int i = 0; i < BLOCKSIZE; i++)
	  {
	    myarr[i]++; 
	  }
      }
    }

  return NULL;
}

int
main (int argc, char* argv[])
{
  int *arr[THREADS];
  pthread_t workerthread[THREADS];
  int i;

  for (i = 0; i < THREADS; i++)
    {
      arr[i] = malloc (BLOCKSIZE * sizeof (int));
      memset (arr[i], 0, BLOCKSIZE * sizeof (int));
    }

  for (i = 0; i < THREADS; i++)
    {
      int *tmp = arr[i];
      #pragma omp target enter data map(to: tmp[0:BLOCKSIZE])
    }

  for (i = 0; i < THREADS; i++)
    pthread_create (&workerthread[i], NULL, target_incr, (void *) arr[i]);

  for (i = 0; i < THREADS; i++)
    {
      void *rv;
      pthread_join (workerthread[i], &rv);
      assert (rv == NULL);
    }

  for (i = 0; i < THREADS; i++)
    {
      int *tmp = arr[i];
      #pragma omp target exit data map(from: tmp[0:BLOCKSIZE])
    }

  for (i = 0; i < THREADS; i++)
    for (int j = 0; j < BLOCKSIZE; j++)
      assert (arr[i][j] == REPS);

  for (i = 0; i < THREADS; i++)
    free (arr[i]);

  return 0;
}

[-- Attachment #3: nvptx-stacks-caching-3.diff --]
[-- Type: text/x-patch, Size: 7677 bytes --]

commit eea42d570664fa3370732d504425508593735899
Author: Julian Brown <julian@codesourcery.com>
Date:   Wed Oct 21 10:00:19 2020 -0700

    nvptx: Cache stacks block for OpenMP kernel launch
    
    2020-11-13  Julian Brown  <julian@codesourcery.com>
    
    libgomp/
            * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
            (struct ptx_device): Add omp_stacks struct.
            (nvptx_open_device): Initialise cached-stacks housekeeping info.
            (nvptx_close_device): Free cached stacks block and mutex.
            (nvptx_stacks_free): New function.
            (nvptx_alloc): Add SUPPRESS_ERRORS parameter.
            (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks block.
            (nvptx_stacks_alloc): Rename to...
            (nvptx_stacks_acquire): This.  Cache stacks block between runs if same
            size or smaller is required.
            (nvptx_stacks_free): Remove.
            (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks block
            during kernel execution.

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 11d4ceeae62e..2261f367cc2c 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -49,6 +49,15 @@
 #include <assert.h>
 #include <errno.h>
 
+/* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
+   block to cache between kernel invocations.  For soft-stacks blocks bigger
+   than this, we will free the block before attempting another GPU memory
+   allocation (i.e. in GOMP_OFFLOAD_alloc).  Otherwise, if an allocation fails,
+   we will free the cached soft-stacks block anyway then retry the
+   allocation.  If that fails too, we lose.  */
+
+#define SOFTSTACK_CACHE_LIMIT 134217728
+
 #if CUDA_VERSION < 6000
 extern CUresult cuGetErrorString (CUresult, const char **);
 #define CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR 82
@@ -307,6 +316,14 @@ struct ptx_device
   struct ptx_free_block *free_blocks;
   pthread_mutex_t free_blocks_lock;
 
+  /* OpenMP stacks, cached between kernel invocations.  */
+  struct
+    {
+      CUdeviceptr ptr;
+      size_t size;
+      pthread_mutex_t lock;
+    } omp_stacks;
+
   struct ptx_device *next;
 };
 
@@ -514,6 +531,10 @@ nvptx_open_device (int n)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
 
+  ptx_dev->omp_stacks.ptr = 0;
+  ptx_dev->omp_stacks.size = 0;
+  pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
+
   return ptx_dev;
 }
 
@@ -534,6 +555,11 @@ nvptx_close_device (struct ptx_device *ptx_dev)
   pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
   pthread_mutex_destroy (&ptx_dev->image_lock);
 
+  pthread_mutex_destroy (&ptx_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr)
+    CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr);
+
   if (!ptx_dev->ctx_shared)
     CUDA_CALL (cuCtxDestroy, ptx_dev->ctx);
 
@@ -999,12 +1025,39 @@ goacc_profiling_acc_ev_alloc (struct goacc_thread *thr, void *dp, size_t s)
   GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, api_info);
 }
 
+/* Free the cached soft-stacks block if it is above the SOFTSTACK_CACHE_LIMIT
+   size threshold, or if FORCE is true.  */
+
+static void
+nvptx_stacks_free (struct ptx_device *ptx_dev, bool force)
+{
+  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
+  if (force || ptx_dev->omp_stacks.size > SOFTSTACK_CACHE_LIMIT)
+    {
+      CUresult r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+      ptx_dev->omp_stacks.ptr = 0;
+      ptx_dev->omp_stacks.size = 0;
+    }
+  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
+}
+
 static void *
-nvptx_alloc (size_t s)
+nvptx_alloc (size_t s, bool suppress_errors)
 {
   CUdeviceptr d;
 
-  CUDA_CALL_ERET (NULL, cuMemAlloc, &d, s);
+  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &d, s);
+  if (suppress_errors && r == CUDA_ERROR_OUT_OF_MEMORY)
+    return NULL;
+  else if (r != CUDA_SUCCESS)
+    {
+      GOMP_PLUGIN_error ("nvptx_alloc error: %s", cuda_error (r));
+      return NULL;
+    }
+
+  /* NOTE: We only do profiling stuff if the memory allocation succeeds.  */
   struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
   bool profiling_p
     = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
@@ -1352,6 +1405,8 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
 
+  nvptx_stacks_free (ptx_dev, false);
+
   while (blocks)
     {
       tmp = blocks->next;
@@ -1360,7 +1415,16 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
       blocks = tmp;
     }
 
-  return nvptx_alloc (size);
+  void *d = nvptx_alloc (size, true);
+  if (d)
+    return d;
+  else
+    {
+      /* Memory allocation failed.  Try freeing the stacks block, and
+	 retrying.  */
+      nvptx_stacks_free (ptx_dev, true);
+      return nvptx_alloc (size, false);
+    }
 }
 
 bool
@@ -1866,26 +1930,36 @@ nvptx_stacks_size ()
   return 128 * 1024;
 }
 
-/* Return contiguous storage for NUM stacks, each SIZE bytes.  */
+/* Return contiguous storage for NUM stacks, each SIZE bytes.  The lock for
+   the storage should be held on entry, and remains held on exit.  */
 
 static void *
-nvptx_stacks_alloc (size_t size, int num)
+nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
 {
-  CUdeviceptr stacks;
-  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &stacks, size * num);
+  if (ptx_dev->omp_stacks.ptr && ptx_dev->omp_stacks.size >= size * num)
+    return (void *) ptx_dev->omp_stacks.ptr;
+
+  /* Free the old, too-small stacks.  */
+  if (ptx_dev->omp_stacks.ptr)
+    {
+      CUresult r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s\n", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+    }
+
+  /* Make new and bigger stacks, and remember where we put them and how big
+     they are.  */
+  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &ptx_dev->omp_stacks.ptr,
+				  size * num);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r));
-  return (void *) stacks;
-}
 
-/* Release storage previously allocated by nvptx_stacks_alloc.  */
+  ptx_dev->omp_stacks.size = size * num;
 
-static void
-nvptx_stacks_free (void *p, int num)
-{
-  CUresult r = CUDA_CALL_NOCHECK (cuMemFree, (CUdeviceptr) p);
-  if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+  return (void *) ptx_dev->omp_stacks.ptr;
 }
 
 void
@@ -1922,7 +1996,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
-  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
+
+  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
+  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
   void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
   size_t fn_args_size = sizeof fn_args;
   void *config[] = {
@@ -1944,7 +2020,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 		       maybe_abort_msg);
   else if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
-  nvptx_stacks_free (stacks, teams * threads);
+
+  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
 }
 
 /* TODO: Implement GOMP_OFFLOAD_async_run. */

next prev parent reply	other threads:[~2020-11-13 20:55 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-26 14:14 Julian Brown
2020-10-26 14:26 ` Jakub Jelinek
2020-11-09 21:32   ` Alexander Monakov
2020-11-13 20:54     ` Julian Brown [this message]
2020-12-08  1:13       ` Julian Brown
2020-12-08 17:11         ` Alexander Monakov
2020-12-15 13:39           ` Julian Brown
2020-12-15 13:49             ` Jakub Jelinek
2020-12-15 16:49               ` Julian Brown
2020-12-15 17:00                 ` Jakub Jelinek
2020-12-15 23:16                   ` Julian Brown
2021-01-05 12:13                     ` Julian Brown
2021-01-05 15:32                       ` Jakub Jelinek
2020-10-27 13:17 ` Julian Brown
2020-10-28  7:25   ` Chung-Lin Tang
2020-10-28 11:32     ` Julian Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201113205454.65f1539d@squid.athome \
    --to=julian@codesourcery.com \
    --cc=amonakov@ispras.ru \
    --cc=cltang@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=tdevries@suse.de \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).