From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153]) by sourceware.org (Postfix) with ESMTPS id DAEDA394800E for ; Fri, 13 Nov 2020 20:55:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org DAEDA394800E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Julian_Brown@mentor.com IronPort-SDR: Z3jobxCNA3JMep4zKoSgI4PsHUoE0rIRZff2AOSpSeaonlZqqkM3pjTLJdflCiPrWRZTrUNHwv JB9uXyMGOGEnOC4+oF9b1NW4QVHgySHWKbLaoWBo6XAtvgL+003FDFvlka5nmnYqixvxb7JdnX Om/g5YkeRE/R3POSrknA4cuziSD01bbahzmRIXyIlcCIILUuejx/GfNlVpq9JzGC2EiMNq0YBI AoqShtYdvXX0A2aE1GqRhkYb/rZ7n9/LUULOyje+CGkdBYRc+30620SNWyz4iViCGHX/wk87a2 zMA= X-IronPort-AV: E=Sophos;i="5.77,476,1596528000"; d="diff'?c'?scan'208";a="57348787" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa1.mentor.iphmx.com with ESMTP; 13 Nov 2020 12:55:02 -0800 IronPort-SDR: LlRp/+5QzhDnRwwnrgL5+H+Iq741WeCE3UFfQaWQCj1/DmPh69Pw6Qz+zcz2uYDQVTOELF4T+A ZQ8p/hIu9vLeKZGVf/k8EI8Gk+ie0zLz3CLOpwWkuQChYMyqS+roUIVuAZP105vpr/qL+DR0tH H+SN3jUWV3lXuKn0kz/1kAdl8taF0prDkhbCB7n0IvUlPKF3eyqJ4PqZDUXoLSralOc+/+zVwJ g5TTFwuSzhqdHyvKIr8Q6pKiyDMGoJ7mmXwVki+GsHxLxcx2yeO7pyNINUYRiIoFRdbq9kqjBf kS0= Date: Fri, 13 Nov 2020 20:54:54 +0000 From: Julian Brown To: Alexander Monakov CC: Jakub Jelinek , , "Thomas Schwinge" , Tom de Vries , Chung-Lin Tang Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch Message-ID: <20201113205454.65f1539d@squid.athome> In-Reply-To: References: <20201026141448.109041-1-julian@codesourcery.com> <20201026142634.GI7080@tucnak> Organization: Mentor Graphics X-Mailer: Claws Mail 3.17.6 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="MP_/ytq3AIt_PefKyNSQhTfAuKE" X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-02.mgc.mentorg.com (139.181.222.2) To SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Nov 2020 20:55:06 -0000 --MP_/ytq3AIt_PefKyNSQhTfAuKE Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi Alexander, Thanks for the review! Comments below. On Tue, 10 Nov 2020 00:32:36 +0300 Alexander Monakov wrote: > On Mon, 26 Oct 2020, Jakub Jelinek wrote: > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > > This patch adds caching for the stack block allocated for > > > offloaded OpenMP kernel launches on NVPTX. This is a performance > > > optimisation -- we observed an average 11% or so performance > > > improvement with this patch across a set of accelerated GPU > > > benchmarks on one machine (results vary according to individual > > > benchmark and with hardware used). > > In this patch you're folding two changes together: reuse of allocated > stacks and removing one host-device synchronization. Why is that? > Can you report performance change separately for each change (and > split out the patches)? An accident of the development process of the patch, really -- the idea for removing the post-kernel-launch synchronisation came from the OpenACC side, and adapting it to OpenMP meant the stacks had to remain allocated after the return of the GOMP_OFFLOAD_run function. > > > A given kernel launch will reuse the stack block from the > > > previous launch if it is large enough, else it is freed and > > > reallocated. A slight caveat is that memory will not be freed > > > until the device is closed, so e.g. if code is using highly > > > variable launch geometries and large amounts of GPU RAM, you > > > might run out of resources slightly quicker with this patch. > > > > > > Another way this patch gains performance is by omitting the > > > synchronisation at the end of an OpenMP offload kernel launch -- > > > it's safe for the GPU and CPU to continue executing in parallel > > > at that point, because e.g. copies-back from the device will be > > > synchronised properly with kernel completion anyway. > > I don't think this explanation is sufficient. My understanding is > that OpenMP forbids the host to proceed asynchronously after the > target construct unless it is a 'target nowait' construct. This may > be observable if there's a printf in the target region for example > (or if it accesses memory via host pointers). > > So this really needs to be a separate patch with more explanation why > this is okay (if it is okay). As long as the offload kernel only touches GPU memory and does not have any CPU-visible side effects (like the printf you mentioned -- I hadn't really considered that, oops!), it's probably OK. But anyway, the benefit obtained on OpenMP code (the same set of benchmarks run before) of omitting the synchronisation at the end of GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the stacks caching, and miss out the synchronisation removal for now. (It might still be something worth considering later, perhaps, as long as we can show some given kernel doesn't use printf or access memory via host pointers -- I guess the former might be easier than the latter. I have observed the equivalent OpenACC patch provide a significant boost on some benchmarks, so there's probably something that could be gained on the OpenMP side too.) The benefit with the attached patch -- just stacks caching, no synchronisation removal -- is about 12% on the same set of benchmarks as before. Results are a little noisy on the machine I'm benchmarking on, so this isn't necessarily proof that the synchronisation removal is harmful for performance! > > > In turn, the last part necessitates a change to the way "(perhaps > > > abort was called)" errors are detected and reported. > > As already mentioned using callbacks is problematic. Plus, I'm sure > the way you lock out other threads is a performance loss when > multiple threads have target regions: even though they will not run > concurrently on the GPU, you still want to allow host threads to > submit GPU jobs while the GPU is occupied. > > I would suggest to have a small pool (up to 3 entries perhaps) of > stacks. Then you can arrange reuse without totally serializing host > threads on target regions. I'm really wary of the additional complexity of adding a stack pool, and the memory allocation/freeing code paths in CUDA appear to be so slow that we get a benefit with this patch even when the GPU stream has to wait for the CPU to unlock the stacks block. Also, for large GPU launches, the size of the soft-stacks block isn't really trivial (I've seen something like 50MB on the hardware I'm using, with default options), and multiplying that by 3 could start to eat into the GPU heap memory for "useful data" quite significantly. Consider the attached (probably not amazingly-written) microbenchmark. It spawns 8 threads which each launch lots of OpenMP kernels performing some trivial work, then joins the threads and checks the results. As a baseline, with the "FEWER_KERNELS" parameters set (256 kernel launches over 8 threads), this gives us over 5 runs: real 3m55.375s user 7m14.192s sys 0m30.148s real 3m54.487s user 7m6.775s sys 0m34.678s real 3m54.633s user 7m20.381s sys 0m30.620s real 3m54.992s user 7m12.464s sys 0m29.610s real 3m55.471s user 7m14.342s sys 0m29.815s With a version of the attached patch, we instead get: real 3m53.404s user 3m39.869s sys 0m16.149s real 3m54.713s user 3m41.018s sys 0m16.129s real 3m55.242s user 3m55.148s sys 0m17.130s real 3m55.374s user 3m40.411s sys 0m15.818s real 3m55.189s user 3m40.144s sys 0m15.846s That is: real time is about the same, but user/sys time are reduced. Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the baseline is: real 12m29.975s user 24m2.244s sys 8m8.153s real 12m15.391s user 23m51.018s sys 8m0.809s real 12m5.424s user 23m38.585s sys 7m47.714s real 12m10.456s user 23m51.691s sys 7m54.324s real 12m37.735s user 24m19.671s sys 8m15.752s And with the patch, we get: real 4m42.600s user 16m14.593s sys 0m40.444s real 4m43.579s user 15m33.805s sys 0m38.537s real 4m42.211s user 16m32.926s sys 0m40.271s real 4m44.256s user 15m49.290s sys 0m39.116s real 4m42.013s user 15m39.447s sys 0m38.517s Real, user and sys time are all dramatically less. So I'd suggest that the attached patch is an improvement over the status quo, even if we could experiment with the stacks pool idea as a further improvement later on. The attached patch also implements a size limit for retention of the soft-stack block -- freeing it before allocating more memory, rather than at the start of a kernel launch, so bigger blocks can still be shared between kernel launches if there's no memory allocation between them. It also tries freeing smaller cached soft-stack blocks and retrying memory allocation in out-of-memory situations. Re-tested with offloading to NVPTX. OK for trunk? Thanks, Julian ChangeLog 2020-11-13 Julian Brown libgomp/ * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define. (struct ptx_device): Add omp_stacks struct. (nvptx_open_device): Initialise cached-stacks housekeeping info. (nvptx_close_device): Free cached stacks block and mutex. (nvptx_stacks_free): New function. (nvptx_alloc): Add SUPPRESS_ERRORS parameter. (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks block. (nvptx_stacks_alloc): Rename to... (nvptx_stacks_acquire): This. Cache stacks block between runs if same size or smaller is required. (nvptx_stacks_free): Remove. (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks block during kernel execution. --MP_/ytq3AIt_PefKyNSQhTfAuKE Content-Type: text/x-c++src Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="mt.c" #include #include #include #include #include #ifdef FEWER_KERNELS #define THREADS 8 #define BLOCKSIZE (8 * 1024 * 1024) #define REPS 32 #else #define THREADS 8 #define BLOCKSIZE (8 * 256) #define REPS (32 * 1024 * 4) #endif /* A worker thread. Launch a pile of offload kernels and do some work. */ void * target_incr (void *arg) { int *myarr = (int *) arg; for (int r = 0; r < REPS; r++) { #pragma omp target map(tofrom: myarr[0:BLOCKSIZE]) { #pragma omp for for (int i = 0; i < BLOCKSIZE; i++) { myarr[i]++; } } } return NULL; } int main (int argc, char* argv[]) { int *arr[THREADS]; pthread_t workerthread[THREADS]; int i; for (i = 0; i < THREADS; i++) { arr[i] = malloc (BLOCKSIZE * sizeof (int)); memset (arr[i], 0, BLOCKSIZE * sizeof (int)); } for (i = 0; i < THREADS; i++) { int *tmp = arr[i]; #pragma omp target enter data map(to: tmp[0:BLOCKSIZE]) } for (i = 0; i < THREADS; i++) pthread_create (&workerthread[i], NULL, target_incr, (void *) arr[i]); for (i = 0; i < THREADS; i++) { void *rv; pthread_join (workerthread[i], &rv); assert (rv == NULL); } for (i = 0; i < THREADS; i++) { int *tmp = arr[i]; #pragma omp target exit data map(from: tmp[0:BLOCKSIZE]) } for (i = 0; i < THREADS; i++) for (int j = 0; j < BLOCKSIZE; j++) assert (arr[i][j] == REPS); for (i = 0; i < THREADS; i++) free (arr[i]); return 0; } --MP_/ytq3AIt_PefKyNSQhTfAuKE Content-Type: text/x-patch Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="nvptx-stacks-caching-3.diff" commit eea42d570664fa3370732d504425508593735899 Author: Julian Brown Date: Wed Oct 21 10:00:19 2020 -0700 nvptx: Cache stacks block for OpenMP kernel launch 2020-11-13 Julian Brown libgomp/ * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define. (struct ptx_device): Add omp_stacks struct. (nvptx_open_device): Initialise cached-stacks housekeeping info. (nvptx_close_device): Free cached stacks block and mutex. (nvptx_stacks_free): New function. (nvptx_alloc): Add SUPPRESS_ERRORS parameter. (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks block. (nvptx_stacks_alloc): Rename to... (nvptx_stacks_acquire): This. Cache stacks block between runs if same size or smaller is required. (nvptx_stacks_free): Remove. (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks block during kernel execution. diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 11d4ceeae62e..2261f367cc2c 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -49,6 +49,15 @@ #include #include +/* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks + block to cache between kernel invocations. For soft-stacks blocks bigger + than this, we will free the block before attempting another GPU memory + allocation (i.e. in GOMP_OFFLOAD_alloc). Otherwise, if an allocation fails, + we will free the cached soft-stacks block anyway then retry the + allocation. If that fails too, we lose. */ + +#define SOFTSTACK_CACHE_LIMIT 134217728 + #if CUDA_VERSION < 6000 extern CUresult cuGetErrorString (CUresult, const char **); #define CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR 82 @@ -307,6 +316,14 @@ struct ptx_device struct ptx_free_block *free_blocks; pthread_mutex_t free_blocks_lock; + /* OpenMP stacks, cached between kernel invocations. */ + struct + { + CUdeviceptr ptr; + size_t size; + pthread_mutex_t lock; + } omp_stacks; + struct ptx_device *next; }; @@ -514,6 +531,10 @@ nvptx_open_device (int n) ptx_dev->free_blocks = NULL; pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL); + ptx_dev->omp_stacks.ptr = 0; + ptx_dev->omp_stacks.size = 0; + pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL); + return ptx_dev; } @@ -534,6 +555,11 @@ nvptx_close_device (struct ptx_device *ptx_dev) pthread_mutex_destroy (&ptx_dev->free_blocks_lock); pthread_mutex_destroy (&ptx_dev->image_lock); + pthread_mutex_destroy (&ptx_dev->omp_stacks.lock); + + if (ptx_dev->omp_stacks.ptr) + CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr); + if (!ptx_dev->ctx_shared) CUDA_CALL (cuCtxDestroy, ptx_dev->ctx); @@ -999,12 +1025,39 @@ goacc_profiling_acc_ev_alloc (struct goacc_thread *thr, void *dp, size_t s) GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, api_info); } +/* Free the cached soft-stacks block if it is above the SOFTSTACK_CACHE_LIMIT + size threshold, or if FORCE is true. */ + +static void +nvptx_stacks_free (struct ptx_device *ptx_dev, bool force) +{ + pthread_mutex_lock (&ptx_dev->omp_stacks.lock); + if (force || ptx_dev->omp_stacks.size > SOFTSTACK_CACHE_LIMIT) + { + CUresult r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r)); + ptx_dev->omp_stacks.ptr = 0; + ptx_dev->omp_stacks.size = 0; + } + pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); +} + static void * -nvptx_alloc (size_t s) +nvptx_alloc (size_t s, bool suppress_errors) { CUdeviceptr d; - CUDA_CALL_ERET (NULL, cuMemAlloc, &d, s); + CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &d, s); + if (suppress_errors && r == CUDA_ERROR_OUT_OF_MEMORY) + return NULL; + else if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("nvptx_alloc error: %s", cuda_error (r)); + return NULL; + } + + /* NOTE: We only do profiling stuff if the memory allocation succeeds. */ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread (); bool profiling_p = __builtin_expect (thr != NULL && thr->prof_info != NULL, false); @@ -1352,6 +1405,8 @@ GOMP_OFFLOAD_alloc (int ord, size_t size) ptx_dev->free_blocks = NULL; pthread_mutex_unlock (&ptx_dev->free_blocks_lock); + nvptx_stacks_free (ptx_dev, false); + while (blocks) { tmp = blocks->next; @@ -1360,7 +1415,16 @@ GOMP_OFFLOAD_alloc (int ord, size_t size) blocks = tmp; } - return nvptx_alloc (size); + void *d = nvptx_alloc (size, true); + if (d) + return d; + else + { + /* Memory allocation failed. Try freeing the stacks block, and + retrying. */ + nvptx_stacks_free (ptx_dev, true); + return nvptx_alloc (size, false); + } } bool @@ -1866,26 +1930,36 @@ nvptx_stacks_size () return 128 * 1024; } -/* Return contiguous storage for NUM stacks, each SIZE bytes. */ +/* Return contiguous storage for NUM stacks, each SIZE bytes. The lock for + the storage should be held on entry, and remains held on exit. */ static void * -nvptx_stacks_alloc (size_t size, int num) +nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num) { - CUdeviceptr stacks; - CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &stacks, size * num); + if (ptx_dev->omp_stacks.ptr && ptx_dev->omp_stacks.size >= size * num) + return (void *) ptx_dev->omp_stacks.ptr; + + /* Free the old, too-small stacks. */ + if (ptx_dev->omp_stacks.ptr) + { + CUresult r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s\n", cuda_error (r)); + r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r)); + } + + /* Make new and bigger stacks, and remember where we put them and how big + they are. */ + CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &ptx_dev->omp_stacks.ptr, + size * num); if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r)); - return (void *) stacks; -} -/* Release storage previously allocated by nvptx_stacks_alloc. */ + ptx_dev->omp_stacks.size = size * num; -static void -nvptx_stacks_free (void *p, int num) -{ - CUresult r = CUDA_CALL_NOCHECK (cuMemFree, (CUdeviceptr) p); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r)); + return (void *) ptx_dev->omp_stacks.ptr; } void @@ -1922,7 +1996,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads); size_t stack_size = nvptx_stacks_size (); - void *stacks = nvptx_stacks_alloc (stack_size, teams * threads); + + pthread_mutex_lock (&ptx_dev->omp_stacks.lock); + void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads); void *fn_args[] = {tgt_vars, stacks, (void *) stack_size}; size_t fn_args_size = sizeof fn_args; void *config[] = { @@ -1944,7 +2020,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) maybe_abort_msg); else if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r)); - nvptx_stacks_free (stacks, teams * threads); + + pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); } /* TODO: Implement GOMP_OFFLOAD_async_run. */ --MP_/ytq3AIt_PefKyNSQhTfAuKE--