Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Alexander Monakov <amonakov@ispras.ru>
To: Julian Brown <julian@codesourcery.com>
Cc: Jakub Jelinek <jakub@redhat.com>,
	gcc-patches@gcc.gnu.org,
	 Thomas Schwinge <thomas@codesourcery.com>,
	Tom de Vries <tdevries@suse.de>
Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch
Date: Tue, 8 Dec 2020 20:11:38 +0300 (MSK)	[thread overview]
Message-ID: <alpine.LNX.2.20.13.2012082009570.2504@monopod.intra.ispras.ru> (raw)
In-Reply-To: <20201208011348.41209b59@squid.athome>


On Tue, 8 Dec 2020, Julian Brown wrote:

> Ping?

This has addressed my concerns, thanks.

Alexander

> On Fri, 13 Nov 2020 20:54:54 +0000
> Julian Brown <julian@codesourcery.com> wrote:
> 
> > Hi Alexander,
> > 
> > Thanks for the review! Comments below.
> > 
> > On Tue, 10 Nov 2020 00:32:36 +0300
> > Alexander Monakov <amonakov@ispras.ru> wrote:
> > 
> > > On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> > >   
> > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:    
> > > > > This patch adds caching for the stack block allocated for
> > > > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > > > optimisation -- we observed an average 11% or so performance
> > > > > improvement with this patch across a set of accelerated GPU
> > > > > benchmarks on one machine (results vary according to individual
> > > > > benchmark and with hardware used).    
> > > 
> > > In this patch you're folding two changes together: reuse of
> > > allocated stacks and removing one host-device synchronization.  Why
> > > is that? Can you report performance change separately for each
> > > change (and split out the patches)?  
> > 
> > An accident of the development process of the patch, really -- the
> > idea for removing the post-kernel-launch synchronisation came from the
> > OpenACC side, and adapting it to OpenMP meant the stacks had to remain
> > allocated after the return of the GOMP_OFFLOAD_run function.
> > 
> > > > > A given kernel launch will reuse the stack block from the
> > > > > previous launch if it is large enough, else it is freed and
> > > > > reallocated. A slight caveat is that memory will not be freed
> > > > > until the device is closed, so e.g. if code is using highly
> > > > > variable launch geometries and large amounts of GPU RAM, you
> > > > > might run out of resources slightly quicker with this patch.
> > > > > 
> > > > > Another way this patch gains performance is by omitting the
> > > > > synchronisation at the end of an OpenMP offload kernel launch --
> > > > > it's safe for the GPU and CPU to continue executing in parallel
> > > > > at that point, because e.g. copies-back from the device will be
> > > > > synchronised properly with kernel completion anyway.    
> > > 
> > > I don't think this explanation is sufficient. My understanding is
> > > that OpenMP forbids the host to proceed asynchronously after the
> > > target construct unless it is a 'target nowait' construct. This may
> > > be observable if there's a printf in the target region for example
> > > (or if it accesses memory via host pointers).
> > > 
> > > So this really needs to be a separate patch with more explanation
> > > why this is okay (if it is okay).  
> > 
> > As long as the offload kernel only touches GPU memory and does not
> > have any CPU-visible side effects (like the printf you mentioned -- I
> > hadn't really considered that, oops!), it's probably OK.
> > 
> > But anyway, the benefit obtained on OpenMP code (the same set of
> > benchmarks run before) of omitting the synchronisation at the end of
> > GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
> > stacks caching, and miss out the synchronisation removal for now. (It
> > might still be something worth considering later, perhaps, as long as
> > we can show some given kernel doesn't use printf or access memory via
> > host pointers -- I guess the former might be easier than the latter. I
> > have observed the equivalent OpenACC patch provide a significant boost
> > on some benchmarks, so there's probably something that could be gained
> > on the OpenMP side too.)
> > 
> > The benefit with the attached patch -- just stacks caching, no
> > synchronisation removal -- is about 12% on the same set of benchmarks
> > as before. Results are a little noisy on the machine I'm benchmarking
> > on, so this isn't necessarily proof that the synchronisation removal
> > is harmful for performance!
> > 
> > > > > In turn, the last part necessitates a change to the way
> > > > > "(perhaps abort was called)" errors are detected and reported.
> > > > >   
> > > 
> > > As already mentioned using callbacks is problematic. Plus, I'm sure
> > > the way you lock out other threads is a performance loss when
> > > multiple threads have target regions: even though they will not run
> > > concurrently on the GPU, you still want to allow host threads to
> > > submit GPU jobs while the GPU is occupied.
> > > 
> > > I would suggest to have a small pool (up to 3 entries perhaps) of
> > > stacks. Then you can arrange reuse without totally serializing host
> > > threads on target regions.  
> > 
> > I'm really wary of the additional complexity of adding a stack pool,
> > and the memory allocation/freeing code paths in CUDA appear to be so
> > slow that we get a benefit with this patch even when the GPU stream
> > has to wait for the CPU to unlock the stacks block. Also, for large
> > GPU launches, the size of the soft-stacks block isn't really trivial
> > (I've seen something like 50MB on the hardware I'm using, with default
> > options), and multiplying that by 3 could start to eat into the GPU
> > heap memory for "useful data" quite significantly.
> > 
> > Consider the attached (probably not amazingly-written) microbenchmark.
> > It spawns 8 threads which each launch lots of OpenMP kernels
> > performing some trivial work, then joins the threads and checks the
> > results. As a baseline, with the "FEWER_KERNELS" parameters set (256
> > kernel launches over 8 threads), this gives us over 5 runs:
> > 
> > real    3m55.375s
> > user    7m14.192s
> > sys     0m30.148s
> > 
> > real    3m54.487s
> > user    7m6.775s
> > sys     0m34.678s
> > 
> > real    3m54.633s
> > user    7m20.381s
> > sys     0m30.620s
> > 
> > real    3m54.992s
> > user    7m12.464s
> > sys     0m29.610s
> > 
> > real    3m55.471s
> > user    7m14.342s
> > sys     0m29.815s
> > 
> > With a version of the attached patch, we instead get:
> > 
> > real    3m53.404s
> > user    3m39.869s
> > sys     0m16.149s
> > 
> > real    3m54.713s
> > user    3m41.018s
> > sys     0m16.129s
> > 
> > real    3m55.242s
> > user    3m55.148s
> > sys     0m17.130s
> > 
> > real    3m55.374s
> > user    3m40.411s
> > sys     0m15.818s
> > 
> > real    3m55.189s
> > user    3m40.144s
> > sys     0m15.846s
> > 
> > That is: real time is about the same, but user/sys time are reduced.
> > 
> > Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the
> > baseline is:
> > 
> > real    12m29.975s
> > user    24m2.244s
> > sys     8m8.153s
> > 
> > real    12m15.391s
> > user    23m51.018s
> > sys     8m0.809s
> > 
> > real    12m5.424s
> > user    23m38.585s
> > sys     7m47.714s
> > 
> > real    12m10.456s
> > user    23m51.691s
> > sys     7m54.324s
> > 
> > real    12m37.735s
> > user    24m19.671s
> > sys     8m15.752s
> > 
> > And with the patch, we get:
> > 
> > real    4m42.600s
> > user    16m14.593s
> > sys     0m40.444s
> > 
> > real    4m43.579s
> > user    15m33.805s
> > sys     0m38.537s
> > 
> > real    4m42.211s
> > user    16m32.926s
> > sys     0m40.271s
> > 
> > real    4m44.256s
> > user    15m49.290s
> > sys     0m39.116s
> > 
> > real    4m42.013s
> > user    15m39.447s
> > sys     0m38.517s
> > 
> > Real, user and sys time are all dramatically less. So I'd suggest that
> > the attached patch is an improvement over the status quo, even if we
> > could experiment with the stacks pool idea as a further improvement
> > later on.
> > 
> > The attached patch also implements a size limit for retention of the
> > soft-stack block -- freeing it before allocating more memory, rather
> > than at the start of a kernel launch, so bigger blocks can still be
> > shared between kernel launches if there's no memory allocation between
> > them. It also tries freeing smaller cached soft-stack blocks and
> > retrying memory allocation in out-of-memory situations.
> > 
> > Re-tested with offloading to NVPTX. OK for trunk?
> > 
> > Thanks,
> > 
> > Julian
> > 
> > ChangeLog
> > 
> > 2020-11-13  Julian Brown  <julian@codesourcery.com>
> > 
> > libgomp/
> >     * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
> >     (struct ptx_device): Add omp_stacks struct.
> >     (nvptx_open_device): Initialise cached-stacks housekeeping info.
> >     (nvptx_close_device): Free cached stacks block and mutex.
> >     (nvptx_stacks_free): New function.
> >     (nvptx_alloc): Add SUPPRESS_ERRORS parameter.
> >     (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks
> > block. (nvptx_stacks_alloc): Rename to...
> >     (nvptx_stacks_acquire): This.  Cache stacks block between runs if
> > same size or smaller is required.
> >     (nvptx_stacks_free): Remove.
> >     (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks
> > block during kernel execution.
>

next prev parent reply	other threads:[~2020-12-08 17:11 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-26 14:14 Julian Brown
2020-10-26 14:26 ` Jakub Jelinek
2020-11-09 21:32   ` Alexander Monakov
2020-11-13 20:54     ` Julian Brown
2020-12-08  1:13       ` Julian Brown
2020-12-08 17:11         ` Alexander Monakov [this message]
2020-12-15 13:39           ` Julian Brown
2020-12-15 13:49             ` Jakub Jelinek
2020-12-15 16:49               ` Julian Brown
2020-12-15 17:00                 ` Jakub Jelinek
2020-12-15 23:16                   ` Julian Brown
2021-01-05 12:13                     ` Julian Brown
2021-01-05 15:32                       ` Jakub Jelinek
2020-10-27 13:17 ` Julian Brown
2020-10-28  7:25   ` Chung-Lin Tang
2020-10-28 11:32     ` Julian Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LNX.2.20.13.2012082009570.2504@monopod.intra.ispras.ru \
    --to=amonakov@ispras.ru \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=julian@codesourcery.com \
    --cc=tdevries@suse.de \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).