From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Julian_Brown@mentor.com>
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
 by sourceware.org (Postfix) with ESMTPS id 3A5A1385803E
 for <gcc-patches@gcc.gnu.org>; Tue,  8 Dec 2020 01:13:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3A5A1385803E
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=Julian_Brown@mentor.com
IronPort-SDR: u4vOCieG13aSR6c6nQ2SpN/Cl3Ty83QivElIRskBCOPv/SCFVLdNxZvQl9WkzDHgHYDxmjmi7H
 V6jIoHEkD0WFhtUEldaqAMfNVj6Kj2kDnohdNJVd+u9TcHLpQ0a0BSzFYeRBCBnJkd1fjwiCla
 mC2ajX3WkTSTG+GofY6607B//UE4D/H0vE+Mare9yIg6KX2KylQmH5e+b9fRZWg+qgN+DaeOLy
 00tIPxlXCjS+oq6ERlwXf9jDCrUmkh0HkSCsj0JHAelv0xJhBgFmG9dbUpK7p7PNtai5fiP5uO
 Bck=
X-IronPort-AV: E=Sophos;i="5.78,401,1599552000"; d="scan'208";a="55918813"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa2.mentor.iphmx.com with ESMTP; 07 Dec 2020 17:13:55 -0800
IronPort-SDR: 9IuWZh+avYjS420dHmBasaAwBWbALISPNubk/PCq9gwV/wGvbPgv1pAQVlGRFzSrNdNkTM92sy
 fGujJEt6FHsWEf1IGDKjceyBBLlFwwPbWQlhBop8hSV2FZQRdzhJIEQf56G17hIFx27BGTPQiL
 1Am3YkEHgwjkjo6KtWdaXKU47uaKy1/UtYMbpOO5M6K4lFp959pa49vviM7KJ+E7nq7C5ODZCn
 iyfZHjRy10CyprrHkY5kmQ/SKVyXV1J2BhF3Y3re5tKxcQ0bylUuIJgNlj4+JrH8breBFBC3Ig
 vJ8=
Date: Tue, 8 Dec 2020 01:13:48 +0000
From: Julian Brown <julian@codesourcery.com>
To: Alexander Monakov <amonakov@ispras.ru>
CC: Jakub Jelinek <jakub@redhat.com>, <gcc-patches@gcc.gnu.org>, "Thomas
 Schwinge" <thomas@codesourcery.com>, Tom de Vries <tdevries@suse.de>
Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch
Message-ID: <20201208011348.41209b59@squid.athome>
In-Reply-To: <20201113205454.65f1539d@squid.athome>
References: <20201026141448.109041-1-julian@codesourcery.com>
 <20201026142634.GI7080@tucnak>
 <alpine.LNX.2.20.13.2011100000050.9902@monopod.intra.ispras.ru>
 <20201113205454.65f1539d@squid.athome>
Organization: Mentor Graphics
X-Mailer: Claws Mail 3.17.6 (GTK+ 2.24.32; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: SVR-IES-MBX-08.mgc.mentorg.com (139.181.222.8) To
 SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4)
X-Spam-Status: No, score=-5.8 required=5.0 tests=BAYES_00,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE,
 SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Dec 2020 01:13:59 -0000

Ping?

Thanks,

Julian

On Fri, 13 Nov 2020 20:54:54 +0000
Julian Brown <julian@codesourcery.com> wrote:

> Hi Alexander,
> 
> Thanks for the review! Comments below.
> 
> On Tue, 10 Nov 2020 00:32:36 +0300
> Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> > On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> >   
> > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:    
> > > > This patch adds caching for the stack block allocated for
> > > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > > optimisation -- we observed an average 11% or so performance
> > > > improvement with this patch across a set of accelerated GPU
> > > > benchmarks on one machine (results vary according to individual
> > > > benchmark and with hardware used).    
> > 
> > In this patch you're folding two changes together: reuse of
> > allocated stacks and removing one host-device synchronization.  Why
> > is that? Can you report performance change separately for each
> > change (and split out the patches)?  
> 
> An accident of the development process of the patch, really -- the
> idea for removing the post-kernel-launch synchronisation came from the
> OpenACC side, and adapting it to OpenMP meant the stacks had to remain
> allocated after the return of the GOMP_OFFLOAD_run function.
> 
> > > > A given kernel launch will reuse the stack block from the
> > > > previous launch if it is large enough, else it is freed and
> > > > reallocated. A slight caveat is that memory will not be freed
> > > > until the device is closed, so e.g. if code is using highly
> > > > variable launch geometries and large amounts of GPU RAM, you
> > > > might run out of resources slightly quicker with this patch.
> > > > 
> > > > Another way this patch gains performance is by omitting the
> > > > synchronisation at the end of an OpenMP offload kernel launch --
> > > > it's safe for the GPU and CPU to continue executing in parallel
> > > > at that point, because e.g. copies-back from the device will be
> > > > synchronised properly with kernel completion anyway.    
> > 
> > I don't think this explanation is sufficient. My understanding is
> > that OpenMP forbids the host to proceed asynchronously after the
> > target construct unless it is a 'target nowait' construct. This may
> > be observable if there's a printf in the target region for example
> > (or if it accesses memory via host pointers).
> > 
> > So this really needs to be a separate patch with more explanation
> > why this is okay (if it is okay).  
> 
> As long as the offload kernel only touches GPU memory and does not
> have any CPU-visible side effects (like the printf you mentioned -- I
> hadn't really considered that, oops!), it's probably OK.
> 
> But anyway, the benefit obtained on OpenMP code (the same set of
> benchmarks run before) of omitting the synchronisation at the end of
> GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
> stacks caching, and miss out the synchronisation removal for now. (It
> might still be something worth considering later, perhaps, as long as
> we can show some given kernel doesn't use printf or access memory via
> host pointers -- I guess the former might be easier than the latter. I
> have observed the equivalent OpenACC patch provide a significant boost
> on some benchmarks, so there's probably something that could be gained
> on the OpenMP side too.)
> 
> The benefit with the attached patch -- just stacks caching, no
> synchronisation removal -- is about 12% on the same set of benchmarks
> as before. Results are a little noisy on the machine I'm benchmarking
> on, so this isn't necessarily proof that the synchronisation removal
> is harmful for performance!
> 
> > > > In turn, the last part necessitates a change to the way
> > > > "(perhaps abort was called)" errors are detected and reported.
> > > >   
> > 
> > As already mentioned using callbacks is problematic. Plus, I'm sure
> > the way you lock out other threads is a performance loss when
> > multiple threads have target regions: even though they will not run
> > concurrently on the GPU, you still want to allow host threads to
> > submit GPU jobs while the GPU is occupied.
> > 
> > I would suggest to have a small pool (up to 3 entries perhaps) of
> > stacks. Then you can arrange reuse without totally serializing host
> > threads on target regions.  
> 
> I'm really wary of the additional complexity of adding a stack pool,
> and the memory allocation/freeing code paths in CUDA appear to be so
> slow that we get a benefit with this patch even when the GPU stream
> has to wait for the CPU to unlock the stacks block. Also, for large
> GPU launches, the size of the soft-stacks block isn't really trivial
> (I've seen something like 50MB on the hardware I'm using, with default
> options), and multiplying that by 3 could start to eat into the GPU
> heap memory for "useful data" quite significantly.
> 
> Consider the attached (probably not amazingly-written) microbenchmark.
> It spawns 8 threads which each launch lots of OpenMP kernels
> performing some trivial work, then joins the threads and checks the
> results. As a baseline, with the "FEWER_KERNELS" parameters set (256
> kernel launches over 8 threads), this gives us over 5 runs:
> 
> real    3m55.375s
> user    7m14.192s
> sys     0m30.148s
> 
> real    3m54.487s
> user    7m6.775s
> sys     0m34.678s
> 
> real    3m54.633s
> user    7m20.381s
> sys     0m30.620s
> 
> real    3m54.992s
> user    7m12.464s
> sys     0m29.610s
> 
> real    3m55.471s
> user    7m14.342s
> sys     0m29.815s
> 
> With a version of the attached patch, we instead get:
> 
> real    3m53.404s
> user    3m39.869s
> sys     0m16.149s
> 
> real    3m54.713s
> user    3m41.018s
> sys     0m16.129s
> 
> real    3m55.242s
> user    3m55.148s
> sys     0m17.130s
> 
> real    3m55.374s
> user    3m40.411s
> sys     0m15.818s
> 
> real    3m55.189s
> user    3m40.144s
> sys     0m15.846s
> 
> That is: real time is about the same, but user/sys time are reduced.
> 
> Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the
> baseline is:
> 
> real    12m29.975s
> user    24m2.244s
> sys     8m8.153s
> 
> real    12m15.391s
> user    23m51.018s
> sys     8m0.809s
> 
> real    12m5.424s
> user    23m38.585s
> sys     7m47.714s
> 
> real    12m10.456s
> user    23m51.691s
> sys     7m54.324s
> 
> real    12m37.735s
> user    24m19.671s
> sys     8m15.752s
> 
> And with the patch, we get:
> 
> real    4m42.600s
> user    16m14.593s
> sys     0m40.444s
> 
> real    4m43.579s
> user    15m33.805s
> sys     0m38.537s
> 
> real    4m42.211s
> user    16m32.926s
> sys     0m40.271s
> 
> real    4m44.256s
> user    15m49.290s
> sys     0m39.116s
> 
> real    4m42.013s
> user    15m39.447s
> sys     0m38.517s
> 
> Real, user and sys time are all dramatically less. So I'd suggest that
> the attached patch is an improvement over the status quo, even if we
> could experiment with the stacks pool idea as a further improvement
> later on.
> 
> The attached patch also implements a size limit for retention of the
> soft-stack block -- freeing it before allocating more memory, rather
> than at the start of a kernel launch, so bigger blocks can still be
> shared between kernel launches if there's no memory allocation between
> them. It also tries freeing smaller cached soft-stack blocks and
> retrying memory allocation in out-of-memory situations.
> 
> Re-tested with offloading to NVPTX. OK for trunk?
> 
> Thanks,
> 
> Julian
> 
> ChangeLog
> 
> 2020-11-13  Julian Brown  <julian@codesourcery.com>
> 
> libgomp/
>     * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
>     (struct ptx_device): Add omp_stacks struct.
>     (nvptx_open_device): Initialise cached-stacks housekeeping info.
>     (nvptx_close_device): Free cached stacks block and mutex.
>     (nvptx_stacks_free): New function.
>     (nvptx_alloc): Add SUPPRESS_ERRORS parameter.
>     (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks
> block. (nvptx_stacks_alloc): Rename to...
>     (nvptx_stacks_acquire): This.  Cache stacks block between runs if
> same size or smaller is required.
>     (nvptx_stacks_free): Remove.
>     (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks
> block during kernel execution.