From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id 3A5A1385803E for ; Tue, 8 Dec 2020 01:13:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3A5A1385803E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Julian_Brown@mentor.com IronPort-SDR: u4vOCieG13aSR6c6nQ2SpN/Cl3Ty83QivElIRskBCOPv/SCFVLdNxZvQl9WkzDHgHYDxmjmi7H V6jIoHEkD0WFhtUEldaqAMfNVj6Kj2kDnohdNJVd+u9TcHLpQ0a0BSzFYeRBCBnJkd1fjwiCla mC2ajX3WkTSTG+GofY6607B//UE4D/H0vE+Mare9yIg6KX2KylQmH5e+b9fRZWg+qgN+DaeOLy 00tIPxlXCjS+oq6ERlwXf9jDCrUmkh0HkSCsj0JHAelv0xJhBgFmG9dbUpK7p7PNtai5fiP5uO Bck= X-IronPort-AV: E=Sophos;i="5.78,401,1599552000"; d="scan'208";a="55918813" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 07 Dec 2020 17:13:55 -0800 IronPort-SDR: 9IuWZh+avYjS420dHmBasaAwBWbALISPNubk/PCq9gwV/wGvbPgv1pAQVlGRFzSrNdNkTM92sy fGujJEt6FHsWEf1IGDKjceyBBLlFwwPbWQlhBop8hSV2FZQRdzhJIEQf56G17hIFx27BGTPQiL 1Am3YkEHgwjkjo6KtWdaXKU47uaKy1/UtYMbpOO5M6K4lFp959pa49vviM7KJ+E7nq7C5ODZCn iyfZHjRy10CyprrHkY5kmQ/SKVyXV1J2BhF3Y3re5tKxcQ0bylUuIJgNlj4+JrH8breBFBC3Ig vJ8= Date: Tue, 8 Dec 2020 01:13:48 +0000 From: Julian Brown To: Alexander Monakov CC: Jakub Jelinek , , "Thomas Schwinge" , Tom de Vries Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch Message-ID: <20201208011348.41209b59@squid.athome> In-Reply-To: <20201113205454.65f1539d@squid.athome> References: <20201026141448.109041-1-julian@codesourcery.com> <20201026142634.GI7080@tucnak> <20201113205454.65f1539d@squid.athome> Organization: Mentor Graphics X-Mailer: Claws Mail 3.17.6 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: SVR-IES-MBX-08.mgc.mentorg.com (139.181.222.8) To SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) X-Spam-Status: No, score=-5.8 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Dec 2020 01:13:59 -0000 Ping? Thanks, Julian On Fri, 13 Nov 2020 20:54:54 +0000 Julian Brown wrote: > Hi Alexander, > > Thanks for the review! Comments below. > > On Tue, 10 Nov 2020 00:32:36 +0300 > Alexander Monakov wrote: > > > On Mon, 26 Oct 2020, Jakub Jelinek wrote: > > > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > > > This patch adds caching for the stack block allocated for > > > > offloaded OpenMP kernel launches on NVPTX. This is a performance > > > > optimisation -- we observed an average 11% or so performance > > > > improvement with this patch across a set of accelerated GPU > > > > benchmarks on one machine (results vary according to individual > > > > benchmark and with hardware used). > > > > In this patch you're folding two changes together: reuse of > > allocated stacks and removing one host-device synchronization. Why > > is that? Can you report performance change separately for each > > change (and split out the patches)? > > An accident of the development process of the patch, really -- the > idea for removing the post-kernel-launch synchronisation came from the > OpenACC side, and adapting it to OpenMP meant the stacks had to remain > allocated after the return of the GOMP_OFFLOAD_run function. > > > > > A given kernel launch will reuse the stack block from the > > > > previous launch if it is large enough, else it is freed and > > > > reallocated. A slight caveat is that memory will not be freed > > > > until the device is closed, so e.g. if code is using highly > > > > variable launch geometries and large amounts of GPU RAM, you > > > > might run out of resources slightly quicker with this patch. > > > > > > > > Another way this patch gains performance is by omitting the > > > > synchronisation at the end of an OpenMP offload kernel launch -- > > > > it's safe for the GPU and CPU to continue executing in parallel > > > > at that point, because e.g. copies-back from the device will be > > > > synchronised properly with kernel completion anyway. > > > > I don't think this explanation is sufficient. My understanding is > > that OpenMP forbids the host to proceed asynchronously after the > > target construct unless it is a 'target nowait' construct. This may > > be observable if there's a printf in the target region for example > > (or if it accesses memory via host pointers). > > > > So this really needs to be a separate patch with more explanation > > why this is okay (if it is okay). > > As long as the offload kernel only touches GPU memory and does not > have any CPU-visible side effects (like the printf you mentioned -- I > hadn't really considered that, oops!), it's probably OK. > > But anyway, the benefit obtained on OpenMP code (the same set of > benchmarks run before) of omitting the synchronisation at the end of > GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the > stacks caching, and miss out the synchronisation removal for now. (It > might still be something worth considering later, perhaps, as long as > we can show some given kernel doesn't use printf or access memory via > host pointers -- I guess the former might be easier than the latter. I > have observed the equivalent OpenACC patch provide a significant boost > on some benchmarks, so there's probably something that could be gained > on the OpenMP side too.) > > The benefit with the attached patch -- just stacks caching, no > synchronisation removal -- is about 12% on the same set of benchmarks > as before. Results are a little noisy on the machine I'm benchmarking > on, so this isn't necessarily proof that the synchronisation removal > is harmful for performance! > > > > > In turn, the last part necessitates a change to the way > > > > "(perhaps abort was called)" errors are detected and reported. > > > > > > > > As already mentioned using callbacks is problematic. Plus, I'm sure > > the way you lock out other threads is a performance loss when > > multiple threads have target regions: even though they will not run > > concurrently on the GPU, you still want to allow host threads to > > submit GPU jobs while the GPU is occupied. > > > > I would suggest to have a small pool (up to 3 entries perhaps) of > > stacks. Then you can arrange reuse without totally serializing host > > threads on target regions. > > I'm really wary of the additional complexity of adding a stack pool, > and the memory allocation/freeing code paths in CUDA appear to be so > slow that we get a benefit with this patch even when the GPU stream > has to wait for the CPU to unlock the stacks block. Also, for large > GPU launches, the size of the soft-stacks block isn't really trivial > (I've seen something like 50MB on the hardware I'm using, with default > options), and multiplying that by 3 could start to eat into the GPU > heap memory for "useful data" quite significantly. > > Consider the attached (probably not amazingly-written) microbenchmark. > It spawns 8 threads which each launch lots of OpenMP kernels > performing some trivial work, then joins the threads and checks the > results. As a baseline, with the "FEWER_KERNELS" parameters set (256 > kernel launches over 8 threads), this gives us over 5 runs: > > real 3m55.375s > user 7m14.192s > sys 0m30.148s > > real 3m54.487s > user 7m6.775s > sys 0m34.678s > > real 3m54.633s > user 7m20.381s > sys 0m30.620s > > real 3m54.992s > user 7m12.464s > sys 0m29.610s > > real 3m55.471s > user 7m14.342s > sys 0m29.815s > > With a version of the attached patch, we instead get: > > real 3m53.404s > user 3m39.869s > sys 0m16.149s > > real 3m54.713s > user 3m41.018s > sys 0m16.129s > > real 3m55.242s > user 3m55.148s > sys 0m17.130s > > real 3m55.374s > user 3m40.411s > sys 0m15.818s > > real 3m55.189s > user 3m40.144s > sys 0m15.846s > > That is: real time is about the same, but user/sys time are reduced. > > Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the > baseline is: > > real 12m29.975s > user 24m2.244s > sys 8m8.153s > > real 12m15.391s > user 23m51.018s > sys 8m0.809s > > real 12m5.424s > user 23m38.585s > sys 7m47.714s > > real 12m10.456s > user 23m51.691s > sys 7m54.324s > > real 12m37.735s > user 24m19.671s > sys 8m15.752s > > And with the patch, we get: > > real 4m42.600s > user 16m14.593s > sys 0m40.444s > > real 4m43.579s > user 15m33.805s > sys 0m38.537s > > real 4m42.211s > user 16m32.926s > sys 0m40.271s > > real 4m44.256s > user 15m49.290s > sys 0m39.116s > > real 4m42.013s > user 15m39.447s > sys 0m38.517s > > Real, user and sys time are all dramatically less. So I'd suggest that > the attached patch is an improvement over the status quo, even if we > could experiment with the stacks pool idea as a further improvement > later on. > > The attached patch also implements a size limit for retention of the > soft-stack block -- freeing it before allocating more memory, rather > than at the start of a kernel launch, so bigger blocks can still be > shared between kernel launches if there's no memory allocation between > them. It also tries freeing smaller cached soft-stack blocks and > retrying memory allocation in out-of-memory situations. > > Re-tested with offloading to NVPTX. OK for trunk? > > Thanks, > > Julian > > ChangeLog > > 2020-11-13 Julian Brown > > libgomp/ > * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define. > (struct ptx_device): Add omp_stacks struct. > (nvptx_open_device): Initialise cached-stacks housekeeping info. > (nvptx_close_device): Free cached stacks block and mutex. > (nvptx_stacks_free): New function. > (nvptx_alloc): Add SUPPRESS_ERRORS parameter. > (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks > block. (nvptx_stacks_alloc): Rename to... > (nvptx_stacks_acquire): This. Cache stacks block between runs if > same size or smaller is required. > (nvptx_stacks_free): Remove. > (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks > block during kernel execution.