From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153]) by sourceware.org (Postfix) with ESMTPS id B429E3858D37 for ; Tue, 27 Oct 2020 13:17:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B429E3858D37 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Julian_Brown@mentor.com IronPort-SDR: 77JK5ISdoupGEwaSsq2hIFusmto3whwImiFPPU7lUjScDEcvu9P/tvFXzTUOmW9of60FQAWRM/ Tkdhe11RY/xaTyf7CggE9O6pIeLMyEw3MZTfVqsDXZhvcQooXDs4q2rIkxfaK2fgaw3HjxdByt GW1YNHQB5euEJskUzI6rdOLyvI6fhLSVwOp6wCSXfGAboW3/j2iRO0QkHoqrr3jrBBIBZ4Ts/3 twwjEFvTeXwec3NJ3hwOb+fJ2Onl1W/Gqkxb7+sVhshGSsKsW82pJatjoHlyVfppl3nQ/YaztQ Znc= X-IronPort-AV: E=Sophos;i="5.77,424,1596528000"; d="scan'208";a="56659892" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa1.mentor.iphmx.com with ESMTP; 27 Oct 2020 05:17:45 -0800 IronPort-SDR: QzynTTllwqCcQgjeMn52AGgec8jawS2II9RNl4R8KzN8WqTKEZIS5/7kuuwfGRgT5QH+0HAWkc yYMo/yjmnBS5fNvPdqc0SkAMfxm1602t5jzJ0wu2s4bBWCThriRIbMoVIdwIULng+Vqn5A0gTI sZ2qYOGejrSJiDSIp1x6zYKTzB0dQ0UHae8NsVeTTWvUmHp0E+0B0MQNMjLKmciYCfmsvNUNwN pR9LGCbtn3TZVY2N8ekcsgXv+1SoCBi1b/DDUyXXuPHQqNXFaqTXyoMpDuGoz4Dz1LcNmziCTY mIg= Date: Tue, 27 Oct 2020 13:17:37 +0000 From: Julian Brown To: CC: Jakub Jelinek , Thomas Schwinge , Tom de Vries Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch Message-ID: <20201027131737.05873b02@squid.athome> In-Reply-To: <20201026141448.109041-1-julian@codesourcery.com> References: <20201026141448.109041-1-julian@codesourcery.com> Organization: Mentor Graphics X-Mailer: Claws Mail 3.17.6 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) To SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) X-Spam-Status: No, score=-6.2 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Oct 2020 13:17:48 -0000 (Apologies if threading is broken, for some reason I didn't receive this reply directly!) On Mon Oct 26 14:26:34 GMT 2020, Jakub Jelinek wrote: > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > This patch adds caching for the stack block allocated for offloaded > > OpenMP kernel launches on NVPTX. This is a performance optimisation > > -- we observed an average 11% or so performance improvement with > > this patch across a set of accelerated GPU benchmarks on one > > machine (results vary according to individual benchmark and with > > hardware used). > > > > A given kernel launch will reuse the stack block from the previous > > launch if it is large enough, else it is freed and reallocated. A > > slight caveat is that memory will not be freed until the device is > > closed, so e.g. if code is using highly variable launch geometries > > and large amounts of GPU RAM, you might run out of resources > > slightly quicker with this patch. > > > > Another way this patch gains performance is by omitting the > > synchronisation at the end of an OpenMP offload kernel launch -- > > it's safe for the GPU and CPU to continue executing in parallel at > > that point, because e.g. copies-back from the device will be > > synchronised properly with kernel completion anyway. > > > > In turn, the last part necessitates a change to the way "(perhaps > > abort was called)" errors are detected and reported. > > > > Tested with offloading to NVPTX. OK for mainline? > > I'm afraid I don't know the plugin nor CUDA well enough to review this > properly (therefore I'd like to hear from Thomas, Tom and/or > Alexander. Anyway, just two questions, wouldn't it make sense to add > some upper bound limit over which it wouldn't cache the stacks, so > that it would cache most of the time for normal programs but if some > kernel is really excessive and then many normal ones wouldn't result > in memory allocation failures? Yes, that might work -- another idea is to free the stacks then retry if a memory allocation fails, though that might lead to worse fragmentation, perhaps. For the upper bound idea we'd need to pick a sensible maximum limit. Something like 16MB maybe? Or, user-controllable or some fraction of the GPU's total memory? > And, in which context are cuStreamAddCallback registered callbacks > run? E.g. if it is inside of asynchronous interrput, using locking in > there might not be the best thing to do. The cuStreamAddCallback API is documented here: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 We're quite limited in what we can do in the callback function since "Callbacks must not make any CUDA API calls". So what *can* a callback function do? It is mentioned that the callback function's execution will "pause" the stream it is logically running on. So can we get deadlock, e.g. if multiple host threads are launching offload kernels simultaneously? I don't think so, but I don't know how to prove it! Thanks, Julian