From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id B37D13857C63 for ; Tue, 5 Jan 2021 12:14:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B37D13857C63 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Julian_Brown@mentor.com IronPort-SDR: vp4o1wvOIyRa3pPxeZSeem60xo8ya32bOBo1IIHlymllwZ5kORMp7DcO9cRb5EANTHddA5UaKm 3OGEi6o09lUznmMKkoMexnnQFkijrUwuYIbVZymMhU2rY4zTL+r6oIReVa+fLfTu9CXPZCDKm2 GZnI6jQrcCUKGWn7+Zm/L8K+r0HgpuFXi2DAsFqD9PoLY1p2kw6DWWH6aqEfgaX7rOCkZC7nCz gCsXLv36qnYpIw2d4aXgZ3LO2cQswieR8B5KT+hLa49a+IP0R3gTqEnrZXjit0dsa2dRc8B6LS NB0= X-IronPort-AV: E=Sophos;i="5.78,476,1599552000"; d="scan'208";a="56743781" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 05 Jan 2021 04:14:07 -0800 IronPort-SDR: 6Z7lG/yC8Xlkou6v6AIKf8Z7Fq5Mvm+tG8h653S552BobBu/xnfwteSObyza0+ECvHpAsXvyuD Iq88yw4f8tkKLru+MyHfQfejI8Mjq2EBYosSOR3y21QIA2sWxnZoeZhRccRKqmOFKKgKliyzLV spu1JwDvDyJs94BvZLCt7cDCp3+9KJOxnqe6MWPB91RXVzZa/8zOx1kOMThdGgg0Eupj2c07dz ZywgbtL2Z5dNxzeK7EdsQclZDU+LqU3uvVulst+6ru09o40V3X0Ak2b4kurNxfqNrtD4FG6BxG AGg= Date: Tue, 5 Jan 2021 12:13:59 +0000 From: Julian Brown To: Jakub Jelinek CC: Alexander Monakov , , Thomas Schwinge , Tom de Vries Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch Message-ID: <20210105121359.5c08cba2@squid.athome> In-Reply-To: <20201215231648.07759647@squid.athome> References: <20201026141448.109041-1-julian@codesourcery.com> <20201026142634.GI7080@tucnak> <20201113205454.65f1539d@squid.athome> <20201208011348.41209b59@squid.athome> <20201215133913.32520253@squid.athome> <20201215134940.GT3788@tucnak> <20201215164938.22e5477c@squid.athome> <20201215170036.GV3788@tucnak> <20201215231648.07759647@squid.athome> Organization: Mentor Graphics X-Mailer: Claws Mail 3.17.6 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: SVR-IES-MBX-03.mgc.mentorg.com (139.181.222.3) To SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jan 2021 12:14:10 -0000 Hi Jakub, Just to check, does my reply below address your concerns -- particularly with regards to the current usage of CUDA streams serializing kernel executions from different host threads? Given that situation, and the observed speed improvement with OpenMP offloading to NVPTX with the patch, I'm not sure how much sense it makes to do anything more sophisticated than this -- especially without a test case that demonstrates a performance regression (or an exacerbated out-of-memory condition) with the patch. Thanks, Julian On Tue, 15 Dec 2020 23:16:48 +0000 Julian Brown wrote: > On Tue, 15 Dec 2020 18:00:36 +0100 > Jakub Jelinek wrote: > > > On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote: > > > > Do you need to hold the omp_stacks.lock across the entire > > > > offloading? Doesn't that serialize all offloading kernels to the > > > > same device? I mean, can't the lock be taken just shortly at the > > > > start to either acquire the cached stacks or allocate a fresh > > > > stack, and then at the end to put the stack back into the > > > > cache? > > > > > > I think you're suggesting something like what Alexander mentioned > > > -- a pool of cached stacks blocks in case the single, locked block > > > is contested. Obviously at present kernel launches are serialised > > > on the target anyway, so it's a question of whether having the > > > device wait for the host to unlock the stacks block (i.e. a > > > context switch, FSVO context switch), or allocating a new stacks > > > block, is quicker. I think the numbers posted in the parent email > > > show that memory allocation is so slow that just waiting for the > > > lock wins. I'm wary of adding unnecessary complication, > > > especially if it'll only be exercised in already hard-to-debug > > > cases (i.e. lots of threads)! > > > > I'm not suggesting to have multiple stacks, on the contrary. I've > > suggested to do the caching only if at most one host thread is > > offloading to the device. > > > > If one uses > > #pragma omp parallel num_threads(3) > > { > > #pragma omp target > > ... > > } > > then I don't see what would previously prevent the concurrent > > offloading, yes, we take the device lock during gomp_map_vars and > > again during gomp_unmap_vars, but don't hold it across the > > offloading in between. > > I still don't think I quite understand what you're getting at. > > We only implement synchronous launches for OpenMP on NVPTX at present, > and those all use the default CUDA runtime driver stream. Only one > kernel executes on the hardware at once, even if launched from > different host threads. The serialisation isn't due to the device lock > being held, but by the queueing semantics of the underlying API. > > > > Does target-side memory allocation call back into the plugin's > > > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, > > > target-side memory allocation shouldn't be affected, I don't > > > think? > > > > Again, I'm not suggesting that it should, but what I'm saying is > > that if target region ends but some other host tasks are doing > > target regions to the same device concurrently with that, or if > > there are async target in fly, we shouldn't try to cache the stack, > > but free it right away, because what the other target regions might > > need to malloc larger amounts of memory and fail because of the > > caching. > > I'm assuming you're not suggesting fundamentally changing APIs or > anything to determine if we're launching target regions from multiple > threads at once, but instead that we try to detect the condition > dynamically in the plugin? > > So, would kernel launch look something like this? (Excuse > pseudo-code-isms!) > > void GOMP_OFFLOAD_run (...) > { > bool used_cache; > > pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > if (&ptx_dev->omp_stacks.usage_count > 0) > { > cuCtxSynchronize (); > nvptx_stacks_free (&ptx_dev); > ...allocate fresh stack, no caching... > used_cache = false; > } > else > { > /* Allocate or re-use cached stacks, and then... */ > ptx_dev->omp_stacks.usage_count++; > used_cache = true; > } > pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > > /* Launch kernel */ > > if (used_cache) { > cuStreamAddCallback ( > pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > ptx_dev->omp_stacks.usage_count--; > pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > ); > } else { > pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > /* Free uncached stack */ > pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > } > } > > This seems like it'd be rather fragile to me, and would offer some > benefit perhaps only if a previous cached stacks block was much larger > than the one required for some given later launch. It wouldn't allow > any additional parallelism on the target I don't think. > > Is that sort-of what you meant? > > Oh, or perhaps something more like checking cuStreamQuery at the end > of the kernel launch to see if more work (...from other threads) is > outstanding on the same queue? I think that only usefully returns > CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help. > > Thanks for clarification (& apologies for being slow!), > > Julian