From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Julian_Brown@mentor.com>
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
 by sourceware.org (Postfix) with ESMTPS id B37D13857C63
 for <gcc-patches@gcc.gnu.org>; Tue,  5 Jan 2021 12:14:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B37D13857C63
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=Julian_Brown@mentor.com
IronPort-SDR: vp4o1wvOIyRa3pPxeZSeem60xo8ya32bOBo1IIHlymllwZ5kORMp7DcO9cRb5EANTHddA5UaKm
 3OGEi6o09lUznmMKkoMexnnQFkijrUwuYIbVZymMhU2rY4zTL+r6oIReVa+fLfTu9CXPZCDKm2
 GZnI6jQrcCUKGWn7+Zm/L8K+r0HgpuFXi2DAsFqD9PoLY1p2kw6DWWH6aqEfgaX7rOCkZC7nCz
 gCsXLv36qnYpIw2d4aXgZ3LO2cQswieR8B5KT+hLa49a+IP0R3gTqEnrZXjit0dsa2dRc8B6LS
 NB0=
X-IronPort-AV: E=Sophos;i="5.78,476,1599552000"; d="scan'208";a="56743781"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa2.mentor.iphmx.com with ESMTP; 05 Jan 2021 04:14:07 -0800
IronPort-SDR: 6Z7lG/yC8Xlkou6v6AIKf8Z7Fq5Mvm+tG8h653S552BobBu/xnfwteSObyza0+ECvHpAsXvyuD
 Iq88yw4f8tkKLru+MyHfQfejI8Mjq2EBYosSOR3y21QIA2sWxnZoeZhRccRKqmOFKKgKliyzLV
 spu1JwDvDyJs94BvZLCt7cDCp3+9KJOxnqe6MWPB91RXVzZa/8zOx1kOMThdGgg0Eupj2c07dz
 ZywgbtL2Z5dNxzeK7EdsQclZDU+LqU3uvVulst+6ru09o40V3X0Ak2b4kurNxfqNrtD4FG6BxG
 AGg=
Date: Tue, 5 Jan 2021 12:13:59 +0000
From: Julian Brown <julian@codesourcery.com>
To: Jakub Jelinek <jakub@redhat.com>
CC: Alexander Monakov <amonakov@ispras.ru>, <gcc-patches@gcc.gnu.org>, Thomas
 Schwinge <thomas@codesourcery.com>, Tom de Vries <tdevries@suse.de>
Subject: Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch
Message-ID: <20210105121359.5c08cba2@squid.athome>
In-Reply-To: <20201215231648.07759647@squid.athome>
References: <20201026141448.109041-1-julian@codesourcery.com>
 <20201026142634.GI7080@tucnak>
 <alpine.LNX.2.20.13.2011100000050.9902@monopod.intra.ispras.ru>
 <20201113205454.65f1539d@squid.athome>
 <20201208011348.41209b59@squid.athome>
 <alpine.LNX.2.20.13.2012082009570.2504@monopod.intra.ispras.ru>
 <20201215133913.32520253@squid.athome>
 <20201215134940.GT3788@tucnak>
 <20201215164938.22e5477c@squid.athome>
 <20201215170036.GV3788@tucnak>
 <20201215231648.07759647@squid.athome>
Organization: Mentor Graphics
X-Mailer: Claws Mail 3.17.6 (GTK+ 2.24.32; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: SVR-IES-MBX-03.mgc.mentorg.com (139.181.222.3) To
 SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4)
X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE,
 SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jan 2021 12:14:10 -0000

Hi Jakub,

Just to check, does my reply below address your concerns --
particularly with regards to the current usage of CUDA streams
serializing kernel executions from different host threads? Given that
situation, and the observed speed improvement with OpenMP offloading to
NVPTX with the patch, I'm not sure how much sense it makes to do
anything more sophisticated than this -- especially without a test case
that demonstrates a performance regression (or an exacerbated
out-of-memory condition) with the patch.

Thanks,

Julian

On Tue, 15 Dec 2020 23:16:48 +0000
Julian Brown <julian@codesourcery.com> wrote:

> On Tue, 15 Dec 2020 18:00:36 +0100
> Jakub Jelinek <jakub@redhat.com> wrote:
> 
> > On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote:  
> > > > Do you need to hold the omp_stacks.lock across the entire
> > > > offloading? Doesn't that serialize all offloading kernels to the
> > > > same device? I mean, can't the lock be taken just shortly at the
> > > > start to either acquire the cached stacks or allocate a fresh
> > > > stack, and then at the end to put the stack back into the
> > > > cache?    
> > > 
> > > I think you're suggesting something like what Alexander mentioned
> > > -- a pool of cached stacks blocks in case the single, locked block
> > > is contested. Obviously at present kernel launches are serialised
> > > on the target anyway, so it's a question of whether having the
> > > device wait for the host to unlock the stacks block (i.e. a
> > > context switch, FSVO context switch), or allocating a new stacks
> > > block, is quicker. I think the numbers posted in the parent email
> > > show that memory allocation is so slow that just waiting for the
> > > lock wins. I'm wary of adding unnecessary complication,
> > > especially if it'll only be exercised in already hard-to-debug
> > > cases (i.e. lots of threads)!    
> > 
> > I'm not suggesting to have multiple stacks, on the contrary.  I've
> > suggested to do the caching only if at most one host thread is
> > offloading to the device.
> > 
> > If one uses
> > #pragma omp parallel num_threads(3)
> > {
> >   #pragma omp target
> >   ...
> > }
> > then I don't see what would previously prevent the concurrent
> > offloading, yes, we take the device lock during gomp_map_vars and
> > again during gomp_unmap_vars, but don't hold it across the
> > offloading in between.  
> 
> I still don't think I quite understand what you're getting at.
> 
> We only implement synchronous launches for OpenMP on NVPTX at present,
> and those all use the default CUDA runtime driver stream. Only one
> kernel executes on the hardware at once, even if launched from
> different host threads. The serialisation isn't due to the device lock
> being held, but by the queueing semantics of the underlying API.
> 
> > > Does target-side memory allocation call back into the plugin's
> > > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not,
> > > target-side memory allocation shouldn't be affected, I don't
> > > think?    
> > 
> > Again, I'm not suggesting that it should, but what I'm saying is
> > that if target region ends but some other host tasks are doing
> > target regions to the same device concurrently with that, or if
> > there are async target in fly, we shouldn't try to cache the stack,
> > but free it right away, because what the other target regions might
> > need to malloc larger amounts of memory and fail because of the
> > caching.  
> 
> I'm assuming you're not suggesting fundamentally changing APIs or
> anything to determine if we're launching target regions from multiple
> threads at once, but instead that we try to detect the condition
> dynamically in the plugin?
> 
> So, would kernel launch look something like this? (Excuse
> pseudo-code-isms!)
> 
> void GOMP_OFFLOAD_run (...)
> {
>   bool used_cache;
> 
>   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>   if (&ptx_dev->omp_stacks.usage_count > 0)
>   {
>     cuCtxSynchronize ();
>     nvptx_stacks_free (&ptx_dev);
>     ...allocate fresh stack, no caching...
>     used_cache = false;
>   }
>   else
>   {
>     /* Allocate or re-use cached stacks, and then... */
>     ptx_dev->omp_stacks.usage_count++;
>     used_cache = true;
>   }
>   pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
> 
>   /* Launch kernel */
> 
>   if (used_cache) {
>     cuStreamAddCallback (
>       pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>       ptx_dev->omp_stacks.usage_count--;
>       pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>     );
>   } else {
>     pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>     /* Free uncached stack */
>     pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>   }
> }
> 
> This seems like it'd be rather fragile to me, and would offer some
> benefit perhaps only if a previous cached stacks block was much larger
> than the one required for some given later launch. It wouldn't allow
> any additional parallelism on the target I don't think.
> 
> Is that sort-of what you meant?
> 
> Oh, or perhaps something more like checking cuStreamQuery at the end
> of the kernel launch to see if more work (...from other threads) is
> outstanding on the same queue? I think that only usefully returns
> CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help.
> 
> Thanks for clarification (& apologies for being slow!),
> 
> Julian