From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-492625-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 98709 invoked by alias); 17 Dec 2018 11:03:27 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 98697 invoked by uid 89); 17 Dec 2018 11:03:26 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-11.9 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=multi, meaningful, D*mentor.com, CUDA_SUCCESS
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 17 Dec 2018 11:03:24 +0000
Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202])	by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1gYqgD-0001sB-QX from ChungLin_Tang@mentor.com 	for gcc-patches@gcc.gnu.org; Mon, 17 Dec 2018 03:03:21 -0800
Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Mon, 17 Dec 2018 03:03:18 -0800
Reply-To: <cltang@codesourcery.com>
Subject: Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts
To: Thomas Schwinge <Thomas_Schwinge@mentor.com>, Chung-Lin Tang	<cltang@codesourcery.com>
CC: <gcc-patches@gcc.gnu.org>
References: <12319572-dd02-c946-f2b9-9d047be9c707@mentor.com> <yxfpefakupg7.fsf@hertz.schwinge.homeip.net>
From: Chung-Lin Tang <chunglin_tang@mentor.com>
Message-ID: <0c99f730-3cd6-d43a-fb71-dbdf6bcf26b5@mentor.com>
Date: Mon, 17 Dec 2018 11:03:00 -0000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.3.3
MIME-Version: 1.0
In-Reply-To: <yxfpefakupg7.fsf@hertz.schwinge.homeip.net>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
X-SW-Source: 2018-12/txt/msg01199.txt.bz2

On 2018/12/14 10:56 PM, Thomas Schwinge wrote:
> Hi Chung-Lin!
> 
> On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang<chunglin_tang@mentor.com>  wrote:
>> --- a/libgomp/oacc-async.c
>> +++ b/libgomp/oacc-async.c
>> +attribute_hidden struct goacc_asyncqueue *
>> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
>> +{
>> +  /* The special value acc_async_noval (-1) maps to the thread-specific
>> +     default async stream.  */
>> +  if (async == acc_async_noval)
>> +    async = thr->default_async;
>> +
>> +  if (async == acc_async_sync)
>> +    return NULL;
>> +
>> +  if (async < 0)
>> +    gomp_fatal ("bad async %d", async);
>> +
>> +  struct gomp_device_descr *dev = thr->dev;
>> +
>> +  if (!create
>> +      && (async >= dev->openacc.async.nasyncqueue
>> +	  || !dev->openacc.async.asyncqueue[async]))
>> +    return NULL;
>> +
> Doesn't this last block also have to be included in the lock you're
> taking below?

Good catch, I'll revise this.

> -  gomp_mutex_lock (&dev->openacc.async.lock);
>     if (id >= dev->openacc.async.nasyncqueue)
>       {
>         int diff = id + 1 - dev->openacc.async.nasyncqueue;
> +      // TODO gomp_realloc might call "gomp_fatal" with "&dev->openacc.async.lock" locked.  Might cause deadlock?
>         dev->openacc.async.asyncqueue
>   	= gomp_realloc (dev->openacc.async.asyncqueue,
>   			sizeof (goacc_aq) * (id + 1));
> @@ -105,16 +109,23 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
>   
>     if (!dev->openacc.async.asyncqueue[id])
>       {
> +      //TODO We have "&dev->openacc.async.lock" locked here, and if "openacc.async.construct_func" calls "GOMP_PLUGIN_fatal" (via "CUDA_CALL_ASSERT", for example), that might cause deadlock?
> +      //TODO Change the interface to emit an error in the plugin, but then "return NULL", and we catch that here, unlock, and bail out?
>         dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func ();

> +  //TODO Are these safe to call, or might this cause deadlock if something's locked?
>     CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
>     CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
>     CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
> @@ -1413,6 +1416,7 @@ static void
>   cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
>   {
>     if (res != CUDA_SUCCESS)
> +    //TODO Is this safe to call, or might this cause deadlock if something's locked?
>       GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));

The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is when we hold the
struct gomp_device_descr's *device* lock, which is also acquired when we execute atexit device shutdown handlers, hence the deadlock.

I don't think this is the case for the OpenACC entry points that grab at the openacc.async.* hooks,
though I can audit them again if deemed so.
> 
> But then, I wonder if we couldn't skip all that locking, if we moved the
> "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?
> 

>     struct {
> +    //TODO Why do these live in the "device" data structure, and not in the "per-thread" data structure?
> +    //TODO Aren't they meant to be separate per thread?
> +    //TODO That is, as far as I remember right now, OpenACC explicitly states that an asyncqueue doesn't entail any synchronization between different host threads.
> +    //TODO Verify OpenACC.
> +    //TODO With that moved into "goacc_thread", we could get rid of all the locking needed here?
>       /* Once created and put into the "active" list, asyncqueues are then never
>          destructed and removed from the "active" list, other than if the TODO
>          device is shut down.  */
> 
> At this point, I will again (as in that other email) state that my
> understanding of OpenACC is that an async queue does not entail any
> inter-thread synchronization, so it would seem reasonable that all async
> queues are separate per thread.

OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 2029):

"If there are two or more host threads executing and sharing the same accelerator device,
two asynchronous operations with the same async-value will be enqueued on the same activity queue"

That said, I recall most (if not all) of the synchronization operations and behavior are all
defined to be with respect to operations of the local host thread only, so the spec mentioning interaction with
other host threads here may be moot, as there's no way meaningful way to synchronize with
the OpenACC activity of other host threads (though correct me if I forgot some API)

Also, CUDA streams do not seem to support local-thread-operation-only synchronization.
I remember this was an issue in the old implementation inside the nvptx plugin as well, and we
had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" threaded)

Well, another issue that we might want to bring up to the OpenACC committee :)
I agree that if async queues spaces were simply thread-local then things would be much simpler.

Thanks,
Chung-Lin