From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-492649-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 81562 invoked by alias); 17 Dec 2018 14:32:40 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 81542 invoked by uid 89); 17 Dec 2018 14:32:40 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=sk:misunde, synchronized
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 17 Dec 2018 14:32:36 +0000
Received: from nat-ies.mentorg.com ([192.94.31.2] helo=svr-ies-mbx-01.mgc.mentorg.com)	by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1gYtwg-00038K-1Z from Thomas_Schwinge@mentor.com 	for gcc-patches@gcc.gnu.org; Mon, 17 Dec 2018 06:32:34 -0800
Received: from hertz.schwinge.homeip.net (137.202.0.90) by svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Mon, 17 Dec 2018 14:32:30 +0000
From: Thomas Schwinge <Thomas_Schwinge@mentor.com>
To: Chung-Lin Tang <cltang@codesourcery.com>
CC: <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts
In-Reply-To: <0c99f730-3cd6-d43a-fb71-dbdf6bcf26b5@mentor.com>
References: <12319572-dd02-c946-f2b9-9d047be9c707@mentor.com> <yxfpefakupg7.fsf@hertz.schwinge.homeip.net> <0c99f730-3cd6-d43a-fb71-dbdf6bcf26b5@mentor.com>
User-Agent: Notmuch/0.9-101-g81dad07 (http://notmuchmail.org) Emacs/25.2.2 (x86_64-pc-linux-gnu)
Date: Mon, 17 Dec 2018 14:32:00 -0000
Message-ID: <yxfp7eg8te9w.fsf@hertz.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2018-12/txt/msg01223.txt.bz2

Hi Chung-Lin!

On Mon, 17 Dec 2018 19:03:12 +0800, Chung-Lin Tang <chunglin_tang@mentor.co=
m> wrote:
> On 2018/12/14 10:56 PM, Thomas Schwinge wrote:
> > +  //TODO Are these safe to call, or might this cause deadlock if somet=
hing's locked?
> >     CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
> >     CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
> >     CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
> > @@ -1413,6 +1416,7 @@ static void
> >   cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
> >   {
> >     if (res !=3D CUDA_SUCCESS)
> > +    //TODO Is this safe to call, or might this cause deadlock if somet=
hing's locked?
> >       GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res)=
);
>=20
> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fata=
l() is when we hold the
> struct gomp_device_descr's *device* lock, which is also acquired when we =
execute atexit device shutdown handlers, hence the deadlock.
>=20
> I don't think this is the case for the OpenACC entry points that grab at =
the openacc.async.* hooks,

Ah, OK, I see.  (But I thought that method of deadlock had been fixed by
some structural changes, to have plugin functions call the
non-terminating "GOMP_PLUGIN_error" and return some error, instead of
calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's
another TODO item for later, separately...  Or, if that's actually the
case, that this has been fixed in the way I described, then should these
functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"
call "GOMP_PLUGIN_error", and then return an error code?)

> though I can audit them again if deemed so.

My understanding had been that deadlock may happen if we're inside some
of these async/wait/serialize/synchronize functions, with "async" locked,
then run into an error, then libgomp prepares to abort, and at that time
shuts down the device, which will shut down the asyncqueues
("goacc_fini_asyncqueues"), which will again try to lock "async" -- which
it actually doesn't.  My misunderstanding, I guess?


> > But then, I wonder if we couldn't skip all that locking, if we moved the
> > "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?
> >=20
>=20
> >     struct {
> > +    //TODO Why do these live in the "device" data structure, and not i=
n the "per-thread" data structure?
> > +    //TODO Aren't they meant to be separate per thread?
> > +    //TODO That is, as far as I remember right now, OpenACC explicitly=
 states that an asyncqueue doesn't entail any synchronization between diffe=
rent host threads.
> > +    //TODO Verify OpenACC.
> > +    //TODO With that moved into "goacc_thread", we could get rid of al=
l the locking needed here?
> >       /* Once created and put into the "active" list, asyncqueues are t=
hen never
> >          destructed and removed from the "active" list, other than if t=
he TODO
> >          device is shut down.  */
> >=20
> > At this point, I will again (as in that other email) state that my
> > understanding of OpenACC is that an async queue does not entail any
> > inter-thread synchronization, so it would seem reasonable that all async
> > queues are separate per thread.
>=20
> OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 20=
29):
>=20
> "If there are two or more host threads executing and sharing the same acc=
elerator device,
> two asynchronous operations with the same async-value will be enqueued on=
 the same activity queue"

Right, but then, in the very next sentence, it goes on to state: "If the
threads are not synchronized with respect to each other, the operations
may be enqueued in either order and therefore may execute on the device
in either order".  So this, and given that:

> That said, I recall most (if not all) of the synchronization operations a=
nd behavior are all
> defined to be with respect to operations of the local host thread only, s=
o the spec mentioning interaction with
> other host threads here may be moot, as there's no way meaningful way to =
synchronize with
> the OpenACC activity of other host threads (though correct me if I forgot=
 some API)

..., I concluded something must be wrong in the OpenACC 2.6,
2.16.1. "async clause" text, and no such (host-side) inter-thread
synchronization can be expected from OpenACC "async"/"wait".  I've also
already got that on my list of things to clarify with the OpenACC
technical committee, later on.

> Also, CUDA streams do not seem to support local-thread-operation-only syn=
chronization.
> I remember this was an issue in the old implementation inside the nvptx p=
lugin as well, and we
> had hacks to somewhat alleviate it (i.e. calling streams "single" or "mul=
ti" threaded)

Right.

> Well, another issue that we might want to bring up to the OpenACC committ=
ee :)
> I agree that if async queues spaces were simply thread-local then things =
would be much simpler.

OK, so you agree with that, good.

And, no problem foreseeable about simply moving the asyncqueues into
"goacc_thread" -- and removing the "async" lock?


Gr=C3=BC=C3=9Fe
 Thomas