From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 81562 invoked by alias); 17 Dec 2018 14:32:40 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 81542 invoked by uid 89); 17 Dec 2018 14:32:40 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=sk:misunde, synchronized X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 17 Dec 2018 14:32:36 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=svr-ies-mbx-01.mgc.mentorg.com) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1gYtwg-00038K-1Z from Thomas_Schwinge@mentor.com for gcc-patches@gcc.gnu.org; Mon, 17 Dec 2018 06:32:34 -0800 Received: from hertz.schwinge.homeip.net (137.202.0.90) by svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Mon, 17 Dec 2018 14:32:30 +0000 From: Thomas Schwinge To: Chung-Lin Tang CC: Subject: Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts In-Reply-To: <0c99f730-3cd6-d43a-fb71-dbdf6bcf26b5@mentor.com> References: <12319572-dd02-c946-f2b9-9d047be9c707@mentor.com> <0c99f730-3cd6-d43a-fb71-dbdf6bcf26b5@mentor.com> User-Agent: Notmuch/0.9-101-g81dad07 (http://notmuchmail.org) Emacs/25.2.2 (x86_64-pc-linux-gnu) Date: Mon, 17 Dec 2018 14:32:00 -0000 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-SW-Source: 2018-12/txt/msg01223.txt.bz2 Hi Chung-Lin! On Mon, 17 Dec 2018 19:03:12 +0800, Chung-Lin Tang wrote: > On 2018/12/14 10:56 PM, Thomas Schwinge wrote: > > + //TODO Are these safe to call, or might this cause deadlock if somet= hing's locked? > > CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING); > > CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream); > > CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0); > > @@ -1413,6 +1416,7 @@ static void > > cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr) > > { > > if (res !=3D CUDA_SUCCESS) > > + //TODO Is this safe to call, or might this cause deadlock if somet= hing's locked? > > GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res)= ); >=20 > The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fata= l() is when we hold the > struct gomp_device_descr's *device* lock, which is also acquired when we = execute atexit device shutdown handlers, hence the deadlock. >=20 > I don't think this is the case for the OpenACC entry points that grab at = the openacc.async.* hooks, Ah, OK, I see. (But I thought that method of deadlock had been fixed by some structural changes, to have plugin functions call the non-terminating "GOMP_PLUGIN_error" and return some error, instead of calling "GOMP_PLUGIN_fatal"? I may be misremembering. Or, that's another TODO item for later, separately... Or, if that's actually the case, that this has been fixed in the way I described, then should these functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal" call "GOMP_PLUGIN_error", and then return an error code?) > though I can audit them again if deemed so. My understanding had been that deadlock may happen if we're inside some of these async/wait/serialize/synchronize functions, with "async" locked, then run into an error, then libgomp prepares to abort, and at that time shuts down the device, which will shut down the asyncqueues ("goacc_fini_asyncqueues"), which will again try to lock "async" -- which it actually doesn't. My misunderstanding, I guess? > > But then, I wonder if we couldn't skip all that locking, if we moved the > > "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"? > >=20 >=20 > > struct { > > + //TODO Why do these live in the "device" data structure, and not i= n the "per-thread" data structure? > > + //TODO Aren't they meant to be separate per thread? > > + //TODO That is, as far as I remember right now, OpenACC explicitly= states that an asyncqueue doesn't entail any synchronization between diffe= rent host threads. > > + //TODO Verify OpenACC. > > + //TODO With that moved into "goacc_thread", we could get rid of al= l the locking needed here? > > /* Once created and put into the "active" list, asyncqueues are t= hen never > > destructed and removed from the "active" list, other than if t= he TODO > > device is shut down. */ > >=20 > > At this point, I will again (as in that other email) state that my > > understanding of OpenACC is that an async queue does not entail any > > inter-thread synchronization, so it would seem reasonable that all async > > queues are separate per thread. >=20 > OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 20= 29): >=20 > "If there are two or more host threads executing and sharing the same acc= elerator device, > two asynchronous operations with the same async-value will be enqueued on= the same activity queue" Right, but then, in the very next sentence, it goes on to state: "If the threads are not synchronized with respect to each other, the operations may be enqueued in either order and therefore may execute on the device in either order". So this, and given that: > That said, I recall most (if not all) of the synchronization operations a= nd behavior are all > defined to be with respect to operations of the local host thread only, s= o the spec mentioning interaction with > other host threads here may be moot, as there's no way meaningful way to = synchronize with > the OpenACC activity of other host threads (though correct me if I forgot= some API) ..., I concluded something must be wrong in the OpenACC 2.6, 2.16.1. "async clause" text, and no such (host-side) inter-thread synchronization can be expected from OpenACC "async"/"wait". I've also already got that on my list of things to clarify with the OpenACC technical committee, later on. > Also, CUDA streams do not seem to support local-thread-operation-only syn= chronization. > I remember this was an issue in the old implementation inside the nvptx p= lugin as well, and we > had hacks to somewhat alleviate it (i.e. calling streams "single" or "mul= ti" threaded) Right. > Well, another issue that we might want to bring up to the OpenACC committ= ee :) > I agree that if async queues spaces were simply thread-local then things = would be much simpler. OK, so you agree with that, good. And, no problem foreseeable about simply moving the asyncqueues into "goacc_thread" -- and removing the "async" lock? Gr=C3=BC=C3=9Fe Thomas