From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 125493 invoked by alias); 18 Dec 2018 11:44:38 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 125478 invoked by uid 89); 18 Dec 2018 11:44:38 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=wondered X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 18 Dec 2018 11:44:34 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=svr-ies-mbx-01.mgc.mentorg.com) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1gZDnb-0005Nn-Kn from Thomas_Schwinge@mentor.com for gcc-patches@gcc.gnu.org; Tue, 18 Dec 2018 03:44:31 -0800 Received: from hertz.schwinge.homeip.net (137.202.0.90) by svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 18 Dec 2018 11:44:27 +0000 From: Thomas Schwinge To: Chung-Lin Tang CC: Subject: Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts In-Reply-To: References: <12319572-dd02-c946-f2b9-9d047be9c707@mentor.com> <0c99f730-3cd6-d43a-fb71-dbdf6bcf26b5@mentor.com> User-Agent: Notmuch/0.9-101-g81dad07 (http://notmuchmail.org) Emacs/25.2.2 (x86_64-pc-linux-gnu) Date: Tue, 18 Dec 2018 11:44:00 -0000 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-SW-Source: 2018-12/txt/msg01291.txt.bz2 Hi Chung-Lin! On Tue, 18 Dec 2018 18:02:54 +0800, Chung-Lin Tang wrote: > On 2018/12/17 10:32 PM, Thomas Schwinge wrote: > >> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_f= atal() is when we hold the > >> struct gomp_device_descr's*device* lock, which is also acquired when = we execute atexit device shutdown handlers, hence the deadlock. > >> > >> I don't think this is the case for the OpenACC entry points that grab = at the openacc.async.* hooks, > > Ah, OK, I see. (But I thought that method of deadlock had been fixed by > > some structural changes, to have plugin functions call the > > non-terminating "GOMP_PLUGIN_error" and return some error, instead of > > calling "GOMP_PLUGIN_fatal"? I may be misremembering. Or, that's > > another TODO item for later, separately... Or, if that's actually the > > case, that this has been fixed in the way I described, then should these > > functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal" > > call "GOMP_PLUGIN_error", and then return an error code?) >=20 > You remembered correctly, although... >=20 > >> though I can audit them again if deemed so. > > My understanding had been that deadlock may happen if we're inside some > > of these async/wait/serialize/synchronize functions, with "async" locke= d, > > then run into an error, then libgomp prepares to abort, and at that time > > shuts down the device, which will shut down the asyncqueues > > ("goacc_fini_asyncqueues"), which will again try to lock "async" -- whi= ch > > it actually doesn't. My misunderstanding, I guess? >=20 > ...at least now, you can see that goacc_fini_asyncqueues() does not attem= pt to > acquire devicep->openacc.async.lock when doing cleanup. >=20 > Come to think of it, that might be a bug there. :P Heh, I wondered about that, too. ;-) An asyncqueue as returned by "lookup_goacc_asyncqueue" itself is not locked (and I suppose it shouldn't be, because that would be "too much"?), so it may -- at least (only?) in a specially constructed test case -- happen that an asyncqueue gets destructed ("goacc_fini_asyncqueues") while it's still in use? (Don't know how the CUDA Driver library thinks of that, for example. Though, probably, that scenario can only happen if the device used by a certain host thread is shut down while an "async" operation is still running. But, can we easily avoid that issue by calling "openacc.async.synchronize_func" before "openacc.async.destruct_func" (or, have the latter do that internally)? Just have to make sure that any such synchonization then doesn't raise (potentially) nested "GOMP_PLUGIN_fatal" calls. Hence the TODO comment I added in my "into async re-work: locking concerns" commit, before the "openacc.async.destruct_func" call: "Can/should/must we "synchronize" here (how?), so as to make sure that no other operation on this asyncqueue is going on while/after we've destructed it here?" Probably an error-ignoring "cuStreamSynchronize" call before "cuStreamDestroy" would be reasonable? Oh, and don't we have another problem... Consider an "acc_shutdown" run by host thread 1, while another host thread 2 continues to use the device-wide queues. That "acc_shutdown" will call "gomp_fini_device", which will call "goacc_fini_asyncqueues", which will happily destruct the whole device-wide "async" data. Just taking the "async" lock before doing that won't solve the problem, as host thread 2 is supposed to continue using the existing queues. Reference counting required? Anyway: I'm not asking you to fix that now. "Fortunately", we're not at all properly implementing OpenACC usage in context of multiple host threads (such as created by OpenMP or the pthreads interface), so I'm just noting that issue now, to be resolved later (as part of our internal tracker issues CSTS-110 or CSTS-115). Anyway: > >> "If there are two or more host threads executing and sharing the same = accelerator device, > >> two asynchronous operations with the same async-value will be enqueued= on the same activity queue" > > Right, but then, in the very next sentence, it goes on to state: "If the > > threads are not synchronized with respect to each other, the operations > > may be enqueued in either order and therefore may execute on the device > > in either order". So this, and given that: >=20 > I actually didn't care much about that next sentence, since it's just sta= ting the obvious :) ;-) > It also seem to imply that the multiple host threads are enqueuing operat= ions to the same async queue, hence further > corroborating that queues are device-wide, not thread. OK, that's your (certainly valid) interpretation; mine was to make our life simpler: > >> That said, I recall most (if not all) of the synchronization operation= s and behavior are all > >> defined to be with respect to operations of the local host thread only= , so the spec mentioning interaction with > >> other host threads here may be moot, as there's no way meaningful way = to synchronize with > >> the OpenACC activity of other host threads (though correct me if I for= got some API) > > ..., I concluded something must be wrong in the OpenACC 2.6, > > 2.16.1. "async clause" text, and no such (host-side) inter-thread > > synchronization can be expected from OpenACC "async"/"wait". I've also > > already got that on my list of things to clarify with the OpenACC > > technical committee, later on. ... so that we could simply avoid all that locking, by moving all "async" stuff into "goacc_thread" instead of device-wide. But you're raising a valid point here: > I just remembered, there does seem to be one area where device vs. thread= -wide interpretation will be visible: > when using acc_get/set_cuda_stream(). Supposedly, given the specification= 's device-wide queue/stream model, > different host-threads should access the same CUDA stream when using acc_= get/set_cuda_stream(). > This will break if we made async queues to be thread-local. That's either (a) correct (in your interpretation of the device-wide queue model), or (b) another OpenACC specification documentation issue (in the host thread-local queue model). ;-| > >> Also, CUDA streams do not seem to support local-thread-operation-only = synchronization. > >> I remember this was an issue in the old implementation inside the nvpt= x plugin as well, and we > >> had hacks to somewhat alleviate it (i.e. calling streams "single" or "= multi" threaded) > > Right. > >=20 > >> Well, another issue that we might want to bring up to the OpenACC comm= ittee:) > >> I agree that if async queues spaces were simply thread-local then thin= gs would be much simpler. > > OK, so you agree with that, good. > >=20 > > And, no problem foreseeable about simply moving the asyncqueues into > > "goacc_thread" -- and removing the "async" lock? >=20 > I think we should still try to solve the potential deadlock problems, and= stay close to the current > implementation just for now. OK. Again, I was just trying to avoid that work. > We can ask the committee for further guidance later. ACK. Oh, and I have another (possible?) interpretation of the OpenACC 2.6 specification: if the queues themselves indeed are device-wide, then maybe the "active" queues list is private per host thread? Because, don't just all "test/wait all" operations (which are the ones to work on the "active" queues list) have this "local (host) thread only" comment? Is that the intention of the specification after all? I'll leave it to you whether you want to look into that at this point -- if that might simplify some of the locking required maybe? Gr=C3=BC=C3=9Fe Thomas