From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 50142 invoked by alias); 28 Dec 2018 12:07:45 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 50128 invoked by uid 89); 28 Dec 2018 12:07:45 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-11.9 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=destroy, locked, destroying, requiring X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 28 Dec 2018 12:07:41 +0000 Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202]) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1gcqvT-0002Fw-25 from Thomas_Schwinge@mentor.com for gcc-patches@gcc.gnu.org; Fri, 28 Dec 2018 04:07:39 -0800 Received: from svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Fri, 28 Dec 2018 04:07:36 -0800 Received: from tftp-cs (147.34.91.1) by svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) with Microsoft SMTP Server id 15.0.1320.4 via Frontend Transport; Fri, 28 Dec 2018 04:07:36 -0800 Received: by tftp-cs (Postfix, from userid 49978) id F0F9AC220A; Fri, 28 Dec 2018 04:07:35 -0800 (PST) From: Thomas Schwinge To: Chung-Lin Tang CC: Subject: Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v3) In-Reply-To: <9d8a319f-1955-3c1d-74b2-75c3815943ba@mentor.com> References: <12319572-dd02-c946-f2b9-9d047be9c707@mentor.com> <34db59a1-0446-1720-9a2e-b6e54f52edb2@mentor.com> <9d8a319f-1955-3c1d-74b2-75c3815943ba@mentor.com> User-Agent: Notmuch/0.9-125-g4686d11 (http://notmuchmail.org) Emacs/25.2.2 (x86_64-pc-linux-gnu) Date: Fri, 28 Dec 2018 14:52:00 -0000 Message-ID: <874laxluri.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-SW-Source: 2018-12/txt/msg01729.txt.bz2 Hi Chung-Lin! On Sat, 22 Dec 2018 00:04:56 +0800, Chung-Lin Tang wrote: > On 2018/12/19 5:03 AM, Thomas Schwinge wrote: > > On Tue, 18 Dec 2018 23:06:38 +0800, Chung-Lin Tang wrote: > >> this part includes some of the lookup_goacc_asyncqueue fixes we talked= about. > >> I am still thinking about how the queue lock problem should really be = solved, so regard > >> this patch as just fixing some of the problems. >=20 > This is my solution to the queue lock stuff we talked about. I've only at= tached a diff to > be applied atop of the existing changes, though that may actually be easi= er to review. >=20 > Note that this is still in testing, which means it hasn't been tested :P,= but I'm > posting to see if you have time to give it a look before the holidays. >=20 > Having the need for a lock on the async queues is quite irritating, espec= ially > when the structure needed for managing them is quite simple. > > Therefore, lets do away the need for locks entirely. OK, if that's easily possible, fine. Though, I won't object to the current "lock" version, if done properly in all the relevant places, as pointed out. (And, as that was an open issue, as far as I remember: I think its fine to have "aq"s not protected by the async lock, because they will only be destroyed by device shutdown, and at that point, all async operations must be synchronized with the local (host) thread, so the are then "aq"s idle.) > This patch makes the asyncqueue part of the device->openacc.async managed= by lock-free > atomic operations; almost all of the complexity is contained in lookup_go= acc_asyncqueue(), > so it should be not too complex. A descriptor and the queue array is allo= cated/exchanged > atomically when more storage is required, while in the common case a simp= le lookup is enough. > The fact that we manage asyncqueues by only expanding and never destroyin= g asyncqueues > during the device lifetime also simplifies many things. >=20 > The current implementation may be not that optimized and clunky in some c= ases, but I think > this should be the way to implement what should be a fairly simple asyncq= ueue array and active > list. OK conceptually, but from the (very quick only!) look that I had, I did not completely understand the data structure you're adding there, and don't you also have to use atomic instructions for reading the lock-free data structure ("aqdesc" etc.)? > I'll update more on the status as testing proceeds. >=20 > (and about the other corners you noticed in the last mail, I'll get to th= at later...) OK, thanks. Gr=C3=BC=C3=9Fe Thomas > >> --- libgomp/oacc-async.c (revision 267226) > >> +++ libgomp/oacc-async.c (working copy) > >=20 > >> +attribute_hidden struct goacc_asyncqueue * > >> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int a= sync) > >> +{ > >> +[...] > >> + /* Link new async queue into active list. */ > >> + goacc_aq_list n =3D gomp_malloc (sizeof (struct goacc_asyncqueu= e_list)); > >> + n->aq =3D dev->openacc.async.asyncqueue[async]; > >> + n->next =3D dev->openacc.async.active; > >> + dev->openacc.async.active =3D n; > >> + } > >> + gomp_mutex_unlock (&dev->openacc.async.lock); > >=20 > > You still need to keep "async" locked during... > >=20 > >> + return dev->openacc.async.asyncqueue[async]; > >=20 > > ... this dereference. > >=20 > >> +} > --- trunk-orig/libgomp/libgomp.h 2018-12-20 23:24:20.040375724 +0800 > +++ trunk-work/libgomp/libgomp.h 2018-12-21 23:32:47.938628954 +0800 > @@ -937,6 +937,13 @@ >=20=20 > #include "splay-tree.h" >=20=20 > +struct goacc_asyncqueue_desc > +{ > + int nasyncqueue; > + struct goacc_asyncqueue **asyncqueue; > + struct goacc_asyncqueue_list *active; > +}; > + > typedef struct acc_dispatch_t > { > /* This is a linked list of data mapped using the > @@ -955,10 +962,7 @@ > *destroy_thread_data_func; >=20=20=20=20 > struct { > - gomp_mutex_t lock; > - int nasyncqueue; > - struct goacc_asyncqueue **asyncqueue; > - struct goacc_asyncqueue_list *active; > + struct goacc_asyncqueue_desc *aqdesc; >=20=20 > __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func; > __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func; > diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c > --- trunk-orig/libgomp/oacc-async.c 2018-12-18 22:19:51.923102938 +0800 > +++ trunk-work/libgomp/oacc-async.c 2018-12-21 23:40:36.088528890 +0800 > @@ -70,45 +70,84 @@ >=20=20 > struct gomp_device_descr *dev =3D thr->dev; >=20=20 > - gomp_mutex_lock (&dev->openacc.async.lock); > + struct goacc_asyncqueue_desc *aqdesc =3D dev->openacc.async.aqdesc; >=20=20 > - if (!create > - && (async >=3D dev->openacc.async.nasyncqueue > - || !dev->openacc.async.asyncqueue[async])) > - { > - gomp_mutex_unlock (&dev->openacc.async.lock); > - return NULL; > - } > + if (!create) > + return async < aqdesc->nasyncqueue ? aqdesc->asyncqueue[async] : NUL= L; >=20=20 > - if (async >=3D dev->openacc.async.nasyncqueue) > - { > - int diff =3D async + 1 - dev->openacc.async.nasyncqueue; > - dev->openacc.async.asyncqueue > - =3D gomp_realloc (dev->openacc.async.asyncqueue, > - sizeof (goacc_aq) * (async + 1)); > - memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncq= ueue, > - 0, sizeof (goacc_aq) * diff); > - dev->openacc.async.nasyncqueue =3D async + 1; > - } > + if (async < aqdesc->nasyncqueue && aqdesc->asyncqueue[async]) > + return aqdesc->asyncqueue[async]; >=20=20 > - if (!dev->openacc.async.asyncqueue[async]) > - { > - dev->openacc.async.asyncqueue[async] =3D dev->openacc.async.constr= uct_func (); > + struct goacc_asyncqueue *new_aq =3D dev->openacc.async.construct_func = (); > + const int inc_size =3D 8; > + int new_sz =3D async + 1 + inc_size; > + new_sz =3D (new_sz + inc_size - 1) & ~(inc_size - 1); > + > + struct goacc_asyncqueue **new_aqvec =3D > + gomp_malloc_cleared (sizeof (goacc_aq) * new_sz); > + new_aqvec[async] =3D new_aq; > + > + struct goacc_asyncqueue_desc *ndesc =3D > + gomp_malloc (sizeof (struct goacc_asyncqueue_desc)); > + ndesc->asyncqueue =3D new_aqvec; > + ndesc->nasyncqueue =3D new_sz; > + > + goacc_aq_list nl =3D gomp_malloc (sizeof (struct goacc_asyncqueue_list= )); > + nl->aq =3D new_aq; > + ndesc->active =3D nl; > + > + /* Loop to atomically create and add a new async queue, without using = a lock, > + though may possibly requiring creating and replacing with a new des= criptor. > + This may likely not be as optimized as it can be, although simplifi= es > + other handling of async queues. */ > + do { > + int curr_num_aqs =3D aqdesc->nasyncqueue; > +=20=20=20=20 > + if (async < curr_num_aqs) > + { > + /* Someone else expanded the asyncqeue array to enough size, free the > + allocated resources. */ > + free (ndesc->asyncqueue); > + free (ndesc); > + > + /* Try to set the new aq to the right place again. */ > + struct goacc_asyncqueue_desc *nullq =3D NULL; > + if (__atomic_compare_exchange (&aqdesc->asyncqueue[async], > + &nullq, &new_aq, false, > + __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)) > + { > + /* Push new asyncqueue to front of active list. */ > + nl->next =3D aqdesc->active; > + while (!__atomic_compare_exchange (&aqdesc->active, &nl->next, &nl, > + true, __ATOMIC_ACQ_REL, > + __ATOMIC_ACQUIRE)) > + ; > + } > + else > + { > + /* Okay, someone else already created 'async', free/destroy > + everything allocated and return existing queue directly. */ > + if (!dev->openacc.async.destruct_func (new_aq)) > + gomp_fatal ("error in async queue creation"); > + free (nl); > + } > + return aqdesc->asyncqueue[async]; > + } >=20=20 > - if (!dev->openacc.async.asyncqueue[async]) > - { > - gomp_mutex_unlock (&dev->openacc.async.lock); > - gomp_fatal ("async %d creation failed", async); > - } > -=20=20=20=20=20=20 > - /* Link new async queue into active list. */ > - goacc_aq_list n =3D gomp_malloc (sizeof (struct goacc_asyncqueue_l= ist)); > - n->aq =3D dev->openacc.async.asyncqueue[async]; > - n->next =3D dev->openacc.async.active; > - dev->openacc.async.active =3D n; > - } > - gomp_mutex_unlock (&dev->openacc.async.lock); > - return dev->openacc.async.asyncqueue[async]; > + /* Copy current asyncqueue contents*/ > + memcpy (ndesc->asyncqueue, aqdesc->asyncqueue, > + sizeof (struct goacc_asyncqueue *) * curr_num_aqs); > + ndesc->active->next =3D aqdesc->active; > + > + } while (!__atomic_compare_exchange (&dev->openacc.async.aqdesc, > + &aqdesc, &ndesc, false, > + __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)); > + > + /* Free the old stale descriptor and resources in aqdesc. */ > + free (aqdesc->asyncqueue); > + free (aqdesc); > + > + return dev->openacc.async.aqdesc->asyncqueue[async]; > } >=20=20 > attribute_hidden struct goacc_asyncqueue * > @@ -139,14 +178,12 @@ > struct goacc_thread *thr =3D get_goacc_thread (); >=20=20 > int ret =3D 1; > - gomp_mutex_lock (&thr->dev->openacc.async.lock); > - for (goacc_aq_list l =3D thr->dev->openacc.async.active; l; l =3D l->n= ext) > + for (goacc_aq_list l =3D thr->dev->openacc.async.aqdesc->active; l; l = =3D l->next) > if (!thr->dev->openacc.async.test_func (l->aq)) > { > ret =3D 0; > break; > } > - gomp_mutex_unlock (&thr->dev->openacc.async.lock); > return ret; > } >=20=20 > @@ -194,10 +231,8 @@ > { > struct gomp_device_descr *dev =3D get_goacc_thread_device (); >=20=20 > - gomp_mutex_lock (&dev->openacc.async.lock); > - for (goacc_aq_list l =3D dev->openacc.async.active; l; l =3D l->next) > + for (goacc_aq_list l =3D dev->openacc.async.aqdesc->active; l; l =3D l= ->next) > dev->openacc.async.synchronize_func (l->aq); > - gomp_mutex_unlock (&dev->openacc.async.lock); > } >=20=20 > /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_= all. */ > @@ -221,14 +256,12 @@ >=20=20 > goacc_aq waiting_queue =3D lookup_goacc_asyncqueue (thr, true, async); >=20=20 > - gomp_mutex_lock (&thr->dev->openacc.async.lock); > - for (goacc_aq_list l =3D thr->dev->openacc.async.active; l; l =3D l->n= ext) > + for (goacc_aq_list l =3D thr->dev->openacc.async.aqdesc->active; l; l = =3D l->next) > { > thr->dev->openacc.async.synchronize_func (l->aq); > if (waiting_queue) > thr->dev->openacc.async.serialize_func (l->aq, waiting_queue); > } > - gomp_mutex_unlock (&thr->dev->openacc.async.lock); > } >=20=20 > int > @@ -261,30 +294,27 @@ > attribute_hidden void > goacc_init_asyncqueues (struct gomp_device_descr *devicep) > { > - gomp_mutex_init (&devicep->openacc.async.lock); > - devicep->openacc.async.nasyncqueue =3D 0; > - devicep->openacc.async.asyncqueue =3D NULL; > - devicep->openacc.async.active =3D NULL; > + devicep->openacc.async.aqdesc =3D > + gomp_malloc_cleared (sizeof (struct goacc_asyncqueue_desc)); > } >=20=20 > attribute_hidden bool > goacc_fini_asyncqueues (struct gomp_device_descr *devicep) > { > + struct goacc_asyncqueue_desc *aqdesc =3D devicep->openacc.async.aqdesc; > bool ret =3D true; > - if (devicep->openacc.async.nasyncqueue > 0) > + > + if (aqdesc->nasyncqueue > 0) > { > goacc_aq_list next; > - for (goacc_aq_list l =3D devicep->openacc.async.active; l; l =3D n= ext) > + for (goacc_aq_list l =3D aqdesc->active; l; l =3D next) > { > ret &=3D devicep->openacc.async.destruct_func (l->aq); > next =3D l->next; > free (l); > } > - free (devicep->openacc.async.asyncqueue); > - devicep->openacc.async.nasyncqueue =3D 0; > - devicep->openacc.async.asyncqueue =3D NULL; > - devicep->openacc.async.active =3D NULL; > + free (aqdesc->asyncqueue); > } > - gomp_mutex_destroy (&devicep->openacc.async.lock); > + free (aqdesc); > return ret; > } > diff -ru trunk-orig/libgomp/oacc-cuda.c trunk-work/libgomp/oacc-cuda.c > --- trunk-orig/libgomp/oacc-cuda.c 2018-12-18 18:09:08.601913507 +0800 > +++ trunk-work/libgomp/oacc-cuda.c 2018-12-21 23:34:10.024738834 +0800 > @@ -87,9 +87,7 @@ > if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func) > { > goacc_aq aq =3D get_goacc_asyncqueue (async); > - gomp_mutex_lock (&thr->dev->openacc.async.lock); > ret =3D thr->dev->openacc.cuda.set_stream_func (aq, stream); > - gomp_mutex_unlock (&thr->dev->openacc.async.lock); > } >=20=20 > return ret;