From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180]) by sourceware.org (Postfix) with ESMTPS id 1169C3851C00 for ; Tue, 29 Jun 2021 23:42:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1169C3851C00 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com IronPort-SDR: CqGOCJ+VDjT7YfYmvOEbnjo+hpWDzBUPhxpOe33PbEJh0dpAGLLX8FoBMn7RU2cAfvtHj/WtIF TPQbbCzYZUGjHRIjEykvoqzJfdTCrLUyC9XFd0RCjO+pIJ4KdblLbxO90jvKrHT55xiC4oC+9d Beq+bSuwvDxQWVnvEndYB4JzP61zyknfXlk7rPk1bKfKxIJy7QCSeIPhwoxiJy1HWE2OOudqbz V4071aRhomivQdUkRHqLwX6QNaUZ9Q8pZ7XyXHsJXAqa6CMSYnDzaZer5tbbpI4qnAhvz1ANrv Cjg= X-IronPort-AV: E=Sophos;i="5.83,310,1616486400"; d="scan'208";a="62949882" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa3.mentor.iphmx.com with ESMTP; 29 Jun 2021 15:42:25 -0800 IronPort-SDR: CBI5l5yAkio0vye0dqQ7eKH08QuL3SsG0uVl+ZvsMOIcuUMaBvvAGwXiAtYsS0otODYuYgtxzj tvYRMGVHzYmR7aIZBy92ulYJqlkNZj8CvrtKKY+BiC0sq02InzNmM1OJYGfHtoPyZzltCt5uwJ pEaj6thiSrgu3ryYWFw97+sW2P/B+L3ulKAkrT5vPgPAP9BJXChQdhJaxqwOwRq8FXyFvgLR0w 4s/wHu13bTs2PoYDzNfWSpKsk0jEbZlzZ8urfbcqNWp9VkQTFP+/y7WHoG/RYbeyzmI4XY5dI1 XqM= From: Julian Brown To: CC: Thomas Schwinge , Jakub Jelinek , Chung-Lin Tang Subject: [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime Date: Tue, 29 Jun 2021 16:42:03 -0700 Message-ID: <560623585b89e577c9e76d8b53939569f49064fa.1624987598.git.julian@codesourcery.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) To SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jun 2021 23:42:28 -0000 This patch fixes several places in libgomp/target.c where "ephemeral" data (on the stack or in temporary heap locations) may be used as the source of an asynchronous host-to-device copy that may not complete before the host data disappears. Versions of the patch have been posted several times before, but this one (at Chung-Lin Tang's prior suggesion, IIRC) moves all logic into target.c rather than pushing it out to each target plugin. An existing, but flawed, workaround for this problem in the AMD GCN libgomp offloading plugin is currently present on mainline, and was posted for the og9 branch here: https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg00901.html and previous versions of this patch were posted here (for mainline/og9): https://gcc.gnu.org/legacy-ml/gcc-patches/2019-11/msg01482.html https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01026.html This patch exposes a problem with OpenACC profiling support that is fixed by the next patch in the series. The acc_prof-parallel-1.c test is XFAILed for now. Tested with offloading to AMD GCN. OK? Julian 2021-06-29 Julian Brown libgomp/ * libgomp.h (gomp_copy_host2dev): Update prototype. (memcpy_tofrom_device, update_dev_host): Add new argument to gomp_copy_host2dev (false). * plugin/plugin-gcn.c (struct copy_data): Remove free_src field. (copy_data): Don't free free_src. (queue_push_copy): Remove free_src handling. (GOMP_OFFLOAD_dev2dev): Update call to queue_push_copy. (GOMP_OFFLOAD_openacc_async_host2dev): Remove source-data snapshotting. (GOMP_OFFLOAD_openacc_async_dev2host): Update call to queue_push_copy. * target.c (goacc_device_copy_async): Remove. (gomp_copy_host2dev): Add EPHEMERAL parameter. Snapshot source data when true, and set up deferred freeing of temporary buffer. (gomp_copy_dev2host): Inline device-to-host copy handling instead of calling goacc_device_copy_async. (gomp_map_vars_existing): Update calls to gomp_copy_host2dev with appropriate ephemeral argument. (gomp_map_pointer, gomp_attach_pointer, gomp_detach_pointer, gomp_update): Likewise. (gomp_map_vars_internal): Likewise. Don't use coalescing buffer for async copies. * testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c: XFAIL for now. --- libgomp/libgomp.h | 2 +- libgomp/oacc-mem.c | 4 +- libgomp/plugin/plugin-gcn.c | 20 +--- libgomp/target.c | 111 +++++++++++------- .../acc_prof-parallel-1.c | 2 + 5 files changed, 77 insertions(+), 62 deletions(-) diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 8d25dc8e2a8..e8901da1069 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -1226,7 +1226,7 @@ extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *, struct gomp_coalesce_buf; extern void gomp_copy_host2dev (struct gomp_device_descr *, struct goacc_asyncqueue *, void *, const void *, - size_t, struct gomp_coalesce_buf *); + size_t, bool, struct gomp_coalesce_buf *); extern void gomp_copy_dev2host (struct gomp_device_descr *, struct goacc_asyncqueue *, void *, const void *, size_t); diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c index c21508f3739..5988db0b886 100644 --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -202,7 +202,7 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async, if (from) gomp_copy_dev2host (thr->dev, aq, h, d, s); else - gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL); + gomp_copy_host2dev (thr->dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL); if (profiling_p) { @@ -874,7 +874,7 @@ update_dev_host (int is_dev, void *h, size_t s, int async) goacc_aq aq = get_goacc_asyncqueue (async); if (is_dev) - gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL); + gomp_copy_host2dev (acc_dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL); else gomp_copy_dev2host (acc_dev, aq, h, d, s); diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c index cfed42a2d4d..98da48b77cb 100644 --- a/libgomp/plugin/plugin-gcn.c +++ b/libgomp/plugin/plugin-gcn.c @@ -292,7 +292,6 @@ struct copy_data void *dst; const void *src; size_t len; - bool free_src; struct goacc_asyncqueue *aq; }; @@ -2914,8 +2913,6 @@ copy_data (void *data_) data->aq->agent->device_id, data->aq->id, data->len, data->src, data->dst); hsa_memory_copy_wrapper (data->dst, data->src, data->len); - if (data->free_src) - free ((void *) data->src); free (data); } @@ -2934,7 +2931,7 @@ gomp_offload_free (void *ptr) static void queue_push_copy (struct goacc_asyncqueue *aq, void *dst, const void *src, - size_t len, bool free_src) + size_t len) { if (DEBUG_QUEUES) GCN_DEBUG ("queue_push_copy %d:%d: %zu bytes from (%p) to (%p)\n", @@ -2944,7 +2941,6 @@ queue_push_copy (struct goacc_asyncqueue *aq, void *dst, const void *src, data->dst = dst; data->src = src; data->len = len; - data->free_src = free_src; data->aq = aq; queue_push_callback (aq, copy_data, data); } @@ -3646,7 +3642,7 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n) { struct agent_info *agent = get_agent_info (device); maybe_init_omp_async (agent); - queue_push_copy (agent->omp_async_queue, dst, src, n, false); + queue_push_copy (agent->omp_async_queue, dst, src, n); return true; } @@ -3916,15 +3912,7 @@ GOMP_OFFLOAD_openacc_async_host2dev (int device, void *dst, const void *src, { struct agent_info *agent = get_agent_info (device); assert (agent == aq->agent); - /* The source data does not necessarily remain live until the deferred - copy happens. Taking a snapshot of the data here avoids reading - uninitialised data later, but means that (a) data is copied twice and - (b) modifications to the copied data between the "spawning" point of - the asynchronous kernel and when it is executed will not be seen. - But, that is probably correct. */ - void *src_copy = GOMP_PLUGIN_malloc (n); - memcpy (src_copy, src, n); - queue_push_copy (aq, dst, src_copy, n, true); + queue_push_copy (aq, dst, src, n); return true; } @@ -3936,7 +3924,7 @@ GOMP_OFFLOAD_openacc_async_dev2host (int device, void *dst, const void *src, { struct agent_info *agent = get_agent_info (device); assert (agent == aq->agent); - queue_push_copy (aq, dst, src, n, false); + queue_push_copy (aq, dst, src, n); return true; } diff --git a/libgomp/target.c b/libgomp/target.c index bb09d501dd6..5e4a80a67e1 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -208,22 +208,6 @@ gomp_device_copy (struct gomp_device_descr *devicep, } } -static inline void -goacc_device_copy_async (struct gomp_device_descr *devicep, - bool (*copy_func) (int, void *, const void *, size_t, - struct goacc_asyncqueue *), - const char *dst, void *dstaddr, - const char *src, const void *srcaddr, - size_t size, struct goacc_asyncqueue *aq) -{ - if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq)) - { - gomp_mutex_unlock (&devicep->lock); - gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed", - src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size); - } -} - /* Infrastructure for coalescing adjacent or nearly adjacent (in device addresses) host to device memory transfers. */ @@ -317,11 +301,18 @@ gomp_to_device_kind_p (int kind) } } +/* Copy host memory to an offload device. In asynchronous mode (if AQ is + non-NULL), when the source data is stack or may otherwise be deallocated + before the asynchronous copy takes place, EPHEMERAL must be passed as + TRUE. The CBUF isn't used for non-ephemeral asynchronous copies, because + the host data might not be computed yet (by an earlier asynchronous compute + region). */ + attribute_hidden void gomp_copy_host2dev (struct gomp_device_descr *devicep, struct goacc_asyncqueue *aq, void *d, const void *h, size_t sz, - struct gomp_coalesce_buf *cbuf) + bool ephemeral, struct gomp_coalesce_buf *cbuf) { if (cbuf) { @@ -349,8 +340,30 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep, } } if (__builtin_expect (aq != NULL, 0)) - goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func, - "dev", d, "host", h, sz, aq); + { + void *srcdata = (void *) h; + if (ephemeral) + { + /* We're queueing up an asynchronous copy from data that may + disappear before the transfer takes place (i.e. because it is a + stack local in a function that is no longer executing). Make a + copy of the data into a temporary buffer in those cases. */ + void *tmpbuf = gomp_malloc (sz); + memcpy (tmpbuf, h, sz); + srcdata = tmpbuf; + } + if (!devicep->openacc.async.host2dev_func (devicep->target_id, d, + srcdata, sz, aq)) + { + gomp_mutex_unlock (&devicep->lock); + gomp_fatal ("Copying of host object [%p..%p) to dev object [%p..%p) " + "failed", h, h + sz, d, d + sz); + } + /* Free any temporary buffer created above once the transfer has + completed. */ + if (srcdata != h) + devicep->openacc.async.queue_callback_func (aq, free, srcdata); + } else gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz); } @@ -361,8 +374,15 @@ gomp_copy_dev2host (struct gomp_device_descr *devicep, void *h, const void *d, size_t sz) { if (__builtin_expect (aq != NULL, 0)) - goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func, - "host", h, "dev", d, sz, aq); + { + if (!devicep->openacc.async.dev2host_func (devicep->target_id, h, d, sz, + aq)) + { + gomp_mutex_unlock (&devicep->lock); + gomp_fatal ("Copying of dev object [%p..%p) to host object [%p..%p) " + "failed", d, d + sz, h, h + sz); + } + } else gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz); } @@ -521,7 +541,7 @@ gomp_map_vars_existing (struct gomp_device_descr *devicep, (void *) (oldn->tgt->tgt_start + oldn->tgt_offset + newn->host_start - oldn->host_start), (void *) newn->host_start, - newn->host_end - newn->host_start, cbuf); + newn->host_end - newn->host_start, false, cbuf); gomp_increment_refcount (oldn, refcount_set); } @@ -548,8 +568,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq, cur_node.tgt_offset = (uintptr_t) NULL; gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset), - (void *) &cur_node.tgt_offset, - sizeof (void *), cbuf); + (void *) &cur_node.tgt_offset, sizeof (void *), + true, cbuf); return; } /* Add bias to the pointer value. */ @@ -569,7 +589,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq, to initialize the pointer with. */ cur_node.tgt_offset -= bias; gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset), - (void *) &cur_node.tgt_offset, sizeof (void *), cbuf); + (void *) &cur_node.tgt_offset, sizeof (void *), true, + cbuf); } static void @@ -702,7 +723,7 @@ gomp_attach_pointer (struct gomp_device_descr *devicep, (void *) (n->tgt->tgt_start + n->tgt_offset), (void *) data); gomp_copy_host2dev (devicep, aq, (void *) devptr, (void *) &data, - sizeof (void *), cbufp); + sizeof (void *), true, cbufp); } else gomp_debug (1, "%s: attach count for %p -> %u\n", __FUNCTION__, @@ -755,7 +776,7 @@ gomp_detach_pointer (struct gomp_device_descr *devicep, (void *) target); gomp_copy_host2dev (devicep, aq, (void *) devptr, (void *) &target, - sizeof (void *), cbufp); + sizeof (void *), true, cbufp); } else gomp_debug (1, "%s: attach count for %p -> %u\n", __FUNCTION__, @@ -927,8 +948,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, for (i = first; i <= last; i++) { tgt->list[i].key = NULL; - if (gomp_to_device_kind_p (get_kind (short_mapkind, kinds, i) - & typemask)) + if (!aq + && gomp_to_device_kind_p (get_kind (short_mapkind, kinds, + i) & typemask)) gomp_coalesce_buf_add (&cbuf, tgt_size - cur_node.host_end + (uintptr_t) hostaddrs[i], @@ -969,8 +991,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, if (tgt_align < align) tgt_align = align; tgt_size = (tgt_size + align - 1) & ~(align - 1); - gomp_coalesce_buf_add (&cbuf, tgt_size, - cur_node.host_end - cur_node.host_start); + if (!aq) + gomp_coalesce_buf_add (&cbuf, tgt_size, + cur_node.host_end - cur_node.host_start); tgt_size += cur_node.host_end - cur_node.host_start; has_firstprivate = true; continue; @@ -1063,7 +1086,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, if (tgt_align < align) tgt_align = align; tgt_size = (tgt_size + align - 1) & ~(align - 1); - if (gomp_to_device_kind_p (kind & typemask)) + if (!aq && gomp_to_device_kind_p (kind & typemask)) gomp_coalesce_buf_add (&cbuf, tgt_size, cur_node.host_end - cur_node.host_start); tgt_size += cur_node.host_end - cur_node.host_start; @@ -1218,7 +1241,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, len = sizes[i]; gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + tgt_size), - (void *) hostaddrs[i], len, cbufp); + (void *) hostaddrs[i], len, false, cbufp); tgt_size += len; continue; case GOMP_MAP_FIRSTPRIVATE_INT: @@ -1307,12 +1330,11 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, if (cur_node.tgt_offset) cur_node.tgt_offset -= sizes[i]; gomp_copy_host2dev (devicep, aq, - (void *) (n->tgt->tgt_start - + n->tgt_offset + (void *) (n->tgt->tgt_start + n->tgt_offset + cur_node.host_start - n->host_start), (void *) &cur_node.tgt_offset, - sizeof (void *), cbufp); + sizeof (void *), true, cbufp); cur_node.tgt_offset = n->tgt->tgt_start + n->tgt_offset + cur_node.host_start - n->host_start; continue; @@ -1450,7 +1472,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, (void *) (tgt->tgt_start + k->tgt_offset), (void *) k->host_start, - k->host_end - k->host_start, cbufp); + k->host_end - k->host_start, false, + cbufp); break; case GOMP_MAP_POINTER: gomp_map_pointer (tgt, aq, @@ -1462,7 +1485,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, (void *) (tgt->tgt_start + k->tgt_offset), (void *) k->host_start, - k->host_end - k->host_start, cbufp); + k->host_end - k->host_start, false, + cbufp); tgt->list[i].has_null_ptr_assoc = false; for (j = i + 1; j < mapnum; j++) @@ -1525,7 +1549,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, (void *) (tgt->tgt_start + k->tgt_offset), (void *) k->host_start, - sizeof (void *), cbufp); + sizeof (void *), false, cbufp); break; default: gomp_mutex_unlock (&devicep->lock); @@ -1541,7 +1565,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, /* We intentionally do not use coalescing here, as it's not data allocated by the current call to this function. */ gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset, - &tgt_addr, sizeof (void *), NULL); + &tgt_addr, sizeof (void *), true, NULL); } array++; } @@ -1556,7 +1580,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + i * sizeof (void *)), (void *) &cur_node.tgt_offset, sizeof (void *), - cbufp); + true, cbufp); } } @@ -1568,7 +1592,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, (void *) (tgt->tgt_start + cbuf.chunks[c].start), (char *) cbuf.buf + (cbuf.chunks[c].start - cbuf.chunks[0].start), - cbuf.chunks[c].end - cbuf.chunks[c].start, NULL); + cbuf.chunks[c].end - cbuf.chunks[c].start, true, + NULL); free (cbuf.buf); cbuf.buf = NULL; cbufp = NULL; @@ -1892,7 +1917,7 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs, if (GOMP_MAP_COPY_TO_P (kind & typemask)) gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size, - NULL); + false, NULL); if (GOMP_MAP_COPY_FROM_P (kind & typemask)) gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size); } diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c index a5e9ab3f936..dc1807c6ce4 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c @@ -1,3 +1,5 @@ +/* { dg-xfail-run-if "Async profiling bug" { *-*-* } } */ + /* Test dispatch of events to callbacks. */ #undef NDEBUG -- 2.29.2