public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
@ 2015-04-15 14:26 Dominique Dhumieres
  0 siblings, 0 replies; 30+ messages in thread
From: Dominique Dhumieres @ 2015-04-15 14:26 UTC (permalink / raw)
  To: julian; +Cc: gcc-patches, jakub, iverbin

> (Not asking for review just yet, JFYI.)

This is not a review! but the patch fixes PR65742.

Dominique

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-08 14:59                                                         ` Ilya Verbin
  2015-04-08 16:14                                                           ` Julian Brown
@ 2015-04-14 14:15                                                           ` Julian Brown
  1 sibling, 0 replies; 30+ messages in thread
From: Julian Brown @ 2015-04-14 14:15 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Jakub Jelinek, Thomas Schwinge, gcc-patches, Kirill Yukhin

[-- Attachment #1: Type: text/plain, Size: 1944 bytes --]

On Wed, 8 Apr 2015 17:58:56 +0300
Ilya Verbin <iverbin@gmail.com> wrote:

> On Wed, Apr 08, 2015 at 15:31:42 +0100, Julian Brown wrote:
> > This version is mostly the same as the last posted version but has a
> > tweak in GOACC_parallel to account for the new splay tree
> > arrangement for target functions:
> > 
> > -      tgt_fn = (void (*)) tgt_fn_key->tgt->tgt_start;
> > +      tgt_fn = (void (*)) tgt_fn_key->tgt_offset;
> > 
> > Have there been any other changes I might have missed?
> 
> No.
> 
> > It passes libgomp testing on NVPTX. OK?
> 
> Have you tested it with disabled offloading?
> 
> I see several regressions:
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c
> -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c
> -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test

I think there may be multiple issues here. The attached patch addresses
one -- acc_device_type not distinguishing between "offloaded" and host
code with the host_nonshm plugin.

The other problem is that it appears that the ACC_DEVICE_TYPE
environment variable is not getting set properly on the target for (any
of) the OpenACC tests: this means a lot of the time the "wrong" plugin
is being tested, and means that the above tests (and several others)
still fail. That will apparently need some more engineering (on our
part).

(Not asking for review just yet, JFYI.)

Julian

ChangeLog

    libgomp/
    * oacc-init.c (acc_on_device): Check whether we're in an offloaded
    region for host_nonshm plugin.
    * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_parallel): Set
    nonshm_exec flag in thread-local data.
    (GOMP_OFFLOAD_openacc_create_thread_data): Allocate thread-local
    data for host_nonshm plugin.
    (+GOMP_OFFLOAD_openacc_destroy_thread_data): Free thread-local data
    for host_nonshm plugin.
    * plugin/plugin-host.h: New.

[-- Attachment #2: nonshm-acc-on-device-1.diff --]
[-- Type: text/x-patch, Size: 3949 bytes --]

Index: libgomp/oacc-init.c
===================================================================
--- libgomp/oacc-init.c	(revision 221922)
+++ libgomp/oacc-init.c	(working copy)
@@ -29,6 +29,7 @@
 #include "libgomp.h"
 #include "oacc-int.h"
 #include "openacc.h"
+#include "plugin/plugin-host.h"
 #include <assert.h>
 #include <stdlib.h>
 #include <strings.h>
@@ -548,7 +549,14 @@ ialias (acc_set_device_num)
 int
 acc_on_device (acc_device_t dev)
 {
-  if (acc_get_device_type () == acc_device_host_nonshm)
+  struct goacc_thread *thr = goacc_thread ();
+
+  /* We only want to appear to be the "host_nonshm" plugin from "offloaded"
+     code -- i.e. within a parallel region.  Test a flag set by the
+     openacc_parallel hook of the host_nonshm plugin to determine that.  */
+  if (acc_get_device_type () == acc_device_host_nonshm
+      && thr && thr->target_tls
+      && ((struct nonshm_thread *)thr->target_tls)->nonshm_exec)
     return dev == acc_device_host_nonshm || dev == acc_device_not_host;
 
   /* Just rely on the compiler builtin.  */
Index: libgomp/plugin/plugin-host.c
===================================================================
--- libgomp/plugin/plugin-host.c	(revision 221922)
+++ libgomp/plugin/plugin-host.c	(working copy)
@@ -44,6 +44,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
+#include <stdbool.h>
 
 #ifdef HOST_NONSHM_PLUGIN
 #define STATIC
@@ -55,6 +56,10 @@
 #define SELF "host: "
 #endif
 
+#ifdef HOST_NONSHM_PLUGIN
+#include "plugin-host.h"
+#endif
+
 STATIC const char *
 GOMP_OFFLOAD_get_name (void)
 {
@@ -174,7 +179,10 @@ GOMP_OFFLOAD_openacc_parallel (void (*fn
 			       void *targ_mem_desc __attribute__ ((unused)))
 {
 #ifdef HOST_NONSHM_PLUGIN
+  struct nonshm_thread *thd = GOMP_PLUGIN_acc_thread ();
+  thd->nonshm_exec = true;
   fn (devaddrs);
+  thd->nonshm_exec = false;
 #else
   fn (hostaddrs);
 #endif
@@ -232,11 +240,20 @@ STATIC void *
 GOMP_OFFLOAD_openacc_create_thread_data (int ord
 					 __attribute__ ((unused)))
 {
+#ifdef HOST_NONSHM_PLUGIN
+  struct nonshm_thread *thd
+    = GOMP_PLUGIN_malloc (sizeof (struct nonshm_thread));
+  thd->nonshm_exec = false;
+  return thd;
+#else
   return NULL;
+#endif
 }
 
 STATIC void
-GOMP_OFFLOAD_openacc_destroy_thread_data (void *tls_data
-					  __attribute__ ((unused)))
+GOMP_OFFLOAD_openacc_destroy_thread_data (void *tls_data)
 {
+#ifdef HOST_NONSHM_PLUGIN
+  free (tls_data);
+#endif
 }
Index: libgomp/plugin/plugin-host.h
===================================================================
--- libgomp/plugin/plugin-host.h	(revision 0)
+++ libgomp/plugin/plugin-host.h	(revision 0)
@@ -0,0 +1,37 @@
+/* OpenACC Runtime Library: acc_device_host, acc_device_host_nonshm.
+
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+   Contributed by Mentor Embedded.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef PLUGIN_HOST_H
+#define PLUGIN_HOST_H
+
+struct nonshm_thread
+{
+  bool nonshm_exec;
+};
+
+#endif

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-08 14:59                                                         ` Ilya Verbin
@ 2015-04-08 16:14                                                           ` Julian Brown
  2015-04-14 14:15                                                           ` Julian Brown
  1 sibling, 0 replies; 30+ messages in thread
From: Julian Brown @ 2015-04-08 16:14 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Jakub Jelinek, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Wed, 8 Apr 2015 17:58:56 +0300
Ilya Verbin <iverbin@gmail.com> wrote:

> Have you tested it with disabled offloading?
> 
> I see several regressions:
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c
> -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c
> -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test

No -- thanks for the note. I've committed the patch now, but I'll try
to get to looking at these in the next day or two (it's probably
something relatively minor, I guess).

Julian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-08 14:32                                                       ` Julian Brown
  2015-04-08 14:34                                                         ` Jakub Jelinek
@ 2015-04-08 14:59                                                         ` Ilya Verbin
  2015-04-08 16:14                                                           ` Julian Brown
  2015-04-14 14:15                                                           ` Julian Brown
  1 sibling, 2 replies; 30+ messages in thread
From: Ilya Verbin @ 2015-04-08 14:59 UTC (permalink / raw)
  To: Julian Brown; +Cc: Jakub Jelinek, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Wed, Apr 08, 2015 at 15:31:42 +0100, Julian Brown wrote:
> This version is mostly the same as the last posted version but has a
> tweak in GOACC_parallel to account for the new splay tree arrangement
> for target functions:
> 
> -      tgt_fn = (void (*)) tgt_fn_key->tgt->tgt_start;
> +      tgt_fn = (void (*)) tgt_fn_key->tgt_offset;
> 
> Have there been any other changes I might have missed?

No.

> It passes libgomp testing on NVPTX. OK?

Have you tested it with disabled offloading?

I see several regressions:
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test

  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-08 14:32                                                       ` Julian Brown
@ 2015-04-08 14:34                                                         ` Jakub Jelinek
  2015-04-08 14:59                                                         ` Ilya Verbin
  1 sibling, 0 replies; 30+ messages in thread
From: Jakub Jelinek @ 2015-04-08 14:34 UTC (permalink / raw)
  To: Julian Brown; +Cc: Ilya Verbin, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Wed, Apr 08, 2015 at 03:31:42PM +0100, Julian Brown wrote:
> It passes libgomp testing on NVPTX. OK?

Please write a proper ChangeLog entry for it.
Ok with that.

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-07 15:26                                                     ` Jakub Jelinek
@ 2015-04-08 14:32                                                       ` Julian Brown
  2015-04-08 14:34                                                         ` Jakub Jelinek
  2015-04-08 14:59                                                         ` Ilya Verbin
  0 siblings, 2 replies; 30+ messages in thread
From: Julian Brown @ 2015-04-08 14:32 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Ilya Verbin, Thomas Schwinge, gcc-patches, Kirill Yukhin

[-- Attachment #1: Type: text/plain, Size: 927 bytes --]

On Tue, 7 Apr 2015 17:26:45 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Apr 06, 2015 at 03:45:57PM +0300, Ilya Verbin wrote:
> > On Wed, Apr 01, 2015 at 15:20:25 +0200, Jakub Jelinek wrote:
> > > LGTM with proper ChangeLog entry.
> > 
> > I've commited this patch into trunk.
> > 
> > Julian, you probably want to update the nvptx plugin.
> 
> Note that as the number of P1s without posted fixes is now zero, it is
> likely RC1 will be done this week, so if you want nvptx working in
> GCC 5, please post a fix as soon as possible.

This version is mostly the same as the last posted version but has a
tweak in GOACC_parallel to account for the new splay tree arrangement
for target functions:

-      tgt_fn = (void (*)) tgt_fn_key->tgt->tgt_start;
+      tgt_fn = (void (*)) tgt_fn_key->tgt_offset;

Have there been any other changes I might have missed?

It passes libgomp testing on NVPTX. OK?

Thanks,

Julian

[-- Attachment #2: nvptx-load-unload-6.diff --]
[-- Type: text/x-patch, Size: 47121 bytes --]

commit ac06b5e25e170061bb9855b9ea4b8e5696816bf1
Author: Julian Brown <julian@codesourcery.com>
Date:   Tue Apr 7 09:23:58 2015 -0700

    NVPTX load/unload and init-rework patch.

diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c
index 02c44b6..dbc68bc 100644
--- a/gcc/config/nvptx/mkoffload.c
+++ b/gcc/config/nvptx/mkoffload.c
@@ -839,6 +839,7 @@ process (FILE *in, FILE *out)
 {
   const char *input = read_file (in);
   Token *tok = tokenize (input);
+  unsigned int nvars = 0, nfuncs = 0;
 
   do
     tok = parse_file (tok);
@@ -850,16 +851,17 @@ process (FILE *in, FILE *out)
   write_stmts (out, rev_stmts (fns));
   fprintf (out, ";\n\n");
   fprintf (out, "static const char *var_mappings[] = {\n");
-  for (id_map *id = var_ids; id; id = id->next)
+  for (id_map *id = var_ids; id; id = id->next, nvars++)
     fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
   fprintf (out, "};\n\n");
   fprintf (out, "static const char *func_mappings[] = {\n");
-  for (id_map *id = func_ids; id; id = id->next)
+  for (id_map *id = func_ids; id; id = id->next, nfuncs++)
     fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
   fprintf (out, "};\n\n");
 
   fprintf (out, "static const void *target_data[] = {\n");
-  fprintf (out, "  ptx_code, var_mappings, func_mappings\n");
+  fprintf (out, "  ptx_code, (void*) %u, var_mappings, (void*) %u, "
+		"func_mappings\n", nvars, nfuncs);
   fprintf (out, "};\n\n");
 
   fprintf (out, "extern void GOMP_offload_register (const void *, int, void *);\n");
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a1d42c5..5272f01 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -655,9 +655,6 @@ struct target_mem_desc {
   /* Corresponding target device descriptor.  */
   struct gomp_device_descr *device_descr;
 
-  /* Memory mapping info for the thread that created this descriptor.  */
-  struct splay_tree_s *mem_map;
-
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
   splay_tree_key list[];
@@ -691,18 +688,6 @@ typedef struct acc_dispatch_t
   /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
   struct target_mem_desc *data_environ;
 
-  /* Extra information required for a device instance by a given target.  */
-  /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
-  void *target_data;
-
-  /* Open or close a device instance.  */
-  void *(*open_device_func) (int n);
-  int (*close_device_func) (void *h);
-
-  /* Set or get the device number.  */
-  int (*get_device_num_func) (void);
-  void (*set_device_num_func) (int);
-
   /* Execute.  */
   void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *,
 		     unsigned short *, int, int, int, int, void *);
@@ -720,7 +705,7 @@ typedef struct acc_dispatch_t
   void (*async_set_async_func) (int);
 
   /* Create/destroy TLS data.  */
-  void *(*create_thread_data_func) (void *);
+  void *(*create_thread_data_func) (int);
   void (*destroy_thread_data_func) (void *);
 
   /* NVIDIA target specific routines.  */
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 08b7c5e..1f5827e 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -26,7 +26,7 @@
    see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
    <http://www.gnu.org/licenses/>.  */
 
-
+#include <assert.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
@@ -37,13 +37,23 @@ acc_async_test (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  return base_dev->openacc.async_test_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev->openacc.async_test_func (async);
 }
 
 int
 acc_async_test_all (void)
 {
-  return base_dev->openacc.async_test_all_func ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev->openacc.async_test_all_func ();
 }
 
 void
@@ -52,19 +62,34 @@ acc_wait (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  base_dev->openacc.async_wait_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_func (async);
 }
 
 void
 acc_wait_async (int async1, int async2)
 {
-  base_dev->openacc.async_wait_async_func (async1, async2);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_async_func (async1, async2);
 }
 
 void
 acc_wait_all (void)
 {
-  base_dev->openacc.async_wait_all_func ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_all_func ();
 }
 
 void
@@ -73,5 +98,10 @@ acc_wait_all_async (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  base_dev->openacc.async_wait_all_async_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_all_async_func (async);
 }
diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index c8ef376..4aab422 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -34,51 +34,53 @@
 void *
 acc_get_current_cuda_device (void)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev && base_dev->openacc.cuda.get_current_device_func)
-    p = base_dev->openacc.cuda.get_current_device_func ();
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_device_func)
+    return thr->dev->openacc.cuda.get_current_device_func ();
 
-  return p;
+  return NULL;
 }
 
 void *
 acc_get_current_cuda_context (void)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev && base_dev->openacc.cuda.get_current_context_func)
-    p = base_dev->openacc.cuda.get_current_context_func ();
-
-  return p;
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_context_func)
+    return thr->dev->openacc.cuda.get_current_context_func ();
+ 
+  return NULL;
 }
 
 void *
 acc_get_cuda_stream (int async)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
   if (async < 0)
-    return p;
-
-  if (base_dev && base_dev->openacc.cuda.get_stream_func)
-    p = base_dev->openacc.cuda.get_stream_func (async);
+    return NULL;
 
-  return p;
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
+    return thr->dev->openacc.cuda.get_stream_func (async);
+ 
+  return NULL;
 }
 
 int
 acc_set_cuda_stream (int async, void *stream)
 {
-  int s = -1;
+  struct goacc_thread *thr;
 
   if (async < 0 || stream == NULL)
     return 0;
 
   goacc_lazy_initialize ();
 
-  if (base_dev && base_dev->openacc.cuda.set_stream_func)
-    s = base_dev->openacc.cuda.set_stream_func (async, stream);
+  thr = goacc_thread ();
+
+  if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
+    return thr->dev->openacc.cuda.set_stream_func (async, stream);
 
-  return s;
+  return -1;
 }
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index e4756b6..6dcdbf3 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -53,16 +53,9 @@ static struct gomp_device_descr host_dispatch =
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.root = NULL,
     .is_initialized = false,
 
     .openacc = {
-      .open_device_func = GOMP_OFFLOAD_openacc_open_device,
-      .close_device_func = GOMP_OFFLOAD_openacc_close_device,
-
-      .get_device_num_func = GOMP_OFFLOAD_openacc_get_device_num,
-      .set_device_num_func = GOMP_OFFLOAD_openacc_set_device_num,
-
       .exec_func = GOMP_OFFLOAD_openacc_parallel,
 
       .register_async_cleanup_func
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 1e0243e..dc40fb6 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -37,14 +37,13 @@
 
 static gomp_mutex_t acc_device_lock;
 
-/* The dispatch table for the current accelerator device.  This is global, so
-   you can only have one type of device open at any given time in a program.
-   This is the "base" device in that several devices that use the same
-   dispatch table may be active concurrently: this one (the "zeroth") is used
-   for overall initialisation/shutdown, and other instances -- not necessarily
-   including this one -- may be opened and closed once the base device has
-   been initialized.  */
-struct gomp_device_descr *base_dev;
+/* A cached version of the dispatcher for the global "current" accelerator type,
+   e.g. used as the default when creating new host threads.  This is the
+   device-type equivalent of goacc_device_num (which specifies which device to
+   use out of potentially several of the same type).  If there are several
+   devices of a given type, this points at the first one.  */
+
+static struct gomp_device_descr *cached_base_dev = NULL;
 
 #if defined HAVE_TLS || defined USE_EMUTLS
 __thread struct goacc_thread *goacc_tls_data;
@@ -53,9 +52,6 @@ pthread_key_t goacc_tls_key;
 #endif
 static pthread_key_t goacc_cleanup_key;
 
-/* Current dispatcher, and how it was initialized */
-static acc_device_t init_key = _ACC_device_hwm;
-
 static struct goacc_thread *goacc_threads;
 static gomp_mutex_t goacc_thread_lock;
 
@@ -94,6 +90,21 @@ get_openacc_name (const char *name)
     return name;
 }
 
+static const char *
+name_of_acc_device_t (enum acc_device_t type)
+{
+  switch (type)
+    {
+    case acc_device_none: return "none";
+    case acc_device_default: return "default";
+    case acc_device_host: return "host";
+    case acc_device_host_nonshm: return "host_nonshm";
+    case acc_device_not_host: return "not_host";
+    case acc_device_nvidia: return "nvidia";
+    default: gomp_fatal ("unknown device type %u", (unsigned) type);
+    }
+}
+
 static struct gomp_device_descr *
 resolve_device (acc_device_t d)
 {
@@ -159,22 +170,87 @@ resolve_device (acc_device_t d)
 static struct gomp_device_descr *
 acc_init_1 (acc_device_t d)
 {
-  struct gomp_device_descr *acc_dev;
+  struct gomp_device_descr *base_dev, *acc_dev;
+  int ndevs;
 
-  acc_dev = resolve_device (d);
+  base_dev = resolve_device (d);
+
+  ndevs = base_dev->get_num_devices_func ();
+
+  if (!base_dev || ndevs <= 0 || goacc_device_num >= ndevs)
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
 
-  if (!acc_dev || acc_dev->get_num_devices_func () <= 0)
-    gomp_fatal ("device %u not supported", (unsigned)d);
+  acc_dev = &base_dev[goacc_device_num];
 
   if (acc_dev->is_initialized)
     gomp_fatal ("device already active");
 
-  /* We need to remember what we were intialized as, to check shutdown etc.  */
-  init_key = d;
-
   gomp_init_device (acc_dev);
 
-  return acc_dev;
+  return base_dev;
+}
+
+static void
+acc_shutdown_1 (acc_device_t d)
+{
+  struct gomp_device_descr *base_dev;
+  struct goacc_thread *walk;
+  int ndevs, i;
+  bool devices_active = false;
+
+  /* Get the base device for this device type.  */
+  base_dev = resolve_device (d);
+
+  if (!base_dev)
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
+
+  gomp_mutex_lock (&goacc_thread_lock);
+
+  /* Free target-specific TLS data and close all devices.  */
+  for (walk = goacc_threads; walk != NULL; walk = walk->next)
+    {
+      if (walk->target_tls)
+	base_dev->openacc.destroy_thread_data_func (walk->target_tls);
+
+      walk->target_tls = NULL;
+
+      /* This would mean the user is shutting down OpenACC in the middle of an
+         "acc data" pragma.  Likely not intentional.  */
+      if (walk->mapped_data)
+	gomp_fatal ("shutdown in 'acc data' region");
+
+      /* Similarly, if this happens then user code has done something weird.  */
+      if (walk->saved_bound_dev)
+        gomp_fatal ("shutdown during host fallback");
+
+      if (walk->dev)
+	{
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (&walk->dev->mem_map);
+	  gomp_mutex_unlock (&walk->dev->lock);
+
+	  walk->dev = NULL;
+	  walk->base_dev = NULL;
+	}
+    }
+
+  gomp_mutex_unlock (&goacc_thread_lock);
+
+  ndevs = base_dev->get_num_devices_func ();
+
+  /* Close all the devices of this type that have been opened.  */
+  for (i = 0; i < ndevs; i++)
+    {
+      struct gomp_device_descr *acc_dev = &base_dev[i];
+      if (acc_dev->is_initialized)
+        {
+	  devices_active = true;
+	  gomp_fini_device (acc_dev);
+	}
+    }
+
+  if (!devices_active)
+    gomp_fatal ("no device initialized");
 }
 
 static struct goacc_thread *
@@ -207,9 +283,11 @@ goacc_destroy_thread (void *data)
 
   if (thr)
     {
-      if (base_dev && thr->target_tls)
+      struct gomp_device_descr *acc_dev = thr->dev;
+
+      if (acc_dev && thr->target_tls)
 	{
-	  base_dev->openacc.destroy_thread_data_func (thr->target_tls);
+	  acc_dev->openacc.destroy_thread_data_func (thr->target_tls);
 	  thr->target_tls = NULL;
 	}
 
@@ -236,53 +314,49 @@ goacc_destroy_thread (void *data)
   gomp_mutex_unlock (&goacc_thread_lock);
 }
 
-/* Open the ORD'th device of the currently-active type (base_dev must be
-   initialised before calling).  If ORD is < 0, open the default-numbered
-   device (set by the ACC_DEVICE_NUM environment variable or a call to
-   acc_set_device_num), or leave any currently-opened device as is.  "Opening"
-   consists of calling the device's open_device_func hook, and setting up
-   thread-local data (maybe allocating, then initializing with information
-   pertaining to the newly-opened or previously-opened device).  */
+/* Use the ORD'th device instance for the current host thread (or -1 for the
+   current global default).  The device (and the runtime) must be initialised
+   before calling this function.  */
 
-static void
-lazy_open (int ord)
+void
+goacc_attach_host_thread_to_device (int ord)
 {
   struct goacc_thread *thr = goacc_thread ();
-  struct gomp_device_descr *acc_dev;
-
-  if (thr && thr->dev)
-    {
-      assert (ord < 0 || ord == thr->dev->target_id);
-      return;
-    }
-
-  assert (base_dev);
-
+  struct gomp_device_descr *acc_dev = NULL, *base_dev = NULL;
+  int num_devices;
+  
+  if (thr && thr->dev && (thr->dev->target_id == ord || ord < 0))
+    return;
+  
   if (ord < 0)
     ord = goacc_device_num;
-
-  /* The OpenACC 2.0 spec leaves the runtime's behaviour when an out-of-range
-     device is requested as implementation-defined (4.2 ACC_DEVICE_NUM).
-     We choose to raise an error in such a case.  */
-  if (ord >= base_dev->get_num_devices_func ())
-    gomp_fatal ("device %u does not exist", ord);
-
+  
+  /* Decide which type of device to use.  If the current thread has a device
+     type already (e.g. set by acc_set_device_type), use that, else use the
+     global default.  */
+  if (thr && thr->base_dev)
+    base_dev = thr->base_dev;
+  else
+    {
+      assert (cached_base_dev);
+      base_dev = cached_base_dev;
+    }
+  
+  num_devices = base_dev->get_num_devices_func ();
+  if (num_devices <= 0 || ord >= num_devices)
+    gomp_fatal ("device %u out of range", ord);
+  
   if (!thr)
     thr = goacc_new_thread ();
-
-  acc_dev = thr->dev = &base_dev[ord];
-
-  assert (acc_dev->target_id == ord);
-
+  
+  thr->base_dev = base_dev;
+  thr->dev = acc_dev = &base_dev[ord];
   thr->saved_bound_dev = NULL;
   thr->mapped_data = NULL;
-
-  if (!acc_dev->openacc.target_data)
-    acc_dev->openacc.target_data = acc_dev->openacc.open_device_func (ord);
-
+  
   thr->target_tls
-    = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
-
+    = acc_dev->openacc.create_thread_data_func (ord);
+  
   acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
@@ -292,74 +366,20 @@ lazy_open (int ord)
 void
 acc_init (acc_device_t d)
 {
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
 
-  base_dev = acc_init_1 (d);
-
-  lazy_open (-1);
+  cached_base_dev = acc_init_1 (d);
 
   gomp_mutex_unlock (&acc_device_lock);
+  
+  goacc_attach_host_thread_to_device (-1);
 }
 
 ialias (acc_init)
 
-static void
-acc_shutdown_1 (acc_device_t d)
-{
-  struct goacc_thread *walk;
-
-  /* We don't check whether d matches the actual device found, because
-     OpenACC 2.0 (3.2.12) says the parameters to the init and this
-     call must match (for the shutdown call anyway, it's silent on
-     others).  */
-
-  if (!base_dev)
-    gomp_fatal ("no device initialized");
-  if (d != init_key)
-    gomp_fatal ("device %u(%u) is initialized",
-		(unsigned) init_key, (unsigned) base_dev->type);
-
-  gomp_mutex_lock (&goacc_thread_lock);
-
-  /* Free target-specific TLS data and close all devices.  */
-  for (walk = goacc_threads; walk != NULL; walk = walk->next)
-    {
-      if (walk->target_tls)
-	base_dev->openacc.destroy_thread_data_func (walk->target_tls);
-
-      walk->target_tls = NULL;
-
-      /* This would mean the user is shutting down OpenACC in the middle of an
-         "acc data" pragma.  Likely not intentional.  */
-      if (walk->mapped_data)
-	gomp_fatal ("shutdown in 'acc data' region");
-
-      if (walk->dev)
-	{
-	  void *target_data = walk->dev->openacc.target_data;
-	  if (walk->dev->openacc.close_device_func (target_data) < 0)
-	    gomp_fatal ("failed to close device");
-
-	  walk->dev->openacc.target_data = target_data = NULL;
-
-	  gomp_mutex_lock (&walk->dev->lock);
-	  gomp_free_memmap (&walk->dev->mem_map);
-	  gomp_mutex_unlock (&walk->dev->lock);
-
-	  walk->dev = NULL;
-	}
-    }
-
-  gomp_mutex_unlock (&goacc_thread_lock);
-
-  gomp_fini_device (base_dev);
-
-  base_dev = NULL;
-}
-
 void
 acc_shutdown (acc_device_t d)
 {
@@ -372,59 +392,16 @@ acc_shutdown (acc_device_t d)
 
 ialias (acc_shutdown)
 
-/* This function is called after plugins have been initialized.  It deals with
-   the "base" device, and is used to prepare the runtime for dealing with a
-   number of such devices (as implemented by some particular plugin).  If the
-   argument device type D matches a previous call to the function, return the
-   current base device, else shut the old device down and re-initialize with
-   the new device type.  */
-
-static struct gomp_device_descr *
-lazy_init (acc_device_t d)
-{
-  if (base_dev)
-    {
-      /* Re-initializing the same device, do nothing.  */
-      if (d == init_key)
-	return base_dev;
-
-      acc_shutdown_1 (init_key);
-    }
-
-  assert (!base_dev);
-
-  return acc_init_1 (d);
-}
-
-/* Ensure that plugins are loaded, initialize and open the (default-numbered)
-   device.  */
-
-static void
-lazy_init_and_open (acc_device_t d)
-{
-  if (!base_dev)
-    gomp_init_targets_once ();
-
-  gomp_mutex_lock (&acc_device_lock);
-
-  base_dev = lazy_init (d);
-
-  lazy_open (-1);
-
-  gomp_mutex_unlock (&acc_device_lock);
-}
-
 int
 acc_get_num_devices (acc_device_t d)
 {
   int n = 0;
-  const struct gomp_device_descr *acc_dev;
+  struct gomp_device_descr *acc_dev;
 
   if (d == acc_device_none)
     return 0;
 
-  if (!base_dev)
-    gomp_init_targets_once ();
+  gomp_init_targets_once ();
 
   acc_dev = resolve_device (d);
   if (!acc_dev)
@@ -439,10 +416,39 @@ acc_get_num_devices (acc_device_t d)
 
 ialias (acc_get_num_devices)
 
+/* Set the device type for the current thread only (using the current global
+   default device number), initialising that device if necessary.  Also set the
+   default device type for new threads to D.  */
+
 void
 acc_set_device_type (acc_device_t d)
 {
-  lazy_init_and_open (d);
+  struct gomp_device_descr *base_dev, *acc_dev;
+  struct goacc_thread *thr = goacc_thread ();
+
+  gomp_mutex_lock (&acc_device_lock);
+
+  if (!cached_base_dev)
+    gomp_init_targets_once ();
+
+  cached_base_dev = base_dev = resolve_device (d);
+  acc_dev = &base_dev[goacc_device_num];
+
+  if (!acc_dev->is_initialized)
+    gomp_init_device (acc_dev);
+
+  gomp_mutex_unlock (&acc_device_lock);
+
+  /* We're changing device type: invalidate the current thread's dev and
+     base_dev pointers.  */
+  if (thr && thr->base_dev != base_dev)
+    {
+      thr->base_dev = thr->dev = NULL;
+      if (thr->mapped_data)
+        gomp_fatal ("acc_set_device_type in 'acc data' region");
+    }
+
+  goacc_attach_host_thread_to_device (-1);
 }
 
 ialias (acc_set_device_type)
@@ -451,10 +457,11 @@ acc_device_t
 acc_get_device_type (void)
 {
   acc_device_t res = acc_device_none;
-  const struct gomp_device_descr *dev;
+  struct gomp_device_descr *dev;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev)
-    res = acc_device_type (base_dev->type);
+  if (thr && thr->base_dev)
+    res = acc_device_type (thr->base_dev->type);
   else
     {
       gomp_init_targets_once ();
@@ -475,78 +482,65 @@ int
 acc_get_device_num (acc_device_t d)
 {
   const struct gomp_device_descr *dev;
-  int num;
+  struct goacc_thread *thr = goacc_thread ();
 
   if (d >= _ACC_device_hwm)
     gomp_fatal ("device %u out of range", (unsigned)d);
 
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
   dev = resolve_device (d);
   if (!dev)
-    gomp_fatal ("no devices of type %u", d);
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
 
-  /* We might not have called lazy_open for this host thread yet, in which case
-     the get_device_num_func hook will return -1.  */
-  num = dev->openacc.get_device_num_func ();
-  if (num < 0)
-    num = goacc_device_num;
+  if (thr && thr->base_dev == dev && thr->dev)
+    return thr->dev->target_id;
 
-  return num;
+  return goacc_device_num;
 }
 
 ialias (acc_get_device_num)
 
 void
-acc_set_device_num (int n, acc_device_t d)
+acc_set_device_num (int ord, acc_device_t d)
 {
-  const struct gomp_device_descr *dev;
+  struct gomp_device_descr *base_dev, *acc_dev;
   int num_devices;
 
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
-  if ((int) d == 0)
-    {
-      int i;
-
-      /* A device setting of zero sets all device types on the system to use
-         the Nth instance of that device type.  Only attempt it for initialized
-	 devices though.  */
-      for (i = acc_device_not_host + 1; i < _ACC_device_hwm; i++)
-        {
-	  dev = resolve_device (d);
-	  if (dev && dev->is_initialized)
-	    dev->openacc.set_device_num_func (n);
-	}
+  if (ord < 0)
+    ord = goacc_device_num;
 
-      /* ...and for future calls to acc_init/acc_set_device_type, etc.  */
-      goacc_device_num = n;
-    }
+  if ((int) d == 0)
+    /* Set whatever device is being used by the current host thread to use
+       device instance ORD.  It's unclear if this is supposed to affect other
+       host threads too (OpenACC 2.0 (3.2.4) acc_set_device_num).  */
+    goacc_attach_host_thread_to_device (ord);
   else
     {
-      struct goacc_thread *thr = goacc_thread ();
-
       gomp_mutex_lock (&acc_device_lock);
 
-      base_dev = lazy_init (d);
+      cached_base_dev = base_dev = resolve_device (d);
 
       num_devices = base_dev->get_num_devices_func ();
 
-      if (n >= num_devices)
-        gomp_fatal ("device %u out of range", n);
+      if (ord >= num_devices)
+        gomp_fatal ("device %u out of range", ord);
 
-      /* If we're changing the device number, de-associate this thread with
-	 the device (but don't close the device, since it may be in use by
-	 other threads).  */
-      if (thr && thr->dev && n != thr->dev->target_id)
-	thr->dev = NULL;
+      acc_dev = &base_dev[ord];
 
-      lazy_open (n);
+      if (!acc_dev->is_initialized)
+        gomp_init_device (acc_dev);
 
       gomp_mutex_unlock (&acc_device_lock);
+
+      goacc_attach_host_thread_to_device (ord);
     }
+  
+  goacc_device_num = ord;
 }
 
 ialias (acc_set_device_num)
@@ -554,10 +548,7 @@ ialias (acc_set_device_num)
 int
 acc_on_device (acc_device_t dev)
 {
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (thr && thr->dev
-      && acc_device_type (thr->dev->type) == acc_device_host_nonshm)
+  if (acc_get_device_type () == acc_device_host_nonshm)
     return dev == acc_device_host_nonshm || dev == acc_device_not_host;
 
   /* Just rely on the compiler builtin.  */
@@ -577,7 +568,7 @@ goacc_runtime_initialize (void)
 
   pthread_key_create (&goacc_cleanup_key, goacc_destroy_thread);
 
-  base_dev = NULL;
+  cached_base_dev = NULL;
 
   goacc_threads = NULL;
   gomp_mutex_init (&goacc_thread_lock);
@@ -606,9 +597,8 @@ goacc_restore_bind (void)
 }
 
 /* This is called from any OpenACC support function that may need to implicitly
-   initialize the libgomp runtime.  On exit all such initialization will have
-   been done, and both the global ACC_dev and the per-host-thread ACC_memmap
-   pointers will be valid.  */
+   initialize the libgomp runtime, either globally or from a new host thread. 
+   On exit "goacc_thread" will return a valid & populated thread block.  */
 
 attribute_hidden void
 goacc_lazy_initialize (void)
@@ -618,12 +608,8 @@ goacc_lazy_initialize (void)
   if (thr && thr->dev)
     return;
 
-  if (!base_dev)
-    lazy_init_and_open (acc_device_default);
+  if (!cached_base_dev)
+    acc_init (acc_device_default);
   else
-    {
-      gomp_mutex_lock (&acc_device_lock);
-      lazy_open (-1);
-      gomp_mutex_unlock (&acc_device_lock);
-    }
+    goacc_attach_host_thread_to_device (-1);
 }
diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h
index 85619c8..0ace737 100644
--- a/libgomp/oacc-int.h
+++ b/libgomp/oacc-int.h
@@ -56,6 +56,9 @@ acc_device_type (enum offload_target_type type)
 
 struct goacc_thread
 {
+  /* The base device for the current thread.  */
+  struct gomp_device_descr *base_dev;
+
   /* The device for the current thread.  */
   struct gomp_device_descr *dev;
 
@@ -89,10 +92,7 @@ goacc_thread (void)
 #endif
 
 void goacc_register (struct gomp_device_descr *) __GOACC_NOTHROW;
-
-/* Current dispatcher.  */
-extern struct gomp_device_descr *base_dev;
-
+void goacc_attach_host_thread_to_device (int);
 void goacc_runtime_initialize (void);
 void goacc_save_and_set_bind (acc_device_t);
 void goacc_restore_bind (void);
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index fdc82e6..89ef5fc 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -107,7 +107,9 @@ acc_malloc (size_t s)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  return base_dev->alloc_func (thr->dev->target_id, s);
+  assert (thr->dev);
+
+  return thr->dev->alloc_func (thr->dev->target_id, s);
 }
 
 /* OpenACC 2.0a (3.2.16) doesn't specify what to do in the event
@@ -122,6 +124,8 @@ acc_free (void *d)
   if (!d)
     return;
 
+  assert (thr && thr->dev);
+
   /* We don't have to call lazy open here, as the ptr value must have
      been returned by acc_malloc.  It's not permitted to pass NULL in
      (unless you got that null from acc_malloc).  */
@@ -134,7 +138,7 @@ acc_free (void *d)
      acc_unmap_data ((void *)(k->host_start + offset));
    }
 
-  base_dev->free_func (thr->dev->target_id, d);
+  thr->dev->free_func (thr->dev->target_id, d);
 }
 
 void
@@ -144,7 +148,9 @@ acc_memcpy_to_device (void *d, void *h, size_t s)
      been obtained from a routine that did that.  */
   struct goacc_thread *thr = goacc_thread ();
 
-  base_dev->host2dev_func (thr->dev->target_id, d, h, s);
+  assert (thr && thr->dev);
+
+  thr->dev->host2dev_func (thr->dev->target_id, d, h, s);
 }
 
 void
@@ -154,7 +160,9 @@ acc_memcpy_from_device (void *h, void *d, size_t s)
      been obtained from a routine that did that.  */
   struct goacc_thread *thr = goacc_thread ();
 
-  base_dev->dev2host_func (thr->dev->target_id, h, d, s);
+  assert (thr && thr->dev);
+
+  thr->dev->dev2host_func (thr->dev->target_id, h, d, s);
 }
 
 /* Return the device pointer that corresponds to host data H.  Or NULL
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 563f9bb..d899946 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -49,32 +49,6 @@ find_pset (int pos, size_t mapnum, unsigned short *kinds)
   return kind == GOMP_MAP_TO_PSET;
 }
 
-
-/* Ensure that the target device for DEVICE_TYPE is initialised (and that
-   plugins have been loaded if appropriate).  The ACC_dev variable for the
-   current thread will be set appropriately for the given device type on
-   return.  */
-
-attribute_hidden void
-select_acc_device (int device_type)
-{
-  goacc_lazy_initialize ();
-
-  if (device_type == GOMP_DEVICE_HOST_FALLBACK)
-    return;
-
-  if (device_type == acc_device_none)
-    device_type = acc_device_host;
-
-  if (device_type >= 0)
-    {
-      /* NOTE: this will go badly if the surrounding data environment is set up
-         to use a different device type.  We'll just have to trust that users
-	 know what they're doing...  */
-      acc_set_device_type (device_type);
-    }
-}
-
 static void goacc_wait (int async, int num_waits, va_list ap);
 
 void
@@ -111,7 +85,7 @@ GOACC_parallel (int device, void (*fn) (void *),
 	      __FUNCTION__, (unsigned long) mapnum, hostaddrs, sizes, kinds,
 	      async);
 #endif
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   thr = goacc_thread ();
   acc_dev = thr->dev;
@@ -151,7 +125,7 @@ GOACC_parallel (int device, void (*fn) (void *),
       if (tgt_fn_key == NULL)
 	gomp_fatal ("target function wasn't mapped");
 
-      tgt_fn = (void (*)) tgt_fn_key->tgt->tgt_start;
+      tgt_fn = (void (*)) tgt_fn_key->tgt_offset;
     }
   else
     tgt_fn = (void (*)) fn;
@@ -195,7 +169,7 @@ GOACC_data_start (int device, size_t mapnum,
 	      __FUNCTION__, (unsigned long) mapnum, hostaddrs, sizes, kinds);
 #endif
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
@@ -242,7 +216,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
   bool data_enter = false;
   size_t i;
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   thr = goacc_thread ();
   acc_dev = thr->dev;
@@ -429,7 +403,7 @@ GOACC_update (int device, size_t mapnum,
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
   size_t i;
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index bc60f72..1faf5bc 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -119,31 +119,6 @@ GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
 }
 
 STATIC void *
-GOMP_OFFLOAD_openacc_open_device (int n)
-{
-  return (void *) (intptr_t) n;
-}
-
-STATIC int
-GOMP_OFFLOAD_openacc_close_device (void *hnd)
-{
-  return 0;
-}
-
-STATIC int
-GOMP_OFFLOAD_openacc_get_device_num (void)
-{
-  return 0;
-}
-
-STATIC void
-GOMP_OFFLOAD_openacc_set_device_num (int n)
-{
-  if (n > 0)
-    GOMP (fatal) ("device number %u out of range for host execution", n);
-}
-
-STATIC void *
 GOMP_OFFLOAD_alloc (int n __attribute__ ((unused)), size_t s)
 {
   return GOMP (malloc) (s);
@@ -254,7 +229,7 @@ GOMP_OFFLOAD_openacc_async_wait_all_async (int async __attribute__ ((unused)))
 }
 
 STATIC void *
-GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data
+GOMP_OFFLOAD_openacc_create_thread_data (int ord
 					 __attribute__ ((unused)))
 {
   return NULL;
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 483cb75..583ec87 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -133,7 +133,8 @@ struct targ_fn_descriptor
   const char *name;
 };
 
-static bool ptx_inited = false;
+static unsigned int instantiated_devices = 0;
+static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
 struct ptx_stream
 {
@@ -331,9 +332,21 @@ struct ptx_event
   struct ptx_event *next;
 };
 
+struct ptx_image_data
+{
+  void *target_data;
+  CUmodule module;
+  struct ptx_image_data *next;
+};
+
 static pthread_mutex_t ptx_event_lock;
 static struct ptx_event *ptx_events;
 
+static struct ptx_device **ptx_devices;
+
+static struct ptx_image_data *ptx_images = NULL;
+static pthread_mutex_t ptx_image_lock = PTHREAD_MUTEX_INITIALIZER;
+
 #define _XSTR(s) _STR(s)
 #define _STR(s) #s
 
@@ -450,8 +463,8 @@ fini_streams_for_device (struct ptx_device *ptx_dev)
       struct ptx_stream *s = ptx_dev->active_streams;
       ptx_dev->active_streams = ptx_dev->active_streams->next;
 
-      cuStreamDestroy (s->stream);
       map_fini (s);
+      cuStreamDestroy (s->stream);
       free (s);
     }
 
@@ -575,21 +588,21 @@ select_stream_for_async (int async, pthread_t thread, bool create,
   return stream;
 }
 
-static int nvptx_get_num_devices (void);
-
-/* Initialize the device.  */
-static int
+/* Initialize the device.  Return TRUE on success, else FALSE.  PTX_DEV_LOCK
+   should be locked on entry and remains locked on exit.  */
+static bool
 nvptx_init (void)
 {
   CUresult r;
   int rc;
+  int ndevs;
 
-  if (ptx_inited)
-    return nvptx_get_num_devices ();
+  if (instantiated_devices != 0)
+    return true;
 
   rc = verify_device_library ();
   if (rc < 0)
-    return -1;
+    return false;
 
   r = cuInit (0);
   if (r != CUDA_SUCCESS)
@@ -599,22 +612,64 @@ nvptx_init (void)
 
   pthread_mutex_init (&ptx_event_lock, NULL);
 
-  ptx_inited = true;
+  r = cuDeviceGetCount (&ndevs);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetCount error: %s", cuda_error (r));
 
-  return nvptx_get_num_devices ();
+  ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *)
+					    * ndevs);
+
+  return true;
 }
 
+/* Select the N'th PTX device for the current host thread.  The device must
+   have been previously opened before calling this function.  */
+
 static void
-nvptx_fini (void)
+nvptx_attach_host_thread_to_device (int n)
 {
-  ptx_inited = false;
+  CUdevice dev;
+  CUresult r;
+  struct ptx_device *ptx_dev;
+  CUcontext thd_ctx;
+
+  r = cuCtxGetDevice (&dev);
+  if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
+    GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r));
+
+  if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n)
+    return;
+  else
+    {
+      CUcontext old_ctx;
+
+      ptx_dev = ptx_devices[n];
+      assert (ptx_dev);
+
+      r = cuCtxGetCurrent (&thd_ctx);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
+
+      /* We don't necessarily have a current context (e.g. if it has been
+         destroyed.  Pop it if we do though.  */
+      if (thd_ctx != NULL)
+	{
+	  r = cuCtxPopCurrent (&old_ctx);
+	  if (r != CUDA_SUCCESS)
+            GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r));
+	}
+
+      r = cuCtxPushCurrent (ptx_dev->ctx);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuCtxPushCurrent error: %s", cuda_error (r));
+    }
 }
 
-static void *
+static struct ptx_device *
 nvptx_open_device (int n)
 {
   struct ptx_device *ptx_dev;
-  CUdevice dev;
+  CUdevice dev, ctx_dev;
   CUresult r;
   int async_engines, pi;
 
@@ -628,6 +683,21 @@ nvptx_open_device (int n)
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
 
+  r = cuCtxGetDevice (&ctx_dev);
+  if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
+    GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r));
+  
+  if (r != CUDA_ERROR_INVALID_CONTEXT && ctx_dev != dev)
+    {
+      /* The current host thread has an active context for a different device.
+         Detach it.  */
+      CUcontext old_ctx;
+      
+      r = cuCtxPopCurrent (&old_ctx);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r));
+    }
+
   r = cuCtxGetCurrent (&ptx_dev->ctx);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
@@ -678,17 +748,16 @@ nvptx_open_device (int n)
 
   init_streams_for_device (ptx_dev, async_engines);
 
-  return (void *) ptx_dev;
+  return ptx_dev;
 }
 
-static int
-nvptx_close_device (void *targ_data)
+static void
+nvptx_close_device (struct ptx_device *ptx_dev)
 {
   CUresult r;
-  struct ptx_device *ptx_dev = targ_data;
 
   if (!ptx_dev)
-    return 0;
+    return;
 
   fini_streams_for_device (ptx_dev);
 
@@ -700,8 +769,6 @@ nvptx_close_device (void *targ_data)
     }
 
   free (ptx_dev);
-
-  return 0;
 }
 
 static int
@@ -714,7 +781,7 @@ nvptx_get_num_devices (void)
      order to enumerate available devices, but CUDA API routines can't be used
      until cuInit has been called.  Just call it now (but don't yet do any
      further initialization).  */
-  if (!ptx_inited)
+  if (instantiated_devices == 0)
     cuInit (0);
 
   r = cuDeviceGetCount (&n);
@@ -1507,64 +1574,84 @@ GOMP_OFFLOAD_get_num_devices (void)
   return nvptx_get_num_devices ();
 }
 
-static void **kernel_target_data;
-static void **kernel_host_table;
-
 void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_data)
+GOMP_OFFLOAD_init_device (int n)
 {
-  kernel_target_data = target_data;
-  kernel_host_table = host_table;
-}
+  pthread_mutex_lock (&ptx_dev_lock);
 
-void
-GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
-{
-  (void) nvptx_init ();
+  if (!nvptx_init () || ptx_devices[n] != NULL)
+    {
+      pthread_mutex_unlock (&ptx_dev_lock);
+      return;
+    }
+
+  ptx_devices[n] = nvptx_open_device (n);
+  instantiated_devices++;
+
+  pthread_mutex_unlock (&ptx_dev_lock);
 }
 
 void
-GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
+GOMP_OFFLOAD_fini_device (int n)
 {
-  nvptx_fini ();
+  pthread_mutex_lock (&ptx_dev_lock);
+
+  if (ptx_devices[n] != NULL)
+    {
+      nvptx_attach_host_thread_to_device (n);
+      nvptx_close_device (ptx_devices[n]);
+      ptx_devices[n] = NULL;
+      instantiated_devices--;
+    }
+
+  pthread_mutex_unlock (&ptx_dev_lock);
 }
 
 int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **tablep)
+GOMP_OFFLOAD_load_image (int ord, void *target_data,
+			 struct addr_pair **target_table)
 {
   CUmodule module;
-  void **fn_table;
-  char **fn_names;
-  int fn_entries, i;
+  char **fn_names, **var_names;
+  unsigned int fn_entries, var_entries, i, j;
   CUresult r;
   struct targ_fn_descriptor *targ_fns;
+  void **img_header = (void **) target_data;
+  struct ptx_image_data *new_image;
 
-  if (nvptx_init () <= 0)
-    return 0;
+  GOMP_OFFLOAD_init_device (ord);
 
-  /* This isn't an error, because an image may legitimately have no offloaded
-     regions and so will not call GOMP_offload_register.  */
-  if (kernel_target_data == NULL)
-    return 0;
+  nvptx_attach_host_thread_to_device (ord);
+
+  link_ptx (&module, img_header[0]);
 
-  link_ptx (&module, kernel_target_data[0]);
+  pthread_mutex_lock (&ptx_image_lock);
+  new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
+  new_image->target_data = target_data;
+  new_image->module = module;
+  new_image->next = ptx_images;
+  ptx_images = new_image;
+  pthread_mutex_unlock (&ptx_image_lock);
 
-  /* kernel_target_data[0] -> ptx code
-     kernel_target_data[1] -> variable mappings
-     kernel_target_data[2] -> array of kernel names in ascii
+  /* The mkoffload utility emits a table of pointers/integers at the start of
+     each offload image:
 
-     kernel_host_table[0] -> start of function addresses (__offload_func_table)
-     kernel_host_table[1] -> end of function addresses (__offload_funcs_end)
+     img_header[0] -> ptx code
+     img_header[1] -> number of variables
+     img_header[2] -> array of variable names (pointers to strings)
+     img_header[3] -> number of kernels
+     img_header[4] -> array of kernel names (pointers to strings)
 
      The array of kernel names and the functions addresses form a
      one-to-one correspondence.  */
 
-  fn_table = kernel_host_table[0];
-  fn_names = (char **) kernel_target_data[2];
-  fn_entries = (kernel_host_table[1] - kernel_host_table[0]) / sizeof (void *);
+  var_entries = (uintptr_t) img_header[1];
+  var_names = (char **) img_header[2];
+  fn_entries = (uintptr_t) img_header[3];
+  fn_names = (char **) img_header[4];
 
-  *tablep = GOMP_PLUGIN_malloc (sizeof (struct mapping_table) * fn_entries);
+  *target_table = GOMP_PLUGIN_malloc (sizeof (struct addr_pair)
+				      * (fn_entries + var_entries));
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
@@ -1579,38 +1666,86 @@ GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
       targ_fns[i].fn = function;
       targ_fns[i].name = (const char *) fn_names[i];
 
-      (*tablep)[i].host_start = (uintptr_t) fn_table[i];
-      (*tablep)[i].host_end = (*tablep)[i].host_start + 1;
-      (*tablep)[i].tgt_start = (uintptr_t) &targ_fns[i];
-      (*tablep)[i].tgt_end = (*tablep)[i].tgt_start + 1;
+      (*target_table)[i].start = (uintptr_t) &targ_fns[i];
+      (*target_table)[i].end = (*target_table)[i].start + 1;
     }
 
-  return fn_entries;
+  for (j = 0; j < var_entries; j++, i++)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+
+      r = cuModuleGetGlobal (&var, &bytes, module, var_names[j]);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+
+      (*target_table)[i].start = (uintptr_t) var;
+      (*target_table)[i].end = (*target_table)[i].start + bytes;
+    }
+
+  return i;
+}
+
+void
+GOMP_OFFLOAD_unload_image (int tid __attribute__((unused)), void *target_data)
+{
+  void **img_header = (void **) target_data;
+  struct targ_fn_descriptor *targ_fns
+    = (struct targ_fn_descriptor *) img_header[0];
+  struct ptx_image_data *image, *prev = NULL, *newhd = NULL;
+
+  free (targ_fns);
+
+  pthread_mutex_lock (&ptx_image_lock);
+  for (image = ptx_images; image != NULL;)
+    {
+      struct ptx_image_data *next = image->next;
+
+      if (image->target_data == target_data)
+	{
+	  cuModuleUnload (image->module);
+	  free (image);
+	  if (prev)
+	    prev->next = next;
+	}
+      else
+	{
+	  prev = image;
+	  if (!newhd)
+	    newhd = image;
+	}
+
+      image = next;
+    }
+  ptx_images = newhd;
+  pthread_mutex_unlock (&ptx_image_lock);
 }
 
 void *
-GOMP_OFFLOAD_alloc (int n __attribute__ ((unused)), size_t size)
+GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_alloc (size);
 }
 
 void
-GOMP_OFFLOAD_free (int n __attribute__ ((unused)), void *ptr)
+GOMP_OFFLOAD_free (int ord, void *ptr)
 {
+  nvptx_attach_host_thread_to_device (ord);
   nvptx_free (ptr);
 }
 
 void *
-GOMP_OFFLOAD_dev2host (int ord __attribute__ ((unused)), void *dst,
-		       const void *src, size_t n)
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_dev2host (dst, src, n);
 }
 
 void *
-GOMP_OFFLOAD_host2dev (int ord __attribute__ ((unused)), void *dst,
-		       const void *src, size_t n)
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_host2dev (dst, src, n);
 }
 
@@ -1627,45 +1762,6 @@ GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), size_t mapnum,
 	    num_workers, vector_length, async, targ_mem_desc);
 }
 
-void *
-GOMP_OFFLOAD_openacc_open_device (int n)
-{
-  return nvptx_open_device (n);
-}
-
-int
-GOMP_OFFLOAD_openacc_close_device (void *h)
-{
-  return nvptx_close_device (h);
-}
-
-void
-GOMP_OFFLOAD_openacc_set_device_num (int n)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  assert (n >= 0);
-
-  if (!nvthd->ptx_dev || nvthd->ptx_dev->ord != n)
-    (void) nvptx_open_device (n);
-}
-
-/* This can be called before the device is "opened" for the current thread, in
-   which case we can't tell which device number should be returned.  We don't
-   actually want to open the device here, so just return -1 and let the caller
-   (oacc-init.c:acc_get_device_num) handle it.  */
-
-int
-GOMP_OFFLOAD_openacc_get_device_num (void)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (nvthd && nvthd->ptx_dev)
-    return nvthd->ptx_dev->ord;
-  else
-    return -1;
-}
-
 void
 GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc)
 {
@@ -1729,14 +1825,18 @@ GOMP_OFFLOAD_openacc_async_set_async (int async)
 }
 
 void *
-GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data)
+GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {
-  struct ptx_device *ptx_dev = (struct ptx_device *) targ_data;
+  struct ptx_device *ptx_dev;
   struct nvptx_thread *nvthd
     = GOMP_PLUGIN_malloc (sizeof (struct nvptx_thread));
   CUresult r;
   CUcontext thd_ctx;
 
+  ptx_dev = ptx_devices[ord];
+
+  assert (ptx_dev);
+
   r = cuCtxGetCurrent (&thd_ctx);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
diff --git a/libgomp/target.c b/libgomp/target.c
index dfe7fb9..d8da783 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -178,7 +178,6 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mem_map;
 
   if (mapnum == 0)
     return tgt;
@@ -597,7 +596,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (tgt->mem_map, k);
+	splay_tree_remove (&devicep->mem_map, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -1159,10 +1158,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
     {
       optional_present = optional_total = 0;
       DLSYM_OPT (openacc.exec, openacc_parallel);
-      DLSYM_OPT (openacc.open_device, openacc_open_device);
-      DLSYM_OPT (openacc.close_device, openacc_close_device);
-      DLSYM_OPT (openacc.get_device_num, openacc_get_device_num);
-      DLSYM_OPT (openacc.set_device_num, openacc_set_device_num);
       DLSYM_OPT (openacc.register_async_cleanup,
 		 openacc_register_async_cleanup);
       DLSYM_OPT (openacc.async_test, openacc_async_test);
@@ -1271,7 +1266,6 @@ gomp_target_init (void)
 		current_device.mem_map.root = NULL;
 		current_device.is_initialized = false;
 		current_device.openacc.data_environ = NULL;
-		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
index 84045db..a4cf7f2 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
@@ -58,7 +58,7 @@ main (int argc, char **argv)
       acc_set_device_num (1, (acc_device_t) 0);
 
       devnum = acc_get_device_num (devtype);
-      if (devnum != 0)
+      if (devnum != 1)
 	abort ();
   }
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-06 12:46                                                   ` Ilya Verbin
@ 2015-04-07 15:26                                                     ` Jakub Jelinek
  2015-04-08 14:32                                                       ` Julian Brown
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2015-04-07 15:26 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Julian Brown, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Mon, Apr 06, 2015 at 03:45:57PM +0300, Ilya Verbin wrote:
> On Wed, Apr 01, 2015 at 15:20:25 +0200, Jakub Jelinek wrote:
> > LGTM with proper ChangeLog entry.
> 
> I've commited this patch into trunk.
> 
> Julian, you probably want to update the nvptx plugin.

Note that as the number of P1s without posted fixes is now zero, it is
likely RC1 will be done this week, so if you want nvptx working in GCC 5,
please post a fix as soon as possible.

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-01 13:20                                                 ` Jakub Jelinek
  2015-04-01 17:26                                                   ` Ilya Verbin
@ 2015-04-06 12:46                                                   ` Ilya Verbin
  2015-04-07 15:26                                                     ` Jakub Jelinek
  1 sibling, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-04-06 12:46 UTC (permalink / raw)
  To: Julian Brown, Thomas Schwinge; +Cc: Jakub Jelinek, gcc-patches, Kirill Yukhin

On Wed, Apr 01, 2015 at 15:20:25 +0200, Jakub Jelinek wrote:
> LGTM with proper ChangeLog entry.

I've commited this patch into trunk.

Julian, you probably want to update the nvptx plugin.


gcc/
	* config/i386/intelmic-mkoffload.c (generate_host_descr_file): Call
	GOMP_offload_unregister from the destructor.
libgomp/
	* libgomp-plugin.h (struct mapping_table): Replace with addr_pair.
	* libgomp.h (struct gomp_memory_mapping): Remove.
	(struct target_mem_desc): Change type of mem_map from
	gomp_memory_mapping * to splay_tree_s *.
	(struct gomp_device_descr): Remove register_image_func, get_table_func.
	Add load_image_func, unload_image_func.
	Change type of mem_map from gomp_memory_mapping to splay_tree_s.
	Remove offload_regions_registered.
	(gomp_init_tables): Remove.
	(gomp_free_memmap): Change type of argument from gomp_memory_mapping *
	to splay_tree_s *.
	* libgomp.map (GOMP_4.0.1): Add GOMP_offload_unregister.
	* oacc-host.c (host_dispatch): Do not initialize register_image_func,
	get_table_func, mem_map.is_initialized, mem_map.splay_tree.root,
	offload_regions_registered.
	Initialize load_image_func, unload_image_func, mem_map.root.
	(goacc_host_init): Do not initialize host_dispatch.mem_map.lock.
	* oacc-init.c (lazy_open): Don't call gomp_init_tables.
	(acc_shutdown_1): Use dev's lock and splay_tree instead of mem_map's.
	* oacc-mem.c (lookup_host): Get gomp_device_descr *dev instead of
	gomp_memory_mapping *.  Use dev's lock and splay_tree.
	(lookup_dev): Use dev's lock.
	(acc_deviceptr): Pass dev to lookup_host instead of mem_map.
	(acc_is_present): Likewise.
	(acc_map_data): Likewise.
	(acc_unmap_data): Likewise.  Use dev's lock.
	(present_create_copy): Likewise.
	(delete_copyout): Pass dev to lookup_host instead of mem_map.
	(update_dev_host): Likewise.
	(gomp_acc_remove_pointer): Likewise.  Use dev's lock.
	* oacc-parallel.c (GOACC_parallel): Use dev's lock and splay_tree.
	* plugin/plugin-host.c (GOMP_OFFLOAD_register_image): Remove.
	(GOMP_OFFLOAD_get_table): Remove
	(GOMP_OFFLOAD_load_image): New function.
	(GOMP_OFFLOAD_unload_image): New function.
	* target.c (register_lock): New mutex for offload image registration.
	(num_devices): Do not guard with PLUGIN_SUPPORT.
	(gomp_realloc_unlock): New static function.
	(gomp_map_vars_existing): Add device descriptor argument.  Unlock mutex
	before gomp_fatal.
	(gomp_map_vars): Use dev's lock and splay_tree instead of mem_map's.
	Pass devicep to gomp_map_vars_existing.  Unlock mutex before gomp_fatal.
	(gomp_copy_from_async): Use dev's lock and splay_tree instead of
	mem_map's.
	(gomp_unmap_vars): Likewise.
	(gomp_update): Remove gomp_memory_mapping argument.  Use dev's lock and
	splay_tree instead of mm's.  Unlock mutex before gomp_fatal.
	(gomp_offload_image_to_device): New static function.
	(GOMP_offload_register): Add mutex lock.
	Call gomp_offload_image_to_device for all initialized devices.
	Replace gomp_realloc with gomp_realloc_unlock.
	(GOMP_offload_unregister): New function.
	(gomp_init_tables): Replace with gomp_init_device.  Replace a call to
	get_table_func from the plugin with calls to init_device_func and
	gomp_offload_image_to_device.
	(gomp_free_memmap): Change type of argument from gomp_memory_mapping *
	to splay_tree_s *.
	(GOMP_target): Do not call gomp_init_tables.  Use dev's lock and
	splay_tree instead of mem_map's.  Unlock mutex before gomp_fatal.
	(GOMP_target_data): Do not call gomp_init_tables.
	(GOMP_target_update): Likewise.  Remove argument from gomp_update.
	(gomp_load_plugin_for_device): Replace register_image and get_table
	with load_image and unload_image in DLSYM ().
	(gomp_register_images_for_device): Remove function.
	(gomp_target_init): Do not initialize current_device.mem_map.*,
	current_device.offload_regions_registered.
	Remove call to gomp_register_images_for_device.
	Do not free offload_images and num_offload_images.
liboffloadmic/
	* plugin/libgomp-plugin-intelmic.cpp: Include map.
	(AddrVect, DevAddrVect, ImgDevAddrMap): New typedefs.
	(num_devices, num_images, address_table): New static vars.
	(num_libraries, lib_descrs): Remove static vars.
	(set_mic_lib_path): Rename to ...
	(init): ... this.  Allocate address_table and get num_devices.
	(GOMP_OFFLOAD_get_num_devices): return num_devices.
	(load_lib_and_get_table): Remove static function.
	(offload_image): New static function.
	(GOMP_OFFLOAD_get_table): Remove function.
	(GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_unload_image): New functions.

  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-01 13:20                                                 ` Jakub Jelinek
@ 2015-04-01 17:26                                                   ` Ilya Verbin
  2015-04-06 12:46                                                   ` Ilya Verbin
  1 sibling, 0 replies; 30+ messages in thread
From: Ilya Verbin @ 2015-04-01 17:26 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Julian Brown, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Wed, Apr 01, 2015 at 15:20:25 +0200, Jakub Jelinek wrote:
> On Wed, Apr 01, 2015 at 04:14:05PM +0300, Ilya Verbin wrote:
> > I was worried about the following scenario:
> > 1. Thread 1 in GOMP_target locks device 1.
> > 2. Thread 2 in GOMP_target locks device 2 and calls gomp_fatal.
> > 3. GOMP_offload_unregister will wait for device 1, even device 2 is unlocked.
> 
> How is that different from
> 1. Thread 1 in GOMP_target locks device 1.
> 2. Thread 2 calls exit.
> ?  I mean when you unlock the device and register locks if you own them
> before gomp_fatal.

Yeah, it's the same situation.

> > Here is patch, which unlocks proper mutexes in the caller, as you suggested.
> > make check-target-libgomp passed.
> 
> LGTM with proper ChangeLog entry.

When should I commit it into trunk?  Without the corresponding PTX part,
offloading to PTX will not work.

  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-01 13:14                                               ` Ilya Verbin
@ 2015-04-01 13:20                                                 ` Jakub Jelinek
  2015-04-01 17:26                                                   ` Ilya Verbin
  2015-04-06 12:46                                                   ` Ilya Verbin
  0 siblings, 2 replies; 30+ messages in thread
From: Jakub Jelinek @ 2015-04-01 13:20 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Julian Brown, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Wed, Apr 01, 2015 at 04:14:05PM +0300, Ilya Verbin wrote:
> On Wed, Apr 01, 2015 at 07:21:47 +0200, Jakub Jelinek wrote:
> > On Wed, Apr 01, 2015 at 02:53:28AM +0300, Ilya Verbin wrote:
> > > +/* Similar to gomp_fatal, but release mutexes before.  */
> > > +
> > > +static void
> > > +gomp_fatal_unlock (const char *fmt, ...)
> > > +{
> > > +  int i;
> > > +  va_list list;
> > > +
> > > +  for (i = 0; i < num_devices; i++)
> > > +    gomp_mutex_unlock (&devices[i].lock);
> > 
> > This is wrong.  Calling gomp_mutex_unlock on a lock that you don't have
> > locked is undefined behavior.
> > You really should unlock it in the caller which should be aware which 0/1/2
> > locks it holds.
> 
> I was worried about the following scenario:
> 1. Thread 1 in GOMP_target locks device 1.
> 2. Thread 2 in GOMP_target locks device 2 and calls gomp_fatal.
> 3. GOMP_offload_unregister will wait for device 1, even device 2 is unlocked.

How is that different from
1. Thread 1 in GOMP_target locks device 1.
2. Thread 2 calls exit.
?  I mean when you unlock the device and register locks if you own them
before gomp_fatal.

> Anyway, it was a bad idea to unlock mutexes from non-owner thread.
> 
> Here is patch, which unlocks proper mutexes in the caller, as you suggested.
> make check-target-libgomp passed.

LGTM with proper ChangeLog entry.

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-04-01  5:21                                             ` Jakub Jelinek
@ 2015-04-01 13:14                                               ` Ilya Verbin
  2015-04-01 13:20                                                 ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-04-01 13:14 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Julian Brown, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Wed, Apr 01, 2015 at 07:21:47 +0200, Jakub Jelinek wrote:
> On Wed, Apr 01, 2015 at 02:53:28AM +0300, Ilya Verbin wrote:
> > +/* Similar to gomp_fatal, but release mutexes before.  */
> > +
> > +static void
> > +gomp_fatal_unlock (const char *fmt, ...)
> > +{
> > +  int i;
> > +  va_list list;
> > +
> > +  for (i = 0; i < num_devices; i++)
> > +    gomp_mutex_unlock (&devices[i].lock);
> 
> This is wrong.  Calling gomp_mutex_unlock on a lock that you don't have
> locked is undefined behavior.
> You really should unlock it in the caller which should be aware which 0/1/2
> locks it holds.

I was worried about the following scenario:
1. Thread 1 in GOMP_target locks device 1.
2. Thread 2 in GOMP_target locks device 2 and calls gomp_fatal.
3. GOMP_offload_unregister will wait for device 1, even device 2 is unlocked.
Anyway, it was a bad idea to unlock mutexes from non-owner thread.

Here is patch, which unlocks proper mutexes in the caller, as you suggested.
make check-target-libgomp passed.


diff --git a/gcc/config/i386/intelmic-mkoffload.c b/gcc/config/i386/intelmic-mkoffload.c
index f93007c..e101f93 100644
--- a/gcc/config/i386/intelmic-mkoffload.c
+++ b/gcc/config/i386/intelmic-mkoffload.c
@@ -350,14 +350,24 @@ generate_host_descr_file (const char *host_compiler)
 	   "#ifdef __cplusplus\n"
 	   "extern \"C\"\n"
 	   "#endif\n"
-	   "void GOMP_offload_register (void *, int, void *);\n\n"
+	   "void GOMP_offload_register (void *, int, void *);\n"
+	   "void GOMP_offload_unregister (void *, int, void *);\n\n"
 
 	   "__attribute__((constructor))\n"
 	   "static void\n"
 	   "init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
+	   "}\n\n", GOMP_DEVICE_INTEL_MIC);
+
+  fprintf (src_file,
+	   "__attribute__((destructor))\n"
+	   "static void\n"
+	   "fini (void)\n"
+	   "{\n"
+	   "  GOMP_offload_unregister (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
 	   "}\n", GOMP_DEVICE_INTEL_MIC);
+
   fclose (src_file);
 
   unsigned new_argc = 0;
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index d9cbff5..1072ae4 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -51,14 +51,12 @@ enum offload_target_type
   OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
 };
 
-/* Auxiliary struct, used for transferring a host-target address range mapping
-   from plugin to libgomp.  */
-struct mapping_table
+/* Auxiliary struct, used for transferring pairs of addresses from plugin
+   to libgomp.  */
+struct addr_pair
 {
-  uintptr_t host_start;
-  uintptr_t host_end;
-  uintptr_t tgt_start;
-  uintptr_t tgt_end;
+  uintptr_t start;
+  uintptr_t end;
 };
 
 /* Miscellaneous functions.  */
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3089401..a1d42c5 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -224,7 +224,6 @@ struct gomp_team_state
 };
 
 struct target_mem_desc;
-struct gomp_memory_mapping;
 
 /* These are the OpenMP 4.0 Internal Control Variables described in
    section 2.3.1.  Those described as having one copy per task are
@@ -657,7 +656,7 @@ struct target_mem_desc {
   struct gomp_device_descr *device_descr;
 
   /* Memory mapping info for the thread that created this descriptor.  */
-  struct gomp_memory_mapping *mem_map;
+  struct splay_tree_s *mem_map;
 
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
@@ -683,20 +682,6 @@ struct splay_tree_key_s {
 
 #include "splay-tree.h"
 
-/* Information about mapped memory regions (per device/context).  */
-
-struct gomp_memory_mapping
-{
-  /* Mutex for operating with the splay tree and other shared structures.  */
-  gomp_mutex_t lock;
-
-  /* True when tables have been added to this memory map.  */
-  bool is_initialized;
-
-  /* Splay tree containing information about mapped memory regions.  */
-  struct splay_tree_s splay_tree;
-};
-
 typedef struct acc_dispatch_t
 {
   /* This is a linked list of data mapped using the
@@ -773,19 +758,18 @@ struct gomp_device_descr
   unsigned int (*get_caps_func) (void);
   int (*get_type_func) (void);
   int (*get_num_devices_func) (void);
-  void (*register_image_func) (void *, void *);
   void (*init_device_func) (int);
   void (*fini_device_func) (int);
-  int (*get_table_func) (int, struct mapping_table **);
+  int (*load_image_func) (int, void *, struct addr_pair **);
+  void (*unload_image_func) (int, void *);
   void *(*alloc_func) (int, size_t);
   void (*free_func) (int, void *);
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
 
-  /* Memory-mapping info for this device instance.  */
-  /* Uses a separate lock.  */
-  struct gomp_memory_mapping mem_map;
+  /* Splay tree containing information about mapped memory regions.  */
+  struct splay_tree_s mem_map;
 
   /* Mutex for the mutable data.  */
   gomp_mutex_t lock;
@@ -793,9 +777,6 @@ struct gomp_device_descr
   /* Set to true when device is initialized.  */
   bool is_initialized;
 
-  /* True when offload regions have been registered with this device.  */
-  bool offload_regions_registered;
-
   /* OpenACC-specific data and functions.  */
   /* This is mutable because of its mutable data_environ and target_data
      members.  */
@@ -811,9 +792,7 @@ extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
 extern void gomp_copy_from_async (struct target_mem_desc *);
 extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
-extern void gomp_init_tables (struct gomp_device_descr *,
-			      struct gomp_memory_mapping *);
-extern void gomp_free_memmap (struct gomp_memory_mapping *);
+extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_fini_device (struct gomp_device_descr *);
 
 /* work.c */
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index f44174e..2b2b953 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -231,6 +231,7 @@ GOMP_4.0 {
 GOMP_4.0.1 {
   global:
 	GOMP_offload_register;
+	GOMP_offload_unregister;
 } GOMP_4.0;
 
 OACC_2.0 {
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 6aeb1e7..e4756b6 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -43,20 +43,18 @@ static struct gomp_device_descr host_dispatch =
     .get_caps_func = GOMP_OFFLOAD_get_caps,
     .get_type_func = GOMP_OFFLOAD_get_type,
     .get_num_devices_func = GOMP_OFFLOAD_get_num_devices,
-    .register_image_func = GOMP_OFFLOAD_register_image,
     .init_device_func = GOMP_OFFLOAD_init_device,
     .fini_device_func = GOMP_OFFLOAD_fini_device,
-    .get_table_func = GOMP_OFFLOAD_get_table,
+    .load_image_func = GOMP_OFFLOAD_load_image,
+    .unload_image_func = GOMP_OFFLOAD_unload_image,
     .alloc_func = GOMP_OFFLOAD_alloc,
     .free_func = GOMP_OFFLOAD_free,
     .dev2host_func = GOMP_OFFLOAD_dev2host,
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.is_initialized = false,
-    .mem_map.splay_tree.root = NULL,
+    .mem_map.root = NULL,
     .is_initialized = false,
-    .offload_regions_registered = false,
 
     .openacc = {
       .open_device_func = GOMP_OFFLOAD_openacc_open_device,
@@ -94,7 +92,6 @@ static struct gomp_device_descr host_dispatch =
 static __attribute__ ((constructor))
 void goacc_host_init (void)
 {
-  gomp_mutex_init (&host_dispatch.mem_map.lock);
   gomp_mutex_init (&host_dispatch.lock);
   goacc_register (&host_dispatch);
 }
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 166eb55..1e0243e 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -284,12 +284,6 @@ lazy_open (int ord)
     = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
 
   acc_dev->openacc.async_set_async_func (acc_async_sync);
-
-  struct gomp_memory_mapping *mem_map = &acc_dev->mem_map;
-  gomp_mutex_lock (&mem_map->lock);
-  if (!mem_map->is_initialized)
-    gomp_init_tables (acc_dev, mem_map);
-  gomp_mutex_unlock (&mem_map->lock);
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
@@ -351,10 +345,9 @@ acc_shutdown_1 (acc_device_t d)
 
 	  walk->dev->openacc.target_data = target_data = NULL;
 
-	  struct gomp_memory_mapping *mem_map = &walk->dev->mem_map;
-	  gomp_mutex_lock (&mem_map->lock);
-	  gomp_free_memmap (mem_map);
-	  gomp_mutex_unlock (&mem_map->lock);
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (&walk->dev->mem_map);
+	  gomp_mutex_unlock (&walk->dev->lock);
 
 	  walk->dev = NULL;
 	}
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 0096d51..fdc82e6 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -38,7 +38,7 @@
 /* Return block containing [H->S), or NULL if not contained.  */
 
 static splay_tree_key
-lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
+lookup_host (struct gomp_device_descr *dev, void *h, size_t s)
 {
   struct splay_tree_key_s node;
   splay_tree_key key;
@@ -46,11 +46,9 @@ lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
   node.host_start = (uintptr_t) h;
   node.host_end = (uintptr_t) h + s;
 
-  gomp_mutex_lock (&mem_map->lock);
-
-  key = splay_tree_lookup (&mem_map->splay_tree, &node);
-
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_lock (&dev->lock);
+  key = splay_tree_lookup (&dev->mem_map, &node);
+  gomp_mutex_unlock (&dev->lock);
 
   return key;
 }
@@ -65,14 +63,11 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
 {
   int i;
   struct target_mem_desc *t;
-  struct gomp_memory_mapping *mem_map;
 
   if (!tgt)
     return NULL;
 
-  mem_map = tgt->mem_map;
-
-  gomp_mutex_lock (&mem_map->lock);
+  gomp_mutex_lock (&tgt->device_descr->lock);
 
   for (t = tgt; t != NULL; t = t->prev)
     {
@@ -80,7 +75,7 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
         break;
     }
 
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_unlock (&tgt->device_descr->lock);
 
   if (!t)
     return NULL;
@@ -176,7 +171,7 @@ acc_deviceptr (void *h)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  n = lookup_host (&thr->dev->mem_map, h, 1);
+  n = lookup_host (thr->dev, h, 1);
 
   if (!n)
     return NULL;
@@ -229,7 +224,7 @@ acc_is_present (void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   if (n && ((uintptr_t)h < n->host_start
 	    || (uintptr_t)h + s > n->host_end
@@ -271,7 +266,7 @@ acc_map_data (void *h, void *d, size_t s)
 	gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map",
                     (void *)h, (int)s, (void *)d, (int)s);
 
-      if (lookup_host (&acc_dev->mem_map, h, s))
+      if (lookup_host (acc_dev, h, s))
 	gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h,
 		    (int)s);
 
@@ -296,7 +291,7 @@ acc_unmap_data (void *h)
   /* No need to call lazy open, as the address must have been mapped.  */
 
   size_t host_size;
-  splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1);
+  splay_tree_key n = lookup_host (acc_dev, h, 1);
   struct target_mem_desc *t;
 
   if (!n)
@@ -320,7 +315,7 @@ acc_unmap_data (void *h)
       t->tgt_end = 0;
       t->to_free = 0;
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
 	   tp = t, t = t->prev)
@@ -334,7 +329,7 @@ acc_unmap_data (void *h)
 	    break;
 	  }
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   gomp_unmap_vars (t, true);
@@ -358,7 +353,7 @@ present_create_copy (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
   if (n)
     {
       /* Present. */
@@ -389,13 +384,13 @@ present_create_copy (unsigned f, void *h, size_t s)
       tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
 			   false);
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
       tgt->prev = acc_dev->openacc.data_environ;
       acc_dev->openacc.data_environ = tgt;
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   return d;
@@ -436,7 +431,7 @@ delete_copyout (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -479,7 +474,7 @@ update_dev_host (int is_dev, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -532,7 +527,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   struct target_mem_desc *t;
   int minrefs = (mapnum == 1) ? 2 : 3;
 
-  n = lookup_host (&acc_dev->mem_map, h, 1);
+  n = lookup_host (acc_dev, h, 1);
 
   if (!n)
     gomp_fatal ("%p is not a mapped block", (void *)h);
@@ -543,7 +538,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
 
   struct target_mem_desc *tp;
 
-  gomp_mutex_lock (&acc_dev->mem_map.lock);
+  gomp_mutex_lock (&acc_dev->lock);
 
   if (t->refcount == minrefs)
     {
@@ -570,7 +565,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   if (force_copyfrom)
     t->list[0]->copy_from = 1;
 
-  gomp_mutex_unlock (&acc_dev->mem_map.lock);
+  gomp_mutex_unlock (&acc_dev->lock);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 0c74f54..563f9bb 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -144,9 +144,9 @@ GOACC_parallel (int device, void (*fn) (void *),
     {
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
-      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k);
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
+      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map, &k);
+      gomp_mutex_unlock (&acc_dev->lock);
 
       if (tgt_fn_key == NULL)
 	gomp_fatal ("target function wasn't mapped");
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index ebf7f11..bc60f72 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -95,12 +95,6 @@ GOMP_OFFLOAD_get_num_devices (void)
 }
 
 STATIC void
-GOMP_OFFLOAD_register_image (void *host_table __attribute__ ((unused)),
-			     void *target_data __attribute__ ((unused)))
-{
-}
-
-STATIC void
 GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
 {
 }
@@ -111,12 +105,19 @@ GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
 }
 
 STATIC int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **table __attribute__ ((unused)))
+GOMP_OFFLOAD_load_image (int n __attribute__ ((unused)),
+			 void *i __attribute__ ((unused)),
+			 struct addr_pair **r __attribute__ ((unused)))
 {
   return 0;
 }
 
+STATIC void
+GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
+			   void *i __attribute__ ((unused)))
+{
+}
+
 STATIC void *
 GOMP_OFFLOAD_openacc_open_device (int n)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index c5dda3f..dfe7fb9 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -49,6 +49,9 @@ static void gomp_target_init (void);
 /* The whole initialization code for offloading plugins is only run one.  */
 static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT;
 
+/* Mutex for offload image registration.  */
+static gomp_mutex_t register_lock;
+
 /* This structure describes an offload image.
    It contains type of the target device, pointer to host table descriptor, and
    pointer to target data.  */
@@ -67,14 +70,26 @@ static int num_offload_images;
 /* Array of descriptors for all available devices.  */
 static struct gomp_device_descr *devices;
 
-#ifdef PLUGIN_SUPPORT
 /* Total number of available devices.  */
 static int num_devices;
-#endif
 
 /* Number of GOMP_OFFLOAD_CAP_OPENMP_400 devices.  */
 static int num_devices_openmp;
 
+/* Similar to gomp_realloc, but release register_lock before gomp_fatal.  */
+
+static void *
+gomp_realloc_unlock (void *old, size_t size)
+{
+  void *ret = realloc (old, size);
+  if (ret == NULL)
+    {
+      gomp_mutex_unlock (&register_lock);
+      gomp_fatal ("Out of memory allocating %lu bytes", (unsigned long) size);
+    }
+  return ret;
+}
+
 /* The comparison function.  */
 
 attribute_hidden int
@@ -125,16 +140,19 @@ resolve_device (int device_id)
    Helper function of gomp_map_vars.  */
 
 static inline void
-gomp_map_vars_existing (splay_tree_key oldn, splay_tree_key newn,
-			unsigned char kind)
+gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn,
+			splay_tree_key newn, unsigned char kind)
 {
   if ((kind & GOMP_MAP_FLAG_FORCE)
       || oldn->host_start > newn->host_start
       || oldn->host_end < newn->host_end)
-    gomp_fatal ("Trying to map into device [%p..%p) object when "
-		"[%p..%p) is already mapped",
-		(void *) newn->host_start, (void *) newn->host_end,
-		(void *) oldn->host_start, (void *) oldn->host_end);
+    {
+      gomp_mutex_unlock (&devicep->lock);
+      gomp_fatal ("Trying to map into device [%p..%p) object when "
+		  "[%p..%p) is already mapped",
+		  (void *) newn->host_start, (void *) newn->host_end,
+		  (void *) oldn->host_start, (void *) oldn->host_end);
+    }
   oldn->refcount++;
 }
 
@@ -153,14 +171,14 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   const int rshift = is_openacc ? 8 : 3;
   const int typemask = is_openacc ? 0xff : 0x7;
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
+  struct splay_tree_s *mem_map = &devicep->mem_map;
   struct splay_tree_key_s cur_node;
   struct target_mem_desc *tgt
     = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mm;
+  tgt->mem_map = mem_map;
 
   if (mapnum == 0)
     return tgt;
@@ -174,7 +192,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       tgt_size = mapnum * sizeof (void *);
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < mapnum; i++)
     {
@@ -189,11 +207,11 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	cur_node.host_end = cur_node.host_start + sizes[i];
       else
 	cur_node.host_end = cur_node.host_start + sizeof (void *);
-      splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+      splay_tree_key n = splay_tree_lookup (mem_map, &cur_node);
       if (n)
 	{
 	  tgt->list[i] = n;
-	  gomp_map_vars_existing (n, &cur_node, kind & typemask);
+	  gomp_map_vars_existing (devicep, n, &cur_node, kind & typemask);
 	}
       else
 	{
@@ -228,7 +246,10 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   if (devaddrs)
     {
       if (mapnum != 1)
-        gomp_fatal ("unexpected aggregation");
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  gomp_fatal ("unexpected aggregation");
+	}
       tgt->to_free = devaddrs[0];
       tgt->tgt_start = (uintptr_t) tgt->to_free;
       tgt->tgt_end = tgt->tgt_start + sizes[0];
@@ -274,11 +295,11 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      k->host_end = k->host_start + sizes[i];
 	    else
 	      k->host_end = k->host_start + sizeof (void *);
-	    splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k);
+	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n)
 	      {
 		tgt->list[i] = n;
-		gomp_map_vars_existing (n, k, kind & typemask);
+		gomp_map_vars_existing (devicep, n, k, kind & typemask);
 	      }
 	    else
 	      {
@@ -294,7 +315,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		tgt->refcount++;
 		array->left = NULL;
 		array->right = NULL;
-		splay_tree_insert (&mm->splay_tree, array);
+		splay_tree_insert (mem_map, array);
 		switch (kind & typemask)
 		  {
 		  case GOMP_MAP_ALLOC:
@@ -332,22 +353,25 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    /* Add bias to the pointer value.  */
 		    cur_node.host_start += sizes[i];
 		    cur_node.host_end = cur_node.host_start + 1;
-		    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+		    n = splay_tree_lookup (mem_map, &cur_node);
 		    if (n == NULL)
 		      {
 			/* Could be possibly zero size array section.  */
 			cur_node.host_end--;
-			n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			n = splay_tree_lookup (mem_map, &cur_node);
 			if (n == NULL)
 			  {
 			    cur_node.host_start--;
-			    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			    n = splay_tree_lookup (mem_map, &cur_node);
 			    cur_node.host_start++;
 			  }
 		      }
 		    if (n == NULL)
-		      gomp_fatal ("Pointer target of array section "
-				  "wasn't mapped");
+		      {
+			gomp_mutex_unlock (&devicep->lock);
+			gomp_fatal ("Pointer target of array section "
+				    "wasn't mapped");
+		      }
 		    cur_node.host_start -= n->host_start;
 		    cur_node.tgt_offset = n->tgt->tgt_start + n->tgt_offset
 					  + cur_node.host_start;
@@ -400,24 +424,25 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			  /* Add bias to the pointer value.  */
 			  cur_node.host_start += sizes[j];
 			  cur_node.host_end = cur_node.host_start + 1;
-			  n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			  n = splay_tree_lookup (mem_map, &cur_node);
 			  if (n == NULL)
 			    {
 			      /* Could be possibly zero size array section.  */
 			      cur_node.host_end--;
-			      n = splay_tree_lookup (&mm->splay_tree,
-						     &cur_node);
+			      n = splay_tree_lookup (mem_map, &cur_node);
 			      if (n == NULL)
 				{
 				  cur_node.host_start--;
-				  n = splay_tree_lookup (&mm->splay_tree,
-							 &cur_node);
+				  n = splay_tree_lookup (mem_map, &cur_node);
 				  cur_node.host_start++;
 				}
 			    }
 			  if (n == NULL)
-			    gomp_fatal ("Pointer target of array section "
-					"wasn't mapped");
+			    {
+			      gomp_mutex_unlock (&devicep->lock);
+			      gomp_fatal ("Pointer target of array section "
+					  "wasn't mapped");
+			    }
 			  cur_node.host_start -= n->host_start;
 			  cur_node.tgt_offset = n->tgt->tgt_start
 						+ n->tgt_offset
@@ -441,6 +466,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		      /* We already looked up the memory region above and it
 			 was missing.  */
 		      size_t size = k->host_end - k->host_start;
+		      gomp_mutex_unlock (&devicep->lock);
 #ifdef HAVE_INTTYPES_H
 		      gomp_fatal ("present clause: !acc_is_present (%p, "
 				  "%"PRIu64" (0x%"PRIx64"))",
@@ -463,6 +489,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 					    sizeof (void *));
 		    break;
 		  default:
+		    gomp_mutex_unlock (&devicep->lock);
 		    gomp_fatal ("%s: unhandled kind 0x%.2x", __FUNCTION__,
 				kind);
 		  }
@@ -489,7 +516,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	}
     }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
   return tgt;
 }
 
@@ -514,10 +541,9 @@ attribute_hidden void
 gomp_copy_from_async (struct target_mem_desc *tgt)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
   size_t i;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < tgt->list_count; i++)
     if (tgt->list[i] == NULL)
@@ -536,7 +562,7 @@ gomp_copy_from_async (struct target_mem_desc *tgt)
 				  k->host_end - k->host_start);
       }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 /* Unmap variables described by TGT.  If DO_COPYFROM is true, copy relevant
@@ -547,7 +573,6 @@ attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
 
   if (tgt->list_count == 0)
     {
@@ -555,7 +580,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
       return;
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   size_t i;
   for (i = 0; i < tgt->list_count; i++)
@@ -572,7 +597,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (&mm->splay_tree, k);
+	splay_tree_remove (tgt->mem_map, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -584,13 +609,12 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
   else
     gomp_unmap_tgt (tgt);
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 static void
-gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
-	     size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds,
-	     bool is_openacc)
+gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
+	     size_t *sizes, void *kinds, bool is_openacc)
 {
   size_t i;
   struct splay_tree_key_s cur_node;
@@ -602,25 +626,27 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
   if (mapnum == 0)
     return;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
   for (i = 0; i < mapnum; i++)
     if (sizes[i])
       {
 	cur_node.host_start = (uintptr_t) hostaddrs[i];
 	cur_node.host_end = cur_node.host_start + sizes[i];
-	splay_tree_key n = splay_tree_lookup (&mm->splay_tree,
-					      &cur_node);
+	splay_tree_key n = splay_tree_lookup (&devicep->mem_map, &cur_node);
 	if (n)
 	  {
 	    int kind = get_kind (is_openacc, kinds, i);
 	    if (n->host_start > cur_node.host_start
 		|| n->host_end < cur_node.host_end)
-	      gomp_fatal ("Trying to update [%p..%p) object when"
-			  "only [%p..%p) is mapped",
-			  (void *) cur_node.host_start,
-			  (void *) cur_node.host_end,
-			  (void *) n->host_start,
-			  (void *) n->host_end);
+	      {
+		gomp_mutex_unlock (&devicep->lock);
+		gomp_fatal ("Trying to update [%p..%p) object when "
+			    "only [%p..%p) is mapped",
+			    (void *) cur_node.host_start,
+			    (void *) cur_node.host_end,
+			    (void *) n->host_start,
+			    (void *) n->host_end);
+	      }
 	    if (GOMP_MAP_COPY_TO_P (kind & typemask))
 	      devicep->host2dev_func (devicep->target_id,
 				      (void *) (n->tgt->tgt_start
@@ -639,14 +665,106 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
 				      cur_node.host_end - cur_node.host_start);
 	  }
 	else
-	  gomp_fatal ("Trying to update [%p..%p) object that is not mapped",
-		      (void *) cur_node.host_start,
-		      (void *) cur_node.host_end);
+	  {
+	    gomp_mutex_unlock (&devicep->lock);
+	    gomp_fatal ("Trying to update [%p..%p) object that is not mapped",
+			(void *) cur_node.host_start,
+			(void *) cur_node.host_end);
+	  }
       }
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
+}
+
+/* Load image pointed by TARGET_DATA to the device, specified by DEVICEP.
+   And insert to splay tree the mapping between addresses from HOST_TABLE and
+   from loaded target image.  */
+
+static void
+gomp_offload_image_to_device (struct gomp_device_descr *devicep,
+			      void *host_table, void *target_data,
+			      bool is_register_lock)
+{
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  /* Load image to device and get target addresses for the image.  */
+  struct addr_pair *target_table = NULL;
+  int i, num_target_entries
+    = devicep->load_image_func (devicep->target_id, target_data, &target_table);
+
+  if (num_target_entries != num_funcs + num_vars)
+    {
+      gomp_mutex_unlock (&devicep->lock);
+      if (is_register_lock)
+	gomp_mutex_unlock (&register_lock);
+      gomp_fatal ("Can't map target functions or variables");
+    }
+
+  /* Insert host-target address mapping into splay tree.  */
+  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
+  tgt->array = gomp_malloc ((num_funcs + num_vars) * sizeof (*tgt->array));
+  tgt->refcount = 1;
+  tgt->tgt_start = 0;
+  tgt->tgt_end = 0;
+  tgt->to_free = NULL;
+  tgt->prev = NULL;
+  tgt->list_count = 0;
+  tgt->device_descr = devicep;
+  splay_tree_node array = tgt->array;
+
+  for (i = 0; i < num_funcs; i++)
+    {
+      splay_tree_key k = &array->key;
+      k->host_start = (uintptr_t) host_func_table[i];
+      k->host_end = k->host_start + 1;
+      k->tgt = tgt;
+      k->tgt_offset = target_table[i].start;
+      k->refcount = 1;
+      k->async_refcount = 0;
+      k->copy_from = false;
+      array->left = NULL;
+      array->right = NULL;
+      splay_tree_insert (&devicep->mem_map, array);
+      array++;
+    }
+
+  for (i = 0; i < num_vars; i++)
+    {
+      struct addr_pair *target_var = &target_table[num_funcs + i];
+      if (target_var->end - target_var->start
+	  != (uintptr_t) host_var_table[i * 2 + 1])
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  if (is_register_lock)
+	    gomp_mutex_unlock (&register_lock);
+	  gomp_fatal ("Can't map target variables (size mismatch)");
+	}
+
+      splay_tree_key k = &array->key;
+      k->host_start = (uintptr_t) host_var_table[i * 2];
+      k->host_end = k->host_start + (uintptr_t) host_var_table[i * 2 + 1];
+      k->tgt = tgt;
+      k->tgt_offset = target_var->start;
+      k->refcount = 1;
+      k->async_refcount = 0;
+      k->copy_from = false;
+      array->left = NULL;
+      array->right = NULL;
+      splay_tree_insert (&devicep->mem_map, array);
+      array++;
+    }
+
+  free (target_table);
 }
 
-/* This function should be called from every offload image.
+/* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
    the target, and TARGET_DATA needed by target plugin.  */
 
@@ -654,83 +772,152 @@ void
 GOMP_offload_register (void *host_table, enum offload_target_type target_type,
 		       void *target_data)
 {
-  offload_images = gomp_realloc (offload_images,
-				 (num_offload_images + 1)
-				 * sizeof (struct offload_image_descr));
+  int i;
+  gomp_mutex_lock (&register_lock);
+
+  /* Load image to all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type == target_type && devicep->is_initialized)
+	gomp_offload_image_to_device (devicep, host_table, target_data, true);
+      gomp_mutex_unlock (&devicep->lock);
+    }
 
+  /* Insert image to array of pending images.  */
+  offload_images
+    = gomp_realloc_unlock (offload_images,
+			   (num_offload_images + 1)
+			   * sizeof (struct offload_image_descr));
   offload_images[num_offload_images].type = target_type;
   offload_images[num_offload_images].host_table = host_table;
   offload_images[num_offload_images].target_data = target_data;
 
   num_offload_images++;
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* This function initializes the target device, specified by DEVICEP.  DEVICEP
-   must be locked on entry, and remains locked on return.  */
+/* This function should be called from every offload image while unloading.
+   It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
+   the target, and TARGET_DATA needed by target plugin.  */
 
-attribute_hidden void
-gomp_init_device (struct gomp_device_descr *devicep)
+void
+GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
+			 void *target_data)
 {
-  devicep->init_device_func (devicep->target_id);
-  devicep->is_initialized = true;
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+  int i;
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  gomp_mutex_lock (&register_lock);
+
+  /* Unload image from all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      int j;
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type != target_type || !devicep->is_initialized)
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  continue;
+	}
+
+      devicep->unload_image_func (devicep->target_id, target_data);
+
+      /* Remove mapping from splay tree.  */
+      struct splay_tree_key_s k;
+      splay_tree_key node = NULL;
+      if (num_funcs > 0)
+	{
+	  k.host_start = (uintptr_t) host_func_table[0];
+	  k.host_end = k.host_start + 1;
+	  node = splay_tree_lookup (&devicep->mem_map, &k);
+	}
+      else if (num_vars > 0)
+	{
+	  k.host_start = (uintptr_t) host_var_table[0];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[1];
+	  node = splay_tree_lookup (&devicep->mem_map, &k);
+	}
+
+      for (j = 0; j < num_funcs; j++)
+	{
+	  k.host_start = (uintptr_t) host_func_table[j];
+	  k.host_end = k.host_start + 1;
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+
+      for (j = 0; j < num_vars; j++)
+	{
+	  k.host_start = (uintptr_t) host_var_table[j * 2];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[j * 2 + 1];
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+
+      if (node)
+	{
+	  free (node->tgt);
+	  free (node);
+	}
+
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* Remove image from array of pending images.  */
+  for (i = 0; i < num_offload_images; i++)
+    if (offload_images[i].target_data == target_data)
+      {
+	offload_images[i] = offload_images[--num_offload_images];
+	break;
+      }
+
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* Initialize address mapping tables.  MM must be locked on entry, and remains
-   locked on return.  */
+/* This function initializes the target device, specified by DEVICEP.  DEVICEP
+   must be locked on entry, and remains locked on return.  */
 
 attribute_hidden void
-gomp_init_tables (struct gomp_device_descr *devicep,
-		  struct gomp_memory_mapping *mm)
+gomp_init_device (struct gomp_device_descr *devicep)
 {
-  /* Get address mapping table for device.  */
-  struct mapping_table *table = NULL;
-  int num_entries = devicep->get_table_func (devicep->target_id, &table);
-
-  /* Insert host-target address mapping into dev_splay_tree.  */
   int i;
-  for (i = 0; i < num_entries; i++)
+  devicep->init_device_func (devicep->target_id);
+
+  /* Load to device all images registered by the moment.  */
+  for (i = 0; i < num_offload_images; i++)
     {
-      struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
-      tgt->refcount = 1;
-      tgt->array = gomp_malloc (sizeof (*tgt->array));
-      tgt->tgt_start = table[i].tgt_start;
-      tgt->tgt_end = table[i].tgt_end;
-      tgt->to_free = NULL;
-      tgt->list_count = 0;
-      tgt->device_descr = devicep;
-      splay_tree_node node = tgt->array;
-      splay_tree_key k = &node->key;
-      k->host_start = table[i].host_start;
-      k->host_end = table[i].host_end;
-      k->tgt_offset = 0;
-      k->refcount = 1;
-      k->copy_from = false;
-      k->tgt = tgt;
-      node->left = NULL;
-      node->right = NULL;
-      splay_tree_insert (&mm->splay_tree, node);
+      struct offload_image_descr *image = &offload_images[i];
+      if (image->type == devicep->type)
+	gomp_offload_image_to_device (devicep, image->host_table,
+				      image->target_data, false);
     }
 
-  free (table);
-  mm->is_initialized = true;
+  devicep->is_initialized = true;
 }
 
 /* Free address mapping tables.  MM must be locked on entry, and remains locked
    on return.  */
 
 attribute_hidden void
-gomp_free_memmap (struct gomp_memory_mapping *mm)
+gomp_free_memmap (struct splay_tree_s *mem_map)
 {
-  while (mm->splay_tree.root)
+  while (mem_map->root)
     {
-      struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt;
+      struct target_mem_desc *tgt = mem_map->root->key.tgt;
 
-      splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key);
+      splay_tree_remove (mem_map, &mem_map->root->key);
       free (tgt->array);
       free (tgt);
     }
-
-  mm->is_initialized = false;
 }
 
 /* This function de-initializes the target device, specified by DEVICEP.
@@ -791,22 +978,19 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
     fn_addr = (void *) fn;
   else
     {
-      struct gomp_memory_mapping *mm = &devicep->mem_map;
-      gomp_mutex_lock (&mm->lock);
-
-      if (!mm->is_initialized)
-	gomp_init_tables (devicep, mm);
-
+      gomp_mutex_lock (&devicep->lock);
       struct splay_tree_key_s k;
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      splay_tree_key tgt_fn = splay_tree_lookup (&mm->splay_tree, &k);
+      splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
       if (tgt_fn == NULL)
-	gomp_fatal ("Target function wasn't mapped");
-
-      gomp_mutex_unlock (&mm->lock);
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  gomp_fatal ("Target function wasn't mapped");
+	}
+      gomp_mutex_unlock (&devicep->lock);
 
-      fn_addr = (void *) tgt_fn->tgt->tgt_start;
+      fn_addr = (void *) tgt_fn->tgt_offset;
     }
 
   struct target_mem_desc *tgt_vars
@@ -856,12 +1040,6 @@ GOMP_target_data (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
   struct target_mem_desc *tgt
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     false);
@@ -897,13 +1075,7 @@ GOMP_target_update (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
-  gomp_update (devicep, mm, mapnum, hostaddrs, sizes, kinds, false);
+  gomp_update (devicep, mapnum, hostaddrs, sizes, kinds, false);
 }
 
 void
@@ -972,10 +1144,10 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM (get_caps);
   DLSYM (get_type);
   DLSYM (get_num_devices);
-  DLSYM (register_image);
   DLSYM (init_device);
   DLSYM (fini_device);
-  DLSYM (get_table);
+  DLSYM (load_image);
+  DLSYM (unload_image);
   DLSYM (alloc);
   DLSYM (free);
   DLSYM (dev2host);
@@ -1038,22 +1210,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   return err == NULL;
 }
 
-/* This function adds a compatible offload image IMAGE to an accelerator device
-   DEVICE.  DEVICE must be locked on entry, and remains locked on return.  */
-
-static void
-gomp_register_image_for_device (struct gomp_device_descr *device,
-				struct offload_image_descr *image)
-{
-  if (!device->offload_regions_registered
-      && (device->type == image->type
-	  || device->type == OFFLOAD_TARGET_TYPE_HOST))
-    {
-      device->register_image_func (image->host_table, image->target_data);
-      device->offload_regions_registered = true;
-    }
-}
-
 /* This function initializes the runtime needed for offloading.
    It parses the list of offload targets and tries to load the plugins for
    these targets.  On return, the variables NUM_DEVICES and NUM_DEVICES_OPENMP
@@ -1112,17 +1268,14 @@ gomp_target_init (void)
 		current_device.name = current_device.get_name_func ();
 		/* current_device.capabilities has already been set.  */
 		current_device.type = current_device.get_type_func ();
-		current_device.mem_map.is_initialized = false;
-		current_device.mem_map.splay_tree.root = NULL;
+		current_device.mem_map.root = NULL;
 		current_device.is_initialized = false;
-		current_device.offload_regions_registered = false;
 		current_device.openacc.data_environ = NULL;
 		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
 		    devices[num_devices] = current_device;
-		    gomp_mutex_init (&devices[num_devices].mem_map.lock);
 		    gomp_mutex_init (&devices[num_devices].lock);
 		    num_devices++;
 		  }
@@ -1157,21 +1310,12 @@ gomp_target_init (void)
 
   for (i = 0; i < num_devices; i++)
     {
-      int j;
-
-      for (j = 0; j < num_offload_images; j++)
-	gomp_register_image_for_device (&devices[i], &offload_images[j]);
-
       /* The 'devices' array can be moved (by the realloc call) until we have
 	 found all the plugins, so registering with the OpenACC runtime (which
 	 takes a copy of the pointer argument) must be delayed until now.  */
       if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
 	goacc_register (&devices[i]);
     }
-
-  free (offload_images);
-  offload_images = NULL;
-  num_offload_images = 0;
 }
 
 #else /* PLUGIN_SUPPORT */
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index 3e7a958..a2d61b1 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -34,6 +34,7 @@
 #include <string.h>
 #include <utility>
 #include <vector>
+#include <map>
 #include "libgomp-plugin.h"
 #include "compiler_if_host.h"
 #include "main_target_image.h"
@@ -53,6 +54,29 @@ fprintf (stderr, "\n");					    \
 #endif
 
 
+/* Start/end addresses of functions and global variables on a device.  */
+typedef std::vector<addr_pair> AddrVect;
+
+/* Addresses for one image and all devices.  */
+typedef std::vector<AddrVect> DevAddrVect;
+
+/* Addresses for all images and all devices.  */
+typedef std::map<void *, DevAddrVect> ImgDevAddrMap;
+
+
+/* Total number of available devices.  */
+static int num_devices;
+
+/* Total number of shared libraries with offloading to Intel MIC.  */
+static int num_images;
+
+/* Two dimensional array: one key is a pointer to image,
+   second key is number of device.  Contains a vector of pointer pairs.  */
+static ImgDevAddrMap *address_table;
+
+/* Thread-safe registration of the main image.  */
+static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
+
 static VarDesc vd_host2tgt = {
   { 1, 1 },		      /* dst, src			      */
   { 1, 0 },		      /* in, out			      */
@@ -90,28 +114,17 @@ static VarDesc vd_tgt2host = {
 };
 
 
-/* Total number of shared libraries with offloading to Intel MIC.  */
-static int num_libraries;
-
-/* Pointers to the descriptors, containing pointers to host-side tables and to
-   target images.  */
-static std::vector< std::pair<void *, void *> > lib_descrs;
-
-/* Thread-safe registration of the main image.  */
-static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
-
-
 /* Add path specified in LD_LIBRARY_PATH to MIC_LD_LIBRARY_PATH, which is
    required by liboffloadmic.  */
 __attribute__((constructor))
 static void
-set_mic_lib_path (void)
+init (void)
 {
   const char *ld_lib_path = getenv (LD_LIBRARY_PATH_ENV);
   const char *mic_lib_path = getenv (MIC_LD_LIBRARY_PATH_ENV);
 
   if (!ld_lib_path)
-    return;
+    goto out;
 
   if (!mic_lib_path)
     setenv (MIC_LD_LIBRARY_PATH_ENV, ld_lib_path, 1);
@@ -133,6 +146,10 @@ set_mic_lib_path (void)
       if (!use_alloca)
 	free (mic_lib_path_new);
     }
+
+out:
+  address_table = new ImgDevAddrMap;
+  num_devices = _Offload_number_of_devices ();
 }
 
 extern "C" const char *
@@ -162,18 +179,8 @@ GOMP_OFFLOAD_get_type (void)
 extern "C" int
 GOMP_OFFLOAD_get_num_devices (void)
 {
-  int res = _Offload_number_of_devices ();
-  TRACE ("(): return %d", res);
-  return res;
-}
-
-/* This should be called from every shared library with offloading.  */
-extern "C" void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_image)
-{
-  TRACE ("(host_table = %p, target_image = %p)", host_table, target_image);
-  lib_descrs.push_back (std::make_pair (host_table, target_image));
-  num_libraries++;
+  TRACE ("(): return %d", num_devices);
+  return num_devices;
 }
 
 static void
@@ -196,7 +203,8 @@ register_main_image ()
   __offload_register_image (&main_target_image);
 }
 
-/* Load offload_target_main on target.  */
+/* liboffloadmic loads and runs offload_target_main on all available devices
+   during a first call to offload ().  */
 extern "C" void
 GOMP_OFFLOAD_init_device (int device)
 {
@@ -243,9 +251,11 @@ get_target_table (int device, int &num_funcs, int &num_vars, void **&table)
     }
 }
 
+/* Offload TARGET_IMAGE to all available devices and fill address_table with
+   corresponding target addresses.  */
+
 static void
-load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
-			int &table_size)
+offload_image (void *target_image)
 {
   struct TargetImage {
     int64_t size;
@@ -254,19 +264,11 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     char data[];
   } __attribute__ ((packed));
 
-  void ***host_table_descr = (void ***) lib_descrs[lib_num].first;
-  void **host_func_start = host_table_descr[0];
-  void **host_func_end   = host_table_descr[1];
-  void **host_var_start  = host_table_descr[2];
-  void **host_var_end    = host_table_descr[3];
+  void *image_start = ((void **) target_image)[0];
+  void *image_end   = ((void **) target_image)[1];
 
-  void **target_image_descr = (void **) lib_descrs[lib_num].second;
-  void *image_start = target_image_descr[0];
-  void *image_end   = target_image_descr[1];
-
-  TRACE ("() host_table_descr { %p, %p, %p, %p }", host_func_start,
-	 host_func_end, host_var_start, host_var_end);
-  TRACE ("() target_image_descr { %p, %p }", image_start, image_end);
+  TRACE ("(target_image = %p { %p, %p })",
+	 target_image, image_start, image_end);
 
   int64_t image_size = (uintptr_t) image_end - (uintptr_t) image_start;
   TargetImage *image
@@ -279,94 +281,87 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     }
 
   image->size = image_size;
-  sprintf (image->name, "lib%010d.so", lib_num);
+  sprintf (image->name, "lib%010d.so", num_images++);
   memcpy (image->data, image_start, image->size);
 
   TRACE ("() __offload_register_image %s { %p, %d }",
 	 image->name, image_start, image->size);
   __offload_register_image (image);
 
-  int tgt_num_funcs = 0;
-  int tgt_num_vars = 0;
-  void **tgt_table = NULL;
-  get_target_table (device, tgt_num_funcs, tgt_num_vars, tgt_table);
-  free (image);
-
-  /* The func table contains only addresses, the var table contains addresses
-     and corresponding sizes.  */
-  int host_num_funcs = host_func_end - host_func_start;
-  int host_num_vars  = (host_var_end - host_var_start) / 2;
-  TRACE ("() host_num_funcs = %d, tgt_num_funcs = %d",
-	 host_num_funcs, tgt_num_funcs);
-  TRACE ("() host_num_vars = %d, tgt_num_vars = %d",
-	 host_num_vars, tgt_num_vars);
-  if (host_num_funcs != tgt_num_funcs)
+  /* Receive tables for target_image from all devices.  */
+  DevAddrVect dev_table;
+  for (int dev = 0; dev < num_devices; dev++)
     {
-      fprintf (stderr, "%s: Can't map target functions\n", __FILE__);
-      exit (1);
-    }
-  if (host_num_vars != tgt_num_vars)
-    {
-      fprintf (stderr, "%s: Can't map target variables\n", __FILE__);
-      exit (1);
-    }
+      int num_funcs = 0;
+      int num_vars = 0;
+      void **table = NULL;
 
-  table = (mapping_table *) realloc (table, (table_size + host_num_funcs
-					     + host_num_vars)
-					    * sizeof (mapping_table));
-  if (table == NULL)
-    {
-      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
-      exit (1);
-    }
+      get_target_table (dev, num_funcs, num_vars, table);
 
-  for (int i = 0; i < host_num_funcs; i++)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_func_start[i];
-      t.host_end = t.host_start + 1;
-      t.tgt_start = (uintptr_t) tgt_table[i];
-      t.tgt_end = t.tgt_start + 1;
-
-      TRACE ("() lib %d, func %d:\t0x%llx -- 0x%llx",
-	     lib_num, i, t.host_start, t.tgt_start);
-
-      table[table_size++] = t;
-    }
+      AddrVect curr_dev_table;
 
-  for (int i = 0; i < host_num_vars * 2; i += 2)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_var_start[i];
-      t.host_end = t.host_start + (uintptr_t) host_var_start[i+1];
-      t.tgt_start = (uintptr_t) tgt_table[tgt_num_funcs+i];
-      t.tgt_end = t.tgt_start + (uintptr_t) tgt_table[tgt_num_funcs+i+1];
+      for (int i = 0; i < num_funcs; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[i];
+	  tgt_addr.end = tgt_addr.start + 1;
+	  TRACE ("() func %d:\t0x%llx..0x%llx", i,
+		 tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      TRACE ("() lib %d, var %d:\t0x%llx (%d) -- 0x%llx (%d)", lib_num, i/2,
-	     t.host_start, t.host_end - t.host_start,
-	     t.tgt_start, t.tgt_end - t.tgt_start);
+      for (int i = 0; i < num_vars; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[num_funcs+i*2];
+	  tgt_addr.end = tgt_addr.start + (uintptr_t) table[num_funcs+i*2+1];
+	  TRACE ("() var %d:\t0x%llx..0x%llx", i, tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      table[table_size++] = t;
+      dev_table.push_back (curr_dev_table);
     }
 
-  delete [] tgt_table;
+  address_table->insert (std::make_pair (target_image, dev_table));
+
+  free (image);
 }
 
 extern "C" int
-GOMP_OFFLOAD_get_table (int device, void *result)
+GOMP_OFFLOAD_load_image (int device, void *target_image, addr_pair **result)
 {
-  TRACE ("(num_libraries = %d)", num_libraries);
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
 
-  mapping_table *table = NULL;
-  int table_size = 0;
+  /* If target_image is already present in address_table, then there is no need
+     to offload it.  */
+  if (address_table->count (target_image) == 0)
+    offload_image (target_image);
 
-  for (int i = 0; i < num_libraries; i++)
-    load_lib_and_get_table (device, i, table, table_size);
+  AddrVect *curr_dev_table = &(*address_table)[target_image][device];
+  int table_size = curr_dev_table->size ();
+  addr_pair *table = (addr_pair *) malloc (table_size * sizeof (addr_pair));
+  if (table == NULL)
+    {
+      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
+      exit (1);
+    }
 
-  *(void **) result = table;
+  std::copy (curr_dev_table->begin (), curr_dev_table->end (), table);
+  *result = table;
   return table_size;
 }
 
+extern "C" void
+GOMP_OFFLOAD_unload_image (int device, void *target_image)
+{
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
+
+  /* TODO: Currently liboffloadmic doesn't support __offload_unregister_image
+     for libraries.  */
+
+  address_table->erase (target_image);
+}
+
 extern "C" void *
 GOMP_OFFLOAD_alloc (int device, size_t size)
 {


  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-31 23:53                                           ` Ilya Verbin
@ 2015-04-01  5:21                                             ` Jakub Jelinek
  2015-04-01 13:14                                               ` Ilya Verbin
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2015-04-01  5:21 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Julian Brown, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Wed, Apr 01, 2015 at 02:53:28AM +0300, Ilya Verbin wrote:
> +/* Similar to gomp_fatal, but release mutexes before.  */
> +
> +static void
> +gomp_fatal_unlock (const char *fmt, ...)
> +{
> +  int i;
> +  va_list list;
> +
> +  for (i = 0; i < num_devices; i++)
> +    gomp_mutex_unlock (&devices[i].lock);

This is wrong.  Calling gomp_mutex_unlock on a lock that you don't have
locked is undefined behavior.
You really should unlock it in the caller which should be aware which 0/1/2
locks it holds.

> +  gomp_mutex_unlock (&register_lock);

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-31 16:10                                         ` Ilya Verbin
@ 2015-03-31 23:53                                           ` Ilya Verbin
  2015-04-01  5:21                                             ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-03-31 23:53 UTC (permalink / raw)
  To: Jakub Jelinek, Julian Brown; +Cc: Thomas Schwinge, gcc-patches, Kirill Yukhin

On Tue, Mar 31, 2015 at 19:10:36 +0300, Ilya Verbin wrote:
> Ok, thanks for the clarification!  Here is the new patch with variables.
> 
> Unfortunately I see 4 fails in make check-target-libgomp with PTX patch applied
> on top, but with disabled offloading to PTX.
> Julian, have you seen them?  All other tests passed with intelmic emul.
> 
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> 
> acc_on_device-1.c aborts here:
>   /* Offloaded.  */
> #pragma acc parallel
>   {
>     if (acc_on_device (acc_device_none))
>       abort ();

And here is the next version with fixed potential deadlock in
GOMP_offload_unregister.  make check-target-libgomp also passed.
(but with PTX patch make check-target-libgomp has several fails mentioned above)


diff --git a/gcc/config/i386/intelmic-mkoffload.c b/gcc/config/i386/intelmic-mkoffload.c
index f93007c..e101f93 100644
--- a/gcc/config/i386/intelmic-mkoffload.c
+++ b/gcc/config/i386/intelmic-mkoffload.c
@@ -350,14 +350,24 @@ generate_host_descr_file (const char *host_compiler)
 	   "#ifdef __cplusplus\n"
 	   "extern \"C\"\n"
 	   "#endif\n"
-	   "void GOMP_offload_register (void *, int, void *);\n\n"
+	   "void GOMP_offload_register (void *, int, void *);\n"
+	   "void GOMP_offload_unregister (void *, int, void *);\n\n"
 
 	   "__attribute__((constructor))\n"
 	   "static void\n"
 	   "init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
+	   "}\n\n", GOMP_DEVICE_INTEL_MIC);
+
+  fprintf (src_file,
+	   "__attribute__((destructor))\n"
+	   "static void\n"
+	   "fini (void)\n"
+	   "{\n"
+	   "  GOMP_offload_unregister (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
 	   "}\n", GOMP_DEVICE_INTEL_MIC);
+
   fclose (src_file);
 
   unsigned new_argc = 0;
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index d9cbff5..1072ae4 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -51,14 +51,12 @@ enum offload_target_type
   OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
 };
 
-/* Auxiliary struct, used for transferring a host-target address range mapping
-   from plugin to libgomp.  */
-struct mapping_table
+/* Auxiliary struct, used for transferring pairs of addresses from plugin
+   to libgomp.  */
+struct addr_pair
 {
-  uintptr_t host_start;
-  uintptr_t host_end;
-  uintptr_t tgt_start;
-  uintptr_t tgt_end;
+  uintptr_t start;
+  uintptr_t end;
 };
 
 /* Miscellaneous functions.  */
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3089401..a1d42c5 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -224,7 +224,6 @@ struct gomp_team_state
 };
 
 struct target_mem_desc;
-struct gomp_memory_mapping;
 
 /* These are the OpenMP 4.0 Internal Control Variables described in
    section 2.3.1.  Those described as having one copy per task are
@@ -657,7 +656,7 @@ struct target_mem_desc {
   struct gomp_device_descr *device_descr;
 
   /* Memory mapping info for the thread that created this descriptor.  */
-  struct gomp_memory_mapping *mem_map;
+  struct splay_tree_s *mem_map;
 
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
@@ -683,20 +682,6 @@ struct splay_tree_key_s {
 
 #include "splay-tree.h"
 
-/* Information about mapped memory regions (per device/context).  */
-
-struct gomp_memory_mapping
-{
-  /* Mutex for operating with the splay tree and other shared structures.  */
-  gomp_mutex_t lock;
-
-  /* True when tables have been added to this memory map.  */
-  bool is_initialized;
-
-  /* Splay tree containing information about mapped memory regions.  */
-  struct splay_tree_s splay_tree;
-};
-
 typedef struct acc_dispatch_t
 {
   /* This is a linked list of data mapped using the
@@ -773,19 +758,18 @@ struct gomp_device_descr
   unsigned int (*get_caps_func) (void);
   int (*get_type_func) (void);
   int (*get_num_devices_func) (void);
-  void (*register_image_func) (void *, void *);
   void (*init_device_func) (int);
   void (*fini_device_func) (int);
-  int (*get_table_func) (int, struct mapping_table **);
+  int (*load_image_func) (int, void *, struct addr_pair **);
+  void (*unload_image_func) (int, void *);
   void *(*alloc_func) (int, size_t);
   void (*free_func) (int, void *);
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
 
-  /* Memory-mapping info for this device instance.  */
-  /* Uses a separate lock.  */
-  struct gomp_memory_mapping mem_map;
+  /* Splay tree containing information about mapped memory regions.  */
+  struct splay_tree_s mem_map;
 
   /* Mutex for the mutable data.  */
   gomp_mutex_t lock;
@@ -793,9 +777,6 @@ struct gomp_device_descr
   /* Set to true when device is initialized.  */
   bool is_initialized;
 
-  /* True when offload regions have been registered with this device.  */
-  bool offload_regions_registered;
-
   /* OpenACC-specific data and functions.  */
   /* This is mutable because of its mutable data_environ and target_data
      members.  */
@@ -811,9 +792,7 @@ extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
 extern void gomp_copy_from_async (struct target_mem_desc *);
 extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
-extern void gomp_init_tables (struct gomp_device_descr *,
-			      struct gomp_memory_mapping *);
-extern void gomp_free_memmap (struct gomp_memory_mapping *);
+extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_fini_device (struct gomp_device_descr *);
 
 /* work.c */
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index f44174e..2b2b953 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -231,6 +231,7 @@ GOMP_4.0 {
 GOMP_4.0.1 {
   global:
 	GOMP_offload_register;
+	GOMP_offload_unregister;
 } GOMP_4.0;
 
 OACC_2.0 {
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 6aeb1e7..e4756b6 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -43,20 +43,18 @@ static struct gomp_device_descr host_dispatch =
     .get_caps_func = GOMP_OFFLOAD_get_caps,
     .get_type_func = GOMP_OFFLOAD_get_type,
     .get_num_devices_func = GOMP_OFFLOAD_get_num_devices,
-    .register_image_func = GOMP_OFFLOAD_register_image,
     .init_device_func = GOMP_OFFLOAD_init_device,
     .fini_device_func = GOMP_OFFLOAD_fini_device,
-    .get_table_func = GOMP_OFFLOAD_get_table,
+    .load_image_func = GOMP_OFFLOAD_load_image,
+    .unload_image_func = GOMP_OFFLOAD_unload_image,
     .alloc_func = GOMP_OFFLOAD_alloc,
     .free_func = GOMP_OFFLOAD_free,
     .dev2host_func = GOMP_OFFLOAD_dev2host,
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.is_initialized = false,
-    .mem_map.splay_tree.root = NULL,
+    .mem_map.root = NULL,
     .is_initialized = false,
-    .offload_regions_registered = false,
 
     .openacc = {
       .open_device_func = GOMP_OFFLOAD_openacc_open_device,
@@ -94,7 +92,6 @@ static struct gomp_device_descr host_dispatch =
 static __attribute__ ((constructor))
 void goacc_host_init (void)
 {
-  gomp_mutex_init (&host_dispatch.mem_map.lock);
   gomp_mutex_init (&host_dispatch.lock);
   goacc_register (&host_dispatch);
 }
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 166eb55..1e0243e 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -284,12 +284,6 @@ lazy_open (int ord)
     = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
 
   acc_dev->openacc.async_set_async_func (acc_async_sync);
-
-  struct gomp_memory_mapping *mem_map = &acc_dev->mem_map;
-  gomp_mutex_lock (&mem_map->lock);
-  if (!mem_map->is_initialized)
-    gomp_init_tables (acc_dev, mem_map);
-  gomp_mutex_unlock (&mem_map->lock);
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
@@ -351,10 +345,9 @@ acc_shutdown_1 (acc_device_t d)
 
 	  walk->dev->openacc.target_data = target_data = NULL;
 
-	  struct gomp_memory_mapping *mem_map = &walk->dev->mem_map;
-	  gomp_mutex_lock (&mem_map->lock);
-	  gomp_free_memmap (mem_map);
-	  gomp_mutex_unlock (&mem_map->lock);
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (&walk->dev->mem_map);
+	  gomp_mutex_unlock (&walk->dev->lock);
 
 	  walk->dev = NULL;
 	}
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 0096d51..fdc82e6 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -38,7 +38,7 @@
 /* Return block containing [H->S), or NULL if not contained.  */
 
 static splay_tree_key
-lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
+lookup_host (struct gomp_device_descr *dev, void *h, size_t s)
 {
   struct splay_tree_key_s node;
   splay_tree_key key;
@@ -46,11 +46,9 @@ lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
   node.host_start = (uintptr_t) h;
   node.host_end = (uintptr_t) h + s;
 
-  gomp_mutex_lock (&mem_map->lock);
-
-  key = splay_tree_lookup (&mem_map->splay_tree, &node);
-
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_lock (&dev->lock);
+  key = splay_tree_lookup (&dev->mem_map, &node);
+  gomp_mutex_unlock (&dev->lock);
 
   return key;
 }
@@ -65,14 +63,11 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
 {
   int i;
   struct target_mem_desc *t;
-  struct gomp_memory_mapping *mem_map;
 
   if (!tgt)
     return NULL;
 
-  mem_map = tgt->mem_map;
-
-  gomp_mutex_lock (&mem_map->lock);
+  gomp_mutex_lock (&tgt->device_descr->lock);
 
   for (t = tgt; t != NULL; t = t->prev)
     {
@@ -80,7 +75,7 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
         break;
     }
 
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_unlock (&tgt->device_descr->lock);
 
   if (!t)
     return NULL;
@@ -176,7 +171,7 @@ acc_deviceptr (void *h)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  n = lookup_host (&thr->dev->mem_map, h, 1);
+  n = lookup_host (thr->dev, h, 1);
 
   if (!n)
     return NULL;
@@ -229,7 +224,7 @@ acc_is_present (void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   if (n && ((uintptr_t)h < n->host_start
 	    || (uintptr_t)h + s > n->host_end
@@ -271,7 +266,7 @@ acc_map_data (void *h, void *d, size_t s)
 	gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map",
                     (void *)h, (int)s, (void *)d, (int)s);
 
-      if (lookup_host (&acc_dev->mem_map, h, s))
+      if (lookup_host (acc_dev, h, s))
 	gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h,
 		    (int)s);
 
@@ -296,7 +291,7 @@ acc_unmap_data (void *h)
   /* No need to call lazy open, as the address must have been mapped.  */
 
   size_t host_size;
-  splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1);
+  splay_tree_key n = lookup_host (acc_dev, h, 1);
   struct target_mem_desc *t;
 
   if (!n)
@@ -320,7 +315,7 @@ acc_unmap_data (void *h)
       t->tgt_end = 0;
       t->to_free = 0;
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
 	   tp = t, t = t->prev)
@@ -334,7 +329,7 @@ acc_unmap_data (void *h)
 	    break;
 	  }
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   gomp_unmap_vars (t, true);
@@ -358,7 +353,7 @@ present_create_copy (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
   if (n)
     {
       /* Present. */
@@ -389,13 +384,13 @@ present_create_copy (unsigned f, void *h, size_t s)
       tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
 			   false);
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
       tgt->prev = acc_dev->openacc.data_environ;
       acc_dev->openacc.data_environ = tgt;
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   return d;
@@ -436,7 +431,7 @@ delete_copyout (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -479,7 +474,7 @@ update_dev_host (int is_dev, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -532,7 +527,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   struct target_mem_desc *t;
   int minrefs = (mapnum == 1) ? 2 : 3;
 
-  n = lookup_host (&acc_dev->mem_map, h, 1);
+  n = lookup_host (acc_dev, h, 1);
 
   if (!n)
     gomp_fatal ("%p is not a mapped block", (void *)h);
@@ -543,7 +538,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
 
   struct target_mem_desc *tp;
 
-  gomp_mutex_lock (&acc_dev->mem_map.lock);
+  gomp_mutex_lock (&acc_dev->lock);
 
   if (t->refcount == minrefs)
     {
@@ -570,7 +565,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   if (force_copyfrom)
     t->list[0]->copy_from = 1;
 
-  gomp_mutex_unlock (&acc_dev->mem_map.lock);
+  gomp_mutex_unlock (&acc_dev->lock);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 0c74f54..563f9bb 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -144,9 +144,9 @@ GOACC_parallel (int device, void (*fn) (void *),
     {
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
-      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k);
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
+      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map, &k);
+      gomp_mutex_unlock (&acc_dev->lock);
 
       if (tgt_fn_key == NULL)
 	gomp_fatal ("target function wasn't mapped");
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index ebf7f11..bc60f72 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -95,12 +95,6 @@ GOMP_OFFLOAD_get_num_devices (void)
 }
 
 STATIC void
-GOMP_OFFLOAD_register_image (void *host_table __attribute__ ((unused)),
-			     void *target_data __attribute__ ((unused)))
-{
-}
-
-STATIC void
 GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
 {
 }
@@ -111,12 +105,19 @@ GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
 }
 
 STATIC int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **table __attribute__ ((unused)))
+GOMP_OFFLOAD_load_image (int n __attribute__ ((unused)),
+			 void *i __attribute__ ((unused)),
+			 struct addr_pair **r __attribute__ ((unused)))
 {
   return 0;
 }
 
+STATIC void
+GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
+			   void *i __attribute__ ((unused)))
+{
+}
+
 STATIC void *
 GOMP_OFFLOAD_openacc_open_device (int n)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index c5dda3f..fd9ba6d 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -49,6 +49,9 @@ static void gomp_target_init (void);
 /* The whole initialization code for offloading plugins is only run one.  */
 static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT;
 
+/* Mutex for offload image registration.  */
+static gomp_mutex_t register_lock;
+
 /* This structure describes an offload image.
    It contains type of the target device, pointer to host table descriptor, and
    pointer to target data.  */
@@ -67,14 +70,29 @@ static int num_offload_images;
 /* Array of descriptors for all available devices.  */
 static struct gomp_device_descr *devices;
 
-#ifdef PLUGIN_SUPPORT
 /* Total number of available devices.  */
 static int num_devices;
-#endif
 
 /* Number of GOMP_OFFLOAD_CAP_OPENMP_400 devices.  */
 static int num_devices_openmp;
 
+/* Similar to gomp_fatal, but release mutexes before.  */
+
+static void
+gomp_fatal_unlock (const char *fmt, ...)
+{
+  int i;
+  va_list list;
+
+  for (i = 0; i < num_devices; i++)
+    gomp_mutex_unlock (&devices[i].lock);
+  gomp_mutex_unlock (&register_lock);
+
+  va_start (list, fmt);
+  gomp_vfatal (fmt, list);
+  va_end (list);
+}
+
 /* The comparison function.  */
 
 attribute_hidden int
@@ -131,10 +149,10 @@ gomp_map_vars_existing (splay_tree_key oldn, splay_tree_key newn,
   if ((kind & GOMP_MAP_FLAG_FORCE)
       || oldn->host_start > newn->host_start
       || oldn->host_end < newn->host_end)
-    gomp_fatal ("Trying to map into device [%p..%p) object when "
-		"[%p..%p) is already mapped",
-		(void *) newn->host_start, (void *) newn->host_end,
-		(void *) oldn->host_start, (void *) oldn->host_end);
+    gomp_fatal_unlock ("Trying to map into device [%p..%p) object when "
+		       "[%p..%p) is already mapped",
+		       (void *) newn->host_start, (void *) newn->host_end,
+		       (void *) oldn->host_start, (void *) oldn->host_end);
   oldn->refcount++;
 }
 
@@ -153,14 +171,14 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   const int rshift = is_openacc ? 8 : 3;
   const int typemask = is_openacc ? 0xff : 0x7;
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
+  struct splay_tree_s *mem_map = &devicep->mem_map;
   struct splay_tree_key_s cur_node;
   struct target_mem_desc *tgt
     = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mm;
+  tgt->mem_map = mem_map;
 
   if (mapnum == 0)
     return tgt;
@@ -174,7 +192,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       tgt_size = mapnum * sizeof (void *);
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < mapnum; i++)
     {
@@ -189,7 +207,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	cur_node.host_end = cur_node.host_start + sizes[i];
       else
 	cur_node.host_end = cur_node.host_start + sizeof (void *);
-      splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+      splay_tree_key n = splay_tree_lookup (mem_map, &cur_node);
       if (n)
 	{
 	  tgt->list[i] = n;
@@ -228,7 +246,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   if (devaddrs)
     {
       if (mapnum != 1)
-        gomp_fatal ("unexpected aggregation");
+        gomp_fatal_unlock ("unexpected aggregation");
       tgt->to_free = devaddrs[0];
       tgt->tgt_start = (uintptr_t) tgt->to_free;
       tgt->tgt_end = tgt->tgt_start + sizes[0];
@@ -274,7 +292,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      k->host_end = k->host_start + sizes[i];
 	    else
 	      k->host_end = k->host_start + sizeof (void *);
-	    splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k);
+	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n)
 	      {
 		tgt->list[i] = n;
@@ -294,7 +312,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		tgt->refcount++;
 		array->left = NULL;
 		array->right = NULL;
-		splay_tree_insert (&mm->splay_tree, array);
+		splay_tree_insert (mem_map, array);
 		switch (kind & typemask)
 		  {
 		  case GOMP_MAP_ALLOC:
@@ -332,22 +350,22 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    /* Add bias to the pointer value.  */
 		    cur_node.host_start += sizes[i];
 		    cur_node.host_end = cur_node.host_start + 1;
-		    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+		    n = splay_tree_lookup (mem_map, &cur_node);
 		    if (n == NULL)
 		      {
 			/* Could be possibly zero size array section.  */
 			cur_node.host_end--;
-			n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			n = splay_tree_lookup (mem_map, &cur_node);
 			if (n == NULL)
 			  {
 			    cur_node.host_start--;
-			    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			    n = splay_tree_lookup (mem_map, &cur_node);
 			    cur_node.host_start++;
 			  }
 		      }
 		    if (n == NULL)
-		      gomp_fatal ("Pointer target of array section "
-				  "wasn't mapped");
+		      gomp_fatal_unlock ("Pointer target of array section "
+					 "wasn't mapped");
 		    cur_node.host_start -= n->host_start;
 		    cur_node.tgt_offset = n->tgt->tgt_start + n->tgt_offset
 					  + cur_node.host_start;
@@ -400,24 +418,22 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			  /* Add bias to the pointer value.  */
 			  cur_node.host_start += sizes[j];
 			  cur_node.host_end = cur_node.host_start + 1;
-			  n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			  n = splay_tree_lookup (mem_map, &cur_node);
 			  if (n == NULL)
 			    {
 			      /* Could be possibly zero size array section.  */
 			      cur_node.host_end--;
-			      n = splay_tree_lookup (&mm->splay_tree,
-						     &cur_node);
+			      n = splay_tree_lookup (mem_map, &cur_node);
 			      if (n == NULL)
 				{
 				  cur_node.host_start--;
-				  n = splay_tree_lookup (&mm->splay_tree,
-							 &cur_node);
+				  n = splay_tree_lookup (mem_map, &cur_node);
 				  cur_node.host_start++;
 				}
 			    }
 			  if (n == NULL)
-			    gomp_fatal ("Pointer target of array section "
-					"wasn't mapped");
+			    gomp_fatal_unlock ("Pointer target of array section"
+					       " wasn't mapped");
 			  cur_node.host_start -= n->host_start;
 			  cur_node.tgt_offset = n->tgt->tgt_start
 						+ n->tgt_offset
@@ -442,14 +458,15 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			 was missing.  */
 		      size_t size = k->host_end - k->host_start;
 #ifdef HAVE_INTTYPES_H
-		      gomp_fatal ("present clause: !acc_is_present (%p, "
-				  "%"PRIu64" (0x%"PRIx64"))",
-				  (void *) k->host_start,
-				  (uint64_t) size, (uint64_t) size);
+		      gomp_fatal_unlock ("present clause: !acc_is_present (%p, "
+					 "%"PRIu64" (0x%"PRIx64"))",
+					 (void *) k->host_start,
+					 (uint64_t) size, (uint64_t) size);
 #else
-		      gomp_fatal ("present clause: !acc_is_present (%p, "
-				  "%lu (0x%lx))", (void *) k->host_start,
-				  (unsigned long) size, (unsigned long) size);
+		      gomp_fatal_unlock ("present clause: !acc_is_present (%p, "
+					 "%lu (0x%lx))", (void *) k->host_start,
+					 (unsigned long) size,
+					 (unsigned long) size);
 #endif
 		    }
 		    break;
@@ -463,8 +480,8 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 					    sizeof (void *));
 		    break;
 		  default:
-		    gomp_fatal ("%s: unhandled kind 0x%.2x", __FUNCTION__,
-				kind);
+		    gomp_fatal_unlock ("%s: unhandled kind 0x%.2x",
+				       __FUNCTION__, kind);
 		  }
 		array++;
 	      }
@@ -489,7 +506,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	}
     }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
   return tgt;
 }
 
@@ -514,10 +531,9 @@ attribute_hidden void
 gomp_copy_from_async (struct target_mem_desc *tgt)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
   size_t i;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < tgt->list_count; i++)
     if (tgt->list[i] == NULL)
@@ -536,7 +552,7 @@ gomp_copy_from_async (struct target_mem_desc *tgt)
 				  k->host_end - k->host_start);
       }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 /* Unmap variables described by TGT.  If DO_COPYFROM is true, copy relevant
@@ -547,7 +563,6 @@ attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
 
   if (tgt->list_count == 0)
     {
@@ -555,7 +570,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
       return;
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   size_t i;
   for (i = 0; i < tgt->list_count; i++)
@@ -572,7 +587,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (&mm->splay_tree, k);
+	splay_tree_remove (tgt->mem_map, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -584,13 +599,12 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
   else
     gomp_unmap_tgt (tgt);
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 static void
-gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
-	     size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds,
-	     bool is_openacc)
+gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
+	     size_t *sizes, void *kinds, bool is_openacc)
 {
   size_t i;
   struct splay_tree_key_s cur_node;
@@ -602,25 +616,24 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
   if (mapnum == 0)
     return;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
   for (i = 0; i < mapnum; i++)
     if (sizes[i])
       {
 	cur_node.host_start = (uintptr_t) hostaddrs[i];
 	cur_node.host_end = cur_node.host_start + sizes[i];
-	splay_tree_key n = splay_tree_lookup (&mm->splay_tree,
-					      &cur_node);
+	splay_tree_key n = splay_tree_lookup (&devicep->mem_map, &cur_node);
 	if (n)
 	  {
 	    int kind = get_kind (is_openacc, kinds, i);
 	    if (n->host_start > cur_node.host_start
 		|| n->host_end < cur_node.host_end)
-	      gomp_fatal ("Trying to update [%p..%p) object when"
-			  "only [%p..%p) is mapped",
-			  (void *) cur_node.host_start,
-			  (void *) cur_node.host_end,
-			  (void *) n->host_start,
-			  (void *) n->host_end);
+	      gomp_fatal_unlock ("Trying to update [%p..%p) object when "
+				 "only [%p..%p) is mapped",
+				 (void *) cur_node.host_start,
+				 (void *) cur_node.host_end,
+				 (void *) n->host_start,
+				 (void *) n->host_end);
 	    if (GOMP_MAP_COPY_TO_P (kind & typemask))
 	      devicep->host2dev_func (devicep->target_id,
 				      (void *) (n->tgt->tgt_start
@@ -639,14 +652,92 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
 				      cur_node.host_end - cur_node.host_start);
 	  }
 	else
-	  gomp_fatal ("Trying to update [%p..%p) object that is not mapped",
-		      (void *) cur_node.host_start,
-		      (void *) cur_node.host_end);
+	  gomp_fatal_unlock ("Trying to update [%p..%p) object that is not "
+			     "mapped", (void *) cur_node.host_start,
+			     (void *) cur_node.host_end);
       }
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
-/* This function should be called from every offload image.
+/* Load image pointed by TARGET_DATA to the device, specified by DEVICEP.
+   And insert to splay tree the mapping between addresses from HOST_TABLE and
+   from loaded target image.  */
+
+static void
+gomp_offload_image_to_device (struct gomp_device_descr *devicep,
+			      void *host_table, void *target_data)
+{
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  /* Load image to device and get target addresses for the image.  */
+  struct addr_pair *target_table = NULL;
+  int i, num_target_entries
+    = devicep->load_image_func (devicep->target_id, target_data, &target_table);
+
+  if (num_target_entries != num_funcs + num_vars)
+    gomp_fatal_unlock ("Can't map target functions or variables");
+
+  /* Insert host-target address mapping into splay tree.  */
+  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
+  tgt->array = gomp_malloc ((num_funcs + num_vars) * sizeof (*tgt->array));
+  tgt->refcount = 1;
+  tgt->tgt_start = 0;
+  tgt->tgt_end = 0;
+  tgt->to_free = NULL;
+  tgt->prev = NULL;
+  tgt->list_count = 0;
+  tgt->device_descr = devicep;
+  splay_tree_node array = tgt->array;
+
+  for (i = 0; i < num_funcs; i++)
+    {
+      splay_tree_key k = &array->key;
+      k->host_start = (uintptr_t) host_func_table[i];
+      k->host_end = k->host_start + 1;
+      k->tgt = tgt;
+      k->tgt_offset = target_table[i].start;
+      k->refcount = 1;
+      k->async_refcount = 0;
+      k->copy_from = false;
+      array->left = NULL;
+      array->right = NULL;
+      splay_tree_insert (&devicep->mem_map, array);
+      array++;
+    }
+
+  for (i = 0; i < num_vars; i++)
+    {
+      struct addr_pair *target_var = &target_table[num_funcs + i];
+      if (target_var->end - target_var->start
+	  != (uintptr_t) host_var_table[i * 2 + 1])
+	gomp_fatal_unlock ("Can't map target variables (size mismatch)");
+
+      splay_tree_key k = &array->key;
+      k->host_start = (uintptr_t) host_var_table[i * 2];
+      k->host_end = k->host_start + (uintptr_t) host_var_table[i * 2 + 1];
+      k->tgt = tgt;
+      k->tgt_offset = target_var->start;
+      k->refcount = 1;
+      k->async_refcount = 0;
+      k->copy_from = false;
+      array->left = NULL;
+      array->right = NULL;
+      splay_tree_insert (&devicep->mem_map, array);
+      array++;
+    }
+
+  free (target_table);
+}
+
+/* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
    the target, and TARGET_DATA needed by target plugin.  */
 
@@ -654,6 +745,20 @@ void
 GOMP_offload_register (void *host_table, enum offload_target_type target_type,
 		       void *target_data)
 {
+  int i;
+  gomp_mutex_lock (&register_lock);
+
+  /* Load image to all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type == target_type && devicep->is_initialized)
+	gomp_offload_image_to_device (devicep, host_table, target_data);
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* Insert image to array of pending images.  */
   offload_images = gomp_realloc (offload_images,
 				 (num_offload_images + 1)
 				 * sizeof (struct offload_image_descr));
@@ -663,74 +768,129 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type,
   offload_images[num_offload_images].target_data = target_data;
 
   num_offload_images++;
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* This function initializes the target device, specified by DEVICEP.  DEVICEP
-   must be locked on entry, and remains locked on return.  */
+/* This function should be called from every offload image while unloading.
+   It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
+   the target, and TARGET_DATA needed by target plugin.  */
 
-attribute_hidden void
-gomp_init_device (struct gomp_device_descr *devicep)
+void
+GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
+			 void *target_data)
 {
-  devicep->init_device_func (devicep->target_id);
-  devicep->is_initialized = true;
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+  int i;
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  gomp_mutex_lock (&register_lock);
+
+  /* Unload image from all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      int j;
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type != target_type || !devicep->is_initialized)
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  continue;
+	}
+
+      devicep->unload_image_func (devicep->target_id, target_data);
+
+      /* Remove mapping from splay tree.  */
+      struct splay_tree_key_s k;
+      splay_tree_key node = NULL;
+      if (num_funcs > 0)
+	{
+	  k.host_start = (uintptr_t) host_func_table[0];
+	  k.host_end = k.host_start + 1;
+	  node = splay_tree_lookup (&devicep->mem_map, &k);
+	}
+      else if (num_vars > 0)
+	{
+	  k.host_start = (uintptr_t) host_var_table[0];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[1];
+	  node = splay_tree_lookup (&devicep->mem_map, &k);
+	}
+
+      for (j = 0; j < num_funcs; j++)
+	{
+	  k.host_start = (uintptr_t) host_func_table[j];
+	  k.host_end = k.host_start + 1;
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+
+      for (j = 0; j < num_vars; j++)
+	{
+	  k.host_start = (uintptr_t) host_var_table[j * 2];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[j * 2 + 1];
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+
+      if (node)
+	{
+	  free (node->tgt);
+	  free (node);
+	}
+
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* Remove image from array of pending images.  */
+  for (i = 0; i < num_offload_images; i++)
+    if (offload_images[i].target_data == target_data)
+      {
+	offload_images[i] = offload_images[--num_offload_images];
+	break;
+      }
+
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* Initialize address mapping tables.  MM must be locked on entry, and remains
-   locked on return.  */
+/* This function initializes the target device, specified by DEVICEP.  DEVICEP
+   must be locked on entry, and remains locked on return.  */
 
 attribute_hidden void
-gomp_init_tables (struct gomp_device_descr *devicep,
-		  struct gomp_memory_mapping *mm)
+gomp_init_device (struct gomp_device_descr *devicep)
 {
-  /* Get address mapping table for device.  */
-  struct mapping_table *table = NULL;
-  int num_entries = devicep->get_table_func (devicep->target_id, &table);
-
-  /* Insert host-target address mapping into dev_splay_tree.  */
   int i;
-  for (i = 0; i < num_entries; i++)
+  devicep->init_device_func (devicep->target_id);
+
+  /* Load to device all images registered by the moment.  */
+  for (i = 0; i < num_offload_images; i++)
     {
-      struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
-      tgt->refcount = 1;
-      tgt->array = gomp_malloc (sizeof (*tgt->array));
-      tgt->tgt_start = table[i].tgt_start;
-      tgt->tgt_end = table[i].tgt_end;
-      tgt->to_free = NULL;
-      tgt->list_count = 0;
-      tgt->device_descr = devicep;
-      splay_tree_node node = tgt->array;
-      splay_tree_key k = &node->key;
-      k->host_start = table[i].host_start;
-      k->host_end = table[i].host_end;
-      k->tgt_offset = 0;
-      k->refcount = 1;
-      k->copy_from = false;
-      k->tgt = tgt;
-      node->left = NULL;
-      node->right = NULL;
-      splay_tree_insert (&mm->splay_tree, node);
+      struct offload_image_descr *image = &offload_images[i];
+      if (image->type == devicep->type)
+	gomp_offload_image_to_device (devicep, image->host_table,
+				      image->target_data);
     }
 
-  free (table);
-  mm->is_initialized = true;
+  devicep->is_initialized = true;
 }
 
 /* Free address mapping tables.  MM must be locked on entry, and remains locked
    on return.  */
 
 attribute_hidden void
-gomp_free_memmap (struct gomp_memory_mapping *mm)
+gomp_free_memmap (struct splay_tree_s *mem_map)
 {
-  while (mm->splay_tree.root)
+  while (mem_map->root)
     {
-      struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt;
+      struct target_mem_desc *tgt = mem_map->root->key.tgt;
 
-      splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key);
+      splay_tree_remove (mem_map, &mem_map->root->key);
       free (tgt->array);
       free (tgt);
     }
-
-  mm->is_initialized = false;
 }
 
 /* This function de-initializes the target device, specified by DEVICEP.
@@ -791,22 +951,17 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
     fn_addr = (void *) fn;
   else
     {
-      struct gomp_memory_mapping *mm = &devicep->mem_map;
-      gomp_mutex_lock (&mm->lock);
-
-      if (!mm->is_initialized)
-	gomp_init_tables (devicep, mm);
-
+      gomp_mutex_lock (&devicep->lock);
       struct splay_tree_key_s k;
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      splay_tree_key tgt_fn = splay_tree_lookup (&mm->splay_tree, &k);
+      splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
       if (tgt_fn == NULL)
-	gomp_fatal ("Target function wasn't mapped");
+	gomp_fatal_unlock ("Target function wasn't mapped");
 
-      gomp_mutex_unlock (&mm->lock);
+      gomp_mutex_unlock (&devicep->lock);
 
-      fn_addr = (void *) tgt_fn->tgt->tgt_start;
+      fn_addr = (void *) tgt_fn->tgt_offset;
     }
 
   struct target_mem_desc *tgt_vars
@@ -856,12 +1011,6 @@ GOMP_target_data (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
   struct target_mem_desc *tgt
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     false);
@@ -897,13 +1046,7 @@ GOMP_target_update (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
-  gomp_update (devicep, mm, mapnum, hostaddrs, sizes, kinds, false);
+  gomp_update (devicep, mapnum, hostaddrs, sizes, kinds, false);
 }
 
 void
@@ -972,10 +1115,10 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM (get_caps);
   DLSYM (get_type);
   DLSYM (get_num_devices);
-  DLSYM (register_image);
   DLSYM (init_device);
   DLSYM (fini_device);
-  DLSYM (get_table);
+  DLSYM (load_image);
+  DLSYM (unload_image);
   DLSYM (alloc);
   DLSYM (free);
   DLSYM (dev2host);
@@ -1038,22 +1181,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   return err == NULL;
 }
 
-/* This function adds a compatible offload image IMAGE to an accelerator device
-   DEVICE.  DEVICE must be locked on entry, and remains locked on return.  */
-
-static void
-gomp_register_image_for_device (struct gomp_device_descr *device,
-				struct offload_image_descr *image)
-{
-  if (!device->offload_regions_registered
-      && (device->type == image->type
-	  || device->type == OFFLOAD_TARGET_TYPE_HOST))
-    {
-      device->register_image_func (image->host_table, image->target_data);
-      device->offload_regions_registered = true;
-    }
-}
-
 /* This function initializes the runtime needed for offloading.
    It parses the list of offload targets and tries to load the plugins for
    these targets.  On return, the variables NUM_DEVICES and NUM_DEVICES_OPENMP
@@ -1112,17 +1239,14 @@ gomp_target_init (void)
 		current_device.name = current_device.get_name_func ();
 		/* current_device.capabilities has already been set.  */
 		current_device.type = current_device.get_type_func ();
-		current_device.mem_map.is_initialized = false;
-		current_device.mem_map.splay_tree.root = NULL;
+		current_device.mem_map.root = NULL;
 		current_device.is_initialized = false;
-		current_device.offload_regions_registered = false;
 		current_device.openacc.data_environ = NULL;
 		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
 		    devices[num_devices] = current_device;
-		    gomp_mutex_init (&devices[num_devices].mem_map.lock);
 		    gomp_mutex_init (&devices[num_devices].lock);
 		    num_devices++;
 		  }
@@ -1157,21 +1281,12 @@ gomp_target_init (void)
 
   for (i = 0; i < num_devices; i++)
     {
-      int j;
-
-      for (j = 0; j < num_offload_images; j++)
-	gomp_register_image_for_device (&devices[i], &offload_images[j]);
-
       /* The 'devices' array can be moved (by the realloc call) until we have
 	 found all the plugins, so registering with the OpenACC runtime (which
 	 takes a copy of the pointer argument) must be delayed until now.  */
       if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
 	goacc_register (&devices[i]);
     }
-
-  free (offload_images);
-  offload_images = NULL;
-  num_offload_images = 0;
 }
 
 #else /* PLUGIN_SUPPORT */
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index 3e7a958..a2d61b1 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -34,6 +34,7 @@
 #include <string.h>
 #include <utility>
 #include <vector>
+#include <map>
 #include "libgomp-plugin.h"
 #include "compiler_if_host.h"
 #include "main_target_image.h"
@@ -53,6 +54,29 @@ fprintf (stderr, "\n");					    \
 #endif
 
 
+/* Start/end addresses of functions and global variables on a device.  */
+typedef std::vector<addr_pair> AddrVect;
+
+/* Addresses for one image and all devices.  */
+typedef std::vector<AddrVect> DevAddrVect;
+
+/* Addresses for all images and all devices.  */
+typedef std::map<void *, DevAddrVect> ImgDevAddrMap;
+
+
+/* Total number of available devices.  */
+static int num_devices;
+
+/* Total number of shared libraries with offloading to Intel MIC.  */
+static int num_images;
+
+/* Two dimensional array: one key is a pointer to image,
+   second key is number of device.  Contains a vector of pointer pairs.  */
+static ImgDevAddrMap *address_table;
+
+/* Thread-safe registration of the main image.  */
+static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
+
 static VarDesc vd_host2tgt = {
   { 1, 1 },		      /* dst, src			      */
   { 1, 0 },		      /* in, out			      */
@@ -90,28 +114,17 @@ static VarDesc vd_tgt2host = {
 };
 
 
-/* Total number of shared libraries with offloading to Intel MIC.  */
-static int num_libraries;
-
-/* Pointers to the descriptors, containing pointers to host-side tables and to
-   target images.  */
-static std::vector< std::pair<void *, void *> > lib_descrs;
-
-/* Thread-safe registration of the main image.  */
-static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
-
-
 /* Add path specified in LD_LIBRARY_PATH to MIC_LD_LIBRARY_PATH, which is
    required by liboffloadmic.  */
 __attribute__((constructor))
 static void
-set_mic_lib_path (void)
+init (void)
 {
   const char *ld_lib_path = getenv (LD_LIBRARY_PATH_ENV);
   const char *mic_lib_path = getenv (MIC_LD_LIBRARY_PATH_ENV);
 
   if (!ld_lib_path)
-    return;
+    goto out;
 
   if (!mic_lib_path)
     setenv (MIC_LD_LIBRARY_PATH_ENV, ld_lib_path, 1);
@@ -133,6 +146,10 @@ set_mic_lib_path (void)
       if (!use_alloca)
 	free (mic_lib_path_new);
     }
+
+out:
+  address_table = new ImgDevAddrMap;
+  num_devices = _Offload_number_of_devices ();
 }
 
 extern "C" const char *
@@ -162,18 +179,8 @@ GOMP_OFFLOAD_get_type (void)
 extern "C" int
 GOMP_OFFLOAD_get_num_devices (void)
 {
-  int res = _Offload_number_of_devices ();
-  TRACE ("(): return %d", res);
-  return res;
-}
-
-/* This should be called from every shared library with offloading.  */
-extern "C" void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_image)
-{
-  TRACE ("(host_table = %p, target_image = %p)", host_table, target_image);
-  lib_descrs.push_back (std::make_pair (host_table, target_image));
-  num_libraries++;
+  TRACE ("(): return %d", num_devices);
+  return num_devices;
 }
 
 static void
@@ -196,7 +203,8 @@ register_main_image ()
   __offload_register_image (&main_target_image);
 }
 
-/* Load offload_target_main on target.  */
+/* liboffloadmic loads and runs offload_target_main on all available devices
+   during a first call to offload ().  */
 extern "C" void
 GOMP_OFFLOAD_init_device (int device)
 {
@@ -243,9 +251,11 @@ get_target_table (int device, int &num_funcs, int &num_vars, void **&table)
     }
 }
 
+/* Offload TARGET_IMAGE to all available devices and fill address_table with
+   corresponding target addresses.  */
+
 static void
-load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
-			int &table_size)
+offload_image (void *target_image)
 {
   struct TargetImage {
     int64_t size;
@@ -254,19 +264,11 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     char data[];
   } __attribute__ ((packed));
 
-  void ***host_table_descr = (void ***) lib_descrs[lib_num].first;
-  void **host_func_start = host_table_descr[0];
-  void **host_func_end   = host_table_descr[1];
-  void **host_var_start  = host_table_descr[2];
-  void **host_var_end    = host_table_descr[3];
+  void *image_start = ((void **) target_image)[0];
+  void *image_end   = ((void **) target_image)[1];
 
-  void **target_image_descr = (void **) lib_descrs[lib_num].second;
-  void *image_start = target_image_descr[0];
-  void *image_end   = target_image_descr[1];
-
-  TRACE ("() host_table_descr { %p, %p, %p, %p }", host_func_start,
-	 host_func_end, host_var_start, host_var_end);
-  TRACE ("() target_image_descr { %p, %p }", image_start, image_end);
+  TRACE ("(target_image = %p { %p, %p })",
+	 target_image, image_start, image_end);
 
   int64_t image_size = (uintptr_t) image_end - (uintptr_t) image_start;
   TargetImage *image
@@ -279,94 +281,87 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     }
 
   image->size = image_size;
-  sprintf (image->name, "lib%010d.so", lib_num);
+  sprintf (image->name, "lib%010d.so", num_images++);
   memcpy (image->data, image_start, image->size);
 
   TRACE ("() __offload_register_image %s { %p, %d }",
 	 image->name, image_start, image->size);
   __offload_register_image (image);
 
-  int tgt_num_funcs = 0;
-  int tgt_num_vars = 0;
-  void **tgt_table = NULL;
-  get_target_table (device, tgt_num_funcs, tgt_num_vars, tgt_table);
-  free (image);
-
-  /* The func table contains only addresses, the var table contains addresses
-     and corresponding sizes.  */
-  int host_num_funcs = host_func_end - host_func_start;
-  int host_num_vars  = (host_var_end - host_var_start) / 2;
-  TRACE ("() host_num_funcs = %d, tgt_num_funcs = %d",
-	 host_num_funcs, tgt_num_funcs);
-  TRACE ("() host_num_vars = %d, tgt_num_vars = %d",
-	 host_num_vars, tgt_num_vars);
-  if (host_num_funcs != tgt_num_funcs)
+  /* Receive tables for target_image from all devices.  */
+  DevAddrVect dev_table;
+  for (int dev = 0; dev < num_devices; dev++)
     {
-      fprintf (stderr, "%s: Can't map target functions\n", __FILE__);
-      exit (1);
-    }
-  if (host_num_vars != tgt_num_vars)
-    {
-      fprintf (stderr, "%s: Can't map target variables\n", __FILE__);
-      exit (1);
-    }
+      int num_funcs = 0;
+      int num_vars = 0;
+      void **table = NULL;
 
-  table = (mapping_table *) realloc (table, (table_size + host_num_funcs
-					     + host_num_vars)
-					    * sizeof (mapping_table));
-  if (table == NULL)
-    {
-      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
-      exit (1);
-    }
+      get_target_table (dev, num_funcs, num_vars, table);
 
-  for (int i = 0; i < host_num_funcs; i++)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_func_start[i];
-      t.host_end = t.host_start + 1;
-      t.tgt_start = (uintptr_t) tgt_table[i];
-      t.tgt_end = t.tgt_start + 1;
-
-      TRACE ("() lib %d, func %d:\t0x%llx -- 0x%llx",
-	     lib_num, i, t.host_start, t.tgt_start);
-
-      table[table_size++] = t;
-    }
+      AddrVect curr_dev_table;
 
-  for (int i = 0; i < host_num_vars * 2; i += 2)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_var_start[i];
-      t.host_end = t.host_start + (uintptr_t) host_var_start[i+1];
-      t.tgt_start = (uintptr_t) tgt_table[tgt_num_funcs+i];
-      t.tgt_end = t.tgt_start + (uintptr_t) tgt_table[tgt_num_funcs+i+1];
+      for (int i = 0; i < num_funcs; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[i];
+	  tgt_addr.end = tgt_addr.start + 1;
+	  TRACE ("() func %d:\t0x%llx..0x%llx", i,
+		 tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      TRACE ("() lib %d, var %d:\t0x%llx (%d) -- 0x%llx (%d)", lib_num, i/2,
-	     t.host_start, t.host_end - t.host_start,
-	     t.tgt_start, t.tgt_end - t.tgt_start);
+      for (int i = 0; i < num_vars; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[num_funcs+i*2];
+	  tgt_addr.end = tgt_addr.start + (uintptr_t) table[num_funcs+i*2+1];
+	  TRACE ("() var %d:\t0x%llx..0x%llx", i, tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      table[table_size++] = t;
+      dev_table.push_back (curr_dev_table);
     }
 
-  delete [] tgt_table;
+  address_table->insert (std::make_pair (target_image, dev_table));
+
+  free (image);
 }
 
 extern "C" int
-GOMP_OFFLOAD_get_table (int device, void *result)
+GOMP_OFFLOAD_load_image (int device, void *target_image, addr_pair **result)
 {
-  TRACE ("(num_libraries = %d)", num_libraries);
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
 
-  mapping_table *table = NULL;
-  int table_size = 0;
+  /* If target_image is already present in address_table, then there is no need
+     to offload it.  */
+  if (address_table->count (target_image) == 0)
+    offload_image (target_image);
 
-  for (int i = 0; i < num_libraries; i++)
-    load_lib_and_get_table (device, i, table, table_size);
+  AddrVect *curr_dev_table = &(*address_table)[target_image][device];
+  int table_size = curr_dev_table->size ();
+  addr_pair *table = (addr_pair *) malloc (table_size * sizeof (addr_pair));
+  if (table == NULL)
+    {
+      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
+      exit (1);
+    }
 
-  *(void **) result = table;
+  std::copy (curr_dev_table->begin (), curr_dev_table->end (), table);
+  *result = table;
   return table_size;
 }
 
+extern "C" void
+GOMP_OFFLOAD_unload_image (int device, void *target_image)
+{
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
+
+  /* TODO: Currently liboffloadmic doesn't support __offload_unregister_image
+     for libraries.  */
+
+  address_table->erase (target_image);
+}
+
 extern "C" void *
 GOMP_OFFLOAD_alloc (int device, size_t size)
 {


  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-31 18:25                                     ` Ilya Verbin
@ 2015-03-31 19:06                                       ` Jakub Jelinek
  0 siblings, 0 replies; 30+ messages in thread
From: Jakub Jelinek @ 2015-03-31 19:06 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Thomas Schwinge, Julian Brown, gcc-patches, Kirill Yukhin

On Tue, Mar 31, 2015 at 09:25:26PM +0300, Ilya Verbin wrote:
> On Mon, Mar 30, 2015 at 18:42:02 +0200, Jakub Jelinek wrote:
> > Shouldn't either this function, or gomp_offload_image_to_device lock
> > also devicep->lock mutex and unlock at the end?
> > Where exactly I guess depends on if the devicep->* hook calls should be
> > guarded with the mutex or not.  If yes, it should be this function and
> > gomp_init_device.
> > 
> > > +      if (devicep->type != target_type || !devicep->is_initialized)
> > > +	continue;
> > > +
> > 
> > Similarly.
> 
> Oops, there is a deadlock.  E.g. if gomp_map_vars locks devicep->lock and then
> calls gomp_fatal, the destructors from .fini section are executed, so
> gomp_mutex_lock in GOMP_offload_unregister will wait for devicep->lock.

Thus perhaps before calling gomp_fatal you should release the device lock
(if held) and register_lock (ditto).

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-30 16:42                                   ` Jakub Jelinek
  2015-03-30 21:43                                     ` Julian Brown
  2015-03-31 12:52                                     ` Ilya Verbin
@ 2015-03-31 18:25                                     ` Ilya Verbin
  2015-03-31 19:06                                       ` Jakub Jelinek
  2 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-03-31 18:25 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Thomas Schwinge, Julian Brown, gcc-patches, Kirill Yukhin

On Mon, Mar 30, 2015 at 18:42:02 +0200, Jakub Jelinek wrote:
> Shouldn't either this function, or gomp_offload_image_to_device lock
> also devicep->lock mutex and unlock at the end?
> Where exactly I guess depends on if the devicep->* hook calls should be
> guarded with the mutex or not.  If yes, it should be this function and
> gomp_init_device.
> 
> > +      if (devicep->type != target_type || !devicep->is_initialized)
> > +	continue;
> > +
> 
> Similarly.

Oops, there is a deadlock.  E.g. if gomp_map_vars locks devicep->lock and then
calls gomp_fatal, the destructors from .fini section are executed, so
gomp_mutex_lock in GOMP_offload_unregister will wait for devicep->lock.

  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-31 13:08                                       ` Jakub Jelinek
@ 2015-03-31 16:10                                         ` Ilya Verbin
  2015-03-31 23:53                                           ` Ilya Verbin
  0 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-03-31 16:10 UTC (permalink / raw)
  To: Jakub Jelinek, Julian Brown; +Cc: Thomas Schwinge, gcc-patches, Kirill Yukhin

On Tue, Mar 31, 2015 at 15:07:58 +0200, Jakub Jelinek wrote:
> On Tue, Mar 31, 2015 at 03:52:06PM +0300, Ilya Verbin wrote:
> > > What is the reason to register and allocate these one at a time, rather than
> > > using one struct target_mem_desc with one tgt->array for all splay tree
> > > nodes registered from one image?
> > > Perhaps you would just use tgt_start of 0 and tgt_end of 0 too (to make it
> > > clear it is special) and just use tgt_offset relative to that (i.e.
> > > absolute), but having to malloc each node individually and having to malloc
> > > a target_mem_desc for each one sounds expensive.
> > > Everything is freed just once anyway, isn't it?
> > 
> > Here is WIP patch, does this look like what you suggested?  It works fine with
> > functions, however I'm not sure what to do with variables.  Will gomp_map_vars
> > work when tgt_start and tgt_end are equal to 0?
> 
> Can you explain what you are afraid of?  The mapped images (both their
> mapping and unmapping) are done in pairs, and in a valid program the
> addresses shouldn't be already mapped when the image is mapped in etc.
> So, for gomp_map_vars, the var allocations should just be the pre-existing
> mappings, i.e.
>       splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
>       if (n)
>         {
>           tgt->list[i] = n;
>           gomp_map_vars_existing (n, &cur_node, kind & typemask);
>         }
> case and
>   if (is_target)
>     {
>       for (i = 0; i < mapnum; i++)
>         {
>           if (tgt->list[i] == NULL)
>             cur_node.tgt_offset = (uintptr_t) NULL;
>           else
>             cur_node.tgt_offset = tgt->list[i]->tgt->tgt_start
>                                   + tgt->list[i]->tgt_offset;
>           /* FIXME: see above FIXME comment.  */
>           devicep->host2dev_func (devicep->target_id,
>                                   (void *) (tgt->tgt_start
>                                             + i * sizeof (void *)),
>                                   (void *) &cur_node.tgt_offset,
>                                   sizeof (void *));
>         }
>     }
> at the end.  tgt->list[i] will be non-NULL, tgt->list[i]->tgt->tgt_start
> will be 0, but tgt->list[i]->tgt_offset will be absolute and so should DTRT.

Ok, thanks for the clarification!  Here is the new patch with variables.

Unfortunately I see 4 fails in make check-target-libgomp with PTX patch applied
on top, but with disabled offloading to PTX.
Julian, have you seen them?  All other tests passed with intelmic emul.

FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test

acc_on_device-1.c aborts here:
  /* Offloaded.  */
#pragma acc parallel
  {
    if (acc_on_device (acc_device_none))
      abort ();


diff --git a/gcc/config/i386/intelmic-mkoffload.c b/gcc/config/i386/intelmic-mkoffload.c
index f93007c..e101f93 100644
--- a/gcc/config/i386/intelmic-mkoffload.c
+++ b/gcc/config/i386/intelmic-mkoffload.c
@@ -350,14 +350,24 @@ generate_host_descr_file (const char *host_compiler)
 	   "#ifdef __cplusplus\n"
 	   "extern \"C\"\n"
 	   "#endif\n"
-	   "void GOMP_offload_register (void *, int, void *);\n\n"
+	   "void GOMP_offload_register (void *, int, void *);\n"
+	   "void GOMP_offload_unregister (void *, int, void *);\n\n"
 
 	   "__attribute__((constructor))\n"
 	   "static void\n"
 	   "init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
+	   "}\n\n", GOMP_DEVICE_INTEL_MIC);
+
+  fprintf (src_file,
+	   "__attribute__((destructor))\n"
+	   "static void\n"
+	   "fini (void)\n"
+	   "{\n"
+	   "  GOMP_offload_unregister (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
 	   "}\n", GOMP_DEVICE_INTEL_MIC);
+
   fclose (src_file);
 
   unsigned new_argc = 0;
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index d9cbff5..1072ae4 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -51,14 +51,12 @@ enum offload_target_type
   OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
 };
 
-/* Auxiliary struct, used for transferring a host-target address range mapping
-   from plugin to libgomp.  */
-struct mapping_table
+/* Auxiliary struct, used for transferring pairs of addresses from plugin
+   to libgomp.  */
+struct addr_pair
 {
-  uintptr_t host_start;
-  uintptr_t host_end;
-  uintptr_t tgt_start;
-  uintptr_t tgt_end;
+  uintptr_t start;
+  uintptr_t end;
 };
 
 /* Miscellaneous functions.  */
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3089401..a1d42c5 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -224,7 +224,6 @@ struct gomp_team_state
 };
 
 struct target_mem_desc;
-struct gomp_memory_mapping;
 
 /* These are the OpenMP 4.0 Internal Control Variables described in
    section 2.3.1.  Those described as having one copy per task are
@@ -657,7 +656,7 @@ struct target_mem_desc {
   struct gomp_device_descr *device_descr;
 
   /* Memory mapping info for the thread that created this descriptor.  */
-  struct gomp_memory_mapping *mem_map;
+  struct splay_tree_s *mem_map;
 
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
@@ -683,20 +682,6 @@ struct splay_tree_key_s {
 
 #include "splay-tree.h"
 
-/* Information about mapped memory regions (per device/context).  */
-
-struct gomp_memory_mapping
-{
-  /* Mutex for operating with the splay tree and other shared structures.  */
-  gomp_mutex_t lock;
-
-  /* True when tables have been added to this memory map.  */
-  bool is_initialized;
-
-  /* Splay tree containing information about mapped memory regions.  */
-  struct splay_tree_s splay_tree;
-};
-
 typedef struct acc_dispatch_t
 {
   /* This is a linked list of data mapped using the
@@ -773,19 +758,18 @@ struct gomp_device_descr
   unsigned int (*get_caps_func) (void);
   int (*get_type_func) (void);
   int (*get_num_devices_func) (void);
-  void (*register_image_func) (void *, void *);
   void (*init_device_func) (int);
   void (*fini_device_func) (int);
-  int (*get_table_func) (int, struct mapping_table **);
+  int (*load_image_func) (int, void *, struct addr_pair **);
+  void (*unload_image_func) (int, void *);
   void *(*alloc_func) (int, size_t);
   void (*free_func) (int, void *);
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
 
-  /* Memory-mapping info for this device instance.  */
-  /* Uses a separate lock.  */
-  struct gomp_memory_mapping mem_map;
+  /* Splay tree containing information about mapped memory regions.  */
+  struct splay_tree_s mem_map;
 
   /* Mutex for the mutable data.  */
   gomp_mutex_t lock;
@@ -793,9 +777,6 @@ struct gomp_device_descr
   /* Set to true when device is initialized.  */
   bool is_initialized;
 
-  /* True when offload regions have been registered with this device.  */
-  bool offload_regions_registered;
-
   /* OpenACC-specific data and functions.  */
   /* This is mutable because of its mutable data_environ and target_data
      members.  */
@@ -811,9 +792,7 @@ extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
 extern void gomp_copy_from_async (struct target_mem_desc *);
 extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
-extern void gomp_init_tables (struct gomp_device_descr *,
-			      struct gomp_memory_mapping *);
-extern void gomp_free_memmap (struct gomp_memory_mapping *);
+extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_fini_device (struct gomp_device_descr *);
 
 /* work.c */
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index f44174e..2b2b953 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -231,6 +231,7 @@ GOMP_4.0 {
 GOMP_4.0.1 {
   global:
 	GOMP_offload_register;
+	GOMP_offload_unregister;
 } GOMP_4.0;
 
 OACC_2.0 {
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 6aeb1e7..e4756b6 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -43,20 +43,18 @@ static struct gomp_device_descr host_dispatch =
     .get_caps_func = GOMP_OFFLOAD_get_caps,
     .get_type_func = GOMP_OFFLOAD_get_type,
     .get_num_devices_func = GOMP_OFFLOAD_get_num_devices,
-    .register_image_func = GOMP_OFFLOAD_register_image,
     .init_device_func = GOMP_OFFLOAD_init_device,
     .fini_device_func = GOMP_OFFLOAD_fini_device,
-    .get_table_func = GOMP_OFFLOAD_get_table,
+    .load_image_func = GOMP_OFFLOAD_load_image,
+    .unload_image_func = GOMP_OFFLOAD_unload_image,
     .alloc_func = GOMP_OFFLOAD_alloc,
     .free_func = GOMP_OFFLOAD_free,
     .dev2host_func = GOMP_OFFLOAD_dev2host,
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.is_initialized = false,
-    .mem_map.splay_tree.root = NULL,
+    .mem_map.root = NULL,
     .is_initialized = false,
-    .offload_regions_registered = false,
 
     .openacc = {
       .open_device_func = GOMP_OFFLOAD_openacc_open_device,
@@ -94,7 +92,6 @@ static struct gomp_device_descr host_dispatch =
 static __attribute__ ((constructor))
 void goacc_host_init (void)
 {
-  gomp_mutex_init (&host_dispatch.mem_map.lock);
   gomp_mutex_init (&host_dispatch.lock);
   goacc_register (&host_dispatch);
 }
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 166eb55..1e0243e 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -284,12 +284,6 @@ lazy_open (int ord)
     = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
 
   acc_dev->openacc.async_set_async_func (acc_async_sync);
-
-  struct gomp_memory_mapping *mem_map = &acc_dev->mem_map;
-  gomp_mutex_lock (&mem_map->lock);
-  if (!mem_map->is_initialized)
-    gomp_init_tables (acc_dev, mem_map);
-  gomp_mutex_unlock (&mem_map->lock);
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
@@ -351,10 +345,9 @@ acc_shutdown_1 (acc_device_t d)
 
 	  walk->dev->openacc.target_data = target_data = NULL;
 
-	  struct gomp_memory_mapping *mem_map = &walk->dev->mem_map;
-	  gomp_mutex_lock (&mem_map->lock);
-	  gomp_free_memmap (mem_map);
-	  gomp_mutex_unlock (&mem_map->lock);
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (&walk->dev->mem_map);
+	  gomp_mutex_unlock (&walk->dev->lock);
 
 	  walk->dev = NULL;
 	}
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 0096d51..fdc82e6 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -38,7 +38,7 @@
 /* Return block containing [H->S), or NULL if not contained.  */
 
 static splay_tree_key
-lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
+lookup_host (struct gomp_device_descr *dev, void *h, size_t s)
 {
   struct splay_tree_key_s node;
   splay_tree_key key;
@@ -46,11 +46,9 @@ lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
   node.host_start = (uintptr_t) h;
   node.host_end = (uintptr_t) h + s;
 
-  gomp_mutex_lock (&mem_map->lock);
-
-  key = splay_tree_lookup (&mem_map->splay_tree, &node);
-
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_lock (&dev->lock);
+  key = splay_tree_lookup (&dev->mem_map, &node);
+  gomp_mutex_unlock (&dev->lock);
 
   return key;
 }
@@ -65,14 +63,11 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
 {
   int i;
   struct target_mem_desc *t;
-  struct gomp_memory_mapping *mem_map;
 
   if (!tgt)
     return NULL;
 
-  mem_map = tgt->mem_map;
-
-  gomp_mutex_lock (&mem_map->lock);
+  gomp_mutex_lock (&tgt->device_descr->lock);
 
   for (t = tgt; t != NULL; t = t->prev)
     {
@@ -80,7 +75,7 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
         break;
     }
 
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_unlock (&tgt->device_descr->lock);
 
   if (!t)
     return NULL;
@@ -176,7 +171,7 @@ acc_deviceptr (void *h)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  n = lookup_host (&thr->dev->mem_map, h, 1);
+  n = lookup_host (thr->dev, h, 1);
 
   if (!n)
     return NULL;
@@ -229,7 +224,7 @@ acc_is_present (void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   if (n && ((uintptr_t)h < n->host_start
 	    || (uintptr_t)h + s > n->host_end
@@ -271,7 +266,7 @@ acc_map_data (void *h, void *d, size_t s)
 	gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map",
                     (void *)h, (int)s, (void *)d, (int)s);
 
-      if (lookup_host (&acc_dev->mem_map, h, s))
+      if (lookup_host (acc_dev, h, s))
 	gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h,
 		    (int)s);
 
@@ -296,7 +291,7 @@ acc_unmap_data (void *h)
   /* No need to call lazy open, as the address must have been mapped.  */
 
   size_t host_size;
-  splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1);
+  splay_tree_key n = lookup_host (acc_dev, h, 1);
   struct target_mem_desc *t;
 
   if (!n)
@@ -320,7 +315,7 @@ acc_unmap_data (void *h)
       t->tgt_end = 0;
       t->to_free = 0;
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
 	   tp = t, t = t->prev)
@@ -334,7 +329,7 @@ acc_unmap_data (void *h)
 	    break;
 	  }
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   gomp_unmap_vars (t, true);
@@ -358,7 +353,7 @@ present_create_copy (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
   if (n)
     {
       /* Present. */
@@ -389,13 +384,13 @@ present_create_copy (unsigned f, void *h, size_t s)
       tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
 			   false);
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
       tgt->prev = acc_dev->openacc.data_environ;
       acc_dev->openacc.data_environ = tgt;
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   return d;
@@ -436,7 +431,7 @@ delete_copyout (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -479,7 +474,7 @@ update_dev_host (int is_dev, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -532,7 +527,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   struct target_mem_desc *t;
   int minrefs = (mapnum == 1) ? 2 : 3;
 
-  n = lookup_host (&acc_dev->mem_map, h, 1);
+  n = lookup_host (acc_dev, h, 1);
 
   if (!n)
     gomp_fatal ("%p is not a mapped block", (void *)h);
@@ -543,7 +538,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
 
   struct target_mem_desc *tp;
 
-  gomp_mutex_lock (&acc_dev->mem_map.lock);
+  gomp_mutex_lock (&acc_dev->lock);
 
   if (t->refcount == minrefs)
     {
@@ -570,7 +565,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   if (force_copyfrom)
     t->list[0]->copy_from = 1;
 
-  gomp_mutex_unlock (&acc_dev->mem_map.lock);
+  gomp_mutex_unlock (&acc_dev->lock);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 0c74f54..563f9bb 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -144,9 +144,9 @@ GOACC_parallel (int device, void (*fn) (void *),
     {
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
-      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k);
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
+      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map, &k);
+      gomp_mutex_unlock (&acc_dev->lock);
 
       if (tgt_fn_key == NULL)
 	gomp_fatal ("target function wasn't mapped");
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index ebf7f11..bc60f72 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -95,12 +95,6 @@ GOMP_OFFLOAD_get_num_devices (void)
 }
 
 STATIC void
-GOMP_OFFLOAD_register_image (void *host_table __attribute__ ((unused)),
-			     void *target_data __attribute__ ((unused)))
-{
-}
-
-STATIC void
 GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
 {
 }
@@ -111,12 +105,19 @@ GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
 }
 
 STATIC int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **table __attribute__ ((unused)))
+GOMP_OFFLOAD_load_image (int n __attribute__ ((unused)),
+			 void *i __attribute__ ((unused)),
+			 struct addr_pair **r __attribute__ ((unused)))
 {
   return 0;
 }
 
+STATIC void
+GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
+			   void *i __attribute__ ((unused)))
+{
+}
+
 STATIC void *
 GOMP_OFFLOAD_openacc_open_device (int n)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index c5dda3f..1d08681 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -49,6 +49,9 @@ static void gomp_target_init (void);
 /* The whole initialization code for offloading plugins is only run one.  */
 static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT;
 
+/* Mutex for offload image registration.  */
+static gomp_mutex_t register_lock;
+
 /* This structure describes an offload image.
    It contains type of the target device, pointer to host table descriptor, and
    pointer to target data.  */
@@ -153,14 +156,14 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   const int rshift = is_openacc ? 8 : 3;
   const int typemask = is_openacc ? 0xff : 0x7;
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
+  struct splay_tree_s *mem_map = &devicep->mem_map;
   struct splay_tree_key_s cur_node;
   struct target_mem_desc *tgt
     = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mm;
+  tgt->mem_map = mem_map;
 
   if (mapnum == 0)
     return tgt;
@@ -174,7 +177,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       tgt_size = mapnum * sizeof (void *);
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < mapnum; i++)
     {
@@ -189,7 +192,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	cur_node.host_end = cur_node.host_start + sizes[i];
       else
 	cur_node.host_end = cur_node.host_start + sizeof (void *);
-      splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+      splay_tree_key n = splay_tree_lookup (mem_map, &cur_node);
       if (n)
 	{
 	  tgt->list[i] = n;
@@ -274,7 +277,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      k->host_end = k->host_start + sizes[i];
 	    else
 	      k->host_end = k->host_start + sizeof (void *);
-	    splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k);
+	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n)
 	      {
 		tgt->list[i] = n;
@@ -294,7 +297,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		tgt->refcount++;
 		array->left = NULL;
 		array->right = NULL;
-		splay_tree_insert (&mm->splay_tree, array);
+		splay_tree_insert (mem_map, array);
 		switch (kind & typemask)
 		  {
 		  case GOMP_MAP_ALLOC:
@@ -332,16 +335,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    /* Add bias to the pointer value.  */
 		    cur_node.host_start += sizes[i];
 		    cur_node.host_end = cur_node.host_start + 1;
-		    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+		    n = splay_tree_lookup (mem_map, &cur_node);
 		    if (n == NULL)
 		      {
 			/* Could be possibly zero size array section.  */
 			cur_node.host_end--;
-			n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			n = splay_tree_lookup (mem_map, &cur_node);
 			if (n == NULL)
 			  {
 			    cur_node.host_start--;
-			    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			    n = splay_tree_lookup (mem_map, &cur_node);
 			    cur_node.host_start++;
 			  }
 		      }
@@ -400,18 +403,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			  /* Add bias to the pointer value.  */
 			  cur_node.host_start += sizes[j];
 			  cur_node.host_end = cur_node.host_start + 1;
-			  n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			  n = splay_tree_lookup (mem_map, &cur_node);
 			  if (n == NULL)
 			    {
 			      /* Could be possibly zero size array section.  */
 			      cur_node.host_end--;
-			      n = splay_tree_lookup (&mm->splay_tree,
-						     &cur_node);
+			      n = splay_tree_lookup (mem_map, &cur_node);
 			      if (n == NULL)
 				{
 				  cur_node.host_start--;
-				  n = splay_tree_lookup (&mm->splay_tree,
-							 &cur_node);
+				  n = splay_tree_lookup (mem_map, &cur_node);
 				  cur_node.host_start++;
 				}
 			    }
@@ -489,7 +490,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	}
     }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
   return tgt;
 }
 
@@ -514,10 +515,9 @@ attribute_hidden void
 gomp_copy_from_async (struct target_mem_desc *tgt)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
   size_t i;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < tgt->list_count; i++)
     if (tgt->list[i] == NULL)
@@ -536,7 +536,7 @@ gomp_copy_from_async (struct target_mem_desc *tgt)
 				  k->host_end - k->host_start);
       }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 /* Unmap variables described by TGT.  If DO_COPYFROM is true, copy relevant
@@ -547,7 +547,6 @@ attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
 
   if (tgt->list_count == 0)
     {
@@ -555,7 +554,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
       return;
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   size_t i;
   for (i = 0; i < tgt->list_count; i++)
@@ -572,7 +571,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (&mm->splay_tree, k);
+	splay_tree_remove (tgt->mem_map, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -584,13 +583,12 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
   else
     gomp_unmap_tgt (tgt);
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 static void
-gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
-	     size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds,
-	     bool is_openacc)
+gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
+	     size_t *sizes, void *kinds, bool is_openacc)
 {
   size_t i;
   struct splay_tree_key_s cur_node;
@@ -602,14 +600,13 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
   if (mapnum == 0)
     return;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
   for (i = 0; i < mapnum; i++)
     if (sizes[i])
       {
 	cur_node.host_start = (uintptr_t) hostaddrs[i];
 	cur_node.host_end = cur_node.host_start + sizes[i];
-	splay_tree_key n = splay_tree_lookup (&mm->splay_tree,
-					      &cur_node);
+	splay_tree_key n = splay_tree_lookup (&devicep->mem_map, &cur_node);
 	if (n)
 	  {
 	    int kind = get_kind (is_openacc, kinds, i);
@@ -643,10 +640,88 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
 		      (void *) cur_node.host_start,
 		      (void *) cur_node.host_end);
       }
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
+}
+
+/* Load image pointed by TARGET_DATA to the device, specified by DEVICEP.
+   And insert to splay tree the mapping between addresses from HOST_TABLE and
+   from loaded target image.  */
+
+static void
+gomp_offload_image_to_device (struct gomp_device_descr *devicep,
+			      void *host_table, void *target_data)
+{
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  /* Load image to device and get target addresses for the image.  */
+  struct addr_pair *target_table = NULL;
+  int i, num_target_entries
+    = devicep->load_image_func (devicep->target_id, target_data, &target_table);
+
+  if (num_target_entries != num_funcs + num_vars)
+    gomp_fatal ("Can't map target functions or variables");
+
+  /* Insert host-target address mapping into splay tree.  */
+  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
+  tgt->array = gomp_malloc ((num_funcs + num_vars) * sizeof (*tgt->array));
+  tgt->refcount = 1;
+  tgt->tgt_start = 0;
+  tgt->tgt_end = 0;
+  tgt->to_free = NULL;
+  tgt->prev = NULL;
+  tgt->list_count = 0;
+  tgt->device_descr = devicep;
+  splay_tree_node array = tgt->array;
+
+  for (i = 0; i < num_funcs; i++)
+    {
+      splay_tree_key k = &array->key;
+      k->host_start = (uintptr_t) host_func_table[i];
+      k->host_end = k->host_start + 1;
+      k->tgt = tgt;
+      k->tgt_offset = target_table[i].start;
+      k->refcount = 1;
+      k->async_refcount = 0;
+      k->copy_from = false;
+      array->left = NULL;
+      array->right = NULL;
+      splay_tree_insert (&devicep->mem_map, array);
+      array++;
+    }
+
+  for (i = 0; i < num_vars; i++)
+    {
+      struct addr_pair *target_var = &target_table[num_funcs + i];
+      if (target_var->end - target_var->start
+	  != (uintptr_t) host_var_table[i * 2 + 1])
+	gomp_fatal ("Can't map target variables (size mismatch)");
+
+      splay_tree_key k = &array->key;
+      k->host_start = (uintptr_t) host_var_table[i * 2];
+      k->host_end = k->host_start + (uintptr_t) host_var_table[i * 2 + 1];
+      k->tgt = tgt;
+      k->tgt_offset = target_var->start;
+      k->refcount = 1;
+      k->async_refcount = 0;
+      k->copy_from = false;
+      array->left = NULL;
+      array->right = NULL;
+      splay_tree_insert (&devicep->mem_map, array);
+      array++;
+    }
+
+  free (target_table);
 }
 
-/* This function should be called from every offload image.
+/* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
    the target, and TARGET_DATA needed by target plugin.  */
 
@@ -654,6 +729,20 @@ void
 GOMP_offload_register (void *host_table, enum offload_target_type target_type,
 		       void *target_data)
 {
+  int i;
+  gomp_mutex_lock (&register_lock);
+
+  /* Load image to all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type == target_type && devicep->is_initialized)
+	gomp_offload_image_to_device (devicep, host_table, target_data);
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* Insert image to array of pending images.  */
   offload_images = gomp_realloc (offload_images,
 				 (num_offload_images + 1)
 				 * sizeof (struct offload_image_descr));
@@ -663,74 +752,129 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type,
   offload_images[num_offload_images].target_data = target_data;
 
   num_offload_images++;
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* This function initializes the target device, specified by DEVICEP.  DEVICEP
-   must be locked on entry, and remains locked on return.  */
+/* This function should be called from every offload image while unloading.
+   It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
+   the target, and TARGET_DATA needed by target plugin.  */
 
-attribute_hidden void
-gomp_init_device (struct gomp_device_descr *devicep)
+void
+GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
+			 void *target_data)
 {
-  devicep->init_device_func (devicep->target_id);
-  devicep->is_initialized = true;
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+  int i;
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  gomp_mutex_lock (&register_lock);
+
+  /* Unload image from all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      int j;
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type != target_type || !devicep->is_initialized)
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  continue;
+	}
+
+      devicep->unload_image_func (devicep->target_id, target_data);
+
+      /* Remove mapping from splay tree.  */
+      struct splay_tree_key_s k;
+      splay_tree_key node = NULL;
+      if (num_funcs > 0)
+	{
+	  k.host_start = (uintptr_t) host_func_table[0];
+	  k.host_end = k.host_start + 1;
+	  node = splay_tree_lookup (&devicep->mem_map, &k);
+	}
+      else if (num_vars > 0)
+	{
+	  k.host_start = (uintptr_t) host_var_table[0];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[1];
+	  node = splay_tree_lookup (&devicep->mem_map, &k);
+	}
+
+      for (j = 0; j < num_funcs; j++)
+	{
+	  k.host_start = (uintptr_t) host_func_table[j];
+	  k.host_end = k.host_start + 1;
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+
+      for (j = 0; j < num_vars; j++)
+	{
+	  k.host_start = (uintptr_t) host_var_table[j * 2];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[j * 2 + 1];
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+
+      if (node)
+	{
+	  free (node->tgt);
+	  free (node);
+	}
+
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* Remove image from array of pending images.  */
+  for (i = 0; i < num_offload_images; i++)
+    if (offload_images[i].target_data == target_data)
+      {
+	offload_images[i] = offload_images[--num_offload_images];
+	break;
+      }
+
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* Initialize address mapping tables.  MM must be locked on entry, and remains
-   locked on return.  */
+/* This function initializes the target device, specified by DEVICEP.  DEVICEP
+   must be locked on entry, and remains locked on return.  */
 
 attribute_hidden void
-gomp_init_tables (struct gomp_device_descr *devicep,
-		  struct gomp_memory_mapping *mm)
+gomp_init_device (struct gomp_device_descr *devicep)
 {
-  /* Get address mapping table for device.  */
-  struct mapping_table *table = NULL;
-  int num_entries = devicep->get_table_func (devicep->target_id, &table);
-
-  /* Insert host-target address mapping into dev_splay_tree.  */
   int i;
-  for (i = 0; i < num_entries; i++)
+  devicep->init_device_func (devicep->target_id);
+
+  /* Load to device all images registered by the moment.  */
+  for (i = 0; i < num_offload_images; i++)
     {
-      struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
-      tgt->refcount = 1;
-      tgt->array = gomp_malloc (sizeof (*tgt->array));
-      tgt->tgt_start = table[i].tgt_start;
-      tgt->tgt_end = table[i].tgt_end;
-      tgt->to_free = NULL;
-      tgt->list_count = 0;
-      tgt->device_descr = devicep;
-      splay_tree_node node = tgt->array;
-      splay_tree_key k = &node->key;
-      k->host_start = table[i].host_start;
-      k->host_end = table[i].host_end;
-      k->tgt_offset = 0;
-      k->refcount = 1;
-      k->copy_from = false;
-      k->tgt = tgt;
-      node->left = NULL;
-      node->right = NULL;
-      splay_tree_insert (&mm->splay_tree, node);
+      struct offload_image_descr *image = &offload_images[i];
+      if (image->type == devicep->type)
+	gomp_offload_image_to_device (devicep, image->host_table,
+				      image->target_data);
     }
 
-  free (table);
-  mm->is_initialized = true;
+  devicep->is_initialized = true;
 }
 
 /* Free address mapping tables.  MM must be locked on entry, and remains locked
    on return.  */
 
 attribute_hidden void
-gomp_free_memmap (struct gomp_memory_mapping *mm)
+gomp_free_memmap (struct splay_tree_s *mem_map)
 {
-  while (mm->splay_tree.root)
+  while (mem_map->root)
     {
-      struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt;
+      struct target_mem_desc *tgt = mem_map->root->key.tgt;
 
-      splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key);
+      splay_tree_remove (mem_map, &mem_map->root->key);
       free (tgt->array);
       free (tgt);
     }
-
-  mm->is_initialized = false;
 }
 
 /* This function de-initializes the target device, specified by DEVICEP.
@@ -791,22 +935,17 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
     fn_addr = (void *) fn;
   else
     {
-      struct gomp_memory_mapping *mm = &devicep->mem_map;
-      gomp_mutex_lock (&mm->lock);
-
-      if (!mm->is_initialized)
-	gomp_init_tables (devicep, mm);
-
+      gomp_mutex_lock (&devicep->lock);
       struct splay_tree_key_s k;
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      splay_tree_key tgt_fn = splay_tree_lookup (&mm->splay_tree, &k);
+      splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
       if (tgt_fn == NULL)
 	gomp_fatal ("Target function wasn't mapped");
 
-      gomp_mutex_unlock (&mm->lock);
+      gomp_mutex_unlock (&devicep->lock);
 
-      fn_addr = (void *) tgt_fn->tgt->tgt_start;
+      fn_addr = (void *) tgt_fn->tgt_offset;
     }
 
   struct target_mem_desc *tgt_vars
@@ -856,12 +995,6 @@ GOMP_target_data (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
   struct target_mem_desc *tgt
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     false);
@@ -897,13 +1030,7 @@ GOMP_target_update (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
-  gomp_update (devicep, mm, mapnum, hostaddrs, sizes, kinds, false);
+  gomp_update (devicep, mapnum, hostaddrs, sizes, kinds, false);
 }
 
 void
@@ -972,10 +1099,10 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM (get_caps);
   DLSYM (get_type);
   DLSYM (get_num_devices);
-  DLSYM (register_image);
   DLSYM (init_device);
   DLSYM (fini_device);
-  DLSYM (get_table);
+  DLSYM (load_image);
+  DLSYM (unload_image);
   DLSYM (alloc);
   DLSYM (free);
   DLSYM (dev2host);
@@ -1038,22 +1165,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   return err == NULL;
 }
 
-/* This function adds a compatible offload image IMAGE to an accelerator device
-   DEVICE.  DEVICE must be locked on entry, and remains locked on return.  */
-
-static void
-gomp_register_image_for_device (struct gomp_device_descr *device,
-				struct offload_image_descr *image)
-{
-  if (!device->offload_regions_registered
-      && (device->type == image->type
-	  || device->type == OFFLOAD_TARGET_TYPE_HOST))
-    {
-      device->register_image_func (image->host_table, image->target_data);
-      device->offload_regions_registered = true;
-    }
-}
-
 /* This function initializes the runtime needed for offloading.
    It parses the list of offload targets and tries to load the plugins for
    these targets.  On return, the variables NUM_DEVICES and NUM_DEVICES_OPENMP
@@ -1112,17 +1223,14 @@ gomp_target_init (void)
 		current_device.name = current_device.get_name_func ();
 		/* current_device.capabilities has already been set.  */
 		current_device.type = current_device.get_type_func ();
-		current_device.mem_map.is_initialized = false;
-		current_device.mem_map.splay_tree.root = NULL;
+		current_device.mem_map.root = NULL;
 		current_device.is_initialized = false;
-		current_device.offload_regions_registered = false;
 		current_device.openacc.data_environ = NULL;
 		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
 		    devices[num_devices] = current_device;
-		    gomp_mutex_init (&devices[num_devices].mem_map.lock);
 		    gomp_mutex_init (&devices[num_devices].lock);
 		    num_devices++;
 		  }
@@ -1157,21 +1265,12 @@ gomp_target_init (void)
 
   for (i = 0; i < num_devices; i++)
     {
-      int j;
-
-      for (j = 0; j < num_offload_images; j++)
-	gomp_register_image_for_device (&devices[i], &offload_images[j]);
-
       /* The 'devices' array can be moved (by the realloc call) until we have
 	 found all the plugins, so registering with the OpenACC runtime (which
 	 takes a copy of the pointer argument) must be delayed until now.  */
       if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
 	goacc_register (&devices[i]);
     }
-
-  free (offload_images);
-  offload_images = NULL;
-  num_offload_images = 0;
 }
 
 #else /* PLUGIN_SUPPORT */
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index 3e7a958..a2d61b1 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -34,6 +34,7 @@
 #include <string.h>
 #include <utility>
 #include <vector>
+#include <map>
 #include "libgomp-plugin.h"
 #include "compiler_if_host.h"
 #include "main_target_image.h"
@@ -53,6 +54,29 @@ fprintf (stderr, "\n");					    \
 #endif
 
 
+/* Start/end addresses of functions and global variables on a device.  */
+typedef std::vector<addr_pair> AddrVect;
+
+/* Addresses for one image and all devices.  */
+typedef std::vector<AddrVect> DevAddrVect;
+
+/* Addresses for all images and all devices.  */
+typedef std::map<void *, DevAddrVect> ImgDevAddrMap;
+
+
+/* Total number of available devices.  */
+static int num_devices;
+
+/* Total number of shared libraries with offloading to Intel MIC.  */
+static int num_images;
+
+/* Two dimensional array: one key is a pointer to image,
+   second key is number of device.  Contains a vector of pointer pairs.  */
+static ImgDevAddrMap *address_table;
+
+/* Thread-safe registration of the main image.  */
+static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
+
 static VarDesc vd_host2tgt = {
   { 1, 1 },		      /* dst, src			      */
   { 1, 0 },		      /* in, out			      */
@@ -90,28 +114,17 @@ static VarDesc vd_tgt2host = {
 };
 
 
-/* Total number of shared libraries with offloading to Intel MIC.  */
-static int num_libraries;
-
-/* Pointers to the descriptors, containing pointers to host-side tables and to
-   target images.  */
-static std::vector< std::pair<void *, void *> > lib_descrs;
-
-/* Thread-safe registration of the main image.  */
-static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
-
-
 /* Add path specified in LD_LIBRARY_PATH to MIC_LD_LIBRARY_PATH, which is
    required by liboffloadmic.  */
 __attribute__((constructor))
 static void
-set_mic_lib_path (void)
+init (void)
 {
   const char *ld_lib_path = getenv (LD_LIBRARY_PATH_ENV);
   const char *mic_lib_path = getenv (MIC_LD_LIBRARY_PATH_ENV);
 
   if (!ld_lib_path)
-    return;
+    goto out;
 
   if (!mic_lib_path)
     setenv (MIC_LD_LIBRARY_PATH_ENV, ld_lib_path, 1);
@@ -133,6 +146,10 @@ set_mic_lib_path (void)
       if (!use_alloca)
 	free (mic_lib_path_new);
     }
+
+out:
+  address_table = new ImgDevAddrMap;
+  num_devices = _Offload_number_of_devices ();
 }
 
 extern "C" const char *
@@ -162,18 +179,8 @@ GOMP_OFFLOAD_get_type (void)
 extern "C" int
 GOMP_OFFLOAD_get_num_devices (void)
 {
-  int res = _Offload_number_of_devices ();
-  TRACE ("(): return %d", res);
-  return res;
-}
-
-/* This should be called from every shared library with offloading.  */
-extern "C" void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_image)
-{
-  TRACE ("(host_table = %p, target_image = %p)", host_table, target_image);
-  lib_descrs.push_back (std::make_pair (host_table, target_image));
-  num_libraries++;
+  TRACE ("(): return %d", num_devices);
+  return num_devices;
 }
 
 static void
@@ -196,7 +203,8 @@ register_main_image ()
   __offload_register_image (&main_target_image);
 }
 
-/* Load offload_target_main on target.  */
+/* liboffloadmic loads and runs offload_target_main on all available devices
+   during a first call to offload ().  */
 extern "C" void
 GOMP_OFFLOAD_init_device (int device)
 {
@@ -243,9 +251,11 @@ get_target_table (int device, int &num_funcs, int &num_vars, void **&table)
     }
 }
 
+/* Offload TARGET_IMAGE to all available devices and fill address_table with
+   corresponding target addresses.  */
+
 static void
-load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
-			int &table_size)
+offload_image (void *target_image)
 {
   struct TargetImage {
     int64_t size;
@@ -254,19 +264,11 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     char data[];
   } __attribute__ ((packed));
 
-  void ***host_table_descr = (void ***) lib_descrs[lib_num].first;
-  void **host_func_start = host_table_descr[0];
-  void **host_func_end   = host_table_descr[1];
-  void **host_var_start  = host_table_descr[2];
-  void **host_var_end    = host_table_descr[3];
+  void *image_start = ((void **) target_image)[0];
+  void *image_end   = ((void **) target_image)[1];
 
-  void **target_image_descr = (void **) lib_descrs[lib_num].second;
-  void *image_start = target_image_descr[0];
-  void *image_end   = target_image_descr[1];
-
-  TRACE ("() host_table_descr { %p, %p, %p, %p }", host_func_start,
-	 host_func_end, host_var_start, host_var_end);
-  TRACE ("() target_image_descr { %p, %p }", image_start, image_end);
+  TRACE ("(target_image = %p { %p, %p })",
+	 target_image, image_start, image_end);
 
   int64_t image_size = (uintptr_t) image_end - (uintptr_t) image_start;
   TargetImage *image
@@ -279,94 +281,87 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     }
 
   image->size = image_size;
-  sprintf (image->name, "lib%010d.so", lib_num);
+  sprintf (image->name, "lib%010d.so", num_images++);
   memcpy (image->data, image_start, image->size);
 
   TRACE ("() __offload_register_image %s { %p, %d }",
 	 image->name, image_start, image->size);
   __offload_register_image (image);
 
-  int tgt_num_funcs = 0;
-  int tgt_num_vars = 0;
-  void **tgt_table = NULL;
-  get_target_table (device, tgt_num_funcs, tgt_num_vars, tgt_table);
-  free (image);
-
-  /* The func table contains only addresses, the var table contains addresses
-     and corresponding sizes.  */
-  int host_num_funcs = host_func_end - host_func_start;
-  int host_num_vars  = (host_var_end - host_var_start) / 2;
-  TRACE ("() host_num_funcs = %d, tgt_num_funcs = %d",
-	 host_num_funcs, tgt_num_funcs);
-  TRACE ("() host_num_vars = %d, tgt_num_vars = %d",
-	 host_num_vars, tgt_num_vars);
-  if (host_num_funcs != tgt_num_funcs)
+  /* Receive tables for target_image from all devices.  */
+  DevAddrVect dev_table;
+  for (int dev = 0; dev < num_devices; dev++)
     {
-      fprintf (stderr, "%s: Can't map target functions\n", __FILE__);
-      exit (1);
-    }
-  if (host_num_vars != tgt_num_vars)
-    {
-      fprintf (stderr, "%s: Can't map target variables\n", __FILE__);
-      exit (1);
-    }
+      int num_funcs = 0;
+      int num_vars = 0;
+      void **table = NULL;
 
-  table = (mapping_table *) realloc (table, (table_size + host_num_funcs
-					     + host_num_vars)
-					    * sizeof (mapping_table));
-  if (table == NULL)
-    {
-      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
-      exit (1);
-    }
+      get_target_table (dev, num_funcs, num_vars, table);
 
-  for (int i = 0; i < host_num_funcs; i++)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_func_start[i];
-      t.host_end = t.host_start + 1;
-      t.tgt_start = (uintptr_t) tgt_table[i];
-      t.tgt_end = t.tgt_start + 1;
-
-      TRACE ("() lib %d, func %d:\t0x%llx -- 0x%llx",
-	     lib_num, i, t.host_start, t.tgt_start);
-
-      table[table_size++] = t;
-    }
+      AddrVect curr_dev_table;
 
-  for (int i = 0; i < host_num_vars * 2; i += 2)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_var_start[i];
-      t.host_end = t.host_start + (uintptr_t) host_var_start[i+1];
-      t.tgt_start = (uintptr_t) tgt_table[tgt_num_funcs+i];
-      t.tgt_end = t.tgt_start + (uintptr_t) tgt_table[tgt_num_funcs+i+1];
+      for (int i = 0; i < num_funcs; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[i];
+	  tgt_addr.end = tgt_addr.start + 1;
+	  TRACE ("() func %d:\t0x%llx..0x%llx", i,
+		 tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      TRACE ("() lib %d, var %d:\t0x%llx (%d) -- 0x%llx (%d)", lib_num, i/2,
-	     t.host_start, t.host_end - t.host_start,
-	     t.tgt_start, t.tgt_end - t.tgt_start);
+      for (int i = 0; i < num_vars; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[num_funcs+i*2];
+	  tgt_addr.end = tgt_addr.start + (uintptr_t) table[num_funcs+i*2+1];
+	  TRACE ("() var %d:\t0x%llx..0x%llx", i, tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      table[table_size++] = t;
+      dev_table.push_back (curr_dev_table);
     }
 
-  delete [] tgt_table;
+  address_table->insert (std::make_pair (target_image, dev_table));
+
+  free (image);
 }
 
 extern "C" int
-GOMP_OFFLOAD_get_table (int device, void *result)
+GOMP_OFFLOAD_load_image (int device, void *target_image, addr_pair **result)
 {
-  TRACE ("(num_libraries = %d)", num_libraries);
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
 
-  mapping_table *table = NULL;
-  int table_size = 0;
+  /* If target_image is already present in address_table, then there is no need
+     to offload it.  */
+  if (address_table->count (target_image) == 0)
+    offload_image (target_image);
 
-  for (int i = 0; i < num_libraries; i++)
-    load_lib_and_get_table (device, i, table, table_size);
+  AddrVect *curr_dev_table = &(*address_table)[target_image][device];
+  int table_size = curr_dev_table->size ();
+  addr_pair *table = (addr_pair *) malloc (table_size * sizeof (addr_pair));
+  if (table == NULL)
+    {
+      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
+      exit (1);
+    }
 
-  *(void **) result = table;
+  std::copy (curr_dev_table->begin (), curr_dev_table->end (), table);
+  *result = table;
   return table_size;
 }
 
+extern "C" void
+GOMP_OFFLOAD_unload_image (int device, void *target_image)
+{
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
+
+  /* TODO: Currently liboffloadmic doesn't support __offload_unregister_image
+     for libraries.  */
+
+  address_table->erase (target_image);
+}
+
 extern "C" void *
 GOMP_OFFLOAD_alloc (int device, size_t size)
 {


  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-31 12:52                                     ` Ilya Verbin
@ 2015-03-31 13:08                                       ` Jakub Jelinek
  2015-03-31 16:10                                         ` Ilya Verbin
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2015-03-31 13:08 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Thomas Schwinge, Julian Brown, gcc-patches, Kirill Yukhin

On Tue, Mar 31, 2015 at 03:52:06PM +0300, Ilya Verbin wrote:
> > What is the reason to register and allocate these one at a time, rather than
> > using one struct target_mem_desc with one tgt->array for all splay tree
> > nodes registered from one image?
> > Perhaps you would just use tgt_start of 0 and tgt_end of 0 too (to make it
> > clear it is special) and just use tgt_offset relative to that (i.e.
> > absolute), but having to malloc each node individually and having to malloc
> > a target_mem_desc for each one sounds expensive.
> > Everything is freed just once anyway, isn't it?
> 
> Here is WIP patch, does this look like what you suggested?  It works fine with
> functions, however I'm not sure what to do with variables.  Will gomp_map_vars
> work when tgt_start and tgt_end are equal to 0?

Can you explain what you are afraid of?  The mapped images (both their
mapping and unmapping) are done in pairs, and in a valid program the
addresses shouldn't be already mapped when the image is mapped in etc.
So, for gomp_map_vars, the var allocations should just be the pre-existing
mappings, i.e.
      splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
      if (n)
        {
          tgt->list[i] = n;
          gomp_map_vars_existing (n, &cur_node, kind & typemask);
        }
case and
  if (is_target)
    {
      for (i = 0; i < mapnum; i++)
        {
          if (tgt->list[i] == NULL)
            cur_node.tgt_offset = (uintptr_t) NULL;
          else
            cur_node.tgt_offset = tgt->list[i]->tgt->tgt_start
                                  + tgt->list[i]->tgt_offset;
          /* FIXME: see above FIXME comment.  */
          devicep->host2dev_func (devicep->target_id,
                                  (void *) (tgt->tgt_start
                                            + i * sizeof (void *)),
                                  (void *) &cur_node.tgt_offset,
                                  sizeof (void *));
        }
    }
at the end.  tgt->list[i] will be non-NULL, tgt->list[i]->tgt->tgt_start
will be 0, but tgt->list[i]->tgt_offset will be absolute and so should DTRT.

> +  for (i = 0; i < num_vars; i++)
> +    {
> +      struct addr_pair host_addr;
> +      host_addr.start = (uintptr_t) host_var_table[i*2];
> +      host_addr.end = host_addr.start + (uintptr_t) host_var_table[i*2+1];

Formatting, spaces around + or *.  But, as said earlier, I don't see why
this wouldn't work for variables too.

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-30 16:42                                   ` Jakub Jelinek
  2015-03-30 21:43                                     ` Julian Brown
@ 2015-03-31 12:52                                     ` Ilya Verbin
  2015-03-31 13:08                                       ` Jakub Jelinek
  2015-03-31 18:25                                     ` Ilya Verbin
  2 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-03-31 12:52 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Thomas Schwinge, Julian Brown, gcc-patches, Kirill Yukhin

On Mon, Mar 30, 2015 at 22:42:51 +0100, Julian Brown wrote:
> On Mon, 30 Mar 2015 18:42:02 +0200
> Jakub Jelinek <jakub@redhat.com> wrote:
> > But the one Julian posted doesn't apply on top of your patch.
> > If there is any interdiff needed on top of your patch, can it be
> > posted against trunk + your patch?
> 
> Here's a version of my patch against trunk and Ilya's latest patch
> (hopefully!). Tests look OK (libgomp + PTX).

Thanks for rebasing!

On Mon, Mar 30, 2015 at 18:42:02 +0200, Jakub Jelinek wrote:
> > +/* Insert mapping of host -> target address pairs to splay tree.  */
> > +
> > +static void
> > +gomp_splay_tree_insert_mapping (struct gomp_device_descr *devicep,
> > +				struct addr_pair *host_addr,
> > +				struct addr_pair *tgt_addr)
> > +{
> > +  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
> > +  tgt->refcount = 1;
> > +  tgt->array = gomp_malloc (sizeof (*tgt->array));
> > +  tgt->tgt_start = tgt_addr->start;
> > +  tgt->tgt_end = tgt_addr->end;
> > +  tgt->to_free = NULL;
> > +  tgt->list_count = 0;
> > +  tgt->device_descr = devicep;
> > +  splay_tree_node node = tgt->array;
> > +  splay_tree_key k = &node->key;
> > +  k->host_start = host_addr->start;
> > +  k->host_end = host_addr->end;
> > +  k->tgt_offset = 0;
> > +  k->refcount = 1;
> > +  k->copy_from = false;
> > +  k->tgt = tgt;
> > +  node->left = NULL;
> > +  node->right = NULL;
> > +  splay_tree_insert (&devicep->mem_map, node);
> > +}
> 
> What is the reason to register and allocate these one at a time, rather than
> using one struct target_mem_desc with one tgt->array for all splay tree
> nodes registered from one image?
> Perhaps you would just use tgt_start of 0 and tgt_end of 0 too (to make it
> clear it is special) and just use tgt_offset relative to that (i.e.
> absolute), but having to malloc each node individually and having to malloc
> a target_mem_desc for each one sounds expensive.
> Everything is freed just once anyway, isn't it?

Here is WIP patch, does this look like what you suggested?  It works fine with
functions, however I'm not sure what to do with variables.  Will gomp_map_vars
work when tgt_start and tgt_end are equal to 0?

> > @@ -654,6 +727,18 @@ void
> >  GOMP_offload_register (void *host_table, enum offload_target_type target_type,
> >  		       void *target_data)
> >  {
> > +  int i;
> > +  gomp_mutex_lock (&register_lock);
> > +
> > +  /* Load image to all initialized devices.  */
> > +  for (i = 0; i < num_devices; i++)
> > +    {
> > +      struct gomp_device_descr *devicep = &devices[i];
> > +      if (devicep->type == target_type && devicep->is_initialized)
> > +	gomp_offload_image_to_device (devicep, host_table, target_data);
> 
> Shouldn't either this function, or gomp_offload_image_to_device lock
> also devicep->lock mutex and unlock at the end?
> Where exactly I guess depends on if the devicep->* hook calls should be
> guarded with the mutex or not.  If yes, it should be this function and
> gomp_init_device.
> 
> > +      if (devicep->type != target_type || !devicep->is_initialized)
> > +	continue;
> > +
> 
> Similarly.

I've added lock/unlock to GOMP_offload_register and GOMP_offload_unregister.
All calls to gomp_init_device were already guarded.

> > +      devicep->unload_image_func (devicep->target_id, target_data);
> > +
> > +      /* Remove mapping from splay tree.  */
> > +      for (j = 0; j < num_funcs; j++)
> > +	{
> > +	  struct splay_tree_key_s k;
> > +	  k.host_start = (uintptr_t) host_func_table[j];
> > +	  k.host_end = k.host_start + 1;
> > +	  splay_tree_remove (&devicep->mem_map, &k);
> > +	}
> > +
> > +      for (j = 0; j < num_vars; j++)
> > +	{
> > +	  struct splay_tree_key_s k;
> > +	  k.host_start = (uintptr_t) host_var_table[j*2];
> > +	  k.host_end = k.host_start + (uintptr_t) host_var_table[j*2+1];
> > +	  splay_tree_remove (&devicep->mem_map, &k);
> > +	}
> > +    }
> 
> Aren't you leaking here all the tgt->array and tgt allocations here?
> Though, if you change it to just two allocations (one tgt, one array),
> you'd need to free just once.

You're right.  I've fixed this for functions, variables are WIP.


diff --git a/gcc/config/i386/intelmic-mkoffload.c b/gcc/config/i386/intelmic-mkoffload.c
index f93007c..e101f93 100644
--- a/gcc/config/i386/intelmic-mkoffload.c
+++ b/gcc/config/i386/intelmic-mkoffload.c
@@ -350,14 +350,24 @@ generate_host_descr_file (const char *host_compiler)
 	   "#ifdef __cplusplus\n"
 	   "extern \"C\"\n"
 	   "#endif\n"
-	   "void GOMP_offload_register (void *, int, void *);\n\n"
+	   "void GOMP_offload_register (void *, int, void *);\n"
+	   "void GOMP_offload_unregister (void *, int, void *);\n\n"
 
 	   "__attribute__((constructor))\n"
 	   "static void\n"
 	   "init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
+	   "}\n\n", GOMP_DEVICE_INTEL_MIC);
+
+  fprintf (src_file,
+	   "__attribute__((destructor))\n"
+	   "static void\n"
+	   "fini (void)\n"
+	   "{\n"
+	   "  GOMP_offload_unregister (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
 	   "}\n", GOMP_DEVICE_INTEL_MIC);
+
   fclose (src_file);
 
   unsigned new_argc = 0;
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index d9cbff5..1072ae4 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -51,14 +51,12 @@ enum offload_target_type
   OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
 };
 
-/* Auxiliary struct, used for transferring a host-target address range mapping
-   from plugin to libgomp.  */
-struct mapping_table
+/* Auxiliary struct, used for transferring pairs of addresses from plugin
+   to libgomp.  */
+struct addr_pair
 {
-  uintptr_t host_start;
-  uintptr_t host_end;
-  uintptr_t tgt_start;
-  uintptr_t tgt_end;
+  uintptr_t start;
+  uintptr_t end;
 };
 
 /* Miscellaneous functions.  */
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3089401..a1d42c5 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -224,7 +224,6 @@ struct gomp_team_state
 };
 
 struct target_mem_desc;
-struct gomp_memory_mapping;
 
 /* These are the OpenMP 4.0 Internal Control Variables described in
    section 2.3.1.  Those described as having one copy per task are
@@ -657,7 +656,7 @@ struct target_mem_desc {
   struct gomp_device_descr *device_descr;
 
   /* Memory mapping info for the thread that created this descriptor.  */
-  struct gomp_memory_mapping *mem_map;
+  struct splay_tree_s *mem_map;
 
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
@@ -683,20 +682,6 @@ struct splay_tree_key_s {
 
 #include "splay-tree.h"
 
-/* Information about mapped memory regions (per device/context).  */
-
-struct gomp_memory_mapping
-{
-  /* Mutex for operating with the splay tree and other shared structures.  */
-  gomp_mutex_t lock;
-
-  /* True when tables have been added to this memory map.  */
-  bool is_initialized;
-
-  /* Splay tree containing information about mapped memory regions.  */
-  struct splay_tree_s splay_tree;
-};
-
 typedef struct acc_dispatch_t
 {
   /* This is a linked list of data mapped using the
@@ -773,19 +758,18 @@ struct gomp_device_descr
   unsigned int (*get_caps_func) (void);
   int (*get_type_func) (void);
   int (*get_num_devices_func) (void);
-  void (*register_image_func) (void *, void *);
   void (*init_device_func) (int);
   void (*fini_device_func) (int);
-  int (*get_table_func) (int, struct mapping_table **);
+  int (*load_image_func) (int, void *, struct addr_pair **);
+  void (*unload_image_func) (int, void *);
   void *(*alloc_func) (int, size_t);
   void (*free_func) (int, void *);
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
 
-  /* Memory-mapping info for this device instance.  */
-  /* Uses a separate lock.  */
-  struct gomp_memory_mapping mem_map;
+  /* Splay tree containing information about mapped memory regions.  */
+  struct splay_tree_s mem_map;
 
   /* Mutex for the mutable data.  */
   gomp_mutex_t lock;
@@ -793,9 +777,6 @@ struct gomp_device_descr
   /* Set to true when device is initialized.  */
   bool is_initialized;
 
-  /* True when offload regions have been registered with this device.  */
-  bool offload_regions_registered;
-
   /* OpenACC-specific data and functions.  */
   /* This is mutable because of its mutable data_environ and target_data
      members.  */
@@ -811,9 +792,7 @@ extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
 extern void gomp_copy_from_async (struct target_mem_desc *);
 extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
-extern void gomp_init_tables (struct gomp_device_descr *,
-			      struct gomp_memory_mapping *);
-extern void gomp_free_memmap (struct gomp_memory_mapping *);
+extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_fini_device (struct gomp_device_descr *);
 
 /* work.c */
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index f44174e..2b2b953 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -231,6 +231,7 @@ GOMP_4.0 {
 GOMP_4.0.1 {
   global:
 	GOMP_offload_register;
+	GOMP_offload_unregister;
 } GOMP_4.0;
 
 OACC_2.0 {
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 6aeb1e7..e4756b6 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -43,20 +43,18 @@ static struct gomp_device_descr host_dispatch =
     .get_caps_func = GOMP_OFFLOAD_get_caps,
     .get_type_func = GOMP_OFFLOAD_get_type,
     .get_num_devices_func = GOMP_OFFLOAD_get_num_devices,
-    .register_image_func = GOMP_OFFLOAD_register_image,
     .init_device_func = GOMP_OFFLOAD_init_device,
     .fini_device_func = GOMP_OFFLOAD_fini_device,
-    .get_table_func = GOMP_OFFLOAD_get_table,
+    .load_image_func = GOMP_OFFLOAD_load_image,
+    .unload_image_func = GOMP_OFFLOAD_unload_image,
     .alloc_func = GOMP_OFFLOAD_alloc,
     .free_func = GOMP_OFFLOAD_free,
     .dev2host_func = GOMP_OFFLOAD_dev2host,
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.is_initialized = false,
-    .mem_map.splay_tree.root = NULL,
+    .mem_map.root = NULL,
     .is_initialized = false,
-    .offload_regions_registered = false,
 
     .openacc = {
       .open_device_func = GOMP_OFFLOAD_openacc_open_device,
@@ -94,7 +92,6 @@ static struct gomp_device_descr host_dispatch =
 static __attribute__ ((constructor))
 void goacc_host_init (void)
 {
-  gomp_mutex_init (&host_dispatch.mem_map.lock);
   gomp_mutex_init (&host_dispatch.lock);
   goacc_register (&host_dispatch);
 }
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 166eb55..1e0243e 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -284,12 +284,6 @@ lazy_open (int ord)
     = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
 
   acc_dev->openacc.async_set_async_func (acc_async_sync);
-
-  struct gomp_memory_mapping *mem_map = &acc_dev->mem_map;
-  gomp_mutex_lock (&mem_map->lock);
-  if (!mem_map->is_initialized)
-    gomp_init_tables (acc_dev, mem_map);
-  gomp_mutex_unlock (&mem_map->lock);
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
@@ -351,10 +345,9 @@ acc_shutdown_1 (acc_device_t d)
 
 	  walk->dev->openacc.target_data = target_data = NULL;
 
-	  struct gomp_memory_mapping *mem_map = &walk->dev->mem_map;
-	  gomp_mutex_lock (&mem_map->lock);
-	  gomp_free_memmap (mem_map);
-	  gomp_mutex_unlock (&mem_map->lock);
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (&walk->dev->mem_map);
+	  gomp_mutex_unlock (&walk->dev->lock);
 
 	  walk->dev = NULL;
 	}
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 0096d51..fdc82e6 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -38,7 +38,7 @@
 /* Return block containing [H->S), or NULL if not contained.  */
 
 static splay_tree_key
-lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
+lookup_host (struct gomp_device_descr *dev, void *h, size_t s)
 {
   struct splay_tree_key_s node;
   splay_tree_key key;
@@ -46,11 +46,9 @@ lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
   node.host_start = (uintptr_t) h;
   node.host_end = (uintptr_t) h + s;
 
-  gomp_mutex_lock (&mem_map->lock);
-
-  key = splay_tree_lookup (&mem_map->splay_tree, &node);
-
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_lock (&dev->lock);
+  key = splay_tree_lookup (&dev->mem_map, &node);
+  gomp_mutex_unlock (&dev->lock);
 
   return key;
 }
@@ -65,14 +63,11 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
 {
   int i;
   struct target_mem_desc *t;
-  struct gomp_memory_mapping *mem_map;
 
   if (!tgt)
     return NULL;
 
-  mem_map = tgt->mem_map;
-
-  gomp_mutex_lock (&mem_map->lock);
+  gomp_mutex_lock (&tgt->device_descr->lock);
 
   for (t = tgt; t != NULL; t = t->prev)
     {
@@ -80,7 +75,7 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
         break;
     }
 
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_unlock (&tgt->device_descr->lock);
 
   if (!t)
     return NULL;
@@ -176,7 +171,7 @@ acc_deviceptr (void *h)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  n = lookup_host (&thr->dev->mem_map, h, 1);
+  n = lookup_host (thr->dev, h, 1);
 
   if (!n)
     return NULL;
@@ -229,7 +224,7 @@ acc_is_present (void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   if (n && ((uintptr_t)h < n->host_start
 	    || (uintptr_t)h + s > n->host_end
@@ -271,7 +266,7 @@ acc_map_data (void *h, void *d, size_t s)
 	gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map",
                     (void *)h, (int)s, (void *)d, (int)s);
 
-      if (lookup_host (&acc_dev->mem_map, h, s))
+      if (lookup_host (acc_dev, h, s))
 	gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h,
 		    (int)s);
 
@@ -296,7 +291,7 @@ acc_unmap_data (void *h)
   /* No need to call lazy open, as the address must have been mapped.  */
 
   size_t host_size;
-  splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1);
+  splay_tree_key n = lookup_host (acc_dev, h, 1);
   struct target_mem_desc *t;
 
   if (!n)
@@ -320,7 +315,7 @@ acc_unmap_data (void *h)
       t->tgt_end = 0;
       t->to_free = 0;
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
 	   tp = t, t = t->prev)
@@ -334,7 +329,7 @@ acc_unmap_data (void *h)
 	    break;
 	  }
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   gomp_unmap_vars (t, true);
@@ -358,7 +353,7 @@ present_create_copy (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
   if (n)
     {
       /* Present. */
@@ -389,13 +384,13 @@ present_create_copy (unsigned f, void *h, size_t s)
       tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
 			   false);
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
       tgt->prev = acc_dev->openacc.data_environ;
       acc_dev->openacc.data_environ = tgt;
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   return d;
@@ -436,7 +431,7 @@ delete_copyout (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -479,7 +474,7 @@ update_dev_host (int is_dev, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -532,7 +527,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   struct target_mem_desc *t;
   int minrefs = (mapnum == 1) ? 2 : 3;
 
-  n = lookup_host (&acc_dev->mem_map, h, 1);
+  n = lookup_host (acc_dev, h, 1);
 
   if (!n)
     gomp_fatal ("%p is not a mapped block", (void *)h);
@@ -543,7 +538,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
 
   struct target_mem_desc *tp;
 
-  gomp_mutex_lock (&acc_dev->mem_map.lock);
+  gomp_mutex_lock (&acc_dev->lock);
 
   if (t->refcount == minrefs)
     {
@@ -570,7 +565,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   if (force_copyfrom)
     t->list[0]->copy_from = 1;
 
-  gomp_mutex_unlock (&acc_dev->mem_map.lock);
+  gomp_mutex_unlock (&acc_dev->lock);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 0c74f54..563f9bb 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -144,9 +144,9 @@ GOACC_parallel (int device, void (*fn) (void *),
     {
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
-      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k);
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
+      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map, &k);
+      gomp_mutex_unlock (&acc_dev->lock);
 
       if (tgt_fn_key == NULL)
 	gomp_fatal ("target function wasn't mapped");
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index ebf7f11..bc60f72 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -95,12 +95,6 @@ GOMP_OFFLOAD_get_num_devices (void)
 }
 
 STATIC void
-GOMP_OFFLOAD_register_image (void *host_table __attribute__ ((unused)),
-			     void *target_data __attribute__ ((unused)))
-{
-}
-
-STATIC void
 GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
 {
 }
@@ -111,12 +105,19 @@ GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
 }
 
 STATIC int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **table __attribute__ ((unused)))
+GOMP_OFFLOAD_load_image (int n __attribute__ ((unused)),
+			 void *i __attribute__ ((unused)),
+			 struct addr_pair **r __attribute__ ((unused)))
 {
   return 0;
 }
 
+STATIC void
+GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
+			   void *i __attribute__ ((unused)))
+{
+}
+
 STATIC void *
 GOMP_OFFLOAD_openacc_open_device (int n)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index c5dda3f..418a2f5 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -49,6 +49,9 @@ static void gomp_target_init (void);
 /* The whole initialization code for offloading plugins is only run one.  */
 static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT;
 
+/* Mutex for offload image registration.  */
+static gomp_mutex_t register_lock;
+
 /* This structure describes an offload image.
    It contains type of the target device, pointer to host table descriptor, and
    pointer to target data.  */
@@ -153,14 +156,14 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   const int rshift = is_openacc ? 8 : 3;
   const int typemask = is_openacc ? 0xff : 0x7;
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
+  struct splay_tree_s *mem_map = &devicep->mem_map;
   struct splay_tree_key_s cur_node;
   struct target_mem_desc *tgt
     = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mm;
+  tgt->mem_map = mem_map;
 
   if (mapnum == 0)
     return tgt;
@@ -174,7 +177,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       tgt_size = mapnum * sizeof (void *);
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < mapnum; i++)
     {
@@ -189,7 +192,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	cur_node.host_end = cur_node.host_start + sizes[i];
       else
 	cur_node.host_end = cur_node.host_start + sizeof (void *);
-      splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+      splay_tree_key n = splay_tree_lookup (mem_map, &cur_node);
       if (n)
 	{
 	  tgt->list[i] = n;
@@ -274,7 +277,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      k->host_end = k->host_start + sizes[i];
 	    else
 	      k->host_end = k->host_start + sizeof (void *);
-	    splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k);
+	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n)
 	      {
 		tgt->list[i] = n;
@@ -294,7 +297,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		tgt->refcount++;
 		array->left = NULL;
 		array->right = NULL;
-		splay_tree_insert (&mm->splay_tree, array);
+		splay_tree_insert (mem_map, array);
 		switch (kind & typemask)
 		  {
 		  case GOMP_MAP_ALLOC:
@@ -332,16 +335,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    /* Add bias to the pointer value.  */
 		    cur_node.host_start += sizes[i];
 		    cur_node.host_end = cur_node.host_start + 1;
-		    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+		    n = splay_tree_lookup (mem_map, &cur_node);
 		    if (n == NULL)
 		      {
 			/* Could be possibly zero size array section.  */
 			cur_node.host_end--;
-			n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			n = splay_tree_lookup (mem_map, &cur_node);
 			if (n == NULL)
 			  {
 			    cur_node.host_start--;
-			    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			    n = splay_tree_lookup (mem_map, &cur_node);
 			    cur_node.host_start++;
 			  }
 		      }
@@ -400,18 +403,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			  /* Add bias to the pointer value.  */
 			  cur_node.host_start += sizes[j];
 			  cur_node.host_end = cur_node.host_start + 1;
-			  n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			  n = splay_tree_lookup (mem_map, &cur_node);
 			  if (n == NULL)
 			    {
 			      /* Could be possibly zero size array section.  */
 			      cur_node.host_end--;
-			      n = splay_tree_lookup (&mm->splay_tree,
-						     &cur_node);
+			      n = splay_tree_lookup (mem_map, &cur_node);
 			      if (n == NULL)
 				{
 				  cur_node.host_start--;
-				  n = splay_tree_lookup (&mm->splay_tree,
-							 &cur_node);
+				  n = splay_tree_lookup (mem_map, &cur_node);
 				  cur_node.host_start++;
 				}
 			    }
@@ -489,7 +490,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	}
     }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
   return tgt;
 }
 
@@ -514,10 +515,9 @@ attribute_hidden void
 gomp_copy_from_async (struct target_mem_desc *tgt)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
   size_t i;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < tgt->list_count; i++)
     if (tgt->list[i] == NULL)
@@ -536,7 +536,7 @@ gomp_copy_from_async (struct target_mem_desc *tgt)
 				  k->host_end - k->host_start);
       }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 /* Unmap variables described by TGT.  If DO_COPYFROM is true, copy relevant
@@ -547,7 +547,6 @@ attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
 
   if (tgt->list_count == 0)
     {
@@ -555,7 +554,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
       return;
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   size_t i;
   for (i = 0; i < tgt->list_count; i++)
@@ -572,7 +571,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (&mm->splay_tree, k);
+	splay_tree_remove (tgt->mem_map, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -584,13 +583,12 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
   else
     gomp_unmap_tgt (tgt);
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 static void
-gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
-	     size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds,
-	     bool is_openacc)
+gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
+	     size_t *sizes, void *kinds, bool is_openacc)
 {
   size_t i;
   struct splay_tree_key_s cur_node;
@@ -602,14 +600,13 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
   if (mapnum == 0)
     return;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
   for (i = 0; i < mapnum; i++)
     if (sizes[i])
       {
 	cur_node.host_start = (uintptr_t) hostaddrs[i];
 	cur_node.host_end = cur_node.host_start + sizes[i];
-	splay_tree_key n = splay_tree_lookup (&mm->splay_tree,
-					      &cur_node);
+	splay_tree_key n = splay_tree_lookup (&devicep->mem_map, &cur_node);
 	if (n)
 	  {
 	    int kind = get_kind (is_openacc, kinds, i);
@@ -643,10 +640,105 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
 		      (void *) cur_node.host_start,
 		      (void *) cur_node.host_end);
       }
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
+}
+
+
+/* Insert mapping of host -> target address pairs to splay tree.  */
+
+static void
+gomp_splay_tree_insert_mapping (struct gomp_device_descr *devicep,
+				struct addr_pair *host_addr,
+				struct addr_pair *tgt_addr)
+{
+  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
+  tgt->refcount = 1;
+  tgt->array = gomp_malloc (sizeof (*tgt->array));
+  tgt->tgt_start = tgt_addr->start;
+  tgt->tgt_end = tgt_addr->end;
+  tgt->to_free = NULL;
+  tgt->list_count = 0;
+  tgt->device_descr = devicep;
+  splay_tree_node node = tgt->array;
+  splay_tree_key k = &node->key;
+  k->host_start = host_addr->start;
+  k->host_end = host_addr->end;
+  k->tgt_offset = 0;
+  k->refcount = 1;
+  k->copy_from = false;
+  k->tgt = tgt;
+  node->left = NULL;
+  node->right = NULL;
+  splay_tree_insert (&devicep->mem_map, node);
+}
+
+/* Load image pointed by TARGET_DATA to the device, specified by DEVICEP.
+   And insert to splay tree the mapping between addresses from HOST_TABLE and
+   from loaded target image.  */
+
+static void
+gomp_offload_image_to_device (struct gomp_device_descr *devicep,
+			      void *host_table, void *target_data)
+{
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  /* Load image to device and get target addresses for the image.  */
+  struct addr_pair *target_table = NULL;
+  int i, num_target_entries
+    = devicep->load_image_func (devicep->target_id, target_data, &target_table);
+
+  if (num_target_entries != num_funcs + num_vars)
+    gomp_fatal ("Can't map target functions or variables");
+
+  /* Insert host-target address mapping into splay tree.  */
+  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
+  tgt->array = gomp_malloc (num_funcs * sizeof (*tgt->array));
+  tgt->refcount = 1;
+  tgt->tgt_start = 0;
+  tgt->tgt_end = 0;
+  tgt->to_free = NULL;
+  tgt->prev = NULL;
+  tgt->list_count = 0;
+  tgt->device_descr = devicep;
+
+  splay_tree_node array = tgt->array;
+  for (i = 0; i < num_funcs; i++)
+    {
+      splay_tree_key k = &array->key;
+      k->host_start = (uintptr_t) host_func_table[i];
+      k->host_end = k->host_start + 1;
+      k->tgt = tgt;
+      k->tgt_offset = target_table[i].start;
+      k->refcount = 1;
+      k->async_refcount = 0;
+      k->copy_from = false;
+      array->left = NULL;
+      array->right = NULL;
+      splay_tree_insert (&devicep->mem_map, array);
+      array++;
+    }
+
+  for (i = 0; i < num_vars; i++)
+    {
+      struct addr_pair host_addr;
+      host_addr.start = (uintptr_t) host_var_table[i*2];
+      host_addr.end = host_addr.start + (uintptr_t) host_var_table[i*2+1];
+      gomp_splay_tree_insert_mapping (devicep, &host_addr,
+				      &target_table[num_funcs+i]);
+    }
+
+  free (target_table);
 }
 
-/* This function should be called from every offload image.
+/* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
    the target, and TARGET_DATA needed by target plugin.  */
 
@@ -654,6 +746,20 @@ void
 GOMP_offload_register (void *host_table, enum offload_target_type target_type,
 		       void *target_data)
 {
+  int i;
+  gomp_mutex_lock (&register_lock);
+
+  /* Load image to all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type == target_type && devicep->is_initialized)
+	gomp_offload_image_to_device (devicep, host_table, target_data);
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* Insert image to array of pending images.  */
   offload_images = gomp_realloc (offload_images,
 				 (num_offload_images + 1)
 				 * sizeof (struct offload_image_descr));
@@ -663,74 +769,122 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type,
   offload_images[num_offload_images].target_data = target_data;
 
   num_offload_images++;
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* This function initializes the target device, specified by DEVICEP.  DEVICEP
-   must be locked on entry, and remains locked on return.  */
+/* This function should be called from every offload image while unloading.
+   It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
+   the target, and TARGET_DATA needed by target plugin.  */
 
-attribute_hidden void
-gomp_init_device (struct gomp_device_descr *devicep)
+void
+GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
+			 void *target_data)
 {
-  devicep->init_device_func (devicep->target_id);
-  devicep->is_initialized = true;
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+  int i;
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  gomp_mutex_lock (&register_lock);
+
+  /* Unload image from all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      int j;
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type != target_type || !devicep->is_initialized)
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  continue;
+	}
+
+      devicep->unload_image_func (devicep->target_id, target_data);
+
+      /* Remove mapping from splay tree.  */
+      struct splay_tree_key_s k;
+      splay_tree_key node = NULL;
+      if (num_funcs > 0)
+	{
+	  k.host_start = (uintptr_t) host_func_table[0];
+	  k.host_end = k.host_start + 1;
+	  node = splay_tree_lookup (&devicep->mem_map, &k);
+	}
+      for (j = 0; j < num_funcs; j++)
+	{
+	  k.host_start = (uintptr_t) host_func_table[j];
+	  k.host_end = k.host_start + 1;
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+      if (node)
+	{
+	  free (node->tgt);
+	  free (node);
+	}
+
+      for (j = 0; j < num_vars; j++)
+	{
+	  k.host_start = (uintptr_t) host_var_table[j*2];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[j*2+1];
+	  splay_tree_remove (&devicep->mem_map, &k);
+	  /* FIXME: free tgt->array and tgt.  */
+	}
+
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* Remove image from array of pending images.  */
+  for (i = 0; i < num_offload_images; i++)
+    if (offload_images[i].target_data == target_data)
+      {
+	offload_images[i] = offload_images[--num_offload_images];
+	break;
+      }
+
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* Initialize address mapping tables.  MM must be locked on entry, and remains
-   locked on return.  */
+/* This function initializes the target device, specified by DEVICEP.  DEVICEP
+   must be locked on entry, and remains locked on return.  */
 
 attribute_hidden void
-gomp_init_tables (struct gomp_device_descr *devicep,
-		  struct gomp_memory_mapping *mm)
+gomp_init_device (struct gomp_device_descr *devicep)
 {
-  /* Get address mapping table for device.  */
-  struct mapping_table *table = NULL;
-  int num_entries = devicep->get_table_func (devicep->target_id, &table);
-
-  /* Insert host-target address mapping into dev_splay_tree.  */
   int i;
-  for (i = 0; i < num_entries; i++)
+  devicep->init_device_func (devicep->target_id);
+
+  /* Load to device all images registered by the moment.  */
+  for (i = 0; i < num_offload_images; i++)
     {
-      struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
-      tgt->refcount = 1;
-      tgt->array = gomp_malloc (sizeof (*tgt->array));
-      tgt->tgt_start = table[i].tgt_start;
-      tgt->tgt_end = table[i].tgt_end;
-      tgt->to_free = NULL;
-      tgt->list_count = 0;
-      tgt->device_descr = devicep;
-      splay_tree_node node = tgt->array;
-      splay_tree_key k = &node->key;
-      k->host_start = table[i].host_start;
-      k->host_end = table[i].host_end;
-      k->tgt_offset = 0;
-      k->refcount = 1;
-      k->copy_from = false;
-      k->tgt = tgt;
-      node->left = NULL;
-      node->right = NULL;
-      splay_tree_insert (&mm->splay_tree, node);
+      struct offload_image_descr *image = &offload_images[i];
+      if (image->type == devicep->type)
+	gomp_offload_image_to_device (devicep, image->host_table,
+				      image->target_data);
     }
 
-  free (table);
-  mm->is_initialized = true;
+  devicep->is_initialized = true;
 }
 
 /* Free address mapping tables.  MM must be locked on entry, and remains locked
    on return.  */
 
 attribute_hidden void
-gomp_free_memmap (struct gomp_memory_mapping *mm)
+gomp_free_memmap (struct splay_tree_s *mem_map)
 {
-  while (mm->splay_tree.root)
+  while (mem_map->root)
     {
-      struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt;
+      struct target_mem_desc *tgt = mem_map->root->key.tgt;
 
-      splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key);
+      splay_tree_remove (mem_map, &mem_map->root->key);
       free (tgt->array);
       free (tgt);
     }
-
-  mm->is_initialized = false;
 }
 
 /* This function de-initializes the target device, specified by DEVICEP.
@@ -791,22 +945,17 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
     fn_addr = (void *) fn;
   else
     {
-      struct gomp_memory_mapping *mm = &devicep->mem_map;
-      gomp_mutex_lock (&mm->lock);
-
-      if (!mm->is_initialized)
-	gomp_init_tables (devicep, mm);
-
+      gomp_mutex_lock (&devicep->lock);
       struct splay_tree_key_s k;
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      splay_tree_key tgt_fn = splay_tree_lookup (&mm->splay_tree, &k);
+      splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
       if (tgt_fn == NULL)
 	gomp_fatal ("Target function wasn't mapped");
 
-      gomp_mutex_unlock (&mm->lock);
+      gomp_mutex_unlock (&devicep->lock);
 
-      fn_addr = (void *) tgt_fn->tgt->tgt_start;
+      fn_addr = (void *) tgt_fn->tgt_offset;
     }
 
   struct target_mem_desc *tgt_vars
@@ -856,12 +1005,6 @@ GOMP_target_data (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
   struct target_mem_desc *tgt
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     false);
@@ -897,13 +1040,7 @@ GOMP_target_update (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
-  gomp_update (devicep, mm, mapnum, hostaddrs, sizes, kinds, false);
+  gomp_update (devicep, mapnum, hostaddrs, sizes, kinds, false);
 }
 
 void
@@ -972,10 +1109,10 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM (get_caps);
   DLSYM (get_type);
   DLSYM (get_num_devices);
-  DLSYM (register_image);
   DLSYM (init_device);
   DLSYM (fini_device);
-  DLSYM (get_table);
+  DLSYM (load_image);
+  DLSYM (unload_image);
   DLSYM (alloc);
   DLSYM (free);
   DLSYM (dev2host);
@@ -1038,22 +1175,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   return err == NULL;
 }
 
-/* This function adds a compatible offload image IMAGE to an accelerator device
-   DEVICE.  DEVICE must be locked on entry, and remains locked on return.  */
-
-static void
-gomp_register_image_for_device (struct gomp_device_descr *device,
-				struct offload_image_descr *image)
-{
-  if (!device->offload_regions_registered
-      && (device->type == image->type
-	  || device->type == OFFLOAD_TARGET_TYPE_HOST))
-    {
-      device->register_image_func (image->host_table, image->target_data);
-      device->offload_regions_registered = true;
-    }
-}
-
 /* This function initializes the runtime needed for offloading.
    It parses the list of offload targets and tries to load the plugins for
    these targets.  On return, the variables NUM_DEVICES and NUM_DEVICES_OPENMP
@@ -1112,17 +1233,14 @@ gomp_target_init (void)
 		current_device.name = current_device.get_name_func ();
 		/* current_device.capabilities has already been set.  */
 		current_device.type = current_device.get_type_func ();
-		current_device.mem_map.is_initialized = false;
-		current_device.mem_map.splay_tree.root = NULL;
+		current_device.mem_map.root = NULL;
 		current_device.is_initialized = false;
-		current_device.offload_regions_registered = false;
 		current_device.openacc.data_environ = NULL;
 		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
 		    devices[num_devices] = current_device;
-		    gomp_mutex_init (&devices[num_devices].mem_map.lock);
 		    gomp_mutex_init (&devices[num_devices].lock);
 		    num_devices++;
 		  }
@@ -1157,21 +1275,12 @@ gomp_target_init (void)
 
   for (i = 0; i < num_devices; i++)
     {
-      int j;
-
-      for (j = 0; j < num_offload_images; j++)
-	gomp_register_image_for_device (&devices[i], &offload_images[j]);
-
       /* The 'devices' array can be moved (by the realloc call) until we have
 	 found all the plugins, so registering with the OpenACC runtime (which
 	 takes a copy of the pointer argument) must be delayed until now.  */
       if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
 	goacc_register (&devices[i]);
     }
-
-  free (offload_images);
-  offload_images = NULL;
-  num_offload_images = 0;
 }
 
 #else /* PLUGIN_SUPPORT */
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index 3e7a958..a2d61b1 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -34,6 +34,7 @@
 #include <string.h>
 #include <utility>
 #include <vector>
+#include <map>
 #include "libgomp-plugin.h"
 #include "compiler_if_host.h"
 #include "main_target_image.h"
@@ -53,6 +54,29 @@ fprintf (stderr, "\n");					    \
 #endif
 
 
+/* Start/end addresses of functions and global variables on a device.  */
+typedef std::vector<addr_pair> AddrVect;
+
+/* Addresses for one image and all devices.  */
+typedef std::vector<AddrVect> DevAddrVect;
+
+/* Addresses for all images and all devices.  */
+typedef std::map<void *, DevAddrVect> ImgDevAddrMap;
+
+
+/* Total number of available devices.  */
+static int num_devices;
+
+/* Total number of shared libraries with offloading to Intel MIC.  */
+static int num_images;
+
+/* Two dimensional array: one key is a pointer to image,
+   second key is number of device.  Contains a vector of pointer pairs.  */
+static ImgDevAddrMap *address_table;
+
+/* Thread-safe registration of the main image.  */
+static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
+
 static VarDesc vd_host2tgt = {
   { 1, 1 },		      /* dst, src			      */
   { 1, 0 },		      /* in, out			      */
@@ -90,28 +114,17 @@ static VarDesc vd_tgt2host = {
 };
 
 
-/* Total number of shared libraries with offloading to Intel MIC.  */
-static int num_libraries;
-
-/* Pointers to the descriptors, containing pointers to host-side tables and to
-   target images.  */
-static std::vector< std::pair<void *, void *> > lib_descrs;
-
-/* Thread-safe registration of the main image.  */
-static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
-
-
 /* Add path specified in LD_LIBRARY_PATH to MIC_LD_LIBRARY_PATH, which is
    required by liboffloadmic.  */
 __attribute__((constructor))
 static void
-set_mic_lib_path (void)
+init (void)
 {
   const char *ld_lib_path = getenv (LD_LIBRARY_PATH_ENV);
   const char *mic_lib_path = getenv (MIC_LD_LIBRARY_PATH_ENV);
 
   if (!ld_lib_path)
-    return;
+    goto out;
 
   if (!mic_lib_path)
     setenv (MIC_LD_LIBRARY_PATH_ENV, ld_lib_path, 1);
@@ -133,6 +146,10 @@ set_mic_lib_path (void)
       if (!use_alloca)
 	free (mic_lib_path_new);
     }
+
+out:
+  address_table = new ImgDevAddrMap;
+  num_devices = _Offload_number_of_devices ();
 }
 
 extern "C" const char *
@@ -162,18 +179,8 @@ GOMP_OFFLOAD_get_type (void)
 extern "C" int
 GOMP_OFFLOAD_get_num_devices (void)
 {
-  int res = _Offload_number_of_devices ();
-  TRACE ("(): return %d", res);
-  return res;
-}
-
-/* This should be called from every shared library with offloading.  */
-extern "C" void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_image)
-{
-  TRACE ("(host_table = %p, target_image = %p)", host_table, target_image);
-  lib_descrs.push_back (std::make_pair (host_table, target_image));
-  num_libraries++;
+  TRACE ("(): return %d", num_devices);
+  return num_devices;
 }
 
 static void
@@ -196,7 +203,8 @@ register_main_image ()
   __offload_register_image (&main_target_image);
 }
 
-/* Load offload_target_main on target.  */
+/* liboffloadmic loads and runs offload_target_main on all available devices
+   during a first call to offload ().  */
 extern "C" void
 GOMP_OFFLOAD_init_device (int device)
 {
@@ -243,9 +251,11 @@ get_target_table (int device, int &num_funcs, int &num_vars, void **&table)
     }
 }
 
+/* Offload TARGET_IMAGE to all available devices and fill address_table with
+   corresponding target addresses.  */
+
 static void
-load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
-			int &table_size)
+offload_image (void *target_image)
 {
   struct TargetImage {
     int64_t size;
@@ -254,19 +264,11 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     char data[];
   } __attribute__ ((packed));
 
-  void ***host_table_descr = (void ***) lib_descrs[lib_num].first;
-  void **host_func_start = host_table_descr[0];
-  void **host_func_end   = host_table_descr[1];
-  void **host_var_start  = host_table_descr[2];
-  void **host_var_end    = host_table_descr[3];
+  void *image_start = ((void **) target_image)[0];
+  void *image_end   = ((void **) target_image)[1];
 
-  void **target_image_descr = (void **) lib_descrs[lib_num].second;
-  void *image_start = target_image_descr[0];
-  void *image_end   = target_image_descr[1];
-
-  TRACE ("() host_table_descr { %p, %p, %p, %p }", host_func_start,
-	 host_func_end, host_var_start, host_var_end);
-  TRACE ("() target_image_descr { %p, %p }", image_start, image_end);
+  TRACE ("(target_image = %p { %p, %p })",
+	 target_image, image_start, image_end);
 
   int64_t image_size = (uintptr_t) image_end - (uintptr_t) image_start;
   TargetImage *image
@@ -279,94 +281,87 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     }
 
   image->size = image_size;
-  sprintf (image->name, "lib%010d.so", lib_num);
+  sprintf (image->name, "lib%010d.so", num_images++);
   memcpy (image->data, image_start, image->size);
 
   TRACE ("() __offload_register_image %s { %p, %d }",
 	 image->name, image_start, image->size);
   __offload_register_image (image);
 
-  int tgt_num_funcs = 0;
-  int tgt_num_vars = 0;
-  void **tgt_table = NULL;
-  get_target_table (device, tgt_num_funcs, tgt_num_vars, tgt_table);
-  free (image);
-
-  /* The func table contains only addresses, the var table contains addresses
-     and corresponding sizes.  */
-  int host_num_funcs = host_func_end - host_func_start;
-  int host_num_vars  = (host_var_end - host_var_start) / 2;
-  TRACE ("() host_num_funcs = %d, tgt_num_funcs = %d",
-	 host_num_funcs, tgt_num_funcs);
-  TRACE ("() host_num_vars = %d, tgt_num_vars = %d",
-	 host_num_vars, tgt_num_vars);
-  if (host_num_funcs != tgt_num_funcs)
+  /* Receive tables for target_image from all devices.  */
+  DevAddrVect dev_table;
+  for (int dev = 0; dev < num_devices; dev++)
     {
-      fprintf (stderr, "%s: Can't map target functions\n", __FILE__);
-      exit (1);
-    }
-  if (host_num_vars != tgt_num_vars)
-    {
-      fprintf (stderr, "%s: Can't map target variables\n", __FILE__);
-      exit (1);
-    }
+      int num_funcs = 0;
+      int num_vars = 0;
+      void **table = NULL;
 
-  table = (mapping_table *) realloc (table, (table_size + host_num_funcs
-					     + host_num_vars)
-					    * sizeof (mapping_table));
-  if (table == NULL)
-    {
-      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
-      exit (1);
-    }
+      get_target_table (dev, num_funcs, num_vars, table);
 
-  for (int i = 0; i < host_num_funcs; i++)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_func_start[i];
-      t.host_end = t.host_start + 1;
-      t.tgt_start = (uintptr_t) tgt_table[i];
-      t.tgt_end = t.tgt_start + 1;
-
-      TRACE ("() lib %d, func %d:\t0x%llx -- 0x%llx",
-	     lib_num, i, t.host_start, t.tgt_start);
-
-      table[table_size++] = t;
-    }
+      AddrVect curr_dev_table;
 
-  for (int i = 0; i < host_num_vars * 2; i += 2)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_var_start[i];
-      t.host_end = t.host_start + (uintptr_t) host_var_start[i+1];
-      t.tgt_start = (uintptr_t) tgt_table[tgt_num_funcs+i];
-      t.tgt_end = t.tgt_start + (uintptr_t) tgt_table[tgt_num_funcs+i+1];
+      for (int i = 0; i < num_funcs; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[i];
+	  tgt_addr.end = tgt_addr.start + 1;
+	  TRACE ("() func %d:\t0x%llx..0x%llx", i,
+		 tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      TRACE ("() lib %d, var %d:\t0x%llx (%d) -- 0x%llx (%d)", lib_num, i/2,
-	     t.host_start, t.host_end - t.host_start,
-	     t.tgt_start, t.tgt_end - t.tgt_start);
+      for (int i = 0; i < num_vars; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[num_funcs+i*2];
+	  tgt_addr.end = tgt_addr.start + (uintptr_t) table[num_funcs+i*2+1];
+	  TRACE ("() var %d:\t0x%llx..0x%llx", i, tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      table[table_size++] = t;
+      dev_table.push_back (curr_dev_table);
     }
 
-  delete [] tgt_table;
+  address_table->insert (std::make_pair (target_image, dev_table));
+
+  free (image);
 }
 
 extern "C" int
-GOMP_OFFLOAD_get_table (int device, void *result)
+GOMP_OFFLOAD_load_image (int device, void *target_image, addr_pair **result)
 {
-  TRACE ("(num_libraries = %d)", num_libraries);
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
 
-  mapping_table *table = NULL;
-  int table_size = 0;
+  /* If target_image is already present in address_table, then there is no need
+     to offload it.  */
+  if (address_table->count (target_image) == 0)
+    offload_image (target_image);
 
-  for (int i = 0; i < num_libraries; i++)
-    load_lib_and_get_table (device, i, table, table_size);
+  AddrVect *curr_dev_table = &(*address_table)[target_image][device];
+  int table_size = curr_dev_table->size ();
+  addr_pair *table = (addr_pair *) malloc (table_size * sizeof (addr_pair));
+  if (table == NULL)
+    {
+      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
+      exit (1);
+    }
 
-  *(void **) result = table;
+  std::copy (curr_dev_table->begin (), curr_dev_table->end (), table);
+  *result = table;
   return table_size;
 }
 
+extern "C" void
+GOMP_OFFLOAD_unload_image (int device, void *target_image)
+{
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
+
+  /* TODO: Currently liboffloadmic doesn't support __offload_unregister_image
+     for libraries.  */
+
+  address_table->erase (target_image);
+}
+
 extern "C" void *
 GOMP_OFFLOAD_alloc (int device, size_t size)
 {


Thanks,
  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-30 16:42                                   ` Jakub Jelinek
@ 2015-03-30 21:43                                     ` Julian Brown
  2015-03-31 12:52                                     ` Ilya Verbin
  2015-03-31 18:25                                     ` Ilya Verbin
  2 siblings, 0 replies; 30+ messages in thread
From: Julian Brown @ 2015-03-30 21:43 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Ilya Verbin, Thomas Schwinge, gcc-patches, Kirill Yukhin

[-- Attachment #1: Type: text/plain, Size: 728 bytes --]

On Mon, 30 Mar 2015 18:42:02 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Thu, Mar 26, 2015 at 11:41:30PM +0300, Ilya Verbin wrote:
> > Here is the latest patch for libgomp and mic plugin.
> > make check-target-libgomp using intelmic emul passed.
> > Also I used a testcase from the attachment.
> 
> This applies cleanly.
> 
> > Latest ptx part is here, I guess:
> > https://gcc.gnu.org/ml/gcc-patches/2015-02/msg01407.html
> 
> But the one Julian posted doesn't apply on top of your patch.
> If there is any interdiff needed on top of your patch, can it be
> posted against trunk + your patch?

Here's a version of my patch against trunk and Ilya's latest patch
(hopefully!). Tests look OK (libgomp + PTX).

HTH,

Julian

[-- Attachment #2: nvptx-load-unload-5.diff --]
[-- Type: text/x-patch, Size: 46820 bytes --]

commit f203634ace786b5bb2fdce56f123f3fba236dda3
Author: Julian Brown <julian@codesourcery.com>
Date:   Mon Mar 30 14:37:53 2015 -0700

    nvptx load/unload support, init rework

diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c
index 02c44b6..dbc68bc 100644
--- a/gcc/config/nvptx/mkoffload.c
+++ b/gcc/config/nvptx/mkoffload.c
@@ -839,6 +839,7 @@ process (FILE *in, FILE *out)
 {
   const char *input = read_file (in);
   Token *tok = tokenize (input);
+  unsigned int nvars = 0, nfuncs = 0;
 
   do
     tok = parse_file (tok);
@@ -850,16 +851,17 @@ process (FILE *in, FILE *out)
   write_stmts (out, rev_stmts (fns));
   fprintf (out, ";\n\n");
   fprintf (out, "static const char *var_mappings[] = {\n");
-  for (id_map *id = var_ids; id; id = id->next)
+  for (id_map *id = var_ids; id; id = id->next, nvars++)
     fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
   fprintf (out, "};\n\n");
   fprintf (out, "static const char *func_mappings[] = {\n");
-  for (id_map *id = func_ids; id; id = id->next)
+  for (id_map *id = func_ids; id; id = id->next, nfuncs++)
     fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
   fprintf (out, "};\n\n");
 
   fprintf (out, "static const void *target_data[] = {\n");
-  fprintf (out, "  ptx_code, var_mappings, func_mappings\n");
+  fprintf (out, "  ptx_code, (void*) %u, var_mappings, (void*) %u, "
+		"func_mappings\n", nvars, nfuncs);
   fprintf (out, "};\n\n");
 
   fprintf (out, "extern void GOMP_offload_register (const void *, int, void *);\n");
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a1d42c5..5272f01 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -655,9 +655,6 @@ struct target_mem_desc {
   /* Corresponding target device descriptor.  */
   struct gomp_device_descr *device_descr;
 
-  /* Memory mapping info for the thread that created this descriptor.  */
-  struct splay_tree_s *mem_map;
-
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
   splay_tree_key list[];
@@ -691,18 +688,6 @@ typedef struct acc_dispatch_t
   /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
   struct target_mem_desc *data_environ;
 
-  /* Extra information required for a device instance by a given target.  */
-  /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
-  void *target_data;
-
-  /* Open or close a device instance.  */
-  void *(*open_device_func) (int n);
-  int (*close_device_func) (void *h);
-
-  /* Set or get the device number.  */
-  int (*get_device_num_func) (void);
-  void (*set_device_num_func) (int);
-
   /* Execute.  */
   void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *,
 		     unsigned short *, int, int, int, int, void *);
@@ -720,7 +705,7 @@ typedef struct acc_dispatch_t
   void (*async_set_async_func) (int);
 
   /* Create/destroy TLS data.  */
-  void *(*create_thread_data_func) (void *);
+  void *(*create_thread_data_func) (int);
   void (*destroy_thread_data_func) (void *);
 
   /* NVIDIA target specific routines.  */
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 08b7c5e..1f5827e 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -26,7 +26,7 @@
    see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
    <http://www.gnu.org/licenses/>.  */
 
-
+#include <assert.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
@@ -37,13 +37,23 @@ acc_async_test (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  return base_dev->openacc.async_test_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev->openacc.async_test_func (async);
 }
 
 int
 acc_async_test_all (void)
 {
-  return base_dev->openacc.async_test_all_func ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev->openacc.async_test_all_func ();
 }
 
 void
@@ -52,19 +62,34 @@ acc_wait (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  base_dev->openacc.async_wait_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_func (async);
 }
 
 void
 acc_wait_async (int async1, int async2)
 {
-  base_dev->openacc.async_wait_async_func (async1, async2);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_async_func (async1, async2);
 }
 
 void
 acc_wait_all (void)
 {
-  base_dev->openacc.async_wait_all_func ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_all_func ();
 }
 
 void
@@ -73,5 +98,10 @@ acc_wait_all_async (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  base_dev->openacc.async_wait_all_async_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_all_async_func (async);
 }
diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index c8ef376..4aab422 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -34,51 +34,53 @@
 void *
 acc_get_current_cuda_device (void)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev && base_dev->openacc.cuda.get_current_device_func)
-    p = base_dev->openacc.cuda.get_current_device_func ();
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_device_func)
+    return thr->dev->openacc.cuda.get_current_device_func ();
 
-  return p;
+  return NULL;
 }
 
 void *
 acc_get_current_cuda_context (void)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev && base_dev->openacc.cuda.get_current_context_func)
-    p = base_dev->openacc.cuda.get_current_context_func ();
-
-  return p;
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_context_func)
+    return thr->dev->openacc.cuda.get_current_context_func ();
+ 
+  return NULL;
 }
 
 void *
 acc_get_cuda_stream (int async)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
   if (async < 0)
-    return p;
-
-  if (base_dev && base_dev->openacc.cuda.get_stream_func)
-    p = base_dev->openacc.cuda.get_stream_func (async);
+    return NULL;
 
-  return p;
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
+    return thr->dev->openacc.cuda.get_stream_func (async);
+ 
+  return NULL;
 }
 
 int
 acc_set_cuda_stream (int async, void *stream)
 {
-  int s = -1;
+  struct goacc_thread *thr;
 
   if (async < 0 || stream == NULL)
     return 0;
 
   goacc_lazy_initialize ();
 
-  if (base_dev && base_dev->openacc.cuda.set_stream_func)
-    s = base_dev->openacc.cuda.set_stream_func (async, stream);
+  thr = goacc_thread ();
+
+  if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
+    return thr->dev->openacc.cuda.set_stream_func (async, stream);
 
-  return s;
+  return -1;
 }
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index e4756b6..6dcdbf3 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -53,16 +53,9 @@ static struct gomp_device_descr host_dispatch =
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.root = NULL,
     .is_initialized = false,
 
     .openacc = {
-      .open_device_func = GOMP_OFFLOAD_openacc_open_device,
-      .close_device_func = GOMP_OFFLOAD_openacc_close_device,
-
-      .get_device_num_func = GOMP_OFFLOAD_openacc_get_device_num,
-      .set_device_num_func = GOMP_OFFLOAD_openacc_set_device_num,
-
       .exec_func = GOMP_OFFLOAD_openacc_parallel,
 
       .register_async_cleanup_func
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 1e0243e..dc40fb6 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -37,14 +37,13 @@
 
 static gomp_mutex_t acc_device_lock;
 
-/* The dispatch table for the current accelerator device.  This is global, so
-   you can only have one type of device open at any given time in a program.
-   This is the "base" device in that several devices that use the same
-   dispatch table may be active concurrently: this one (the "zeroth") is used
-   for overall initialisation/shutdown, and other instances -- not necessarily
-   including this one -- may be opened and closed once the base device has
-   been initialized.  */
-struct gomp_device_descr *base_dev;
+/* A cached version of the dispatcher for the global "current" accelerator type,
+   e.g. used as the default when creating new host threads.  This is the
+   device-type equivalent of goacc_device_num (which specifies which device to
+   use out of potentially several of the same type).  If there are several
+   devices of a given type, this points at the first one.  */
+
+static struct gomp_device_descr *cached_base_dev = NULL;
 
 #if defined HAVE_TLS || defined USE_EMUTLS
 __thread struct goacc_thread *goacc_tls_data;
@@ -53,9 +52,6 @@ pthread_key_t goacc_tls_key;
 #endif
 static pthread_key_t goacc_cleanup_key;
 
-/* Current dispatcher, and how it was initialized */
-static acc_device_t init_key = _ACC_device_hwm;
-
 static struct goacc_thread *goacc_threads;
 static gomp_mutex_t goacc_thread_lock;
 
@@ -94,6 +90,21 @@ get_openacc_name (const char *name)
     return name;
 }
 
+static const char *
+name_of_acc_device_t (enum acc_device_t type)
+{
+  switch (type)
+    {
+    case acc_device_none: return "none";
+    case acc_device_default: return "default";
+    case acc_device_host: return "host";
+    case acc_device_host_nonshm: return "host_nonshm";
+    case acc_device_not_host: return "not_host";
+    case acc_device_nvidia: return "nvidia";
+    default: gomp_fatal ("unknown device type %u", (unsigned) type);
+    }
+}
+
 static struct gomp_device_descr *
 resolve_device (acc_device_t d)
 {
@@ -159,22 +170,87 @@ resolve_device (acc_device_t d)
 static struct gomp_device_descr *
 acc_init_1 (acc_device_t d)
 {
-  struct gomp_device_descr *acc_dev;
+  struct gomp_device_descr *base_dev, *acc_dev;
+  int ndevs;
 
-  acc_dev = resolve_device (d);
+  base_dev = resolve_device (d);
+
+  ndevs = base_dev->get_num_devices_func ();
+
+  if (!base_dev || ndevs <= 0 || goacc_device_num >= ndevs)
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
 
-  if (!acc_dev || acc_dev->get_num_devices_func () <= 0)
-    gomp_fatal ("device %u not supported", (unsigned)d);
+  acc_dev = &base_dev[goacc_device_num];
 
   if (acc_dev->is_initialized)
     gomp_fatal ("device already active");
 
-  /* We need to remember what we were intialized as, to check shutdown etc.  */
-  init_key = d;
-
   gomp_init_device (acc_dev);
 
-  return acc_dev;
+  return base_dev;
+}
+
+static void
+acc_shutdown_1 (acc_device_t d)
+{
+  struct gomp_device_descr *base_dev;
+  struct goacc_thread *walk;
+  int ndevs, i;
+  bool devices_active = false;
+
+  /* Get the base device for this device type.  */
+  base_dev = resolve_device (d);
+
+  if (!base_dev)
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
+
+  gomp_mutex_lock (&goacc_thread_lock);
+
+  /* Free target-specific TLS data and close all devices.  */
+  for (walk = goacc_threads; walk != NULL; walk = walk->next)
+    {
+      if (walk->target_tls)
+	base_dev->openacc.destroy_thread_data_func (walk->target_tls);
+
+      walk->target_tls = NULL;
+
+      /* This would mean the user is shutting down OpenACC in the middle of an
+         "acc data" pragma.  Likely not intentional.  */
+      if (walk->mapped_data)
+	gomp_fatal ("shutdown in 'acc data' region");
+
+      /* Similarly, if this happens then user code has done something weird.  */
+      if (walk->saved_bound_dev)
+        gomp_fatal ("shutdown during host fallback");
+
+      if (walk->dev)
+	{
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (&walk->dev->mem_map);
+	  gomp_mutex_unlock (&walk->dev->lock);
+
+	  walk->dev = NULL;
+	  walk->base_dev = NULL;
+	}
+    }
+
+  gomp_mutex_unlock (&goacc_thread_lock);
+
+  ndevs = base_dev->get_num_devices_func ();
+
+  /* Close all the devices of this type that have been opened.  */
+  for (i = 0; i < ndevs; i++)
+    {
+      struct gomp_device_descr *acc_dev = &base_dev[i];
+      if (acc_dev->is_initialized)
+        {
+	  devices_active = true;
+	  gomp_fini_device (acc_dev);
+	}
+    }
+
+  if (!devices_active)
+    gomp_fatal ("no device initialized");
 }
 
 static struct goacc_thread *
@@ -207,9 +283,11 @@ goacc_destroy_thread (void *data)
 
   if (thr)
     {
-      if (base_dev && thr->target_tls)
+      struct gomp_device_descr *acc_dev = thr->dev;
+
+      if (acc_dev && thr->target_tls)
 	{
-	  base_dev->openacc.destroy_thread_data_func (thr->target_tls);
+	  acc_dev->openacc.destroy_thread_data_func (thr->target_tls);
 	  thr->target_tls = NULL;
 	}
 
@@ -236,53 +314,49 @@ goacc_destroy_thread (void *data)
   gomp_mutex_unlock (&goacc_thread_lock);
 }
 
-/* Open the ORD'th device of the currently-active type (base_dev must be
-   initialised before calling).  If ORD is < 0, open the default-numbered
-   device (set by the ACC_DEVICE_NUM environment variable or a call to
-   acc_set_device_num), or leave any currently-opened device as is.  "Opening"
-   consists of calling the device's open_device_func hook, and setting up
-   thread-local data (maybe allocating, then initializing with information
-   pertaining to the newly-opened or previously-opened device).  */
+/* Use the ORD'th device instance for the current host thread (or -1 for the
+   current global default).  The device (and the runtime) must be initialised
+   before calling this function.  */
 
-static void
-lazy_open (int ord)
+void
+goacc_attach_host_thread_to_device (int ord)
 {
   struct goacc_thread *thr = goacc_thread ();
-  struct gomp_device_descr *acc_dev;
-
-  if (thr && thr->dev)
-    {
-      assert (ord < 0 || ord == thr->dev->target_id);
-      return;
-    }
-
-  assert (base_dev);
-
+  struct gomp_device_descr *acc_dev = NULL, *base_dev = NULL;
+  int num_devices;
+  
+  if (thr && thr->dev && (thr->dev->target_id == ord || ord < 0))
+    return;
+  
   if (ord < 0)
     ord = goacc_device_num;
-
-  /* The OpenACC 2.0 spec leaves the runtime's behaviour when an out-of-range
-     device is requested as implementation-defined (4.2 ACC_DEVICE_NUM).
-     We choose to raise an error in such a case.  */
-  if (ord >= base_dev->get_num_devices_func ())
-    gomp_fatal ("device %u does not exist", ord);
-
+  
+  /* Decide which type of device to use.  If the current thread has a device
+     type already (e.g. set by acc_set_device_type), use that, else use the
+     global default.  */
+  if (thr && thr->base_dev)
+    base_dev = thr->base_dev;
+  else
+    {
+      assert (cached_base_dev);
+      base_dev = cached_base_dev;
+    }
+  
+  num_devices = base_dev->get_num_devices_func ();
+  if (num_devices <= 0 || ord >= num_devices)
+    gomp_fatal ("device %u out of range", ord);
+  
   if (!thr)
     thr = goacc_new_thread ();
-
-  acc_dev = thr->dev = &base_dev[ord];
-
-  assert (acc_dev->target_id == ord);
-
+  
+  thr->base_dev = base_dev;
+  thr->dev = acc_dev = &base_dev[ord];
   thr->saved_bound_dev = NULL;
   thr->mapped_data = NULL;
-
-  if (!acc_dev->openacc.target_data)
-    acc_dev->openacc.target_data = acc_dev->openacc.open_device_func (ord);
-
+  
   thr->target_tls
-    = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
-
+    = acc_dev->openacc.create_thread_data_func (ord);
+  
   acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
@@ -292,74 +366,20 @@ lazy_open (int ord)
 void
 acc_init (acc_device_t d)
 {
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
 
-  base_dev = acc_init_1 (d);
-
-  lazy_open (-1);
+  cached_base_dev = acc_init_1 (d);
 
   gomp_mutex_unlock (&acc_device_lock);
+  
+  goacc_attach_host_thread_to_device (-1);
 }
 
 ialias (acc_init)
 
-static void
-acc_shutdown_1 (acc_device_t d)
-{
-  struct goacc_thread *walk;
-
-  /* We don't check whether d matches the actual device found, because
-     OpenACC 2.0 (3.2.12) says the parameters to the init and this
-     call must match (for the shutdown call anyway, it's silent on
-     others).  */
-
-  if (!base_dev)
-    gomp_fatal ("no device initialized");
-  if (d != init_key)
-    gomp_fatal ("device %u(%u) is initialized",
-		(unsigned) init_key, (unsigned) base_dev->type);
-
-  gomp_mutex_lock (&goacc_thread_lock);
-
-  /* Free target-specific TLS data and close all devices.  */
-  for (walk = goacc_threads; walk != NULL; walk = walk->next)
-    {
-      if (walk->target_tls)
-	base_dev->openacc.destroy_thread_data_func (walk->target_tls);
-
-      walk->target_tls = NULL;
-
-      /* This would mean the user is shutting down OpenACC in the middle of an
-         "acc data" pragma.  Likely not intentional.  */
-      if (walk->mapped_data)
-	gomp_fatal ("shutdown in 'acc data' region");
-
-      if (walk->dev)
-	{
-	  void *target_data = walk->dev->openacc.target_data;
-	  if (walk->dev->openacc.close_device_func (target_data) < 0)
-	    gomp_fatal ("failed to close device");
-
-	  walk->dev->openacc.target_data = target_data = NULL;
-
-	  gomp_mutex_lock (&walk->dev->lock);
-	  gomp_free_memmap (&walk->dev->mem_map);
-	  gomp_mutex_unlock (&walk->dev->lock);
-
-	  walk->dev = NULL;
-	}
-    }
-
-  gomp_mutex_unlock (&goacc_thread_lock);
-
-  gomp_fini_device (base_dev);
-
-  base_dev = NULL;
-}
-
 void
 acc_shutdown (acc_device_t d)
 {
@@ -372,59 +392,16 @@ acc_shutdown (acc_device_t d)
 
 ialias (acc_shutdown)
 
-/* This function is called after plugins have been initialized.  It deals with
-   the "base" device, and is used to prepare the runtime for dealing with a
-   number of such devices (as implemented by some particular plugin).  If the
-   argument device type D matches a previous call to the function, return the
-   current base device, else shut the old device down and re-initialize with
-   the new device type.  */
-
-static struct gomp_device_descr *
-lazy_init (acc_device_t d)
-{
-  if (base_dev)
-    {
-      /* Re-initializing the same device, do nothing.  */
-      if (d == init_key)
-	return base_dev;
-
-      acc_shutdown_1 (init_key);
-    }
-
-  assert (!base_dev);
-
-  return acc_init_1 (d);
-}
-
-/* Ensure that plugins are loaded, initialize and open the (default-numbered)
-   device.  */
-
-static void
-lazy_init_and_open (acc_device_t d)
-{
-  if (!base_dev)
-    gomp_init_targets_once ();
-
-  gomp_mutex_lock (&acc_device_lock);
-
-  base_dev = lazy_init (d);
-
-  lazy_open (-1);
-
-  gomp_mutex_unlock (&acc_device_lock);
-}
-
 int
 acc_get_num_devices (acc_device_t d)
 {
   int n = 0;
-  const struct gomp_device_descr *acc_dev;
+  struct gomp_device_descr *acc_dev;
 
   if (d == acc_device_none)
     return 0;
 
-  if (!base_dev)
-    gomp_init_targets_once ();
+  gomp_init_targets_once ();
 
   acc_dev = resolve_device (d);
   if (!acc_dev)
@@ -439,10 +416,39 @@ acc_get_num_devices (acc_device_t d)
 
 ialias (acc_get_num_devices)
 
+/* Set the device type for the current thread only (using the current global
+   default device number), initialising that device if necessary.  Also set the
+   default device type for new threads to D.  */
+
 void
 acc_set_device_type (acc_device_t d)
 {
-  lazy_init_and_open (d);
+  struct gomp_device_descr *base_dev, *acc_dev;
+  struct goacc_thread *thr = goacc_thread ();
+
+  gomp_mutex_lock (&acc_device_lock);
+
+  if (!cached_base_dev)
+    gomp_init_targets_once ();
+
+  cached_base_dev = base_dev = resolve_device (d);
+  acc_dev = &base_dev[goacc_device_num];
+
+  if (!acc_dev->is_initialized)
+    gomp_init_device (acc_dev);
+
+  gomp_mutex_unlock (&acc_device_lock);
+
+  /* We're changing device type: invalidate the current thread's dev and
+     base_dev pointers.  */
+  if (thr && thr->base_dev != base_dev)
+    {
+      thr->base_dev = thr->dev = NULL;
+      if (thr->mapped_data)
+        gomp_fatal ("acc_set_device_type in 'acc data' region");
+    }
+
+  goacc_attach_host_thread_to_device (-1);
 }
 
 ialias (acc_set_device_type)
@@ -451,10 +457,11 @@ acc_device_t
 acc_get_device_type (void)
 {
   acc_device_t res = acc_device_none;
-  const struct gomp_device_descr *dev;
+  struct gomp_device_descr *dev;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev)
-    res = acc_device_type (base_dev->type);
+  if (thr && thr->base_dev)
+    res = acc_device_type (thr->base_dev->type);
   else
     {
       gomp_init_targets_once ();
@@ -475,78 +482,65 @@ int
 acc_get_device_num (acc_device_t d)
 {
   const struct gomp_device_descr *dev;
-  int num;
+  struct goacc_thread *thr = goacc_thread ();
 
   if (d >= _ACC_device_hwm)
     gomp_fatal ("device %u out of range", (unsigned)d);
 
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
   dev = resolve_device (d);
   if (!dev)
-    gomp_fatal ("no devices of type %u", d);
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
 
-  /* We might not have called lazy_open for this host thread yet, in which case
-     the get_device_num_func hook will return -1.  */
-  num = dev->openacc.get_device_num_func ();
-  if (num < 0)
-    num = goacc_device_num;
+  if (thr && thr->base_dev == dev && thr->dev)
+    return thr->dev->target_id;
 
-  return num;
+  return goacc_device_num;
 }
 
 ialias (acc_get_device_num)
 
 void
-acc_set_device_num (int n, acc_device_t d)
+acc_set_device_num (int ord, acc_device_t d)
 {
-  const struct gomp_device_descr *dev;
+  struct gomp_device_descr *base_dev, *acc_dev;
   int num_devices;
 
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
-  if ((int) d == 0)
-    {
-      int i;
-
-      /* A device setting of zero sets all device types on the system to use
-         the Nth instance of that device type.  Only attempt it for initialized
-	 devices though.  */
-      for (i = acc_device_not_host + 1; i < _ACC_device_hwm; i++)
-        {
-	  dev = resolve_device (d);
-	  if (dev && dev->is_initialized)
-	    dev->openacc.set_device_num_func (n);
-	}
+  if (ord < 0)
+    ord = goacc_device_num;
 
-      /* ...and for future calls to acc_init/acc_set_device_type, etc.  */
-      goacc_device_num = n;
-    }
+  if ((int) d == 0)
+    /* Set whatever device is being used by the current host thread to use
+       device instance ORD.  It's unclear if this is supposed to affect other
+       host threads too (OpenACC 2.0 (3.2.4) acc_set_device_num).  */
+    goacc_attach_host_thread_to_device (ord);
   else
     {
-      struct goacc_thread *thr = goacc_thread ();
-
       gomp_mutex_lock (&acc_device_lock);
 
-      base_dev = lazy_init (d);
+      cached_base_dev = base_dev = resolve_device (d);
 
       num_devices = base_dev->get_num_devices_func ();
 
-      if (n >= num_devices)
-        gomp_fatal ("device %u out of range", n);
+      if (ord >= num_devices)
+        gomp_fatal ("device %u out of range", ord);
 
-      /* If we're changing the device number, de-associate this thread with
-	 the device (but don't close the device, since it may be in use by
-	 other threads).  */
-      if (thr && thr->dev && n != thr->dev->target_id)
-	thr->dev = NULL;
+      acc_dev = &base_dev[ord];
 
-      lazy_open (n);
+      if (!acc_dev->is_initialized)
+        gomp_init_device (acc_dev);
 
       gomp_mutex_unlock (&acc_device_lock);
+
+      goacc_attach_host_thread_to_device (ord);
     }
+  
+  goacc_device_num = ord;
 }
 
 ialias (acc_set_device_num)
@@ -554,10 +548,7 @@ ialias (acc_set_device_num)
 int
 acc_on_device (acc_device_t dev)
 {
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (thr && thr->dev
-      && acc_device_type (thr->dev->type) == acc_device_host_nonshm)
+  if (acc_get_device_type () == acc_device_host_nonshm)
     return dev == acc_device_host_nonshm || dev == acc_device_not_host;
 
   /* Just rely on the compiler builtin.  */
@@ -577,7 +568,7 @@ goacc_runtime_initialize (void)
 
   pthread_key_create (&goacc_cleanup_key, goacc_destroy_thread);
 
-  base_dev = NULL;
+  cached_base_dev = NULL;
 
   goacc_threads = NULL;
   gomp_mutex_init (&goacc_thread_lock);
@@ -606,9 +597,8 @@ goacc_restore_bind (void)
 }
 
 /* This is called from any OpenACC support function that may need to implicitly
-   initialize the libgomp runtime.  On exit all such initialization will have
-   been done, and both the global ACC_dev and the per-host-thread ACC_memmap
-   pointers will be valid.  */
+   initialize the libgomp runtime, either globally or from a new host thread. 
+   On exit "goacc_thread" will return a valid & populated thread block.  */
 
 attribute_hidden void
 goacc_lazy_initialize (void)
@@ -618,12 +608,8 @@ goacc_lazy_initialize (void)
   if (thr && thr->dev)
     return;
 
-  if (!base_dev)
-    lazy_init_and_open (acc_device_default);
+  if (!cached_base_dev)
+    acc_init (acc_device_default);
   else
-    {
-      gomp_mutex_lock (&acc_device_lock);
-      lazy_open (-1);
-      gomp_mutex_unlock (&acc_device_lock);
-    }
+    goacc_attach_host_thread_to_device (-1);
 }
diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h
index 85619c8..0ace737 100644
--- a/libgomp/oacc-int.h
+++ b/libgomp/oacc-int.h
@@ -56,6 +56,9 @@ acc_device_type (enum offload_target_type type)
 
 struct goacc_thread
 {
+  /* The base device for the current thread.  */
+  struct gomp_device_descr *base_dev;
+
   /* The device for the current thread.  */
   struct gomp_device_descr *dev;
 
@@ -89,10 +92,7 @@ goacc_thread (void)
 #endif
 
 void goacc_register (struct gomp_device_descr *) __GOACC_NOTHROW;
-
-/* Current dispatcher.  */
-extern struct gomp_device_descr *base_dev;
-
+void goacc_attach_host_thread_to_device (int);
 void goacc_runtime_initialize (void);
 void goacc_save_and_set_bind (acc_device_t);
 void goacc_restore_bind (void);
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index fdc82e6..89ef5fc 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -107,7 +107,9 @@ acc_malloc (size_t s)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  return base_dev->alloc_func (thr->dev->target_id, s);
+  assert (thr->dev);
+
+  return thr->dev->alloc_func (thr->dev->target_id, s);
 }
 
 /* OpenACC 2.0a (3.2.16) doesn't specify what to do in the event
@@ -122,6 +124,8 @@ acc_free (void *d)
   if (!d)
     return;
 
+  assert (thr && thr->dev);
+
   /* We don't have to call lazy open here, as the ptr value must have
      been returned by acc_malloc.  It's not permitted to pass NULL in
      (unless you got that null from acc_malloc).  */
@@ -134,7 +138,7 @@ acc_free (void *d)
      acc_unmap_data ((void *)(k->host_start + offset));
    }
 
-  base_dev->free_func (thr->dev->target_id, d);
+  thr->dev->free_func (thr->dev->target_id, d);
 }
 
 void
@@ -144,7 +148,9 @@ acc_memcpy_to_device (void *d, void *h, size_t s)
      been obtained from a routine that did that.  */
   struct goacc_thread *thr = goacc_thread ();
 
-  base_dev->host2dev_func (thr->dev->target_id, d, h, s);
+  assert (thr && thr->dev);
+
+  thr->dev->host2dev_func (thr->dev->target_id, d, h, s);
 }
 
 void
@@ -154,7 +160,9 @@ acc_memcpy_from_device (void *h, void *d, size_t s)
      been obtained from a routine that did that.  */
   struct goacc_thread *thr = goacc_thread ();
 
-  base_dev->dev2host_func (thr->dev->target_id, h, d, s);
+  assert (thr && thr->dev);
+
+  thr->dev->dev2host_func (thr->dev->target_id, h, d, s);
 }
 
 /* Return the device pointer that corresponds to host data H.  Or NULL
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 563f9bb..9729e12 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -49,32 +49,6 @@ find_pset (int pos, size_t mapnum, unsigned short *kinds)
   return kind == GOMP_MAP_TO_PSET;
 }
 
-
-/* Ensure that the target device for DEVICE_TYPE is initialised (and that
-   plugins have been loaded if appropriate).  The ACC_dev variable for the
-   current thread will be set appropriately for the given device type on
-   return.  */
-
-attribute_hidden void
-select_acc_device (int device_type)
-{
-  goacc_lazy_initialize ();
-
-  if (device_type == GOMP_DEVICE_HOST_FALLBACK)
-    return;
-
-  if (device_type == acc_device_none)
-    device_type = acc_device_host;
-
-  if (device_type >= 0)
-    {
-      /* NOTE: this will go badly if the surrounding data environment is set up
-         to use a different device type.  We'll just have to trust that users
-	 know what they're doing...  */
-      acc_set_device_type (device_type);
-    }
-}
-
 static void goacc_wait (int async, int num_waits, va_list ap);
 
 void
@@ -111,7 +85,7 @@ GOACC_parallel (int device, void (*fn) (void *),
 	      __FUNCTION__, (unsigned long) mapnum, hostaddrs, sizes, kinds,
 	      async);
 #endif
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   thr = goacc_thread ();
   acc_dev = thr->dev;
@@ -195,7 +169,7 @@ GOACC_data_start (int device, size_t mapnum,
 	      __FUNCTION__, (unsigned long) mapnum, hostaddrs, sizes, kinds);
 #endif
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
@@ -242,7 +216,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
   bool data_enter = false;
   size_t i;
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   thr = goacc_thread ();
   acc_dev = thr->dev;
@@ -429,7 +403,7 @@ GOACC_update (int device, size_t mapnum,
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
   size_t i;
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index bc60f72..1faf5bc 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -119,31 +119,6 @@ GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
 }
 
 STATIC void *
-GOMP_OFFLOAD_openacc_open_device (int n)
-{
-  return (void *) (intptr_t) n;
-}
-
-STATIC int
-GOMP_OFFLOAD_openacc_close_device (void *hnd)
-{
-  return 0;
-}
-
-STATIC int
-GOMP_OFFLOAD_openacc_get_device_num (void)
-{
-  return 0;
-}
-
-STATIC void
-GOMP_OFFLOAD_openacc_set_device_num (int n)
-{
-  if (n > 0)
-    GOMP (fatal) ("device number %u out of range for host execution", n);
-}
-
-STATIC void *
 GOMP_OFFLOAD_alloc (int n __attribute__ ((unused)), size_t s)
 {
   return GOMP (malloc) (s);
@@ -254,7 +229,7 @@ GOMP_OFFLOAD_openacc_async_wait_all_async (int async __attribute__ ((unused)))
 }
 
 STATIC void *
-GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data
+GOMP_OFFLOAD_openacc_create_thread_data (int ord
 					 __attribute__ ((unused)))
 {
   return NULL;
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 483cb75..583ec87 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -133,7 +133,8 @@ struct targ_fn_descriptor
   const char *name;
 };
 
-static bool ptx_inited = false;
+static unsigned int instantiated_devices = 0;
+static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
 struct ptx_stream
 {
@@ -331,9 +332,21 @@ struct ptx_event
   struct ptx_event *next;
 };
 
+struct ptx_image_data
+{
+  void *target_data;
+  CUmodule module;
+  struct ptx_image_data *next;
+};
+
 static pthread_mutex_t ptx_event_lock;
 static struct ptx_event *ptx_events;
 
+static struct ptx_device **ptx_devices;
+
+static struct ptx_image_data *ptx_images = NULL;
+static pthread_mutex_t ptx_image_lock = PTHREAD_MUTEX_INITIALIZER;
+
 #define _XSTR(s) _STR(s)
 #define _STR(s) #s
 
@@ -450,8 +463,8 @@ fini_streams_for_device (struct ptx_device *ptx_dev)
       struct ptx_stream *s = ptx_dev->active_streams;
       ptx_dev->active_streams = ptx_dev->active_streams->next;
 
-      cuStreamDestroy (s->stream);
       map_fini (s);
+      cuStreamDestroy (s->stream);
       free (s);
     }
 
@@ -575,21 +588,21 @@ select_stream_for_async (int async, pthread_t thread, bool create,
   return stream;
 }
 
-static int nvptx_get_num_devices (void);
-
-/* Initialize the device.  */
-static int
+/* Initialize the device.  Return TRUE on success, else FALSE.  PTX_DEV_LOCK
+   should be locked on entry and remains locked on exit.  */
+static bool
 nvptx_init (void)
 {
   CUresult r;
   int rc;
+  int ndevs;
 
-  if (ptx_inited)
-    return nvptx_get_num_devices ();
+  if (instantiated_devices != 0)
+    return true;
 
   rc = verify_device_library ();
   if (rc < 0)
-    return -1;
+    return false;
 
   r = cuInit (0);
   if (r != CUDA_SUCCESS)
@@ -599,22 +612,64 @@ nvptx_init (void)
 
   pthread_mutex_init (&ptx_event_lock, NULL);
 
-  ptx_inited = true;
+  r = cuDeviceGetCount (&ndevs);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetCount error: %s", cuda_error (r));
 
-  return nvptx_get_num_devices ();
+  ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *)
+					    * ndevs);
+
+  return true;
 }
 
+/* Select the N'th PTX device for the current host thread.  The device must
+   have been previously opened before calling this function.  */
+
 static void
-nvptx_fini (void)
+nvptx_attach_host_thread_to_device (int n)
 {
-  ptx_inited = false;
+  CUdevice dev;
+  CUresult r;
+  struct ptx_device *ptx_dev;
+  CUcontext thd_ctx;
+
+  r = cuCtxGetDevice (&dev);
+  if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
+    GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r));
+
+  if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n)
+    return;
+  else
+    {
+      CUcontext old_ctx;
+
+      ptx_dev = ptx_devices[n];
+      assert (ptx_dev);
+
+      r = cuCtxGetCurrent (&thd_ctx);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
+
+      /* We don't necessarily have a current context (e.g. if it has been
+         destroyed.  Pop it if we do though.  */
+      if (thd_ctx != NULL)
+	{
+	  r = cuCtxPopCurrent (&old_ctx);
+	  if (r != CUDA_SUCCESS)
+            GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r));
+	}
+
+      r = cuCtxPushCurrent (ptx_dev->ctx);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuCtxPushCurrent error: %s", cuda_error (r));
+    }
 }
 
-static void *
+static struct ptx_device *
 nvptx_open_device (int n)
 {
   struct ptx_device *ptx_dev;
-  CUdevice dev;
+  CUdevice dev, ctx_dev;
   CUresult r;
   int async_engines, pi;
 
@@ -628,6 +683,21 @@ nvptx_open_device (int n)
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
 
+  r = cuCtxGetDevice (&ctx_dev);
+  if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
+    GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r));
+  
+  if (r != CUDA_ERROR_INVALID_CONTEXT && ctx_dev != dev)
+    {
+      /* The current host thread has an active context for a different device.
+         Detach it.  */
+      CUcontext old_ctx;
+      
+      r = cuCtxPopCurrent (&old_ctx);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r));
+    }
+
   r = cuCtxGetCurrent (&ptx_dev->ctx);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
@@ -678,17 +748,16 @@ nvptx_open_device (int n)
 
   init_streams_for_device (ptx_dev, async_engines);
 
-  return (void *) ptx_dev;
+  return ptx_dev;
 }
 
-static int
-nvptx_close_device (void *targ_data)
+static void
+nvptx_close_device (struct ptx_device *ptx_dev)
 {
   CUresult r;
-  struct ptx_device *ptx_dev = targ_data;
 
   if (!ptx_dev)
-    return 0;
+    return;
 
   fini_streams_for_device (ptx_dev);
 
@@ -700,8 +769,6 @@ nvptx_close_device (void *targ_data)
     }
 
   free (ptx_dev);
-
-  return 0;
 }
 
 static int
@@ -714,7 +781,7 @@ nvptx_get_num_devices (void)
      order to enumerate available devices, but CUDA API routines can't be used
      until cuInit has been called.  Just call it now (but don't yet do any
      further initialization).  */
-  if (!ptx_inited)
+  if (instantiated_devices == 0)
     cuInit (0);
 
   r = cuDeviceGetCount (&n);
@@ -1507,64 +1574,84 @@ GOMP_OFFLOAD_get_num_devices (void)
   return nvptx_get_num_devices ();
 }
 
-static void **kernel_target_data;
-static void **kernel_host_table;
-
 void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_data)
+GOMP_OFFLOAD_init_device (int n)
 {
-  kernel_target_data = target_data;
-  kernel_host_table = host_table;
-}
+  pthread_mutex_lock (&ptx_dev_lock);
 
-void
-GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
-{
-  (void) nvptx_init ();
+  if (!nvptx_init () || ptx_devices[n] != NULL)
+    {
+      pthread_mutex_unlock (&ptx_dev_lock);
+      return;
+    }
+
+  ptx_devices[n] = nvptx_open_device (n);
+  instantiated_devices++;
+
+  pthread_mutex_unlock (&ptx_dev_lock);
 }
 
 void
-GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
+GOMP_OFFLOAD_fini_device (int n)
 {
-  nvptx_fini ();
+  pthread_mutex_lock (&ptx_dev_lock);
+
+  if (ptx_devices[n] != NULL)
+    {
+      nvptx_attach_host_thread_to_device (n);
+      nvptx_close_device (ptx_devices[n]);
+      ptx_devices[n] = NULL;
+      instantiated_devices--;
+    }
+
+  pthread_mutex_unlock (&ptx_dev_lock);
 }
 
 int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **tablep)
+GOMP_OFFLOAD_load_image (int ord, void *target_data,
+			 struct addr_pair **target_table)
 {
   CUmodule module;
-  void **fn_table;
-  char **fn_names;
-  int fn_entries, i;
+  char **fn_names, **var_names;
+  unsigned int fn_entries, var_entries, i, j;
   CUresult r;
   struct targ_fn_descriptor *targ_fns;
+  void **img_header = (void **) target_data;
+  struct ptx_image_data *new_image;
 
-  if (nvptx_init () <= 0)
-    return 0;
+  GOMP_OFFLOAD_init_device (ord);
 
-  /* This isn't an error, because an image may legitimately have no offloaded
-     regions and so will not call GOMP_offload_register.  */
-  if (kernel_target_data == NULL)
-    return 0;
+  nvptx_attach_host_thread_to_device (ord);
+
+  link_ptx (&module, img_header[0]);
 
-  link_ptx (&module, kernel_target_data[0]);
+  pthread_mutex_lock (&ptx_image_lock);
+  new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
+  new_image->target_data = target_data;
+  new_image->module = module;
+  new_image->next = ptx_images;
+  ptx_images = new_image;
+  pthread_mutex_unlock (&ptx_image_lock);
 
-  /* kernel_target_data[0] -> ptx code
-     kernel_target_data[1] -> variable mappings
-     kernel_target_data[2] -> array of kernel names in ascii
+  /* The mkoffload utility emits a table of pointers/integers at the start of
+     each offload image:
 
-     kernel_host_table[0] -> start of function addresses (__offload_func_table)
-     kernel_host_table[1] -> end of function addresses (__offload_funcs_end)
+     img_header[0] -> ptx code
+     img_header[1] -> number of variables
+     img_header[2] -> array of variable names (pointers to strings)
+     img_header[3] -> number of kernels
+     img_header[4] -> array of kernel names (pointers to strings)
 
      The array of kernel names and the functions addresses form a
      one-to-one correspondence.  */
 
-  fn_table = kernel_host_table[0];
-  fn_names = (char **) kernel_target_data[2];
-  fn_entries = (kernel_host_table[1] - kernel_host_table[0]) / sizeof (void *);
+  var_entries = (uintptr_t) img_header[1];
+  var_names = (char **) img_header[2];
+  fn_entries = (uintptr_t) img_header[3];
+  fn_names = (char **) img_header[4];
 
-  *tablep = GOMP_PLUGIN_malloc (sizeof (struct mapping_table) * fn_entries);
+  *target_table = GOMP_PLUGIN_malloc (sizeof (struct addr_pair)
+				      * (fn_entries + var_entries));
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
@@ -1579,38 +1666,86 @@ GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
       targ_fns[i].fn = function;
       targ_fns[i].name = (const char *) fn_names[i];
 
-      (*tablep)[i].host_start = (uintptr_t) fn_table[i];
-      (*tablep)[i].host_end = (*tablep)[i].host_start + 1;
-      (*tablep)[i].tgt_start = (uintptr_t) &targ_fns[i];
-      (*tablep)[i].tgt_end = (*tablep)[i].tgt_start + 1;
+      (*target_table)[i].start = (uintptr_t) &targ_fns[i];
+      (*target_table)[i].end = (*target_table)[i].start + 1;
     }
 
-  return fn_entries;
+  for (j = 0; j < var_entries; j++, i++)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+
+      r = cuModuleGetGlobal (&var, &bytes, module, var_names[j]);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+
+      (*target_table)[i].start = (uintptr_t) var;
+      (*target_table)[i].end = (*target_table)[i].start + bytes;
+    }
+
+  return i;
+}
+
+void
+GOMP_OFFLOAD_unload_image (int tid __attribute__((unused)), void *target_data)
+{
+  void **img_header = (void **) target_data;
+  struct targ_fn_descriptor *targ_fns
+    = (struct targ_fn_descriptor *) img_header[0];
+  struct ptx_image_data *image, *prev = NULL, *newhd = NULL;
+
+  free (targ_fns);
+
+  pthread_mutex_lock (&ptx_image_lock);
+  for (image = ptx_images; image != NULL;)
+    {
+      struct ptx_image_data *next = image->next;
+
+      if (image->target_data == target_data)
+	{
+	  cuModuleUnload (image->module);
+	  free (image);
+	  if (prev)
+	    prev->next = next;
+	}
+      else
+	{
+	  prev = image;
+	  if (!newhd)
+	    newhd = image;
+	}
+
+      image = next;
+    }
+  ptx_images = newhd;
+  pthread_mutex_unlock (&ptx_image_lock);
 }
 
 void *
-GOMP_OFFLOAD_alloc (int n __attribute__ ((unused)), size_t size)
+GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_alloc (size);
 }
 
 void
-GOMP_OFFLOAD_free (int n __attribute__ ((unused)), void *ptr)
+GOMP_OFFLOAD_free (int ord, void *ptr)
 {
+  nvptx_attach_host_thread_to_device (ord);
   nvptx_free (ptr);
 }
 
 void *
-GOMP_OFFLOAD_dev2host (int ord __attribute__ ((unused)), void *dst,
-		       const void *src, size_t n)
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_dev2host (dst, src, n);
 }
 
 void *
-GOMP_OFFLOAD_host2dev (int ord __attribute__ ((unused)), void *dst,
-		       const void *src, size_t n)
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_host2dev (dst, src, n);
 }
 
@@ -1627,45 +1762,6 @@ GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), size_t mapnum,
 	    num_workers, vector_length, async, targ_mem_desc);
 }
 
-void *
-GOMP_OFFLOAD_openacc_open_device (int n)
-{
-  return nvptx_open_device (n);
-}
-
-int
-GOMP_OFFLOAD_openacc_close_device (void *h)
-{
-  return nvptx_close_device (h);
-}
-
-void
-GOMP_OFFLOAD_openacc_set_device_num (int n)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  assert (n >= 0);
-
-  if (!nvthd->ptx_dev || nvthd->ptx_dev->ord != n)
-    (void) nvptx_open_device (n);
-}
-
-/* This can be called before the device is "opened" for the current thread, in
-   which case we can't tell which device number should be returned.  We don't
-   actually want to open the device here, so just return -1 and let the caller
-   (oacc-init.c:acc_get_device_num) handle it.  */
-
-int
-GOMP_OFFLOAD_openacc_get_device_num (void)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (nvthd && nvthd->ptx_dev)
-    return nvthd->ptx_dev->ord;
-  else
-    return -1;
-}
-
 void
 GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc)
 {
@@ -1729,14 +1825,18 @@ GOMP_OFFLOAD_openacc_async_set_async (int async)
 }
 
 void *
-GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data)
+GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {
-  struct ptx_device *ptx_dev = (struct ptx_device *) targ_data;
+  struct ptx_device *ptx_dev;
   struct nvptx_thread *nvthd
     = GOMP_PLUGIN_malloc (sizeof (struct nvptx_thread));
   CUresult r;
   CUcontext thd_ctx;
 
+  ptx_dev = ptx_devices[ord];
+
+  assert (ptx_dev);
+
   r = cuCtxGetCurrent (&thd_ctx);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
diff --git a/libgomp/target.c b/libgomp/target.c
index ba2d231..c67502a 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -163,7 +163,6 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mem_map;
 
   if (mapnum == 0)
     return tgt;
@@ -571,7 +570,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (tgt->mem_map, k);
+	splay_tree_remove (&devicep->mem_map, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -1086,10 +1085,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
     {
       optional_present = optional_total = 0;
       DLSYM_OPT (openacc.exec, openacc_parallel);
-      DLSYM_OPT (openacc.open_device, openacc_open_device);
-      DLSYM_OPT (openacc.close_device, openacc_close_device);
-      DLSYM_OPT (openacc.get_device_num, openacc_get_device_num);
-      DLSYM_OPT (openacc.set_device_num, openacc_set_device_num);
       DLSYM_OPT (openacc.register_async_cleanup,
 		 openacc_register_async_cleanup);
       DLSYM_OPT (openacc.async_test, openacc_async_test);
@@ -1198,7 +1193,6 @@ gomp_target_init (void)
 		current_device.mem_map.root = NULL;
 		current_device.is_initialized = false;
 		current_device.openacc.data_environ = NULL;
-		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
index 84045db..a4cf7f2 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
@@ -58,7 +58,7 @@ main (int argc, char **argv)
       acc_set_device_num (1, (acc_device_t) 0);
 
       devnum = acc_get_device_num (devtype);
-      if (devnum != 0)
+      if (devnum != 1)
 	abort ();
   }
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-26 20:41                                 ` Ilya Verbin
@ 2015-03-30 16:42                                   ` Jakub Jelinek
  2015-03-30 21:43                                     ` Julian Brown
                                                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Jakub Jelinek @ 2015-03-30 16:42 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Thomas Schwinge, Julian Brown, gcc-patches, Kirill Yukhin

On Thu, Mar 26, 2015 at 11:41:30PM +0300, Ilya Verbin wrote:
> Here is the latest patch for libgomp and mic plugin.
> make check-target-libgomp using intelmic emul passed.
> Also I used a testcase from the attachment.

This applies cleanly.

> Latest ptx part is here, I guess:
> https://gcc.gnu.org/ml/gcc-patches/2015-02/msg01407.html

But the one Julian posted doesn't apply on top of your patch.
If there is any interdiff needed on top of your patch, can it be
posted against trunk + your patch?

> +/* Insert mapping of host -> target address pairs to splay tree.  */
> +
> +static void
> +gomp_splay_tree_insert_mapping (struct gomp_device_descr *devicep,
> +				struct addr_pair *host_addr,
> +				struct addr_pair *tgt_addr)
> +{
> +  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
> +  tgt->refcount = 1;
> +  tgt->array = gomp_malloc (sizeof (*tgt->array));
> +  tgt->tgt_start = tgt_addr->start;
> +  tgt->tgt_end = tgt_addr->end;
> +  tgt->to_free = NULL;
> +  tgt->list_count = 0;
> +  tgt->device_descr = devicep;
> +  splay_tree_node node = tgt->array;
> +  splay_tree_key k = &node->key;
> +  k->host_start = host_addr->start;
> +  k->host_end = host_addr->end;
> +  k->tgt_offset = 0;
> +  k->refcount = 1;
> +  k->copy_from = false;
> +  k->tgt = tgt;
> +  node->left = NULL;
> +  node->right = NULL;
> +  splay_tree_insert (&devicep->mem_map, node);
> +}

What is the reason to register and allocate these one at a time, rather than
using one struct target_mem_desc with one tgt->array for all splay tree
nodes registered from one image?
Perhaps you would just use tgt_start of 0 and tgt_end of 0 too (to make it
clear it is special) and just use tgt_offset relative to that (i.e.
absolute), but having to malloc each node individually and having to malloc
a target_mem_desc for each one sounds expensive.
Everything is freed just once anyway, isn't it?

> @@ -654,6 +727,18 @@ void
>  GOMP_offload_register (void *host_table, enum offload_target_type target_type,
>  		       void *target_data)
>  {
> +  int i;
> +  gomp_mutex_lock (&register_lock);
> +
> +  /* Load image to all initialized devices.  */
> +  for (i = 0; i < num_devices; i++)
> +    {
> +      struct gomp_device_descr *devicep = &devices[i];
> +      if (devicep->type == target_type && devicep->is_initialized)
> +	gomp_offload_image_to_device (devicep, host_table, target_data);

Shouldn't either this function, or gomp_offload_image_to_device lock
also devicep->lock mutex and unlock at the end?
Where exactly I guess depends on if the devicep->* hook calls should be
guarded with the mutex or not.  If yes, it should be this function and
gomp_init_device.

> +      if (devicep->type != target_type || !devicep->is_initialized)
> +	continue;
> +

Similarly.

> +      devicep->unload_image_func (devicep->target_id, target_data);
> +
> +      /* Remove mapping from splay tree.  */
> +      for (j = 0; j < num_funcs; j++)
> +	{
> +	  struct splay_tree_key_s k;
> +	  k.host_start = (uintptr_t) host_func_table[j];
> +	  k.host_end = k.host_start + 1;
> +	  splay_tree_remove (&devicep->mem_map, &k);
> +	}
> +
> +      for (j = 0; j < num_vars; j++)
> +	{
> +	  struct splay_tree_key_s k;
> +	  k.host_start = (uintptr_t) host_var_table[j*2];
> +	  k.host_end = k.host_start + (uintptr_t) host_var_table[j*2+1];
> +	  splay_tree_remove (&devicep->mem_map, &k);
> +	}
> +    }

Aren't you leaking here all the tgt->array and tgt allocations here?
Though, if you change it to just two allocations (one tgt, one array),
you'd need to free just once.

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-26 12:09                               ` Jakub Jelinek
  2015-03-26 20:41                                 ` Ilya Verbin
@ 2015-03-27 15:21                                 ` Julian Brown
  1 sibling, 0 replies; 30+ messages in thread
From: Julian Brown @ 2015-03-27 15:21 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Ilya Verbin, Thomas Schwinge, gcc-patches, Kirill Yukhin

[-- Attachment #1: Type: text/plain, Size: 7685 bytes --]

On Thu, 26 Mar 2015 13:09:19 +0100
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Mar 23, 2015 at 10:44:39PM +0300, Ilya Verbin wrote:
> > If it is too late for such global changes (rework initialization in
> > libgomp, change mic and ptx plugins), then here is a small
> > workaround patch to fix offloading from libraries.  Likely, it will
> > not affect OpenACC programs with one image.  make
> > check-target-libgomp passed.
> 
> Sorry for not getting to this earlier, really busy with severe
> regressions bugfixing lately.
> 
> Anyway, IMHO it is not too late to fixing it properly, after all,
> the current code is majorly broken.  As I've said earlier, e.g. the
> lack of mutex guarding gomp_target_init (which is using pthread_once
> guaranteed to be run just once) vs. concurrent GOMP_offload_register
> calls (if those are run from ctors, then I guess something like
> dl_load_lock ensures at least on glibc that multiple
> GOMP_offload_register calls aren't performed at the same time) in
> accessing/reallocating offload_images and num_offload_images and the
> lack of support to register further images after the gomp_target_init
> call (if you dlopen further shared libraries) is really bad.  And it
> would be really nice to support the unloading.
> 
> But I'm afraid I'm lost in what is the latest posted patch for that,
> and how has it been tested (whether just on MIC or MIC emul, or also
> for nvptx).
> 
> So can you please post a link to the latest full patch and how it has
> been tested, and if it is still error prone if say one thread executes
> GOMP_target the first time and another at the same time dlopens some
> shared library that has offloading regions in it, fix that too?

I couldn't say about that -- I don't have all the state on the locking
problems at the moment.

> We still have a week or so to get this sorted out.

Apologies again for the delay in getting this out. Here's a "current"
version of the patch against the gomp4 branch (on top of Ilya's
load/unload patch) which passes testing for nvptx/openacc/libgomp,
modulo the usual (timing-related) noise in the lib-83.c test.

Thomas, can we get this tested on mainline and with MIC emulation?

This version fixes (some if not all) regressions with multiple
NVidia devices, and removes the memory-map lock and is_initialised
fields, reverting to just a splay tree in gomp_device_descr (as the
code was before the OpenACC merge). (Our multi-GPU machine is
temporarily out of action, so I can't easily test that setup at the
moment).

HTH,

Julian

ChangeLog

    gcc/
    * config/nvptx/mkoffload.c (process): Support variable mapping.

    libgomp/
    * libgomp.h (target_mem_desc: Remove mem_map field.
    (struct gomp_memory_mapping): Remove.
    (acc_dispatch_t): Remove open_device_func, close_device_func,
    get_device_num_func, set_device_num_func, target_data members.
    Change create_thread_data_func argument to device number instead of
    generic pointer.
    (struct gomp_device_descr): Replace mem_map field with splay tree
    directly.
    (gomp_free_memmap): Update prototype.
    * oacc-async.c (assert.h): Include.
    (acc_async_test, acc_async_test_all, acc_wait, acc_wait_async)
    (acc_wait_all, acc_wait_all_async): Use current host thread's
    active device, not base_dev.
    * oacc-cuda.c (acc_get_current_cuda_device)
    (acc_get_current_cuda_context, acc_get_cuda_stream)
    (acc_set_cuda_stream): Likewise.
    * oacc-host.c (host_dispatch): Don't set open_device_func,
    close_device_func, get_device_num_func or set_device_num_func.
    (goacc_host_init): Don't initialise host_dispatch.mem_map.lock.
    * oacc-init.c (base_dev, init_key): Remove.
    (cached_base_dev): New.
    (name_of_acc_device_t): New.
    (acc_init_1): Initialise default-numbered device, not zeroth.
    (acc_shutdown_1): Close all devices of a given type.
    (goacc_destroy_thread): Don't use base_dev.
    (lazy_open, lazy_init, lazy_init_and_open): Remove.
    (goacc_attach_host_thread_to_device): New.
    (acc_init): Reimplement with goacc_attach_host_thread_to_device.
    (acc_get_num_devices): Don't use base_dev.
    (acc_set_device_type): Reimplement.
    (acc_get_device_type): Don't use base_dev.
    (acc_get_device_num): Tweak logic.
    (acc_set_device_num): Likewise.
    (goacc_runtime_initialize): Initialize cached_base_dev not base_dev.
    (goacc_lazy_initialize): Reimplement with acc_init and
    goacc_attach_host_thread_to_device.
    * oacc-int.h (goacc_thread): Add base_dev field.
    (base_dev): Remove extern declaration.
    (goacc_attach_host_thread_to_device): Add prototype.
    * oacc-mem.c (lookup_host): Change first argument to
    gomp_device_descr. Use lock/splay tree from gomp_device_descr.
    (lookup_dev): Use lock from devicep not mem_map.
    (acc_malloc): Use current thread's device instead of base_dev.
    (acc_free): Likewise.
    (acc_memcpy_to_device): Likewise.
    (acc_memcpy_from_device): Likewise.
    (acc_deviceptr, acc_map_data, acc_unmap_data, present_create_copy)
    (delete_copyout, update_dev_host, gomp_acc_remove_pointer): Tweak
    lookup_host calls.
    * oacc-parallel.c (select_acc_device): Remove. Replace calls with
    goacc_lazy_initialize (throughout).
    (GOACC_parallel): Use lock and splay tree from gomp_device_descr not
    gomp_memory_mapping.
    * target.c (gomp_map_vars, gomp_copy_from_async, gomp_unmap_vars)
    (gomp_splay_tree_insert_mapping, GOMP_offload_unregister)
    (GOMP_target): Use splay tree and lock directly in
    gomp_device_descr, not gomp_memory_mapping.
    (gomp_update): Remove mm argument. Use splay tree and lock directly
    in gomp_device_descr.
    (gomp_free_memmap): Change argument to struct splay_tree_s.
    (gomp_load_plugin_for_device): Don't initialise openacc
    open_device, close_device, get_device_num or set_device_num hooks.
    Don't initialise target_data or deleted mem_map is_initialized,
    splay_tree.root fields.
    * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_open_device)
    (GOMP_OFFLOAD_openacc_close_device)
    (GOMP_OFFLOAD_openacc_get_device_num)
    (GOMP_OFFLOAD_openacc_set_device_num): Remove.
    (GOMP_OFFLOAD_openacc_create_thread_data): Change (unused) argument
    to int.
    * plugin/plugin-nvptx.c (pthread.h): Include.
    (ptx_inited): Remove.
    (instantiated_devices, ptx_dev_lock): New.
    (struct ptx_image_data): New.
    (ptx_devices, ptx_images, ptx_image_lock): New.
    (fini_streams_for_device): Reorder cuStreamDestroy call.
    (nvptx_get_num_devices): Remove forward declaration.
    (nvptx_init): Change return type to bool.
    (nvptx_fini): Remove.
    (nvptx_attach_host_thread_to_device): New.
    (nvptx_open_device): Return struct ptx_device* instead of void*.
    (nvptx_close_device): Change argument type to struct ptx_device*,
    return type to void.
    (nvptx_get_num_devices): Use instantiated_devices not ptx_inited.
    (kernel_target_data, kernel_host_table): Remove static globals.
    (GOMP_OFFLOAD_register_image, GOMP_OFFLOAD_get_table): Remove.
    (GOMP_OFFLOAD_init_device): Reimplement.
    (GOMP_OFFLOAD_fini_device): Likewise.
    (GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_unload_image): New.
    (GOMP_OFFLOAD_alloc, GOMP_OFFLOAD_free, GOMP_OFFLOAD_dev2host)
    (GOMP_OFFLOAD_host2dev): Use ORD argument.
    (GOMP_OFFLOAD_openacc_open_device)
    (GOMP_OFFLOAD_openacc_close_device)
    (GOMP_OFFLOAD_openacc_set_device_num)
    (GOMP_OFFLOAD_openacc_get_device_num): Remove.
    (GOMP_OFFLOAD_openacc_create_thread_data): Change argument to int
    (device number).

    libgomp/testsuite/
    * libgomp.oacc-c-c++-common/lib-9.c: Fix devnum check in test.

[-- Attachment #2: nvptx-load-unload-4.diff --]
[-- Type: text/x-patch, Size: 65298 bytes --]

commit 63091061f227f124d8d496fd3064982935178f3a
Author: Julian Brown <julian@codesourcery.com>
Date:   Mon Feb 23 11:55:41 2015 -0800

    nvptx load/unload image support, init rework
    
    fix multi-device tests
    
    more load/unload patch cleanups
    
    misc fixes

diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c
index 02c44b6..dbc68bc 100644
--- a/gcc/config/nvptx/mkoffload.c
+++ b/gcc/config/nvptx/mkoffload.c
@@ -839,6 +839,7 @@ process (FILE *in, FILE *out)
 {
   const char *input = read_file (in);
   Token *tok = tokenize (input);
+  unsigned int nvars = 0, nfuncs = 0;
 
   do
     tok = parse_file (tok);
@@ -850,16 +851,17 @@ process (FILE *in, FILE *out)
   write_stmts (out, rev_stmts (fns));
   fprintf (out, ";\n\n");
   fprintf (out, "static const char *var_mappings[] = {\n");
-  for (id_map *id = var_ids; id; id = id->next)
+  for (id_map *id = var_ids; id; id = id->next, nvars++)
     fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
   fprintf (out, "};\n\n");
   fprintf (out, "static const char *func_mappings[] = {\n");
-  for (id_map *id = func_ids; id; id = id->next)
+  for (id_map *id = func_ids; id; id = id->next, nfuncs++)
     fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
   fprintf (out, "};\n\n");
 
   fprintf (out, "static const void *target_data[] = {\n");
-  fprintf (out, "  ptx_code, var_mappings, func_mappings\n");
+  fprintf (out, "  ptx_code, (void*) %u, var_mappings, (void*) %u, "
+		"func_mappings\n", nvars, nfuncs);
   fprintf (out, "};\n\n");
 
   fprintf (out, "extern void GOMP_offload_register (const void *, int, void *);\n");
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3fc9aa9..822d2fe 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -656,9 +656,6 @@ struct target_mem_desc {
   /* Corresponding target device descriptor.  */
   struct gomp_device_descr *device_descr;
 
-  /* Memory mapping info for the thread that created this descriptor.  */
-  struct gomp_memory_mapping *mem_map;
-
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
   splay_tree_key list[];
@@ -683,20 +680,6 @@ struct splay_tree_key_s {
 
 #include "splay-tree.h"
 
-/* Information about mapped memory regions (per device/context).  */
-
-struct gomp_memory_mapping
-{
-  /* Mutex for operating with the splay tree and other shared structures.  */
-  gomp_mutex_t lock;
-
-  /* True when tables have been added to this memory map.  */
-  bool is_initialized;
-
-  /* Splay tree containing information about mapped memory regions.  */
-  struct splay_tree_s splay_tree;
-};
-
 typedef struct acc_dispatch_t
 {
   /* This is a linked list of data mapped using the
@@ -706,18 +689,6 @@ typedef struct acc_dispatch_t
   /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
   struct target_mem_desc *data_environ;
 
-  /* Extra information required for a device instance by a given target.  */
-  /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
-  void *target_data;
-
-  /* Open or close a device instance.  */
-  void *(*open_device_func) (int n);
-  int (*close_device_func) (void *h);
-
-  /* Set or get the device number.  */
-  int (*get_device_num_func) (void);
-  void (*set_device_num_func) (int);
-
   /* Execute.  */
   void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *,
 		     unsigned short *, int, int, int, int, void *);
@@ -735,7 +706,7 @@ typedef struct acc_dispatch_t
   void (*async_set_async_func) (int);
 
   /* Create/destroy TLS data.  */
-  void *(*create_thread_data_func) (void *);
+  void *(*create_thread_data_func) (int);
   void (*destroy_thread_data_func) (void *);
 
   /* NVIDIA target specific routines.  */
@@ -783,9 +754,9 @@ struct gomp_device_descr
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
 
-  /* Memory-mapping info for this device instance.  */
-  /* Uses a separate lock.  */
-  struct gomp_memory_mapping mem_map;
+  /* Splay tree containing information about mapped memory regions for this
+     device instance.  */
+  struct splay_tree_s splay_tree;
 
   /* Mutex for the mutable data.  */
   gomp_mutex_t lock;
@@ -810,7 +781,7 @@ extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
 extern void gomp_init_tables (struct gomp_device_descr *,
 			      struct gomp_memory_mapping *);
-extern void gomp_free_memmap (struct gomp_memory_mapping *);
+extern void gomp_free_memmap (struct splay_tree_s);
 extern void gomp_fini_device (struct gomp_device_descr *);
 
 /* work.c */
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 08b7c5e..1f5827e 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -26,7 +26,7 @@
    see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
    <http://www.gnu.org/licenses/>.  */
 
-
+#include <assert.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
@@ -37,13 +37,23 @@ acc_async_test (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  return base_dev->openacc.async_test_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev->openacc.async_test_func (async);
 }
 
 int
 acc_async_test_all (void)
 {
-  return base_dev->openacc.async_test_all_func ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev->openacc.async_test_all_func ();
 }
 
 void
@@ -52,19 +62,34 @@ acc_wait (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  base_dev->openacc.async_wait_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_func (async);
 }
 
 void
 acc_wait_async (int async1, int async2)
 {
-  base_dev->openacc.async_wait_async_func (async1, async2);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_async_func (async1, async2);
 }
 
 void
 acc_wait_all (void)
 {
-  base_dev->openacc.async_wait_all_func ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_all_func ();
 }
 
 void
@@ -73,5 +98,10 @@ acc_wait_all_async (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  base_dev->openacc.async_wait_all_async_func (async);
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  thr->dev->openacc.async_wait_all_async_func (async);
 }
diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index c8ef376..4aab422 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -34,51 +34,53 @@
 void *
 acc_get_current_cuda_device (void)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev && base_dev->openacc.cuda.get_current_device_func)
-    p = base_dev->openacc.cuda.get_current_device_func ();
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_device_func)
+    return thr->dev->openacc.cuda.get_current_device_func ();
 
-  return p;
+  return NULL;
 }
 
 void *
 acc_get_current_cuda_context (void)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev && base_dev->openacc.cuda.get_current_context_func)
-    p = base_dev->openacc.cuda.get_current_context_func ();
-
-  return p;
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_context_func)
+    return thr->dev->openacc.cuda.get_current_context_func ();
+ 
+  return NULL;
 }
 
 void *
 acc_get_cuda_stream (int async)
 {
-  void *p = NULL;
+  struct goacc_thread *thr = goacc_thread ();
 
   if (async < 0)
-    return p;
-
-  if (base_dev && base_dev->openacc.cuda.get_stream_func)
-    p = base_dev->openacc.cuda.get_stream_func (async);
+    return NULL;
 
-  return p;
+  if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
+    return thr->dev->openacc.cuda.get_stream_func (async);
+ 
+  return NULL;
 }
 
 int
 acc_set_cuda_stream (int async, void *stream)
 {
-  int s = -1;
+  struct goacc_thread *thr;
 
   if (async < 0 || stream == NULL)
     return 0;
 
   goacc_lazy_initialize ();
 
-  if (base_dev && base_dev->openacc.cuda.set_stream_func)
-    s = base_dev->openacc.cuda.set_stream_func (async, stream);
+  thr = goacc_thread ();
+
+  if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
+    return thr->dev->openacc.cuda.set_stream_func (async, stream);
 
-  return s;
+  return -1;
 }
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 5d67c6c..6dcdbf3 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -53,17 +53,9 @@ static struct gomp_device_descr host_dispatch =
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.is_initialized = false,
-    .mem_map.splay_tree.root = NULL,
     .is_initialized = false,
 
     .openacc = {
-      .open_device_func = GOMP_OFFLOAD_openacc_open_device,
-      .close_device_func = GOMP_OFFLOAD_openacc_close_device,
-
-      .get_device_num_func = GOMP_OFFLOAD_openacc_get_device_num,
-      .set_device_num_func = GOMP_OFFLOAD_openacc_set_device_num,
-
       .exec_func = GOMP_OFFLOAD_openacc_parallel,
 
       .register_async_cleanup_func
@@ -93,7 +85,6 @@ static struct gomp_device_descr host_dispatch =
 static __attribute__ ((constructor))
 void goacc_host_init (void)
 {
-  gomp_mutex_init (&host_dispatch.mem_map.lock);
   gomp_mutex_init (&host_dispatch.lock);
   goacc_register (&host_dispatch);
 }
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 19b937a..091dbc9 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -37,14 +37,13 @@
 
 static gomp_mutex_t acc_device_lock;
 
-/* The dispatch table for the current accelerator device.  This is global, so
-   you can only have one type of device open at any given time in a program.
-   This is the "base" device in that several devices that use the same
-   dispatch table may be active concurrently: this one (the "zeroth") is used
-   for overall initialisation/shutdown, and other instances -- not necessarily
-   including this one -- may be opened and closed once the base device has
-   been initialized.  */
-struct gomp_device_descr *base_dev;
+/* A cached version of the dispatcher for the global "current" accelerator type,
+   e.g. used as the default when creating new host threads.  This is the
+   device-type equivalent of goacc_device_num (which specifies which device to
+   use out of potentially several of the same type).  If there are several
+   devices of a given type, this points at the first one.  */
+
+static struct gomp_device_descr *cached_base_dev = NULL;
 
 #if defined HAVE_TLS || defined USE_EMUTLS
 __thread struct goacc_thread *goacc_tls_data;
@@ -53,9 +52,6 @@ pthread_key_t goacc_tls_key;
 #endif
 static pthread_key_t goacc_cleanup_key;
 
-/* Current dispatcher, and how it was initialized */
-static acc_device_t init_key = _ACC_device_hwm;
-
 static struct goacc_thread *goacc_threads;
 static gomp_mutex_t goacc_thread_lock;
 
@@ -94,6 +90,21 @@ get_openacc_name (const char *name)
     return name;
 }
 
+static const char *
+name_of_acc_device_t (enum acc_device_t type)
+{
+  switch (type)
+    {
+    case acc_device_none: return "none";
+    case acc_device_default: return "default";
+    case acc_device_host: return "host";
+    case acc_device_host_nonshm: return "host_nonshm";
+    case acc_device_not_host: return "not_host";
+    case acc_device_nvidia: return "nvidia";
+    default: gomp_fatal ("unknown device type %u", (unsigned) type);
+    }
+}
+
 static struct gomp_device_descr *
 resolve_device (acc_device_t d)
 {
@@ -159,22 +170,87 @@ resolve_device (acc_device_t d)
 static struct gomp_device_descr *
 acc_init_1 (acc_device_t d)
 {
-  struct gomp_device_descr *acc_dev;
+  struct gomp_device_descr *base_dev, *acc_dev;
+  int ndevs;
 
-  acc_dev = resolve_device (d);
+  base_dev = resolve_device (d);
+
+  ndevs = base_dev->get_num_devices_func ();
+
+  if (!base_dev || ndevs <= 0 || goacc_device_num >= ndevs)
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
 
-  if (!acc_dev || acc_dev->get_num_devices_func () <= 0)
-    gomp_fatal ("device %u not supported", (unsigned)d);
+  acc_dev = &base_dev[goacc_device_num];
 
   if (acc_dev->is_initialized)
     gomp_fatal ("device already active");
 
-  /* We need to remember what we were intialized as, to check shutdown etc.  */
-  init_key = d;
-
   gomp_init_device (acc_dev);
 
-  return acc_dev;
+  return base_dev;
+}
+
+static void
+acc_shutdown_1 (acc_device_t d)
+{
+  struct gomp_device_descr *base_dev;
+  struct goacc_thread *walk;
+  int ndevs, i;
+  bool devices_active = false;
+
+  /* Get the base device for this device type.  */
+  base_dev = resolve_device (d);
+
+  if (!base_dev)
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
+
+  gomp_mutex_lock (&goacc_thread_lock);
+
+  /* Free target-specific TLS data and close all devices.  */
+  for (walk = goacc_threads; walk != NULL; walk = walk->next)
+    {
+      if (walk->target_tls)
+	base_dev->openacc.destroy_thread_data_func (walk->target_tls);
+
+      walk->target_tls = NULL;
+
+      /* This would mean the user is shutting down OpenACC in the middle of an
+         "acc data" pragma.  Likely not intentional.  */
+      if (walk->mapped_data)
+	gomp_fatal ("shutdown in 'acc data' region");
+
+      /* Similarly, if this happens then user code has done something weird.  */
+      if (walk->saved_bound_dev)
+        gomp_fatal ("shutdown during host fallback");
+
+      if (walk->dev)
+	{
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (walk->dev->splay_tree);
+	  gomp_mutex_unlock (&walk->dev->lock);
+
+	  walk->dev = NULL;
+	  walk->base_dev = NULL;
+	}
+    }
+
+  gomp_mutex_unlock (&goacc_thread_lock);
+
+  ndevs = base_dev->get_num_devices_func ();
+
+  /* Close all the devices of this type that have been opened.  */
+  for (i = 0; i < ndevs; i++)
+    {
+      struct gomp_device_descr *acc_dev = &base_dev[i];
+      if (acc_dev->is_initialized)
+        {
+	  devices_active = true;
+	  gomp_fini_device (acc_dev);
+	}
+    }
+
+  if (!devices_active)
+    gomp_fatal ("no device initialized");
 }
 
 static struct goacc_thread *
@@ -207,9 +283,11 @@ goacc_destroy_thread (void *data)
 
   if (thr)
     {
-      if (base_dev && thr->target_tls)
+      struct gomp_device_descr *acc_dev = thr->dev;
+
+      if (acc_dev && thr->target_tls)
 	{
-	  base_dev->openacc.destroy_thread_data_func (thr->target_tls);
+	  acc_dev->openacc.destroy_thread_data_func (thr->target_tls);
 	  thr->target_tls = NULL;
 	}
 
@@ -236,53 +314,49 @@ goacc_destroy_thread (void *data)
   gomp_mutex_unlock (&goacc_thread_lock);
 }
 
-/* Open the ORD'th device of the currently-active type (base_dev must be
-   initialised before calling).  If ORD is < 0, open the default-numbered
-   device (set by the ACC_DEVICE_NUM environment variable or a call to
-   acc_set_device_num), or leave any currently-opened device as is.  "Opening"
-   consists of calling the device's open_device_func hook, and setting up
-   thread-local data (maybe allocating, then initializing with information
-   pertaining to the newly-opened or previously-opened device).  */
+/* Use the ORD'th device instance for the current host thread (or -1 for the
+   current global default).  The device (and the runtime) must be initialised
+   before calling this function.  */
 
-static void
-lazy_open (int ord)
+void
+goacc_attach_host_thread_to_device (int ord)
 {
   struct goacc_thread *thr = goacc_thread ();
-  struct gomp_device_descr *acc_dev;
-
-  if (thr && thr->dev)
-    {
-      assert (ord < 0 || ord == thr->dev->target_id);
-      return;
-    }
-
-  assert (base_dev);
-
+  struct gomp_device_descr *acc_dev = NULL, *base_dev = NULL;
+  int num_devices;
+  
+  if (thr && thr->dev && (thr->dev->target_id == ord || ord < 0))
+    return;
+  
   if (ord < 0)
     ord = goacc_device_num;
-
-  /* The OpenACC 2.0 spec leaves the runtime's behaviour when an out-of-range
-     device is requested as implementation-defined (4.2 ACC_DEVICE_NUM).
-     We choose to raise an error in such a case.  */
-  if (ord >= base_dev->get_num_devices_func ())
-    gomp_fatal ("device %u does not exist", ord);
-
+  
+  /* Decide which type of device to use.  If the current thread has a device
+     type already (e.g. set by acc_set_device_type), use that, else use the
+     global default.  */
+  if (thr && thr->base_dev)
+    base_dev = thr->base_dev;
+  else
+    {
+      assert (cached_base_dev);
+      base_dev = cached_base_dev;
+    }
+  
+  num_devices = base_dev->get_num_devices_func ();
+  if (num_devices <= 0 || ord >= num_devices)
+    gomp_fatal ("device %u out of range", ord);
+  
   if (!thr)
     thr = goacc_new_thread ();
-
-  acc_dev = thr->dev = &base_dev[ord];
-
-  assert (acc_dev->target_id == ord);
-
+  
+  thr->base_dev = base_dev;
+  thr->dev = acc_dev = &base_dev[ord];
   thr->saved_bound_dev = NULL;
   thr->mapped_data = NULL;
-
-  if (!acc_dev->openacc.target_data)
-    acc_dev->openacc.target_data = acc_dev->openacc.open_device_func (ord);
-
+  
   thr->target_tls
-    = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
-
+    = acc_dev->openacc.create_thread_data_func (ord);
+  
   acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
@@ -292,75 +366,20 @@ lazy_open (int ord)
 void
 acc_init (acc_device_t d)
 {
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
 
-  base_dev = acc_init_1 (d);
-
-  lazy_open (-1);
+  cached_base_dev = acc_init_1 (d);
 
   gomp_mutex_unlock (&acc_device_lock);
+  
+  goacc_attach_host_thread_to_device (-1);
 }
 
 ialias (acc_init)
 
-static void
-acc_shutdown_1 (acc_device_t d)
-{
-  struct goacc_thread *walk;
-
-  /* We don't check whether d matches the actual device found, because
-     OpenACC 2.0 (3.2.12) says the parameters to the init and this
-     call must match (for the shutdown call anyway, it's silent on
-     others).  */
-
-  if (!base_dev)
-    gomp_fatal ("no device initialized");
-  if (d != init_key)
-    gomp_fatal ("device %u(%u) is initialized",
-		(unsigned) init_key, (unsigned) base_dev->type);
-
-  gomp_mutex_lock (&goacc_thread_lock);
-
-  /* Free target-specific TLS data and close all devices.  */
-  for (walk = goacc_threads; walk != NULL; walk = walk->next)
-    {
-      if (walk->target_tls)
-	base_dev->openacc.destroy_thread_data_func (walk->target_tls);
-
-      walk->target_tls = NULL;
-
-      /* This would mean the user is shutting down OpenACC in the middle of an
-         "acc data" pragma.  Likely not intentional.  */
-      if (walk->mapped_data)
-	gomp_fatal ("shutdown in 'acc data' region");
-
-      if (walk->dev)
-	{
-	  void *target_data = walk->dev->openacc.target_data;
-	  if (walk->dev->openacc.close_device_func (target_data) < 0)
-	    gomp_fatal ("failed to close device");
-
-	  walk->dev->openacc.target_data = target_data = NULL;
-
-	  struct gomp_memory_mapping *mem_map = &walk->dev->mem_map;
-	  gomp_mutex_lock (&mem_map->lock);
-	  gomp_free_memmap (mem_map);
-	  gomp_mutex_unlock (&mem_map->lock);
-
-	  walk->dev = NULL;
-	}
-    }
-
-  gomp_mutex_unlock (&goacc_thread_lock);
-
-  gomp_fini_device (base_dev);
-
-  base_dev = NULL;
-}
-
 void
 acc_shutdown (acc_device_t d)
 {
@@ -373,59 +392,16 @@ acc_shutdown (acc_device_t d)
 
 ialias (acc_shutdown)
 
-/* This function is called after plugins have been initialized.  It deals with
-   the "base" device, and is used to prepare the runtime for dealing with a
-   number of such devices (as implemented by some particular plugin).  If the
-   argument device type D matches a previous call to the function, return the
-   current base device, else shut the old device down and re-initialize with
-   the new device type.  */
-
-static struct gomp_device_descr *
-lazy_init (acc_device_t d)
-{
-  if (base_dev)
-    {
-      /* Re-initializing the same device, do nothing.  */
-      if (d == init_key)
-	return base_dev;
-
-      acc_shutdown_1 (init_key);
-    }
-
-  assert (!base_dev);
-
-  return acc_init_1 (d);
-}
-
-/* Ensure that plugins are loaded, initialize and open the (default-numbered)
-   device.  */
-
-static void
-lazy_init_and_open (acc_device_t d)
-{
-  if (!base_dev)
-    gomp_init_targets_once ();
-
-  gomp_mutex_lock (&acc_device_lock);
-
-  base_dev = lazy_init (d);
-
-  lazy_open (-1);
-
-  gomp_mutex_unlock (&acc_device_lock);
-}
-
 int
 acc_get_num_devices (acc_device_t d)
 {
   int n = 0;
-  const struct gomp_device_descr *acc_dev;
+  struct gomp_device_descr *acc_dev;
 
   if (d == acc_device_none)
     return 0;
 
-  if (!base_dev)
-    gomp_init_targets_once ();
+  gomp_init_targets_once ();
 
   acc_dev = resolve_device (d);
   if (!acc_dev)
@@ -440,10 +416,39 @@ acc_get_num_devices (acc_device_t d)
 
 ialias (acc_get_num_devices)
 
+/* Set the device type for the current thread only (using the current global
+   default device number), initialising that device if necessary.  Also set the
+   default device type for new threads to D.  */
+
 void
 acc_set_device_type (acc_device_t d)
 {
-  lazy_init_and_open (d);
+  struct gomp_device_descr *base_dev, *acc_dev;
+  struct goacc_thread *thr = goacc_thread ();
+
+  gomp_mutex_lock (&acc_device_lock);
+
+  if (!cached_base_dev)
+    gomp_init_targets_once ();
+
+  cached_base_dev = base_dev = resolve_device (d);
+  acc_dev = &base_dev[goacc_device_num];
+
+  if (!acc_dev->is_initialized)
+    gomp_init_device (acc_dev);
+
+  gomp_mutex_unlock (&acc_device_lock);
+
+  /* We're changing device type: invalidate the current thread's dev and
+     base_dev pointers.  */
+  if (thr && thr->base_dev != base_dev)
+    {
+      thr->base_dev = thr->dev = NULL;
+      if (thr->mapped_data)
+        gomp_fatal ("acc_set_device_type in 'acc data' region");
+    }
+
+  goacc_attach_host_thread_to_device (-1);
 }
 
 ialias (acc_set_device_type)
@@ -452,10 +457,11 @@ acc_device_t
 acc_get_device_type (void)
 {
   acc_device_t res = acc_device_none;
-  const struct gomp_device_descr *dev;
+  struct gomp_device_descr *dev;
+  struct goacc_thread *thr = goacc_thread ();
 
-  if (base_dev)
-    res = acc_device_type (base_dev->type);
+  if (thr && thr->base_dev)
+    res = acc_device_type (thr->base_dev->type);
   else
     {
       gomp_init_targets_once ();
@@ -476,78 +482,65 @@ int
 acc_get_device_num (acc_device_t d)
 {
   const struct gomp_device_descr *dev;
-  int num;
+  struct goacc_thread *thr = goacc_thread ();
 
   if (d >= _ACC_device_hwm)
     gomp_fatal ("device %u out of range", (unsigned)d);
 
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
   dev = resolve_device (d);
   if (!dev)
-    gomp_fatal ("no devices of type %u", d);
+    gomp_fatal ("device %s not supported", name_of_acc_device_t (d));
 
-  /* We might not have called lazy_open for this host thread yet, in which case
-     the get_device_num_func hook will return -1.  */
-  num = dev->openacc.get_device_num_func ();
-  if (num < 0)
-    num = goacc_device_num;
+  if (thr && thr->base_dev == dev && thr->dev)
+    return thr->dev->target_id;
 
-  return num;
+  return goacc_device_num;
 }
 
 ialias (acc_get_device_num)
 
 void
-acc_set_device_num (int n, acc_device_t d)
+acc_set_device_num (int ord, acc_device_t d)
 {
-  const struct gomp_device_descr *dev;
+  struct gomp_device_descr *base_dev, *acc_dev;
   int num_devices;
 
-  if (!base_dev)
+  if (!cached_base_dev)
     gomp_init_targets_once ();
 
-  if ((int) d == 0)
-    {
-      int i;
-
-      /* A device setting of zero sets all device types on the system to use
-         the Nth instance of that device type.  Only attempt it for initialized
-	 devices though.  */
-      for (i = acc_device_not_host + 1; i < _ACC_device_hwm; i++)
-        {
-	  dev = resolve_device (d);
-	  if (dev && dev->is_initialized)
-	    dev->openacc.set_device_num_func (n);
-	}
+  if (ord < 0)
+    ord = goacc_device_num;
 
-      /* ...and for future calls to acc_init/acc_set_device_type, etc.  */
-      goacc_device_num = n;
-    }
+  if ((int) d == 0)
+    /* Set whatever device is being used by the current host thread to use
+       device instance ORD.  It's unclear if this is supposed to affect other
+       host threads too (OpenACC 2.0 (3.2.4) acc_set_device_num).  */
+    goacc_attach_host_thread_to_device (ord);
   else
     {
-      struct goacc_thread *thr = goacc_thread ();
-
       gomp_mutex_lock (&acc_device_lock);
 
-      base_dev = lazy_init (d);
+      cached_base_dev = base_dev = resolve_device (d);
 
       num_devices = base_dev->get_num_devices_func ();
 
-      if (n >= num_devices)
-        gomp_fatal ("device %u out of range", n);
+      if (ord >= num_devices)
+        gomp_fatal ("device %u out of range", ord);
 
-      /* If we're changing the device number, de-associate this thread with
-	 the device (but don't close the device, since it may be in use by
-	 other threads).  */
-      if (thr && thr->dev && n != thr->dev->target_id)
-	thr->dev = NULL;
+      acc_dev = &base_dev[ord];
 
-      lazy_open (n);
+      if (!acc_dev->is_initialized)
+        gomp_init_device (acc_dev);
 
       gomp_mutex_unlock (&acc_device_lock);
+
+      goacc_attach_host_thread_to_device (ord);
     }
+  
+  goacc_device_num = ord;
 }
 
 ialias (acc_set_device_num)
@@ -555,10 +548,7 @@ ialias (acc_set_device_num)
 int
 acc_on_device (acc_device_t dev)
 {
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (thr && thr->dev
-      && acc_device_type (thr->dev->type) == acc_device_host_nonshm)
+  if (acc_get_device_type () == acc_device_host_nonshm)
     return dev == acc_device_host_nonshm || dev == acc_device_not_host;
 
   /* Just rely on the compiler builtin.  */
@@ -578,7 +568,7 @@ goacc_runtime_initialize (void)
 
   pthread_key_create (&goacc_cleanup_key, goacc_destroy_thread);
 
-  base_dev = NULL;
+  cached_base_dev = NULL;
 
   goacc_threads = NULL;
   gomp_mutex_init (&goacc_thread_lock);
@@ -607,9 +597,8 @@ goacc_restore_bind (void)
 }
 
 /* This is called from any OpenACC support function that may need to implicitly
-   initialize the libgomp runtime.  On exit all such initialization will have
-   been done, and both the global ACC_dev and the per-host-thread ACC_memmap
-   pointers will be valid.  */
+   initialize the libgomp runtime, either globally or from a new host thread. 
+   On exit "goacc_thread" will return a valid & populated thread block.  */
 
 attribute_hidden void
 goacc_lazy_initialize (void)
@@ -619,12 +608,8 @@ goacc_lazy_initialize (void)
   if (thr && thr->dev)
     return;
 
-  if (!base_dev)
-    lazy_init_and_open (acc_device_default);
+  if (!cached_base_dev)
+    acc_init (acc_device_default);
   else
-    {
-      gomp_mutex_lock (&acc_device_lock);
-      lazy_open (-1);
-      gomp_mutex_unlock (&acc_device_lock);
-    }
+    goacc_attach_host_thread_to_device (-1);
 }
diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h
index 85619c8..0ace737 100644
--- a/libgomp/oacc-int.h
+++ b/libgomp/oacc-int.h
@@ -56,6 +56,9 @@ acc_device_type (enum offload_target_type type)
 
 struct goacc_thread
 {
+  /* The base device for the current thread.  */
+  struct gomp_device_descr *base_dev;
+
   /* The device for the current thread.  */
   struct gomp_device_descr *dev;
 
@@ -89,10 +92,7 @@ goacc_thread (void)
 #endif
 
 void goacc_register (struct gomp_device_descr *) __GOACC_NOTHROW;
-
-/* Current dispatcher.  */
-extern struct gomp_device_descr *base_dev;
-
+void goacc_attach_host_thread_to_device (int);
 void goacc_runtime_initialize (void);
 void goacc_save_and_set_bind (acc_device_t);
 void goacc_restore_bind (void);
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 0096d51..1135be5 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -38,7 +38,7 @@
 /* Return block containing [H->S), or NULL if not contained.  */
 
 static splay_tree_key
-lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
+lookup_host (struct gomp_device_descr *devicep, void *h, size_t s)
 {
   struct splay_tree_key_s node;
   splay_tree_key key;
@@ -46,11 +46,11 @@ lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
   node.host_start = (uintptr_t) h;
   node.host_end = (uintptr_t) h + s;
 
-  gomp_mutex_lock (&mem_map->lock);
+  gomp_mutex_lock (&devicep->lock);
 
-  key = splay_tree_lookup (&mem_map->splay_tree, &node);
+  key = splay_tree_lookup (&devicep->splay_tree, &node);
 
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_unlock (&devicep->lock);
 
   return key;
 }
@@ -65,14 +65,14 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
 {
   int i;
   struct target_mem_desc *t;
-  struct gomp_memory_mapping *mem_map;
+  struct gomp_device_descr *devicep;
 
   if (!tgt)
     return NULL;
 
-  mem_map = tgt->mem_map;
+  devicep = tgt->device_descr;
 
-  gomp_mutex_lock (&mem_map->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (t = tgt; t != NULL; t = t->prev)
     {
@@ -80,7 +80,7 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
         break;
     }
 
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_unlock (&devicep->lock);
 
   if (!t)
     return NULL;
@@ -112,7 +112,9 @@ acc_malloc (size_t s)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  return base_dev->alloc_func (thr->dev->target_id, s);
+  assert (thr->dev);
+
+  return thr->dev->alloc_func (thr->dev->target_id, s);
 }
 
 /* OpenACC 2.0a (3.2.16) doesn't specify what to do in the event
@@ -127,6 +129,8 @@ acc_free (void *d)
   if (!d)
     return;
 
+  assert (thr && thr->dev);
+
   /* We don't have to call lazy open here, as the ptr value must have
      been returned by acc_malloc.  It's not permitted to pass NULL in
      (unless you got that null from acc_malloc).  */
@@ -139,7 +143,7 @@ acc_free (void *d)
      acc_unmap_data ((void *)(k->host_start + offset));
    }
 
-  base_dev->free_func (thr->dev->target_id, d);
+  thr->dev->free_func (thr->dev->target_id, d);
 }
 
 void
@@ -149,7 +153,9 @@ acc_memcpy_to_device (void *d, void *h, size_t s)
      been obtained from a routine that did that.  */
   struct goacc_thread *thr = goacc_thread ();
 
-  base_dev->host2dev_func (thr->dev->target_id, d, h, s);
+  assert (thr && thr->dev);
+
+  thr->dev->host2dev_func (thr->dev->target_id, d, h, s);
 }
 
 void
@@ -159,7 +165,9 @@ acc_memcpy_from_device (void *h, void *d, size_t s)
      been obtained from a routine that did that.  */
   struct goacc_thread *thr = goacc_thread ();
 
-  base_dev->dev2host_func (thr->dev->target_id, h, d, s);
+  assert (thr && thr->dev);
+
+  thr->dev->dev2host_func (thr->dev->target_id, h, d, s);
 }
 
 /* Return the device pointer that corresponds to host data H.  Or NULL
@@ -176,7 +184,7 @@ acc_deviceptr (void *h)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  n = lookup_host (&thr->dev->mem_map, h, 1);
+  n = lookup_host (thr->dev, h, 1);
 
   if (!n)
     return NULL;
@@ -229,7 +237,7 @@ acc_is_present (void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   if (n && ((uintptr_t)h < n->host_start
 	    || (uintptr_t)h + s > n->host_end
@@ -271,7 +279,7 @@ acc_map_data (void *h, void *d, size_t s)
 	gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map",
                     (void *)h, (int)s, (void *)d, (int)s);
 
-      if (lookup_host (&acc_dev->mem_map, h, s))
+      if (lookup_host (acc_dev, h, s))
 	gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h,
 		    (int)s);
 
@@ -296,7 +304,7 @@ acc_unmap_data (void *h)
   /* No need to call lazy open, as the address must have been mapped.  */
 
   size_t host_size;
-  splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1);
+  splay_tree_key n = lookup_host (acc_dev, h, 1);
   struct target_mem_desc *t;
 
   if (!n)
@@ -320,7 +328,7 @@ acc_unmap_data (void *h)
       t->tgt_end = 0;
       t->to_free = 0;
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
 	   tp = t, t = t->prev)
@@ -334,7 +342,7 @@ acc_unmap_data (void *h)
 	    break;
 	  }
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   gomp_unmap_vars (t, true);
@@ -358,7 +366,7 @@ present_create_copy (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
   if (n)
     {
       /* Present. */
@@ -389,13 +397,13 @@ present_create_copy (unsigned f, void *h, size_t s)
       tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
 			   false);
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
       tgt->prev = acc_dev->openacc.data_environ;
       acc_dev->openacc.data_environ = tgt;
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   return d;
@@ -436,7 +444,7 @@ delete_copyout (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -479,7 +487,7 @@ update_dev_host (int is_dev, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -532,7 +540,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   struct target_mem_desc *t;
   int minrefs = (mapnum == 1) ? 2 : 3;
 
-  n = lookup_host (&acc_dev->mem_map, h, 1);
+  n = lookup_host (acc_dev, h, 1);
 
   if (!n)
     gomp_fatal ("%p is not a mapped block", (void *)h);
@@ -543,7 +551,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
 
   struct target_mem_desc *tp;
 
-  gomp_mutex_lock (&acc_dev->mem_map.lock);
+  gomp_mutex_lock (&acc_dev->lock);
 
   if (t->refcount == minrefs)
     {
@@ -570,7 +578,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   if (force_copyfrom)
     t->list[0]->copy_from = 1;
 
-  gomp_mutex_unlock (&acc_dev->mem_map.lock);
+  gomp_mutex_unlock (&acc_dev->lock);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 727fced..ed1b965 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -46,32 +46,6 @@ find_pset (int pos, size_t mapnum, unsigned short *kinds)
   return kind == GOMP_MAP_TO_PSET;
 }
 
-
-/* Ensure that the target device for DEVICE_TYPE is initialised (and that
-   plugins have been loaded if appropriate).  The ACC_dev variable for the
-   current thread will be set appropriately for the given device type on
-   return.  */
-
-attribute_hidden void
-select_acc_device (int device_type)
-{
-  goacc_lazy_initialize ();
-
-  if (device_type == GOMP_DEVICE_HOST_FALLBACK)
-    return;
-
-  if (device_type == acc_device_none)
-    device_type = acc_device_host;
-
-  if (device_type >= 0)
-    {
-      /* NOTE: this will go badly if the surrounding data environment is set up
-         to use a different device type.  We'll just have to trust that users
-	 know what they're doing...  */
-      acc_set_device_type (device_type);
-    }
-}
-
 static void goacc_wait (int async, int num_waits, va_list ap);
 
 void
@@ -99,7 +73,7 @@ GOACC_parallel (int device, void (*fn) (void *),
   gomp_debug (0, "%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p, async=%d\n",
 	      __FUNCTION__, mapnum, hostaddrs, sizes, kinds, async);
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   thr = goacc_thread ();
   acc_dev = thr->dev;
@@ -132,9 +106,9 @@ GOACC_parallel (int device, void (*fn) (void *),
     {
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
-      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k);
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
+      tgt_fn_key = splay_tree_lookup (&acc_dev->splay_tree, &k);
+      gomp_mutex_unlock (&acc_dev->lock);
 
       if (tgt_fn_key == NULL)
 	gomp_fatal ("target function wasn't mapped");
@@ -178,7 +152,7 @@ GOACC_data_start (int device, size_t mapnum,
   gomp_debug (0, "%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p\n",
 	      __FUNCTION__, mapnum, hostaddrs, sizes, kinds);
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
@@ -225,7 +199,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
   bool data_enter = false;
   size_t i;
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   thr = goacc_thread ();
   acc_dev = thr->dev;
@@ -366,7 +340,7 @@ GOACC_kernels (int device, void (*fn) (void *),
 
   va_list ap;
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   va_start (ap, num_waits);
 
@@ -437,7 +411,7 @@ GOACC_update (int device, size_t mapnum,
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
   size_t i;
 
-  select_acc_device (device);
+  goacc_lazy_initialize ();
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index bc60f72..1faf5bc 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -119,31 +119,6 @@ GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
 }
 
 STATIC void *
-GOMP_OFFLOAD_openacc_open_device (int n)
-{
-  return (void *) (intptr_t) n;
-}
-
-STATIC int
-GOMP_OFFLOAD_openacc_close_device (void *hnd)
-{
-  return 0;
-}
-
-STATIC int
-GOMP_OFFLOAD_openacc_get_device_num (void)
-{
-  return 0;
-}
-
-STATIC void
-GOMP_OFFLOAD_openacc_set_device_num (int n)
-{
-  if (n > 0)
-    GOMP (fatal) ("device number %u out of range for host execution", n);
-}
-
-STATIC void *
 GOMP_OFFLOAD_alloc (int n __attribute__ ((unused)), size_t s)
 {
   return GOMP (malloc) (s);
@@ -254,7 +229,7 @@ GOMP_OFFLOAD_openacc_async_wait_all_async (int async __attribute__ ((unused)))
 }
 
 STATIC void *
-GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data
+GOMP_OFFLOAD_openacc_create_thread_data (int ord
 					 __attribute__ ((unused)))
 {
   return NULL;
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 483cb75..583ec87 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -133,7 +133,8 @@ struct targ_fn_descriptor
   const char *name;
 };
 
-static bool ptx_inited = false;
+static unsigned int instantiated_devices = 0;
+static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
 struct ptx_stream
 {
@@ -331,9 +332,21 @@ struct ptx_event
   struct ptx_event *next;
 };
 
+struct ptx_image_data
+{
+  void *target_data;
+  CUmodule module;
+  struct ptx_image_data *next;
+};
+
 static pthread_mutex_t ptx_event_lock;
 static struct ptx_event *ptx_events;
 
+static struct ptx_device **ptx_devices;
+
+static struct ptx_image_data *ptx_images = NULL;
+static pthread_mutex_t ptx_image_lock = PTHREAD_MUTEX_INITIALIZER;
+
 #define _XSTR(s) _STR(s)
 #define _STR(s) #s
 
@@ -450,8 +463,8 @@ fini_streams_for_device (struct ptx_device *ptx_dev)
       struct ptx_stream *s = ptx_dev->active_streams;
       ptx_dev->active_streams = ptx_dev->active_streams->next;
 
-      cuStreamDestroy (s->stream);
       map_fini (s);
+      cuStreamDestroy (s->stream);
       free (s);
     }
 
@@ -575,21 +588,21 @@ select_stream_for_async (int async, pthread_t thread, bool create,
   return stream;
 }
 
-static int nvptx_get_num_devices (void);
-
-/* Initialize the device.  */
-static int
+/* Initialize the device.  Return TRUE on success, else FALSE.  PTX_DEV_LOCK
+   should be locked on entry and remains locked on exit.  */
+static bool
 nvptx_init (void)
 {
   CUresult r;
   int rc;
+  int ndevs;
 
-  if (ptx_inited)
-    return nvptx_get_num_devices ();
+  if (instantiated_devices != 0)
+    return true;
 
   rc = verify_device_library ();
   if (rc < 0)
-    return -1;
+    return false;
 
   r = cuInit (0);
   if (r != CUDA_SUCCESS)
@@ -599,22 +612,64 @@ nvptx_init (void)
 
   pthread_mutex_init (&ptx_event_lock, NULL);
 
-  ptx_inited = true;
+  r = cuDeviceGetCount (&ndevs);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetCount error: %s", cuda_error (r));
 
-  return nvptx_get_num_devices ();
+  ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *)
+					    * ndevs);
+
+  return true;
 }
 
+/* Select the N'th PTX device for the current host thread.  The device must
+   have been previously opened before calling this function.  */
+
 static void
-nvptx_fini (void)
+nvptx_attach_host_thread_to_device (int n)
 {
-  ptx_inited = false;
+  CUdevice dev;
+  CUresult r;
+  struct ptx_device *ptx_dev;
+  CUcontext thd_ctx;
+
+  r = cuCtxGetDevice (&dev);
+  if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
+    GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r));
+
+  if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n)
+    return;
+  else
+    {
+      CUcontext old_ctx;
+
+      ptx_dev = ptx_devices[n];
+      assert (ptx_dev);
+
+      r = cuCtxGetCurrent (&thd_ctx);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
+
+      /* We don't necessarily have a current context (e.g. if it has been
+         destroyed.  Pop it if we do though.  */
+      if (thd_ctx != NULL)
+	{
+	  r = cuCtxPopCurrent (&old_ctx);
+	  if (r != CUDA_SUCCESS)
+            GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r));
+	}
+
+      r = cuCtxPushCurrent (ptx_dev->ctx);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuCtxPushCurrent error: %s", cuda_error (r));
+    }
 }
 
-static void *
+static struct ptx_device *
 nvptx_open_device (int n)
 {
   struct ptx_device *ptx_dev;
-  CUdevice dev;
+  CUdevice dev, ctx_dev;
   CUresult r;
   int async_engines, pi;
 
@@ -628,6 +683,21 @@ nvptx_open_device (int n)
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
 
+  r = cuCtxGetDevice (&ctx_dev);
+  if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
+    GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r));
+  
+  if (r != CUDA_ERROR_INVALID_CONTEXT && ctx_dev != dev)
+    {
+      /* The current host thread has an active context for a different device.
+         Detach it.  */
+      CUcontext old_ctx;
+      
+      r = cuCtxPopCurrent (&old_ctx);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r));
+    }
+
   r = cuCtxGetCurrent (&ptx_dev->ctx);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
@@ -678,17 +748,16 @@ nvptx_open_device (int n)
 
   init_streams_for_device (ptx_dev, async_engines);
 
-  return (void *) ptx_dev;
+  return ptx_dev;
 }
 
-static int
-nvptx_close_device (void *targ_data)
+static void
+nvptx_close_device (struct ptx_device *ptx_dev)
 {
   CUresult r;
-  struct ptx_device *ptx_dev = targ_data;
 
   if (!ptx_dev)
-    return 0;
+    return;
 
   fini_streams_for_device (ptx_dev);
 
@@ -700,8 +769,6 @@ nvptx_close_device (void *targ_data)
     }
 
   free (ptx_dev);
-
-  return 0;
 }
 
 static int
@@ -714,7 +781,7 @@ nvptx_get_num_devices (void)
      order to enumerate available devices, but CUDA API routines can't be used
      until cuInit has been called.  Just call it now (but don't yet do any
      further initialization).  */
-  if (!ptx_inited)
+  if (instantiated_devices == 0)
     cuInit (0);
 
   r = cuDeviceGetCount (&n);
@@ -1507,64 +1574,84 @@ GOMP_OFFLOAD_get_num_devices (void)
   return nvptx_get_num_devices ();
 }
 
-static void **kernel_target_data;
-static void **kernel_host_table;
-
 void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_data)
+GOMP_OFFLOAD_init_device (int n)
 {
-  kernel_target_data = target_data;
-  kernel_host_table = host_table;
-}
+  pthread_mutex_lock (&ptx_dev_lock);
 
-void
-GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
-{
-  (void) nvptx_init ();
+  if (!nvptx_init () || ptx_devices[n] != NULL)
+    {
+      pthread_mutex_unlock (&ptx_dev_lock);
+      return;
+    }
+
+  ptx_devices[n] = nvptx_open_device (n);
+  instantiated_devices++;
+
+  pthread_mutex_unlock (&ptx_dev_lock);
 }
 
 void
-GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
+GOMP_OFFLOAD_fini_device (int n)
 {
-  nvptx_fini ();
+  pthread_mutex_lock (&ptx_dev_lock);
+
+  if (ptx_devices[n] != NULL)
+    {
+      nvptx_attach_host_thread_to_device (n);
+      nvptx_close_device (ptx_devices[n]);
+      ptx_devices[n] = NULL;
+      instantiated_devices--;
+    }
+
+  pthread_mutex_unlock (&ptx_dev_lock);
 }
 
 int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **tablep)
+GOMP_OFFLOAD_load_image (int ord, void *target_data,
+			 struct addr_pair **target_table)
 {
   CUmodule module;
-  void **fn_table;
-  char **fn_names;
-  int fn_entries, i;
+  char **fn_names, **var_names;
+  unsigned int fn_entries, var_entries, i, j;
   CUresult r;
   struct targ_fn_descriptor *targ_fns;
+  void **img_header = (void **) target_data;
+  struct ptx_image_data *new_image;
 
-  if (nvptx_init () <= 0)
-    return 0;
+  GOMP_OFFLOAD_init_device (ord);
 
-  /* This isn't an error, because an image may legitimately have no offloaded
-     regions and so will not call GOMP_offload_register.  */
-  if (kernel_target_data == NULL)
-    return 0;
+  nvptx_attach_host_thread_to_device (ord);
+
+  link_ptx (&module, img_header[0]);
 
-  link_ptx (&module, kernel_target_data[0]);
+  pthread_mutex_lock (&ptx_image_lock);
+  new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
+  new_image->target_data = target_data;
+  new_image->module = module;
+  new_image->next = ptx_images;
+  ptx_images = new_image;
+  pthread_mutex_unlock (&ptx_image_lock);
 
-  /* kernel_target_data[0] -> ptx code
-     kernel_target_data[1] -> variable mappings
-     kernel_target_data[2] -> array of kernel names in ascii
+  /* The mkoffload utility emits a table of pointers/integers at the start of
+     each offload image:
 
-     kernel_host_table[0] -> start of function addresses (__offload_func_table)
-     kernel_host_table[1] -> end of function addresses (__offload_funcs_end)
+     img_header[0] -> ptx code
+     img_header[1] -> number of variables
+     img_header[2] -> array of variable names (pointers to strings)
+     img_header[3] -> number of kernels
+     img_header[4] -> array of kernel names (pointers to strings)
 
      The array of kernel names and the functions addresses form a
      one-to-one correspondence.  */
 
-  fn_table = kernel_host_table[0];
-  fn_names = (char **) kernel_target_data[2];
-  fn_entries = (kernel_host_table[1] - kernel_host_table[0]) / sizeof (void *);
+  var_entries = (uintptr_t) img_header[1];
+  var_names = (char **) img_header[2];
+  fn_entries = (uintptr_t) img_header[3];
+  fn_names = (char **) img_header[4];
 
-  *tablep = GOMP_PLUGIN_malloc (sizeof (struct mapping_table) * fn_entries);
+  *target_table = GOMP_PLUGIN_malloc (sizeof (struct addr_pair)
+				      * (fn_entries + var_entries));
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
@@ -1579,38 +1666,86 @@ GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
       targ_fns[i].fn = function;
       targ_fns[i].name = (const char *) fn_names[i];
 
-      (*tablep)[i].host_start = (uintptr_t) fn_table[i];
-      (*tablep)[i].host_end = (*tablep)[i].host_start + 1;
-      (*tablep)[i].tgt_start = (uintptr_t) &targ_fns[i];
-      (*tablep)[i].tgt_end = (*tablep)[i].tgt_start + 1;
+      (*target_table)[i].start = (uintptr_t) &targ_fns[i];
+      (*target_table)[i].end = (*target_table)[i].start + 1;
     }
 
-  return fn_entries;
+  for (j = 0; j < var_entries; j++, i++)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+
+      r = cuModuleGetGlobal (&var, &bytes, module, var_names[j]);
+      if (r != CUDA_SUCCESS)
+        GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+
+      (*target_table)[i].start = (uintptr_t) var;
+      (*target_table)[i].end = (*target_table)[i].start + bytes;
+    }
+
+  return i;
+}
+
+void
+GOMP_OFFLOAD_unload_image (int tid __attribute__((unused)), void *target_data)
+{
+  void **img_header = (void **) target_data;
+  struct targ_fn_descriptor *targ_fns
+    = (struct targ_fn_descriptor *) img_header[0];
+  struct ptx_image_data *image, *prev = NULL, *newhd = NULL;
+
+  free (targ_fns);
+
+  pthread_mutex_lock (&ptx_image_lock);
+  for (image = ptx_images; image != NULL;)
+    {
+      struct ptx_image_data *next = image->next;
+
+      if (image->target_data == target_data)
+	{
+	  cuModuleUnload (image->module);
+	  free (image);
+	  if (prev)
+	    prev->next = next;
+	}
+      else
+	{
+	  prev = image;
+	  if (!newhd)
+	    newhd = image;
+	}
+
+      image = next;
+    }
+  ptx_images = newhd;
+  pthread_mutex_unlock (&ptx_image_lock);
 }
 
 void *
-GOMP_OFFLOAD_alloc (int n __attribute__ ((unused)), size_t size)
+GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_alloc (size);
 }
 
 void
-GOMP_OFFLOAD_free (int n __attribute__ ((unused)), void *ptr)
+GOMP_OFFLOAD_free (int ord, void *ptr)
 {
+  nvptx_attach_host_thread_to_device (ord);
   nvptx_free (ptr);
 }
 
 void *
-GOMP_OFFLOAD_dev2host (int ord __attribute__ ((unused)), void *dst,
-		       const void *src, size_t n)
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_dev2host (dst, src, n);
 }
 
 void *
-GOMP_OFFLOAD_host2dev (int ord __attribute__ ((unused)), void *dst,
-		       const void *src, size_t n)
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 {
+  nvptx_attach_host_thread_to_device (ord);
   return nvptx_host2dev (dst, src, n);
 }
 
@@ -1627,45 +1762,6 @@ GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), size_t mapnum,
 	    num_workers, vector_length, async, targ_mem_desc);
 }
 
-void *
-GOMP_OFFLOAD_openacc_open_device (int n)
-{
-  return nvptx_open_device (n);
-}
-
-int
-GOMP_OFFLOAD_openacc_close_device (void *h)
-{
-  return nvptx_close_device (h);
-}
-
-void
-GOMP_OFFLOAD_openacc_set_device_num (int n)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  assert (n >= 0);
-
-  if (!nvthd->ptx_dev || nvthd->ptx_dev->ord != n)
-    (void) nvptx_open_device (n);
-}
-
-/* This can be called before the device is "opened" for the current thread, in
-   which case we can't tell which device number should be returned.  We don't
-   actually want to open the device here, so just return -1 and let the caller
-   (oacc-init.c:acc_get_device_num) handle it.  */
-
-int
-GOMP_OFFLOAD_openacc_get_device_num (void)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (nvthd && nvthd->ptx_dev)
-    return nvthd->ptx_dev->ord;
-  else
-    return -1;
-}
-
 void
 GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc)
 {
@@ -1729,14 +1825,18 @@ GOMP_OFFLOAD_openacc_async_set_async (int async)
 }
 
 void *
-GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data)
+GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {
-  struct ptx_device *ptx_dev = (struct ptx_device *) targ_data;
+  struct ptx_device *ptx_dev;
   struct nvptx_thread *nvthd
     = GOMP_PLUGIN_malloc (sizeof (struct nvptx_thread));
   CUresult r;
   CUcontext thd_ctx;
 
+  ptx_dev = ptx_devices[ord];
+
+  assert (ptx_dev);
+
   r = cuCtxGetCurrent (&thd_ctx);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r));
diff --git a/libgomp/target.c b/libgomp/target.c
index f443cff..fabfc02 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -150,14 +150,12 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   const int rshift = is_openacc ? 8 : 3;
   const int typemask = is_openacc ? 0xff : 0x7;
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
   struct splay_tree_key_s cur_node;
   struct target_mem_desc *tgt
     = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mm;
 
   if (mapnum == 0)
     return tgt;
@@ -171,7 +169,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       tgt_size = mapnum * sizeof (void *);
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < mapnum; i++)
     {
@@ -186,7 +184,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	cur_node.host_end = cur_node.host_start + sizes[i];
       else
 	cur_node.host_end = cur_node.host_start + sizeof (void *);
-      splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+      splay_tree_key n = splay_tree_lookup (&devicep->splay_tree, &cur_node);
       if (n)
 	{
 	  tgt->list[i] = n;
@@ -271,7 +269,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      k->host_end = k->host_start + sizes[i];
 	    else
 	      k->host_end = k->host_start + sizeof (void *);
-	    splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k);
+	    splay_tree_key n = splay_tree_lookup (&devicep->splay_tree, k);
 	    if (n)
 	      {
 		tgt->list[i] = n;
@@ -291,7 +289,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		tgt->refcount++;
 		array->left = NULL;
 		array->right = NULL;
-		splay_tree_insert (&mm->splay_tree, array);
+		splay_tree_insert (&devicep->splay_tree, array);
 		switch (kind & typemask)
 		  {
 		  case GOMP_MAP_ALLOC:
@@ -329,16 +327,17 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    /* Add bias to the pointer value.  */
 		    cur_node.host_start += sizes[i];
 		    cur_node.host_end = cur_node.host_start + 1;
-		    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+		    n = splay_tree_lookup (&devicep->splay_tree, &cur_node);
 		    if (n == NULL)
 		      {
 			/* Could be possibly zero size array section.  */
 			cur_node.host_end--;
-			n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			n = splay_tree_lookup (&devicep->splay_tree, &cur_node);
 			if (n == NULL)
 			  {
 			    cur_node.host_start--;
-			    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			    n = splay_tree_lookup (&devicep->splay_tree,
+						   &cur_node);
 			    cur_node.host_start++;
 			  }
 		      }
@@ -397,17 +396,18 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			  /* Add bias to the pointer value.  */
 			  cur_node.host_start += sizes[j];
 			  cur_node.host_end = cur_node.host_start + 1;
-			  n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			  n = splay_tree_lookup (&devicep->splay_tree,
+						 &cur_node);
 			  if (n == NULL)
 			    {
 			      /* Could be possibly zero size array section.  */
 			      cur_node.host_end--;
-			      n = splay_tree_lookup (&mm->splay_tree,
+			      n = splay_tree_lookup (&devicep->splay_tree,
 						     &cur_node);
 			      if (n == NULL)
 				{
 				  cur_node.host_start--;
-				  n = splay_tree_lookup (&mm->splay_tree,
+				  n = splay_tree_lookup (&devicep->splay_tree,
 							 &cur_node);
 				  cur_node.host_start++;
 				}
@@ -479,7 +479,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	}
     }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
   return tgt;
 }
 
@@ -504,10 +504,9 @@ attribute_hidden void
 gomp_copy_from_async (struct target_mem_desc *tgt)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
   size_t i;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < tgt->list_count; i++)
     if (tgt->list[i] == NULL)
@@ -526,7 +525,7 @@ gomp_copy_from_async (struct target_mem_desc *tgt)
 				  k->host_end - k->host_start);
       }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 /* Unmap variables described by TGT.  If DO_COPYFROM is true, copy relevant
@@ -537,7 +536,6 @@ attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
 
   if (tgt->list_count == 0)
     {
@@ -545,7 +543,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
       return;
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   size_t i;
   for (i = 0; i < tgt->list_count; i++)
@@ -562,7 +560,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (&mm->splay_tree, k);
+	splay_tree_remove (&devicep->splay_tree, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -574,13 +572,12 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
   else
     gomp_unmap_tgt (tgt);
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 static void
-gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
-	     size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds,
-	     bool is_openacc)
+gomp_update (struct gomp_device_descr *devicep, size_t mapnum,
+	     void **hostaddrs, size_t *sizes, void *kinds, bool is_openacc)
 {
   size_t i;
   struct splay_tree_key_s cur_node;
@@ -592,13 +589,13 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
   if (mapnum == 0)
     return;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
   for (i = 0; i < mapnum; i++)
     if (sizes[i])
       {
 	cur_node.host_start = (uintptr_t) hostaddrs[i];
 	cur_node.host_end = cur_node.host_start + sizes[i];
-	splay_tree_key n = splay_tree_lookup (&mm->splay_tree,
+	splay_tree_key n = splay_tree_lookup (&devicep->splay_tree,
 					      &cur_node);
 	if (n)
 	  {
@@ -633,7 +630,7 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
 		      (void *) cur_node.host_start,
 		      (void *) cur_node.host_end);
       }
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 
@@ -644,7 +641,6 @@ gomp_splay_tree_insert_mapping (struct gomp_device_descr *devicep,
 				struct addr_pair *host_addr,
 				struct addr_pair *tgt_addr)
 {
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
   struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
   tgt->refcount = 1;
   tgt->array = gomp_malloc (sizeof (*tgt->array));
@@ -663,7 +659,7 @@ gomp_splay_tree_insert_mapping (struct gomp_device_descr *devicep,
   k->tgt = tgt;
   node->left = NULL;
   node->right = NULL;
-  splay_tree_insert (&mm->splay_tree, node);
+  splay_tree_insert (&devicep->splay_tree, node);
 }
 
 /* Load image pointed by TARGET_DATA to the device, specified by DEVICEP.
@@ -767,7 +763,6 @@ GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
     {
       int j;
       struct gomp_device_descr *devicep = &devices[i];
-      struct gomp_memory_mapping *mm = &devicep->mem_map;
 
       if (devicep->type != target_type || !devicep->is_initialized)
 	continue;
@@ -780,7 +775,7 @@ GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
 	  struct splay_tree_key_s k;
 	  k.host_start = (uintptr_t) host_func_table[j];
 	  k.host_end = k.host_start + 1;
-	  splay_tree_remove (&mm->splay_tree, &k);
+	  splay_tree_remove (&devicep->splay_tree, &k);
 	}
 
       for (j = 0; j < num_vars; j++)
@@ -788,7 +783,7 @@ GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
 	  struct splay_tree_key_s k;
 	  k.host_start = (uintptr_t) host_var_table[j*2];
 	  k.host_end = k.host_start + (uintptr_t) host_var_table[j*2+1];
-	  splay_tree_remove (&mm->splay_tree, &k);
+	  splay_tree_remove (&devicep->splay_tree, &k);
 	}
     }
 
@@ -822,22 +817,20 @@ gomp_init_device (struct gomp_device_descr *devicep)
   devicep->is_initialized = true;
 }
 
-/* Free address mapping tables.  MM must be locked on entry, and remains locked
-   on return.  */
+/* Free address mapping tables.  The owning device must be locked on entry, and
+   remains locked on return.  */
 
 attribute_hidden void
-gomp_free_memmap (struct gomp_memory_mapping *mm)
+gomp_free_memmap (struct splay_tree_s splay_tree)
 {
-  while (mm->splay_tree.root)
+  while (splay_tree.root)
     {
-      struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt;
+      struct target_mem_desc *tgt = splay_tree.root->key.tgt;
 
-      splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key);
+      splay_tree_remove (&splay_tree, &splay_tree.root->key);
       free (tgt->array);
       free (tgt);
     }
-
-  mm->is_initialized = false;
 }
 
 /* This function de-initializes the target device, specified by DEVICEP.
@@ -868,7 +861,6 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
 	     unsigned char *kinds)
 {
   struct gomp_device_descr *devicep = resolve_device (device);
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
 
   if (devicep == NULL
       || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
@@ -902,7 +894,7 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
       struct splay_tree_key_s k;
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      splay_tree_key tgt_fn = splay_tree_lookup (&mm->splay_tree, &k);
+      splay_tree_key tgt_fn = splay_tree_lookup (&devicep->splay_tree, &k);
       if (tgt_fn == NULL)
 	gomp_fatal ("Target function wasn't mapped");
       fn_addr = (void *) tgt_fn->tgt->tgt_start;
@@ -990,7 +982,7 @@ GOMP_target_update (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  gomp_update (devicep, &devicep->mem_map, mapnum, hostaddrs, sizes, kinds, false);
+  gomp_update (devicep, mapnum, hostaddrs, sizes, kinds, false);
 }
 
 void
@@ -1074,10 +1066,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
     {
       optional_present = optional_total = 0;
       DLSYM_OPT (openacc.exec, openacc_parallel);
-      DLSYM_OPT (openacc.open_device, openacc_open_device);
-      DLSYM_OPT (openacc.close_device, openacc_close_device);
-      DLSYM_OPT (openacc.get_device_num, openacc_get_device_num);
-      DLSYM_OPT (openacc.set_device_num, openacc_set_device_num);
       DLSYM_OPT (openacc.register_async_cleanup,
 		 openacc_register_async_cleanup);
       DLSYM_OPT (openacc.async_test, openacc_async_test);
@@ -1183,16 +1171,13 @@ gomp_target_init (void)
 		current_device.name = current_device.get_name_func ();
 		/* current_device.capabilities has already been set.  */
 		current_device.type = current_device.get_type_func ();
-		current_device.mem_map.is_initialized = false;
-		current_device.mem_map.splay_tree.root = NULL;
+		current_device.splay_tree.root = NULL;
 		current_device.is_initialized = false;
 		current_device.openacc.data_environ = NULL;
-		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
 		    devices[num_devices] = current_device;
-		    gomp_mutex_init (&devices[num_devices].mem_map.lock);
 		    gomp_mutex_init (&devices[num_devices].lock);
 		    num_devices++;
 		  }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
index 84045db..a4cf7f2 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-9.c
@@ -58,7 +58,7 @@ main (int argc, char **argv)
       acc_set_device_num (1, (acc_device_t) 0);
 
       devnum = acc_get_device_num (devtype);
-      if (devnum != 0)
+      if (devnum != 1)
 	abort ();
   }
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-26 12:09                               ` Jakub Jelinek
@ 2015-03-26 20:41                                 ` Ilya Verbin
  2015-03-30 16:42                                   ` Jakub Jelinek
  2015-03-27 15:21                                 ` Julian Brown
  1 sibling, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-03-26 20:41 UTC (permalink / raw)
  To: Jakub Jelinek, Thomas Schwinge; +Cc: Julian Brown, gcc-patches, Kirill Yukhin

[-- Attachment #1: Type: text/plain, Size: 51153 bytes --]

On Thu, Mar 26, 2015 at 13:09:19 +0100, Jakub Jelinek wrote:
> On Mon, Mar 23, 2015 at 10:44:39PM +0300, Ilya Verbin wrote:
> > If it is too late for such global changes (rework initialization in libgomp,
> > change mic and ptx plugins), then here is a small workaround patch to fix
> > offloading from libraries.  Likely, it will not affect OpenACC programs with one
> > image.  make check-target-libgomp passed.
> 
> Sorry for not getting to this earlier, really busy with severe regressions
> bugfixing lately.
> 
> Anyway, IMHO it is not too late to fixing it properly, after all,
> the current code is majorly broken.  As I've said earlier, e.g. the lack
> of mutex guarding gomp_target_init (which is using pthread_once guaranteed
> to be run just once) vs. concurrent GOMP_offload_register calls
> (if those are run from ctors, then I guess something like dl_load_lock
> ensures at least on glibc that multiple GOMP_offload_register calls aren't
> performed at the same time) in accessing/reallocating offload_images
> and num_offload_images and the lack of support to register further
> images after the gomp_target_init call (if you dlopen further shared
> libraries) is really bad.  And it would be really nice to support the
> unloading.
> 
> But I'm afraid I'm lost in what is the latest posted patch for that,
> and how has it been tested (whether just on MIC or MIC emul, or also for
> nvptx).
> 
> So can you please post a link to the latest full patch and how it has been
> tested, and if it is still error prone if say one thread executes
> GOMP_target the first time and another at the same time dlopens some shared
> library that has offloading regions in it, fix that too?
> 
> We still have a week or so to get this sorted out.

Here is the latest patch for libgomp and mic plugin.
make check-target-libgomp using intelmic emul passed.
Also I used a testcase from the attachment.

Latest ptx part is here, I guess:
https://gcc.gnu.org/ml/gcc-patches/2015-02/msg01407.html

Thomas, could you please test these 2 patches together on nvptx?


gcc/
	* config/i386/intelmic-mkoffload.c (generate_host_descr_file): Call
	GOMP_offload_unregister from the destructor.
libgomp/
	* libgomp-plugin.h (struct mapping_table): Replace with addr_pair.
	* libgomp.h (struct gomp_memory_mapping): Remove.
	(struct target_mem_desc): Change type of mem_map from
	gomp_memory_mapping * to splay_tree_s *.
	(struct gomp_device_descr): Remove register_image_func, get_table_func.
	Add load_image_func, unload_image_func.
	Change type of mem_map from gomp_memory_mapping to splay_tree_s.
	Remove offload_regions_registered.
	(gomp_init_tables): Remove.
	(gomp_free_memmap): Change type of argument from gomp_memory_mapping *
	to splay_tree_s *.
	* libgomp.map (GOMP_4.0.1): Add GOMP_offload_unregister.
	* oacc-host.c (host_dispatch): Do not initialize register_image_func,
	get_table_func, mem_map.is_initialized, mem_map.splay_tree.root,
	offload_regions_registered.
	Initialize load_image_func, unload_image_func, mem_map.root.
	(goacc_host_init): Do not initialize host_dispatch.mem_map.lock.
	* oacc-init.c (lazy_open): Don't call gomp_init_tables.
	(acc_shutdown_1): Use dev's lock and splay_tree instead of mem_map's.
	* oacc-mem.c (lookup_host): Get gomp_device_descr *dev instead of
	gomp_memory_mapping *.  Use dev's lock and splay_tree.
	(lookup_dev): Use dev's lock.
	(acc_deviceptr): Pass dev to lookup_host instead of mem_map.
	(acc_is_present): Likewise.
	(acc_map_data): Likewise.
	(acc_unmap_data): Likewise.  Use dev's lock.
	(present_create_copy): Likewise.
	(delete_copyout): Pass dev to lookup_host instead of mem_map.
	(update_dev_host): Likewise.
	(gomp_acc_remove_pointer): Likewise.  Use dev's lock.
	* oacc-parallel.c (GOACC_parallel): Use dev's lock and splay_tree.
	* plugin/plugin-host.c (GOMP_OFFLOAD_register_image): Remove.
	(GOMP_OFFLOAD_get_table): Remove
	(GOMP_OFFLOAD_load_image): New function.
	(GOMP_OFFLOAD_unload_image): New function.
	* target.c (register_lock): New mutex for offload image registration.
	(gomp_map_vars): Use dev's lock and splay_tree instead of mem_map's.
	(gomp_copy_from_async): Likewise.
	(gomp_unmap_vars): Likewise.
	(gomp_update): Remove gomp_memory_mapping argument.  Use dev's lock and
	splay_tree instead of mm's.
	(gomp_splay_tree_insert_mapping): New static function.
	(gomp_offload_image_to_device): Ditto.
	(GOMP_offload_register): Add mutex lock.
	Call gomp_offload_image_to_device for all initialized devices.
	(GOMP_offload_unregister): New function.
	(gomp_init_tables): Replace with gomp_init_device.  Replace a call to
	get_table_func from the plugin with calls to init_device_func and
	gomp_offload_image_to_device.
	(gomp_free_memmap): Change type of argument from gomp_memory_mapping *
	to splay_tree_s *.
	(GOMP_target): Do not call gomp_init_tables.  Use dev's lock and
	splay_tree instead of mm's.
	(GOMP_target_data): Do not call gomp_init_tables.
	(GOMP_target_update): Likewise.  Remove argument from gomp_update.
	(gomp_load_plugin_for_device): Replace register_image and get_table
	with load_image and unload_image in DLSYM ().
	(gomp_register_images_for_device): Remove function.
	(gomp_target_init): Do not initialize current_device.mem_map.*,
	current_device.offload_regions_registered.
	Remove call to gomp_register_images_for_device.
	Do not free offload_images and num_offload_images.
liboffloadmic/
	* plugin/libgomp-plugin-intelmic.cpp: Include map.
	(AddrVect, DevAddrVect, ImgDevAddrMap): New typedefs.
	(num_devices, num_images, address_table): New static vars.
	(num_libraries, lib_descrs): Remove static vars.
	(set_mic_lib_path): Rename to ...
	(init): ... this.  Allocate address_table and get num_devices.
	(GOMP_OFFLOAD_get_num_devices): return num_devices.
	(load_lib_and_get_table): Remove static function.
	(offload_image): New static function.
	(GOMP_OFFLOAD_get_table): Remove function.
	(GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_unload_image): New functions.


diff --git a/gcc/config/i386/intelmic-mkoffload.c b/gcc/config/i386/intelmic-mkoffload.c
index f93007c..e101f93 100644
--- a/gcc/config/i386/intelmic-mkoffload.c
+++ b/gcc/config/i386/intelmic-mkoffload.c
@@ -350,14 +350,24 @@ generate_host_descr_file (const char *host_compiler)
 	   "#ifdef __cplusplus\n"
 	   "extern \"C\"\n"
 	   "#endif\n"
-	   "void GOMP_offload_register (void *, int, void *);\n\n"
+	   "void GOMP_offload_register (void *, int, void *);\n"
+	   "void GOMP_offload_unregister (void *, int, void *);\n\n"
 
 	   "__attribute__((constructor))\n"
 	   "static void\n"
 	   "init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
+	   "}\n\n", GOMP_DEVICE_INTEL_MIC);
+
+  fprintf (src_file,
+	   "__attribute__((destructor))\n"
+	   "static void\n"
+	   "fini (void)\n"
+	   "{\n"
+	   "  GOMP_offload_unregister (&__OFFLOAD_TABLE__, %d, __offload_target_data);\n"
 	   "}\n", GOMP_DEVICE_INTEL_MIC);
+
   fclose (src_file);
 
   unsigned new_argc = 0;
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index d9cbff5..1072ae4 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -51,14 +51,12 @@ enum offload_target_type
   OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
 };
 
-/* Auxiliary struct, used for transferring a host-target address range mapping
-   from plugin to libgomp.  */
-struct mapping_table
+/* Auxiliary struct, used for transferring pairs of addresses from plugin
+   to libgomp.  */
+struct addr_pair
 {
-  uintptr_t host_start;
-  uintptr_t host_end;
-  uintptr_t tgt_start;
-  uintptr_t tgt_end;
+  uintptr_t start;
+  uintptr_t end;
 };
 
 /* Miscellaneous functions.  */
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3089401..a1d42c5 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -224,7 +224,6 @@ struct gomp_team_state
 };
 
 struct target_mem_desc;
-struct gomp_memory_mapping;
 
 /* These are the OpenMP 4.0 Internal Control Variables described in
    section 2.3.1.  Those described as having one copy per task are
@@ -657,7 +656,7 @@ struct target_mem_desc {
   struct gomp_device_descr *device_descr;
 
   /* Memory mapping info for the thread that created this descriptor.  */
-  struct gomp_memory_mapping *mem_map;
+  struct splay_tree_s *mem_map;
 
   /* List of splay keys to remove (or decrease refcount)
      at the end of region.  */
@@ -683,20 +682,6 @@ struct splay_tree_key_s {
 
 #include "splay-tree.h"
 
-/* Information about mapped memory regions (per device/context).  */
-
-struct gomp_memory_mapping
-{
-  /* Mutex for operating with the splay tree and other shared structures.  */
-  gomp_mutex_t lock;
-
-  /* True when tables have been added to this memory map.  */
-  bool is_initialized;
-
-  /* Splay tree containing information about mapped memory regions.  */
-  struct splay_tree_s splay_tree;
-};
-
 typedef struct acc_dispatch_t
 {
   /* This is a linked list of data mapped using the
@@ -773,19 +758,18 @@ struct gomp_device_descr
   unsigned int (*get_caps_func) (void);
   int (*get_type_func) (void);
   int (*get_num_devices_func) (void);
-  void (*register_image_func) (void *, void *);
   void (*init_device_func) (int);
   void (*fini_device_func) (int);
-  int (*get_table_func) (int, struct mapping_table **);
+  int (*load_image_func) (int, void *, struct addr_pair **);
+  void (*unload_image_func) (int, void *);
   void *(*alloc_func) (int, size_t);
   void (*free_func) (int, void *);
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
 
-  /* Memory-mapping info for this device instance.  */
-  /* Uses a separate lock.  */
-  struct gomp_memory_mapping mem_map;
+  /* Splay tree containing information about mapped memory regions.  */
+  struct splay_tree_s mem_map;
 
   /* Mutex for the mutable data.  */
   gomp_mutex_t lock;
@@ -793,9 +777,6 @@ struct gomp_device_descr
   /* Set to true when device is initialized.  */
   bool is_initialized;
 
-  /* True when offload regions have been registered with this device.  */
-  bool offload_regions_registered;
-
   /* OpenACC-specific data and functions.  */
   /* This is mutable because of its mutable data_environ and target_data
      members.  */
@@ -811,9 +792,7 @@ extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
 extern void gomp_copy_from_async (struct target_mem_desc *);
 extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
-extern void gomp_init_tables (struct gomp_device_descr *,
-			      struct gomp_memory_mapping *);
-extern void gomp_free_memmap (struct gomp_memory_mapping *);
+extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_fini_device (struct gomp_device_descr *);
 
 /* work.c */
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index f44174e..2b2b953 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -231,6 +231,7 @@ GOMP_4.0 {
 GOMP_4.0.1 {
   global:
 	GOMP_offload_register;
+	GOMP_offload_unregister;
 } GOMP_4.0;
 
 OACC_2.0 {
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 6aeb1e7..e4756b6 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -43,20 +43,18 @@ static struct gomp_device_descr host_dispatch =
     .get_caps_func = GOMP_OFFLOAD_get_caps,
     .get_type_func = GOMP_OFFLOAD_get_type,
     .get_num_devices_func = GOMP_OFFLOAD_get_num_devices,
-    .register_image_func = GOMP_OFFLOAD_register_image,
     .init_device_func = GOMP_OFFLOAD_init_device,
     .fini_device_func = GOMP_OFFLOAD_fini_device,
-    .get_table_func = GOMP_OFFLOAD_get_table,
+    .load_image_func = GOMP_OFFLOAD_load_image,
+    .unload_image_func = GOMP_OFFLOAD_unload_image,
     .alloc_func = GOMP_OFFLOAD_alloc,
     .free_func = GOMP_OFFLOAD_free,
     .dev2host_func = GOMP_OFFLOAD_dev2host,
     .host2dev_func = GOMP_OFFLOAD_host2dev,
     .run_func = GOMP_OFFLOAD_run,
 
-    .mem_map.is_initialized = false,
-    .mem_map.splay_tree.root = NULL,
+    .mem_map.root = NULL,
     .is_initialized = false,
-    .offload_regions_registered = false,
 
     .openacc = {
       .open_device_func = GOMP_OFFLOAD_openacc_open_device,
@@ -94,7 +92,6 @@ static struct gomp_device_descr host_dispatch =
 static __attribute__ ((constructor))
 void goacc_host_init (void)
 {
-  gomp_mutex_init (&host_dispatch.mem_map.lock);
   gomp_mutex_init (&host_dispatch.lock);
   goacc_register (&host_dispatch);
 }
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 166eb55..1e0243e 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -284,12 +284,6 @@ lazy_open (int ord)
     = acc_dev->openacc.create_thread_data_func (acc_dev->openacc.target_data);
 
   acc_dev->openacc.async_set_async_func (acc_async_sync);
-
-  struct gomp_memory_mapping *mem_map = &acc_dev->mem_map;
-  gomp_mutex_lock (&mem_map->lock);
-  if (!mem_map->is_initialized)
-    gomp_init_tables (acc_dev, mem_map);
-  gomp_mutex_unlock (&mem_map->lock);
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
@@ -351,10 +345,9 @@ acc_shutdown_1 (acc_device_t d)
 
 	  walk->dev->openacc.target_data = target_data = NULL;
 
-	  struct gomp_memory_mapping *mem_map = &walk->dev->mem_map;
-	  gomp_mutex_lock (&mem_map->lock);
-	  gomp_free_memmap (mem_map);
-	  gomp_mutex_unlock (&mem_map->lock);
+	  gomp_mutex_lock (&walk->dev->lock);
+	  gomp_free_memmap (&walk->dev->mem_map);
+	  gomp_mutex_unlock (&walk->dev->lock);
 
 	  walk->dev = NULL;
 	}
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 0096d51..fdc82e6 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -38,7 +38,7 @@
 /* Return block containing [H->S), or NULL if not contained.  */
 
 static splay_tree_key
-lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
+lookup_host (struct gomp_device_descr *dev, void *h, size_t s)
 {
   struct splay_tree_key_s node;
   splay_tree_key key;
@@ -46,11 +46,9 @@ lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s)
   node.host_start = (uintptr_t) h;
   node.host_end = (uintptr_t) h + s;
 
-  gomp_mutex_lock (&mem_map->lock);
-
-  key = splay_tree_lookup (&mem_map->splay_tree, &node);
-
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_lock (&dev->lock);
+  key = splay_tree_lookup (&dev->mem_map, &node);
+  gomp_mutex_unlock (&dev->lock);
 
   return key;
 }
@@ -65,14 +63,11 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
 {
   int i;
   struct target_mem_desc *t;
-  struct gomp_memory_mapping *mem_map;
 
   if (!tgt)
     return NULL;
 
-  mem_map = tgt->mem_map;
-
-  gomp_mutex_lock (&mem_map->lock);
+  gomp_mutex_lock (&tgt->device_descr->lock);
 
   for (t = tgt; t != NULL; t = t->prev)
     {
@@ -80,7 +75,7 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
         break;
     }
 
-  gomp_mutex_unlock (&mem_map->lock);
+  gomp_mutex_unlock (&tgt->device_descr->lock);
 
   if (!t)
     return NULL;
@@ -176,7 +171,7 @@ acc_deviceptr (void *h)
 
   struct goacc_thread *thr = goacc_thread ();
 
-  n = lookup_host (&thr->dev->mem_map, h, 1);
+  n = lookup_host (thr->dev, h, 1);
 
   if (!n)
     return NULL;
@@ -229,7 +224,7 @@ acc_is_present (void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   if (n && ((uintptr_t)h < n->host_start
 	    || (uintptr_t)h + s > n->host_end
@@ -271,7 +266,7 @@ acc_map_data (void *h, void *d, size_t s)
 	gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map",
                     (void *)h, (int)s, (void *)d, (int)s);
 
-      if (lookup_host (&acc_dev->mem_map, h, s))
+      if (lookup_host (acc_dev, h, s))
 	gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h,
 		    (int)s);
 
@@ -296,7 +291,7 @@ acc_unmap_data (void *h)
   /* No need to call lazy open, as the address must have been mapped.  */
 
   size_t host_size;
-  splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1);
+  splay_tree_key n = lookup_host (acc_dev, h, 1);
   struct target_mem_desc *t;
 
   if (!n)
@@ -320,7 +315,7 @@ acc_unmap_data (void *h)
       t->tgt_end = 0;
       t->to_free = 0;
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
 	   tp = t, t = t->prev)
@@ -334,7 +329,7 @@ acc_unmap_data (void *h)
 	    break;
 	  }
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   gomp_unmap_vars (t, true);
@@ -358,7 +353,7 @@ present_create_copy (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
   if (n)
     {
       /* Present. */
@@ -389,13 +384,13 @@ present_create_copy (unsigned f, void *h, size_t s)
       tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
 			   false);
 
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
       tgt->prev = acc_dev->openacc.data_environ;
       acc_dev->openacc.data_environ = tgt;
 
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_unlock (&acc_dev->lock);
     }
 
   return d;
@@ -436,7 +431,7 @@ delete_copyout (unsigned f, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -479,7 +474,7 @@ update_dev_host (int is_dev, void *h, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
-  n = lookup_host (&acc_dev->mem_map, h, s);
+  n = lookup_host (acc_dev, h, s);
 
   /* No need to call lazy open, as the data must already have been
      mapped.  */
@@ -532,7 +527,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   struct target_mem_desc *t;
   int minrefs = (mapnum == 1) ? 2 : 3;
 
-  n = lookup_host (&acc_dev->mem_map, h, 1);
+  n = lookup_host (acc_dev, h, 1);
 
   if (!n)
     gomp_fatal ("%p is not a mapped block", (void *)h);
@@ -543,7 +538,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
 
   struct target_mem_desc *tp;
 
-  gomp_mutex_lock (&acc_dev->mem_map.lock);
+  gomp_mutex_lock (&acc_dev->lock);
 
   if (t->refcount == minrefs)
     {
@@ -570,7 +565,7 @@ gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
   if (force_copyfrom)
     t->list[0]->copy_from = 1;
 
-  gomp_mutex_unlock (&acc_dev->mem_map.lock);
+  gomp_mutex_unlock (&acc_dev->lock);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 0c74f54..563f9bb 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -144,9 +144,9 @@ GOACC_parallel (int device, void (*fn) (void *),
     {
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      gomp_mutex_lock (&acc_dev->mem_map.lock);
-      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k);
-      gomp_mutex_unlock (&acc_dev->mem_map.lock);
+      gomp_mutex_lock (&acc_dev->lock);
+      tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map, &k);
+      gomp_mutex_unlock (&acc_dev->lock);
 
       if (tgt_fn_key == NULL)
 	gomp_fatal ("target function wasn't mapped");
diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c
index ebf7f11..bc60f72 100644
--- a/libgomp/plugin/plugin-host.c
+++ b/libgomp/plugin/plugin-host.c
@@ -95,12 +95,6 @@ GOMP_OFFLOAD_get_num_devices (void)
 }
 
 STATIC void
-GOMP_OFFLOAD_register_image (void *host_table __attribute__ ((unused)),
-			     void *target_data __attribute__ ((unused)))
-{
-}
-
-STATIC void
 GOMP_OFFLOAD_init_device (int n __attribute__ ((unused)))
 {
 }
@@ -111,12 +105,19 @@ GOMP_OFFLOAD_fini_device (int n __attribute__ ((unused)))
 }
 
 STATIC int
-GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
-			struct mapping_table **table __attribute__ ((unused)))
+GOMP_OFFLOAD_load_image (int n __attribute__ ((unused)),
+			 void *i __attribute__ ((unused)),
+			 struct addr_pair **r __attribute__ ((unused)))
 {
   return 0;
 }
 
+STATIC void
+GOMP_OFFLOAD_unload_image (int n __attribute__ ((unused)),
+			   void *i __attribute__ ((unused)))
+{
+}
+
 STATIC void *
 GOMP_OFFLOAD_openacc_open_device (int n)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index c5dda3f..ba2d231 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -49,6 +49,9 @@ static void gomp_target_init (void);
 /* The whole initialization code for offloading plugins is only run one.  */
 static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT;
 
+/* Mutex for offload image registration.  */
+static gomp_mutex_t register_lock;
+
 /* This structure describes an offload image.
    It contains type of the target device, pointer to host table descriptor, and
    pointer to target data.  */
@@ -153,14 +156,14 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   const int rshift = is_openacc ? 8 : 3;
   const int typemask = is_openacc ? 0xff : 0x7;
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
+  struct splay_tree_s *mem_map = &devicep->mem_map;
   struct splay_tree_key_s cur_node;
   struct target_mem_desc *tgt
     = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
   tgt->list_count = mapnum;
   tgt->refcount = 1;
   tgt->device_descr = devicep;
-  tgt->mem_map = mm;
+  tgt->mem_map = mem_map;
 
   if (mapnum == 0)
     return tgt;
@@ -174,7 +177,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       tgt_size = mapnum * sizeof (void *);
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < mapnum; i++)
     {
@@ -189,7 +192,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	cur_node.host_end = cur_node.host_start + sizes[i];
       else
 	cur_node.host_end = cur_node.host_start + sizeof (void *);
-      splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+      splay_tree_key n = splay_tree_lookup (mem_map, &cur_node);
       if (n)
 	{
 	  tgt->list[i] = n;
@@ -274,7 +277,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      k->host_end = k->host_start + sizes[i];
 	    else
 	      k->host_end = k->host_start + sizeof (void *);
-	    splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k);
+	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n)
 	      {
 		tgt->list[i] = n;
@@ -294,7 +297,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		tgt->refcount++;
 		array->left = NULL;
 		array->right = NULL;
-		splay_tree_insert (&mm->splay_tree, array);
+		splay_tree_insert (mem_map, array);
 		switch (kind & typemask)
 		  {
 		  case GOMP_MAP_ALLOC:
@@ -332,16 +335,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    /* Add bias to the pointer value.  */
 		    cur_node.host_start += sizes[i];
 		    cur_node.host_end = cur_node.host_start + 1;
-		    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+		    n = splay_tree_lookup (mem_map, &cur_node);
 		    if (n == NULL)
 		      {
 			/* Could be possibly zero size array section.  */
 			cur_node.host_end--;
-			n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			n = splay_tree_lookup (mem_map, &cur_node);
 			if (n == NULL)
 			  {
 			    cur_node.host_start--;
-			    n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			    n = splay_tree_lookup (mem_map, &cur_node);
 			    cur_node.host_start++;
 			  }
 		      }
@@ -400,18 +403,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			  /* Add bias to the pointer value.  */
 			  cur_node.host_start += sizes[j];
 			  cur_node.host_end = cur_node.host_start + 1;
-			  n = splay_tree_lookup (&mm->splay_tree, &cur_node);
+			  n = splay_tree_lookup (mem_map, &cur_node);
 			  if (n == NULL)
 			    {
 			      /* Could be possibly zero size array section.  */
 			      cur_node.host_end--;
-			      n = splay_tree_lookup (&mm->splay_tree,
-						     &cur_node);
+			      n = splay_tree_lookup (mem_map, &cur_node);
 			      if (n == NULL)
 				{
 				  cur_node.host_start--;
-				  n = splay_tree_lookup (&mm->splay_tree,
-							 &cur_node);
+				  n = splay_tree_lookup (mem_map, &cur_node);
 				  cur_node.host_start++;
 				}
 			    }
@@ -489,7 +490,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	}
     }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
   return tgt;
 }
 
@@ -514,10 +515,9 @@ attribute_hidden void
 gomp_copy_from_async (struct target_mem_desc *tgt)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
   size_t i;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < tgt->list_count; i++)
     if (tgt->list[i] == NULL)
@@ -536,7 +536,7 @@ gomp_copy_from_async (struct target_mem_desc *tgt)
 				  k->host_end - k->host_start);
       }
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 /* Unmap variables described by TGT.  If DO_COPYFROM is true, copy relevant
@@ -547,7 +547,6 @@ attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
-  struct gomp_memory_mapping *mm = tgt->mem_map;
 
   if (tgt->list_count == 0)
     {
@@ -555,7 +554,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
       return;
     }
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
 
   size_t i;
   for (i = 0; i < tgt->list_count; i++)
@@ -572,7 +571,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
-	splay_tree_remove (&mm->splay_tree, k);
+	splay_tree_remove (tgt->mem_map, k);
 	if (k->tgt->refcount > 1)
 	  k->tgt->refcount--;
 	else
@@ -584,13 +583,12 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
   else
     gomp_unmap_tgt (tgt);
 
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
 }
 
 static void
-gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
-	     size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds,
-	     bool is_openacc)
+gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
+	     size_t *sizes, void *kinds, bool is_openacc)
 {
   size_t i;
   struct splay_tree_key_s cur_node;
@@ -602,14 +600,13 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
   if (mapnum == 0)
     return;
 
-  gomp_mutex_lock (&mm->lock);
+  gomp_mutex_lock (&devicep->lock);
   for (i = 0; i < mapnum; i++)
     if (sizes[i])
       {
 	cur_node.host_start = (uintptr_t) hostaddrs[i];
 	cur_node.host_end = cur_node.host_start + sizes[i];
-	splay_tree_key n = splay_tree_lookup (&mm->splay_tree,
-					      &cur_node);
+	splay_tree_key n = splay_tree_lookup (&devicep->mem_map, &cur_node);
 	if (n)
 	  {
 	    int kind = get_kind (is_openacc, kinds, i);
@@ -643,10 +640,86 @@ gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm,
 		      (void *) cur_node.host_start,
 		      (void *) cur_node.host_end);
       }
-  gomp_mutex_unlock (&mm->lock);
+  gomp_mutex_unlock (&devicep->lock);
+}
+
+
+/* Insert mapping of host -> target address pairs to splay tree.  */
+
+static void
+gomp_splay_tree_insert_mapping (struct gomp_device_descr *devicep,
+				struct addr_pair *host_addr,
+				struct addr_pair *tgt_addr)
+{
+  struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
+  tgt->refcount = 1;
+  tgt->array = gomp_malloc (sizeof (*tgt->array));
+  tgt->tgt_start = tgt_addr->start;
+  tgt->tgt_end = tgt_addr->end;
+  tgt->to_free = NULL;
+  tgt->list_count = 0;
+  tgt->device_descr = devicep;
+  splay_tree_node node = tgt->array;
+  splay_tree_key k = &node->key;
+  k->host_start = host_addr->start;
+  k->host_end = host_addr->end;
+  k->tgt_offset = 0;
+  k->refcount = 1;
+  k->copy_from = false;
+  k->tgt = tgt;
+  node->left = NULL;
+  node->right = NULL;
+  splay_tree_insert (&devicep->mem_map, node);
+}
+
+/* Load image pointed by TARGET_DATA to the device, specified by DEVICEP.
+   And insert to splay tree the mapping between addresses from HOST_TABLE and
+   from loaded target image.  */
+
+static void
+gomp_offload_image_to_device (struct gomp_device_descr *devicep,
+			      void *host_table, void *target_data)
+{
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  /* Load image to device and get target addresses for the image.  */
+  struct addr_pair *target_table = NULL;
+  int i, num_target_entries
+    = devicep->load_image_func (devicep->target_id, target_data, &target_table);
+
+  if (num_target_entries != num_funcs + num_vars)
+    gomp_fatal ("Can't map target functions or variables");
+
+  /* Insert host-target address mapping into devicep->dev_splay_tree.  */
+  for (i = 0; i < num_funcs; i++)
+    {
+      struct addr_pair host_addr;
+      host_addr.start = (uintptr_t) host_func_table[i];
+      host_addr.end = host_addr.start + 1;
+      gomp_splay_tree_insert_mapping (devicep, &host_addr, &target_table[i]);
+    }
+
+  for (i = 0; i < num_vars; i++)
+    {
+      struct addr_pair host_addr;
+      host_addr.start = (uintptr_t) host_var_table[i*2];
+      host_addr.end = host_addr.start + (uintptr_t) host_var_table[i*2+1];
+      gomp_splay_tree_insert_mapping (devicep, &host_addr,
+				      &target_table[num_funcs+i]);
+    }
+
+  free (target_table);
 }
 
-/* This function should be called from every offload image.
+/* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
    the target, and TARGET_DATA needed by target plugin.  */
 
@@ -654,6 +727,18 @@ void
 GOMP_offload_register (void *host_table, enum offload_target_type target_type,
 		       void *target_data)
 {
+  int i;
+  gomp_mutex_lock (&register_lock);
+
+  /* Load image to all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      struct gomp_device_descr *devicep = &devices[i];
+      if (devicep->type == target_type && devicep->is_initialized)
+	gomp_offload_image_to_device (devicep, host_table, target_data);
+    }
+
+  /* Insert image to array of pending images.  */
   offload_images = gomp_realloc (offload_images,
 				 (num_offload_images + 1)
 				 * sizeof (struct offload_image_descr));
@@ -663,74 +748,105 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type,
   offload_images[num_offload_images].target_data = target_data;
 
   num_offload_images++;
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* This function initializes the target device, specified by DEVICEP.  DEVICEP
-   must be locked on entry, and remains locked on return.  */
+/* This function should be called from every offload image while unloading.
+   It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
+   the target, and TARGET_DATA needed by target plugin.  */
 
-attribute_hidden void
-gomp_init_device (struct gomp_device_descr *devicep)
+void
+GOMP_offload_unregister (void *host_table, enum offload_target_type target_type,
+			 void *target_data)
 {
-  devicep->init_device_func (devicep->target_id);
-  devicep->is_initialized = true;
+  void **host_func_table = ((void ***) host_table)[0];
+  void **host_funcs_end  = ((void ***) host_table)[1];
+  void **host_var_table  = ((void ***) host_table)[2];
+  void **host_vars_end   = ((void ***) host_table)[3];
+  int i;
+
+  /* The func table contains only addresses, the var table contains addresses
+     and corresponding sizes.  */
+  int num_funcs = host_funcs_end - host_func_table;
+  int num_vars  = (host_vars_end - host_var_table) / 2;
+
+  gomp_mutex_lock (&register_lock);
+
+  /* Unload image from all initialized devices.  */
+  for (i = 0; i < num_devices; i++)
+    {
+      int j;
+      struct gomp_device_descr *devicep = &devices[i];
+
+      if (devicep->type != target_type || !devicep->is_initialized)
+	continue;
+
+      devicep->unload_image_func (devicep->target_id, target_data);
+
+      /* Remove mapping from splay tree.  */
+      for (j = 0; j < num_funcs; j++)
+	{
+	  struct splay_tree_key_s k;
+	  k.host_start = (uintptr_t) host_func_table[j];
+	  k.host_end = k.host_start + 1;
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+
+      for (j = 0; j < num_vars; j++)
+	{
+	  struct splay_tree_key_s k;
+	  k.host_start = (uintptr_t) host_var_table[j*2];
+	  k.host_end = k.host_start + (uintptr_t) host_var_table[j*2+1];
+	  splay_tree_remove (&devicep->mem_map, &k);
+	}
+    }
+
+  /* Remove image from array of pending images.  */
+  for (i = 0; i < num_offload_images; i++)
+    if (offload_images[i].target_data == target_data)
+      {
+	offload_images[i] = offload_images[--num_offload_images];
+	break;
+      }
+
+  gomp_mutex_unlock (&register_lock);
 }
 
-/* Initialize address mapping tables.  MM must be locked on entry, and remains
-   locked on return.  */
+/* This function initializes the target device, specified by DEVICEP.  DEVICEP
+   must be locked on entry, and remains locked on return.  */
 
 attribute_hidden void
-gomp_init_tables (struct gomp_device_descr *devicep,
-		  struct gomp_memory_mapping *mm)
+gomp_init_device (struct gomp_device_descr *devicep)
 {
-  /* Get address mapping table for device.  */
-  struct mapping_table *table = NULL;
-  int num_entries = devicep->get_table_func (devicep->target_id, &table);
-
-  /* Insert host-target address mapping into dev_splay_tree.  */
   int i;
-  for (i = 0; i < num_entries; i++)
+  devicep->init_device_func (devicep->target_id);
+
+  /* Load to device all images registered by the moment.  */
+  for (i = 0; i < num_offload_images; i++)
     {
-      struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt));
-      tgt->refcount = 1;
-      tgt->array = gomp_malloc (sizeof (*tgt->array));
-      tgt->tgt_start = table[i].tgt_start;
-      tgt->tgt_end = table[i].tgt_end;
-      tgt->to_free = NULL;
-      tgt->list_count = 0;
-      tgt->device_descr = devicep;
-      splay_tree_node node = tgt->array;
-      splay_tree_key k = &node->key;
-      k->host_start = table[i].host_start;
-      k->host_end = table[i].host_end;
-      k->tgt_offset = 0;
-      k->refcount = 1;
-      k->copy_from = false;
-      k->tgt = tgt;
-      node->left = NULL;
-      node->right = NULL;
-      splay_tree_insert (&mm->splay_tree, node);
+      struct offload_image_descr *image = &offload_images[i];
+      if (image->type == devicep->type)
+	gomp_offload_image_to_device (devicep, image->host_table,
+				      image->target_data);
     }
 
-  free (table);
-  mm->is_initialized = true;
+  devicep->is_initialized = true;
 }
 
 /* Free address mapping tables.  MM must be locked on entry, and remains locked
    on return.  */
 
 attribute_hidden void
-gomp_free_memmap (struct gomp_memory_mapping *mm)
+gomp_free_memmap (struct splay_tree_s *mem_map)
 {
-  while (mm->splay_tree.root)
+  while (mem_map->root)
     {
-      struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt;
+      struct target_mem_desc *tgt = mem_map->root->key.tgt;
 
-      splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key);
+      splay_tree_remove (mem_map, &mem_map->root->key);
       free (tgt->array);
       free (tgt);
     }
-
-  mm->is_initialized = false;
 }
 
 /* This function de-initializes the target device, specified by DEVICEP.
@@ -791,20 +907,15 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
     fn_addr = (void *) fn;
   else
     {
-      struct gomp_memory_mapping *mm = &devicep->mem_map;
-      gomp_mutex_lock (&mm->lock);
-
-      if (!mm->is_initialized)
-	gomp_init_tables (devicep, mm);
-
+      gomp_mutex_lock (&devicep->lock);
       struct splay_tree_key_s k;
       k.host_start = (uintptr_t) fn;
       k.host_end = k.host_start + 1;
-      splay_tree_key tgt_fn = splay_tree_lookup (&mm->splay_tree, &k);
+      splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
       if (tgt_fn == NULL)
 	gomp_fatal ("Target function wasn't mapped");
 
-      gomp_mutex_unlock (&mm->lock);
+      gomp_mutex_unlock (&devicep->lock);
 
       fn_addr = (void *) tgt_fn->tgt->tgt_start;
     }
@@ -856,12 +967,6 @@ GOMP_target_data (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
   struct target_mem_desc *tgt
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     false);
@@ -897,13 +1002,7 @@ GOMP_target_update (int device, const void *unused, size_t mapnum,
     gomp_init_device (devicep);
   gomp_mutex_unlock (&devicep->lock);
 
-  struct gomp_memory_mapping *mm = &devicep->mem_map;
-  gomp_mutex_lock (&mm->lock);
-  if (!mm->is_initialized)
-    gomp_init_tables (devicep, mm);
-  gomp_mutex_unlock (&mm->lock);
-
-  gomp_update (devicep, mm, mapnum, hostaddrs, sizes, kinds, false);
+  gomp_update (devicep, mapnum, hostaddrs, sizes, kinds, false);
 }
 
 void
@@ -972,10 +1071,10 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM (get_caps);
   DLSYM (get_type);
   DLSYM (get_num_devices);
-  DLSYM (register_image);
   DLSYM (init_device);
   DLSYM (fini_device);
-  DLSYM (get_table);
+  DLSYM (load_image);
+  DLSYM (unload_image);
   DLSYM (alloc);
   DLSYM (free);
   DLSYM (dev2host);
@@ -1038,22 +1137,6 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   return err == NULL;
 }
 
-/* This function adds a compatible offload image IMAGE to an accelerator device
-   DEVICE.  DEVICE must be locked on entry, and remains locked on return.  */
-
-static void
-gomp_register_image_for_device (struct gomp_device_descr *device,
-				struct offload_image_descr *image)
-{
-  if (!device->offload_regions_registered
-      && (device->type == image->type
-	  || device->type == OFFLOAD_TARGET_TYPE_HOST))
-    {
-      device->register_image_func (image->host_table, image->target_data);
-      device->offload_regions_registered = true;
-    }
-}
-
 /* This function initializes the runtime needed for offloading.
    It parses the list of offload targets and tries to load the plugins for
    these targets.  On return, the variables NUM_DEVICES and NUM_DEVICES_OPENMP
@@ -1112,17 +1195,14 @@ gomp_target_init (void)
 		current_device.name = current_device.get_name_func ();
 		/* current_device.capabilities has already been set.  */
 		current_device.type = current_device.get_type_func ();
-		current_device.mem_map.is_initialized = false;
-		current_device.mem_map.splay_tree.root = NULL;
+		current_device.mem_map.root = NULL;
 		current_device.is_initialized = false;
-		current_device.offload_regions_registered = false;
 		current_device.openacc.data_environ = NULL;
 		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
 		    devices[num_devices] = current_device;
-		    gomp_mutex_init (&devices[num_devices].mem_map.lock);
 		    gomp_mutex_init (&devices[num_devices].lock);
 		    num_devices++;
 		  }
@@ -1157,21 +1237,12 @@ gomp_target_init (void)
 
   for (i = 0; i < num_devices; i++)
     {
-      int j;
-
-      for (j = 0; j < num_offload_images; j++)
-	gomp_register_image_for_device (&devices[i], &offload_images[j]);
-
       /* The 'devices' array can be moved (by the realloc call) until we have
 	 found all the plugins, so registering with the OpenACC runtime (which
 	 takes a copy of the pointer argument) must be delayed until now.  */
       if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
 	goacc_register (&devices[i]);
     }
-
-  free (offload_images);
-  offload_images = NULL;
-  num_offload_images = 0;
 }
 
 #else /* PLUGIN_SUPPORT */
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index 3e7a958..a2d61b1 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -34,6 +34,7 @@
 #include <string.h>
 #include <utility>
 #include <vector>
+#include <map>
 #include "libgomp-plugin.h"
 #include "compiler_if_host.h"
 #include "main_target_image.h"
@@ -53,6 +54,29 @@ fprintf (stderr, "\n");					    \
 #endif
 
 
+/* Start/end addresses of functions and global variables on a device.  */
+typedef std::vector<addr_pair> AddrVect;
+
+/* Addresses for one image and all devices.  */
+typedef std::vector<AddrVect> DevAddrVect;
+
+/* Addresses for all images and all devices.  */
+typedef std::map<void *, DevAddrVect> ImgDevAddrMap;
+
+
+/* Total number of available devices.  */
+static int num_devices;
+
+/* Total number of shared libraries with offloading to Intel MIC.  */
+static int num_images;
+
+/* Two dimensional array: one key is a pointer to image,
+   second key is number of device.  Contains a vector of pointer pairs.  */
+static ImgDevAddrMap *address_table;
+
+/* Thread-safe registration of the main image.  */
+static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
+
 static VarDesc vd_host2tgt = {
   { 1, 1 },		      /* dst, src			      */
   { 1, 0 },		      /* in, out			      */
@@ -90,28 +114,17 @@ static VarDesc vd_tgt2host = {
 };
 
 
-/* Total number of shared libraries with offloading to Intel MIC.  */
-static int num_libraries;
-
-/* Pointers to the descriptors, containing pointers to host-side tables and to
-   target images.  */
-static std::vector< std::pair<void *, void *> > lib_descrs;
-
-/* Thread-safe registration of the main image.  */
-static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
-
-
 /* Add path specified in LD_LIBRARY_PATH to MIC_LD_LIBRARY_PATH, which is
    required by liboffloadmic.  */
 __attribute__((constructor))
 static void
-set_mic_lib_path (void)
+init (void)
 {
   const char *ld_lib_path = getenv (LD_LIBRARY_PATH_ENV);
   const char *mic_lib_path = getenv (MIC_LD_LIBRARY_PATH_ENV);
 
   if (!ld_lib_path)
-    return;
+    goto out;
 
   if (!mic_lib_path)
     setenv (MIC_LD_LIBRARY_PATH_ENV, ld_lib_path, 1);
@@ -133,6 +146,10 @@ set_mic_lib_path (void)
       if (!use_alloca)
 	free (mic_lib_path_new);
     }
+
+out:
+  address_table = new ImgDevAddrMap;
+  num_devices = _Offload_number_of_devices ();
 }
 
 extern "C" const char *
@@ -162,18 +179,8 @@ GOMP_OFFLOAD_get_type (void)
 extern "C" int
 GOMP_OFFLOAD_get_num_devices (void)
 {
-  int res = _Offload_number_of_devices ();
-  TRACE ("(): return %d", res);
-  return res;
-}
-
-/* This should be called from every shared library with offloading.  */
-extern "C" void
-GOMP_OFFLOAD_register_image (void *host_table, void *target_image)
-{
-  TRACE ("(host_table = %p, target_image = %p)", host_table, target_image);
-  lib_descrs.push_back (std::make_pair (host_table, target_image));
-  num_libraries++;
+  TRACE ("(): return %d", num_devices);
+  return num_devices;
 }
 
 static void
@@ -196,7 +203,8 @@ register_main_image ()
   __offload_register_image (&main_target_image);
 }
 
-/* Load offload_target_main on target.  */
+/* liboffloadmic loads and runs offload_target_main on all available devices
+   during a first call to offload ().  */
 extern "C" void
 GOMP_OFFLOAD_init_device (int device)
 {
@@ -243,9 +251,11 @@ get_target_table (int device, int &num_funcs, int &num_vars, void **&table)
     }
 }
 
+/* Offload TARGET_IMAGE to all available devices and fill address_table with
+   corresponding target addresses.  */
+
 static void
-load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
-			int &table_size)
+offload_image (void *target_image)
 {
   struct TargetImage {
     int64_t size;
@@ -254,19 +264,11 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     char data[];
   } __attribute__ ((packed));
 
-  void ***host_table_descr = (void ***) lib_descrs[lib_num].first;
-  void **host_func_start = host_table_descr[0];
-  void **host_func_end   = host_table_descr[1];
-  void **host_var_start  = host_table_descr[2];
-  void **host_var_end    = host_table_descr[3];
+  void *image_start = ((void **) target_image)[0];
+  void *image_end   = ((void **) target_image)[1];
 
-  void **target_image_descr = (void **) lib_descrs[lib_num].second;
-  void *image_start = target_image_descr[0];
-  void *image_end   = target_image_descr[1];
-
-  TRACE ("() host_table_descr { %p, %p, %p, %p }", host_func_start,
-	 host_func_end, host_var_start, host_var_end);
-  TRACE ("() target_image_descr { %p, %p }", image_start, image_end);
+  TRACE ("(target_image = %p { %p, %p })",
+	 target_image, image_start, image_end);
 
   int64_t image_size = (uintptr_t) image_end - (uintptr_t) image_start;
   TargetImage *image
@@ -279,94 +281,87 @@ load_lib_and_get_table (int device, int lib_num, mapping_table *&table,
     }
 
   image->size = image_size;
-  sprintf (image->name, "lib%010d.so", lib_num);
+  sprintf (image->name, "lib%010d.so", num_images++);
   memcpy (image->data, image_start, image->size);
 
   TRACE ("() __offload_register_image %s { %p, %d }",
 	 image->name, image_start, image->size);
   __offload_register_image (image);
 
-  int tgt_num_funcs = 0;
-  int tgt_num_vars = 0;
-  void **tgt_table = NULL;
-  get_target_table (device, tgt_num_funcs, tgt_num_vars, tgt_table);
-  free (image);
-
-  /* The func table contains only addresses, the var table contains addresses
-     and corresponding sizes.  */
-  int host_num_funcs = host_func_end - host_func_start;
-  int host_num_vars  = (host_var_end - host_var_start) / 2;
-  TRACE ("() host_num_funcs = %d, tgt_num_funcs = %d",
-	 host_num_funcs, tgt_num_funcs);
-  TRACE ("() host_num_vars = %d, tgt_num_vars = %d",
-	 host_num_vars, tgt_num_vars);
-  if (host_num_funcs != tgt_num_funcs)
+  /* Receive tables for target_image from all devices.  */
+  DevAddrVect dev_table;
+  for (int dev = 0; dev < num_devices; dev++)
     {
-      fprintf (stderr, "%s: Can't map target functions\n", __FILE__);
-      exit (1);
-    }
-  if (host_num_vars != tgt_num_vars)
-    {
-      fprintf (stderr, "%s: Can't map target variables\n", __FILE__);
-      exit (1);
-    }
+      int num_funcs = 0;
+      int num_vars = 0;
+      void **table = NULL;
 
-  table = (mapping_table *) realloc (table, (table_size + host_num_funcs
-					     + host_num_vars)
-					    * sizeof (mapping_table));
-  if (table == NULL)
-    {
-      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
-      exit (1);
-    }
+      get_target_table (dev, num_funcs, num_vars, table);
 
-  for (int i = 0; i < host_num_funcs; i++)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_func_start[i];
-      t.host_end = t.host_start + 1;
-      t.tgt_start = (uintptr_t) tgt_table[i];
-      t.tgt_end = t.tgt_start + 1;
-
-      TRACE ("() lib %d, func %d:\t0x%llx -- 0x%llx",
-	     lib_num, i, t.host_start, t.tgt_start);
-
-      table[table_size++] = t;
-    }
+      AddrVect curr_dev_table;
 
-  for (int i = 0; i < host_num_vars * 2; i += 2)
-    {
-      mapping_table t;
-      t.host_start = (uintptr_t) host_var_start[i];
-      t.host_end = t.host_start + (uintptr_t) host_var_start[i+1];
-      t.tgt_start = (uintptr_t) tgt_table[tgt_num_funcs+i];
-      t.tgt_end = t.tgt_start + (uintptr_t) tgt_table[tgt_num_funcs+i+1];
+      for (int i = 0; i < num_funcs; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[i];
+	  tgt_addr.end = tgt_addr.start + 1;
+	  TRACE ("() func %d:\t0x%llx..0x%llx", i,
+		 tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      TRACE ("() lib %d, var %d:\t0x%llx (%d) -- 0x%llx (%d)", lib_num, i/2,
-	     t.host_start, t.host_end - t.host_start,
-	     t.tgt_start, t.tgt_end - t.tgt_start);
+      for (int i = 0; i < num_vars; i++)
+	{
+	  addr_pair tgt_addr;
+	  tgt_addr.start = (uintptr_t) table[num_funcs+i*2];
+	  tgt_addr.end = tgt_addr.start + (uintptr_t) table[num_funcs+i*2+1];
+	  TRACE ("() var %d:\t0x%llx..0x%llx", i, tgt_addr.start, tgt_addr.end);
+	  curr_dev_table.push_back (tgt_addr);
+	}
 
-      table[table_size++] = t;
+      dev_table.push_back (curr_dev_table);
     }
 
-  delete [] tgt_table;
+  address_table->insert (std::make_pair (target_image, dev_table));
+
+  free (image);
 }
 
 extern "C" int
-GOMP_OFFLOAD_get_table (int device, void *result)
+GOMP_OFFLOAD_load_image (int device, void *target_image, addr_pair **result)
 {
-  TRACE ("(num_libraries = %d)", num_libraries);
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
 
-  mapping_table *table = NULL;
-  int table_size = 0;
+  /* If target_image is already present in address_table, then there is no need
+     to offload it.  */
+  if (address_table->count (target_image) == 0)
+    offload_image (target_image);
 
-  for (int i = 0; i < num_libraries; i++)
-    load_lib_and_get_table (device, i, table, table_size);
+  AddrVect *curr_dev_table = &(*address_table)[target_image][device];
+  int table_size = curr_dev_table->size ();
+  addr_pair *table = (addr_pair *) malloc (table_size * sizeof (addr_pair));
+  if (table == NULL)
+    {
+      fprintf (stderr, "%s: Can't allocate memory\n", __FILE__);
+      exit (1);
+    }
 
-  *(void **) result = table;
+  std::copy (curr_dev_table->begin (), curr_dev_table->end (), table);
+  *result = table;
   return table_size;
 }
 
+extern "C" void
+GOMP_OFFLOAD_unload_image (int device, void *target_image)
+{
+  TRACE ("(device = %d, target_image = %p)", device, target_image);
+
+  /* TODO: Currently liboffloadmic doesn't support __offload_unregister_image
+     for libraries.  */
+
+  address_table->erase (target_image);
+}
+
 extern "C" void *
 GOMP_OFFLOAD_alloc (int device, size_t size)
 {


Thanks,
  -- Ilya

[-- Attachment #2: dlopen.tgz --]
[-- Type: application/gzip, Size: 1031 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-23 19:44                             ` Ilya Verbin
@ 2015-03-26 12:09                               ` Jakub Jelinek
  2015-03-26 20:41                                 ` Ilya Verbin
  2015-03-27 15:21                                 ` Julian Brown
  0 siblings, 2 replies; 30+ messages in thread
From: Jakub Jelinek @ 2015-03-26 12:09 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Julian Brown, Thomas Schwinge, gcc-patches, Kirill Yukhin

On Mon, Mar 23, 2015 at 10:44:39PM +0300, Ilya Verbin wrote:
> If it is too late for such global changes (rework initialization in libgomp,
> change mic and ptx plugins), then here is a small workaround patch to fix
> offloading from libraries.  Likely, it will not affect OpenACC programs with one
> image.  make check-target-libgomp passed.

Sorry for not getting to this earlier, really busy with severe regressions
bugfixing lately.

Anyway, IMHO it is not too late to fixing it properly, after all,
the current code is majorly broken.  As I've said earlier, e.g. the lack
of mutex guarding gomp_target_init (which is using pthread_once guaranteed
to be run just once) vs. concurrent GOMP_offload_register calls
(if those are run from ctors, then I guess something like dl_load_lock
ensures at least on glibc that multiple GOMP_offload_register calls aren't
performed at the same time) in accessing/reallocating offload_images
and num_offload_images and the lack of support to register further
images after the gomp_target_init call (if you dlopen further shared
libraries) is really bad.  And it would be really nice to support the
unloading.

But I'm afraid I'm lost in what is the latest posted patch for that,
and how has it been tested (whether just on MIC or MIC emul, or also for
nvptx).

So can you please post a link to the latest full patch and how it has been
tested, and if it is still error prone if say one thread executes
GOMP_target the first time and another at the same time dlopens some shared
library that has offloading regions in it, fix that too?

We still have a week or so to get this sorted out.

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-09 14:46                           ` Julian Brown
@ 2015-03-23 19:44                             ` Ilya Verbin
  2015-03-26 12:09                               ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-03-23 19:44 UTC (permalink / raw)
  To: Julian Brown, Thomas Schwinge, Jakub Jelinek; +Cc: gcc-patches, Kirill Yukhin

On Mon, Mar 09, 2015 at 14:45:55 +0000, Julian Brown wrote:
> On Fri, 6 Mar 2015 17:01:13 +0300
> Ilya Verbin <iverbin@gmail.com> wrote:
> 
> > On Thu, Feb 26, 2015 at 20:25:11 +0300, Ilya Verbin wrote:
> > > On Wed, Feb 25, 2015 at 10:36:08 +0100, Thomas Schwinge wrote:
> > > > > Julian Brown <julian@codesourcery.com> wrote:
> > > > > This is a version of the previously-posted patch to rework
> > > > > initialisation and support the proposed load/unload hooks,
> > > > > merged to gomp4 branch and tested alongside the two patches
> > > > > (from
> > > 
> > > Currently the 'struct gomp_memory_mapping' contains 'lock' and
> > > 'is_initialized'. Do you still need them?  Or we can use
> > > gomp_device_descr::lock and is_initialized instead?  If yes, then
> > > we can replace the gomp_memory_mapping structure with a splay_tree,
> > > as it was before the OpenACC merge.
> > 
> > Ping?
> 
> Apologies, I've been distracted with travel and other things. I
> suspect, as you suggest, that the gomp_memory_mapping
> lock/is_initialized fields may no longer be required. I haven't yet had
> time to address that nor all of Thomas's comments on the patch (mostly
> breakage with multiple devices), and I'm unlikely to have time this
> week either due to vacation...

If it is too late for such global changes (rework initialization in libgomp,
change mic and ptx plugins), then here is a small workaround patch to fix
offloading from libraries.  Likely, it will not affect OpenACC programs with one
image.  make check-target-libgomp passed.


	PR libgomp/65338
libgomp/
	* libgomp.h (struct gomp_device_descr): Remove
	offload_regions_registered.
	* oacc-host.c (host_dispatch): Do not initialize
	offload_regions_registered.
	* target.c (gomp_register_image_for_device): Do not check for
	offload_regions_registered.
	(gomp_target_init): Do not initialize offload_regions_registered.


diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3089401..f45fdba 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -793,9 +793,6 @@ struct gomp_device_descr
   /* Set to true when device is initialized.  */
   bool is_initialized;
 
-  /* True when offload regions have been registered with this device.  */
-  bool offload_regions_registered;
-
   /* OpenACC-specific data and functions.  */
   /* This is mutable because of its mutable data_environ and target_data
      members.  */
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 6aeb1e7..2763f44 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -56,7 +56,6 @@ static struct gomp_device_descr host_dispatch =
     .mem_map.is_initialized = false,
     .mem_map.splay_tree.root = NULL,
     .is_initialized = false,
-    .offload_regions_registered = false,
 
     .openacc = {
       .open_device_func = GOMP_OFFLOAD_openacc_open_device,
diff --git a/libgomp/target.c b/libgomp/target.c
index 50baa4d..db1f509 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1035,13 +1035,8 @@ static void
 gomp_register_image_for_device (struct gomp_device_descr *device,
 				struct offload_image_descr *image)
 {
-  if (!device->offload_regions_registered
-      && (device->type == image->type
-	  || device->type == OFFLOAD_TARGET_TYPE_HOST))
-    {
-      device->register_image_func (image->host_table, image->target_data);
-      device->offload_regions_registered = true;
-    }
+  if (device->type == image->type || device->type == OFFLOAD_TARGET_TYPE_HOST)
+    device->register_image_func (image->host_table, image->target_data);
 }
 
 /* This function initializes the runtime needed for offloading.
@@ -1105,7 +1100,6 @@ gomp_target_init (void)
 		current_device.mem_map.is_initialized = false;
 		current_device.mem_map.splay_tree.root = NULL;
 		current_device.is_initialized = false;
-		current_device.offload_regions_registered = false;
 		current_device.openacc.data_environ = NULL;
 		current_device.openacc.target_data = NULL;
 		for (i = 0; i < new_num_devices; i++)


  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-03-06 14:01                         ` Ilya Verbin
@ 2015-03-09 14:46                           ` Julian Brown
  2015-03-23 19:44                             ` Ilya Verbin
  0 siblings, 1 reply; 30+ messages in thread
From: Julian Brown @ 2015-03-09 14:46 UTC (permalink / raw)
  To: Ilya Verbin; +Cc: Thomas Schwinge, Jakub Jelinek, gcc-patches, Kirill Yukhin

On Fri, 6 Mar 2015 17:01:13 +0300
Ilya Verbin <iverbin@gmail.com> wrote:

> On Thu, Feb 26, 2015 at 20:25:11 +0300, Ilya Verbin wrote:
> > On Wed, Feb 25, 2015 at 10:36:08 +0100, Thomas Schwinge wrote:
> > > > Julian Brown <julian@codesourcery.com> wrote:
> > > > This is a version of the previously-posted patch to rework
> > > > initialisation and support the proposed load/unload hooks,
> > > > merged to gomp4 branch and tested alongside the two patches
> > > > (from
> > 
> > Currently the 'struct gomp_memory_mapping' contains 'lock' and
> > 'is_initialized'. Do you still need them?  Or we can use
> > gomp_device_descr::lock and is_initialized instead?  If yes, then
> > we can replace the gomp_memory_mapping structure with a splay_tree,
> > as it was before the OpenACC merge.
> 
> Ping?

Apologies, I've been distracted with travel and other things. I
suspect, as you suggest, that the gomp_memory_mapping
lock/is_initialized fields may no longer be required. I haven't yet had
time to address that nor all of Thomas's comments on the patch (mostly
breakage with multiple devices), and I'm unlikely to have time this
week either due to vacation...

Thanks,

Julian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-02-26 17:31                       ` Ilya Verbin
@ 2015-03-06 14:01                         ` Ilya Verbin
  2015-03-09 14:46                           ` Julian Brown
  0 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-03-06 14:01 UTC (permalink / raw)
  To: Thomas Schwinge, Julian Brown; +Cc: Jakub Jelinek, gcc-patches, Kirill Yukhin

On Thu, Feb 26, 2015 at 20:25:11 +0300, Ilya Verbin wrote:
> On Wed, Feb 25, 2015 at 10:36:08 +0100, Thomas Schwinge wrote:
> > > Julian Brown <julian@codesourcery.com> wrote:
> > > This is a version of the previously-posted patch to rework
> > > initialisation and support the proposed load/unload hooks, merged to
> > > gomp4 branch and tested alongside the two patches (from
> 
> Currently the 'struct gomp_memory_mapping' contains 'lock' and 'is_initialized'.
> Do you still need them?  Or we can use gomp_device_descr::lock and
> is_initialized instead?  If yes, then we can replace the gomp_memory_mapping
> structure with a splay_tree, as it was before the OpenACC merge.

Ping?

Thanks,
  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-02-25  9:54                     ` libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch) Thomas Schwinge
  2015-02-25 12:17                       ` Julian Brown
  2015-02-25 12:23                       ` Ilya Verbin
@ 2015-02-26 17:31                       ` Ilya Verbin
  2015-03-06 14:01                         ` Ilya Verbin
  2 siblings, 1 reply; 30+ messages in thread
From: Ilya Verbin @ 2015-02-26 17:31 UTC (permalink / raw)
  To: Thomas Schwinge, Julian Brown; +Cc: Jakub Jelinek, gcc-patches, Kirill Yukhin

Hi,

On Wed, Feb 25, 2015 at 10:36:08 +0100, Thomas Schwinge wrote:
> > Julian Brown <julian@codesourcery.com> wrote:
> > This is a version of the previously-posted patch to rework
> > initialisation and support the proposed load/unload hooks, merged to
> > gomp4 branch and tested alongside the two patches (from

Currently the 'struct gomp_memory_mapping' contains 'lock' and 'is_initialized'.
Do you still need them?  Or we can use gomp_device_descr::lock and
is_initialized instead?  If yes, then we can replace the gomp_memory_mapping
structure with a splay_tree, as it was before the OpenACC merge.

Thanks,
  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-02-25  9:54                     ` libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch) Thomas Schwinge
  2015-02-25 12:17                       ` Julian Brown
@ 2015-02-25 12:23                       ` Ilya Verbin
  2015-02-26 17:31                       ` Ilya Verbin
  2 siblings, 0 replies; 30+ messages in thread
From: Ilya Verbin @ 2015-02-25 12:23 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Julian Brown, Jakub Jelinek, gcc-patches, Kirill Yukhin

On Wed, Feb 25, 2015 at 10:36:08 +0100, Thomas Schwinge wrote:
> > Julian Brown <julian@codesourcery.com> wrote:
> > OK for gomp4 branch? I could commit Ilya's patch there too if so.
> 
> I'll leave the decision to Jakub, but, what about trunk?  As Ilya
> indicated in
> <http://news.gmane.org/find-root.php?message_id=%3C20150116231632.GB48380%40msticlxl57.ims.intel.com%3E>,
> (at least part of) these patches are fixing a regression with offloading
> From shared libraries.  (And maybe the rest qualifies as fixes and
> extensions to new code (offloading), so no danger to cause any
> regressions compared to the last GCC release?)

BTW, when I removed calls to gomp_init_tables in  <https://gcc.gnu.org/ml/gcc-patches/2015-01/msg02275.html>,
I could accidentally remove some necessary gomp_mutex_lock/unlock.
Also GOMP_offload_[un]register require some mutexes, as noted by Jakub.
I'm going to fix this.  So, I think we should commit all dependent patches to
gomp4 branch, and I will post a fix for mutexes on top of them.

Thanks,
  -- Ilya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-02-25  9:54                     ` libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch) Thomas Schwinge
@ 2015-02-25 12:17                       ` Julian Brown
  2015-02-25 12:23                       ` Ilya Verbin
  2015-02-26 17:31                       ` Ilya Verbin
  2 siblings, 0 replies; 30+ messages in thread
From: Julian Brown @ 2015-02-25 12:17 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Ilya Verbin, Jakub Jelinek, gcc-patches, Kirill Yukhin

On Wed, 25 Feb 2015 10:36:08 +0100
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi!
> 
> On Tue, 24 Feb 2015 11:29:51 +0000, Julian Brown
> <julian@codesourcery.com> wrote:
> > Test results look OK, barring a suspected harness issue (lib-83
> > failing with a timeout for nvptx
> 
> However, I'm seeing a class of testsuite regressions: all variants of
> libgomp.oacc-fortran/lib-5.f90 and libgomp.oacc-fortran/lib-7.f90
> FAIL: »libgomp: cuMemFreeHost error: invalid value«.  I see these two
> test cases contain a lot of acc_get_num_devices and similar calls --
> I've been testing this on our nvidiak20-2 system, which contains two
> Nvidia K20 cards, so maybe there's something wrong in that regard.
> (But why is this failing only for Fortran -- are we missing C/C++
> tests in that area?) Can you have a look, or want me to?

I can have a look at that.

> > --- a/gcc/config/nvptx/mkoffload.c
> > +++ b/gcc/config/nvptx/mkoffload.c
> > @@ -850,16 +851,17 @@ process (FILE *in, FILE *out)
> 
> >    fprintf (out, "static const void *target_data[] = {\n");
> > -  fprintf (out, "  ptx_code, var_mappings, func_mappings\n");
> > +  fprintf (out, "  ptx_code, (void*) %u, var_mappings, (void*) %u,
> > "
> > +		"func_mappings\n", nvars, nfuncs);
> >    fprintf (out, "};\n\n");
> 
> I wondered if it's maybe more elegant to just separate those by NULL
> delimiters instead of the size integers casted to void * (spaces
> missing)?  But then, that'd need "double scanning" in the consumer,
> libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_load_image, because we
> need to allocate an appropriately sized array, so maybe your more
> expressive approach is better indeed.

Yeah, I considered both: there's probably not much to choose between
the approaches. They use the same amount of space.

> > --- a/libgomp/oacc-async.c
> > +++ b/libgomp/oacc-async.c
> > @@ -34,44 +34,68 @@
> >  int
> >  acc_async_test (int async)
> >  {
> > +  struct goacc_thread *thr = goacc_thread ();
> > +
> >    if (async < acc_async_sync)
> >      gomp_fatal ("invalid async argument: %d", async);
> >  
> > -  return base_dev->openacc.async_test_func (async);
> > +  assert (thr->dev);
> > +
> > +  return thr->dev->openacc.async_test_func (async);
> >  }

> Here, and in several other places: is this code conforming to the
> OpenACC specification?  Do we need to (lazily) initialize in all
> these places, or in goacc_thread, or gracefully fail (see below) if
> not initialized (basically in all places where you currently assert
> (thr->dev)?
> 
>     #include <openacc.h>
>     
>     int main(int argc, char *argv[])
>     {
>       return acc_async_test(0);
>     }
> 
> [sigsegv]

Whether it conforms to the spec or not is a hard question to answer,
because a lot of behaviour is left undefined. But here are two
possibly-useful made-up guidelines:

1. Does the program work the same with OpenACC disabled?

2. Does some strange use of OpenACC functionality (including library
   calls, etc.) probably indicate user error?

Much of the lazy initialisation code is there so that (1) can be true
-- i.e., a program can use OpenACC directives without making an
explicit call to "acc_init" or other API-specific initialisation code.

But this case is an explicit call to the OpenACC runtime library, so the
program can't work without -fopenacc enabled, so we can follow
guideline (2) instead. And in this case, it's meaningless to test for
completion of async operation when no device is active.

Of course though, this should be an actual error rather than a crash.
But, I don't think we want to lazily-initialise here.

> Also, I'm not sure what the expected outcome of this code sequence is:
> 
>     acc_init(acc_device_nvidia);
>     acc_shutdown(acc_device_nvidia);
>     acc_async_test(0);
> 
>     a.out: [...]/source-gcc/libgomp/oacc-async.c:42: acc_async_test:
> Assertion `thr->dev' failed. Aborted (core dumped)
> 
> If the OpenACC specification can be read such that all this indeed is
> "undefined behavior", then aborting/crashing is OK, of course.

Again, this would probably indicate user error in a real program, so it
should raise a (real) error message.

> > --- a/libgomp/oacc-cuda.c
> > +++ b/libgomp/oacc-cuda.c
> > @@ -34,51 +34,53 @@
> >  void *
> >  acc_get_current_cuda_device (void)
> >  {
> > -  void *p = NULL;
> > +  struct goacc_thread *thr = goacc_thread ();
> >  
> > -  if (base_dev && base_dev->openacc.cuda.get_current_device_func)
> > -    p = base_dev->openacc.cuda.get_current_device_func ();
> > +  if (thr && thr->dev &&
> > thr->dev->openacc.cuda.get_current_device_func)
> > +    return thr->dev->openacc.cuda.get_current_device_func ();
> >  
> > -  return p;
> > +  return NULL;
> >  }
> 
> Here, and in other places, it looks as if we'd fail gracefully.

Not sure about this (maybe it should be an error too?), but...

> >  int
> >  acc_set_cuda_stream (int async, void *stream)
> >  {
> > -  int s = -1;
> > +  struct goacc_thread *thr;
> >  
> >    if (async < 0 || stream == NULL)
> >      return 0;
> >  
> >    goacc_lazy_initialize ();
> >  
> > -  if (base_dev && base_dev->openacc.cuda.set_stream_func)
> > -    s = base_dev->openacc.cuda.set_stream_func (async, stream);
> > +  thr = goacc_thread ();
> > +
> > +  if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
> > +    return thr->dev->openacc.cuda.set_stream_func (async, stream);
> >  
> > -  return s;
> > +  return -1;
> >  }
> 
> This one does have a goacc_lazy_initialize call.

This one might indeed be a reasonable way of initialising the OpenACC
runtime: the user is already using CUDA or some other CUDA consumer,
and wishes to use OpenACC also. So this can reasonably be the first
"OpenACC" call in a program.

> > --- a/libgomp/oacc-init.c
> > +++ b/libgomp/oacc-init.c
> 
> > +static const char *
> > +name_of_acc_device_t (enum acc_device_t type)
> > +{
> > +  switch (type)
> > +    {
> > +    case acc_device_none: return "none";
> > +    case acc_device_default: return "default";
> > +    case acc_device_host: return "host";
> > +    case acc_device_host_nonshm: return "host_nonshm";
> > +    case acc_device_not_host: return "not_host";
> > +    case acc_device_nvidia: return "nvidia";
> > +    default: return "<unknown>";
> > +    }
> > +}
> 
> I'd have made the default case abort.  (Does a missing case actually
> trigger a compile-time error?)

We're fixing the list of available offloading plugins at build time,
yes? In that case failing on the default case is reasonable.

> > +  ptx_devices[n] = nvptx_open_device (n);
> > +  instantiated_devices |= 1 << n;
> 
> Here, and also in several other places: do we have to care about big
> values of n?

I wondered that: we could equally just use the ptx_devices array and
remove the bitmask. I'll fix that.

> >  int
> > -GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
> > -			struct mapping_table **tablep)
> > +GOMP_OFFLOAD_load_image (int ord, void *target_data,
> > +			 struct addr_pair **target_table)
> >  {
> >    CUmodule module;
> > -  void **fn_table;
> > -  char **fn_names;
> > -  int fn_entries, i;
> > +  char **fn_names, **var_names;
> > +  unsigned int fn_entries, var_entries, i, j;
> >    CUresult r;
> >    struct targ_fn_descriptor *targ_fns;
> > +  void **img_header = (void **) target_data;
> > +  struct ptx_image_data *new_image;
> > +
> > +  nvptx_attach_host_thread_to_device (ord);
> >  
> >    if (nvptx_init () <= 0)
> >      return 0;
> 
> Need to adapt to the interface change of nvptx_init.  Also, missing
> locking of ptx_dev_lock.

and these other couple of bits too.

Thanks,

Julian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
  2015-02-24 12:49                   ` Julian Brown
@ 2015-02-25  9:54                     ` Thomas Schwinge
  2015-02-25 12:17                       ` Julian Brown
                                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Thomas Schwinge @ 2015-02-25  9:54 UTC (permalink / raw)
  To: Julian Brown, Ilya Verbin, Jakub Jelinek; +Cc: gcc-patches, Kirill Yukhin

[-- Attachment #1: Type: text/plain, Size: 9291 bytes --]

Hi!

On Tue, 24 Feb 2015 11:29:51 +0000, Julian Brown <julian@codesourcery.com> wrote:
> On Wed, 4 Feb 2015 15:05:45 +0000
> Julian Brown <julian@codesourcery.com> wrote:
> 
> > The major changes are: [...]

Thanks for looking into this!

> This is a version of the previously-posted patch to rework
> initialisation and support the proposed load/unload hooks, merged to
> gomp4 branch and tested alongside the two patches (from
> https://gcc.gnu.org/wiki/Offloading#nvptx_Offloading):
> 
> http://news.gmane.org/find-root.php?message_id=%3C20150218100035.GF1746%40tucnak.redhat.com%3E
> 
> http://news.gmane.org/find-root.php?message_id=%3C546CF508.9010807%40codesourcery.com%3E
> 
> As well as Ilya Verbin's patch:
> 
> https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01605.html

(I also added
<http://news.gmane.org/find-root.php?message_id=%3C20141115000346.GF40445%40msticlxl57.ims.intel.com%3E>
to the mix.)

> Test results look OK, barring a suspected harness issue (lib-83
> failing with a timeout for nvptx

Yes; Jim's rewriting the timing code.

However, I'm seeing a class of testsuite regressions: all variants of
libgomp.oacc-fortran/lib-5.f90 and libgomp.oacc-fortran/lib-7.f90 FAIL:
»libgomp: cuMemFreeHost error: invalid value«.  I see these two test
cases contain a lot of acc_get_num_devices and similar calls -- I've been
testing this on our nvidiak20-2 system, which contains two Nvidia K20
cards, so maybe there's something wrong in that regard.  (But why is this
failing only for Fortran -- are we missing C/C++ tests in that area?)
Can you have a look, or want me to?

> OK for gomp4 branch? I could commit Ilya's patch there too if so.

I'll leave the decision to Jakub, but, what about trunk?  As Ilya
indicated in
<http://news.gmane.org/find-root.php?message_id=%3C20150116231632.GB48380%40msticlxl57.ims.intel.com%3E>,
(at least part of) these patches are fixing a regression with offloading
From shared libraries.  (And maybe the rest qualifies as fixes and
extensions to new code (offloading), so no danger to cause any
regressions compared to the last GCC release?)


I have not reviewed all your changes; just a few comments:

> --- a/gcc/config/nvptx/mkoffload.c
> +++ b/gcc/config/nvptx/mkoffload.c
> @@ -850,16 +851,17 @@ process (FILE *in, FILE *out)

>    fprintf (out, "static const void *target_data[] = {\n");
> -  fprintf (out, "  ptx_code, var_mappings, func_mappings\n");
> +  fprintf (out, "  ptx_code, (void*) %u, var_mappings, (void*) %u, "
> +		"func_mappings\n", nvars, nfuncs);
>    fprintf (out, "};\n\n");

I wondered if it's maybe more elegant to just separate those by NULL
delimiters instead of the size integers casted to void * (spaces
missing)?  But then, that'd need "double scanning" in the consumer,
libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_load_image, because we need to
allocate an appropriately sized array, so maybe your more expressive
approach is better indeed.

> --- a/libgomp/oacc-async.c
> +++ b/libgomp/oacc-async.c
> @@ -34,44 +34,68 @@
>  int
>  acc_async_test (int async)
>  {
> +  struct goacc_thread *thr = goacc_thread ();
> +
>    if (async < acc_async_sync)
>      gomp_fatal ("invalid async argument: %d", async);
>  
> -  return base_dev->openacc.async_test_func (async);
> +  assert (thr->dev);
> +
> +  return thr->dev->openacc.async_test_func (async);
>  }

(Here, and in several other places: I would have placed the declaration
of thr and its initialization just before its first use, but then, no
need to change that now.)

Here, and in several other places: is this code conforming to the OpenACC
specification?  Do we need to (lazily) initialize in all these places, or
in goacc_thread, or gracefully fail (see below) if not initialized
(basically in all places where you currently assert (thr->dev)?

    #include <openacc.h>
    
    int main(int argc, char *argv[])
    {
      return acc_async_test(0);
    }

    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ -Bbuild-gcc/x86_64-unknown-linux-gnu/./libgomp/ -Bbuild-gcc/x86_64-unknown-linux-gnu/./libgomp/.libs -Ibuild-gcc/x86_64-unknown-linux-gnu/./libgomp -Isource-gcc/libgomp -Binstall/offload-nvptx-none/libexec/gcc/x86_64-unknown-linux-gnu/5.0.0 -Binstall/offload-nvptx-none/bin -Binstall/offload-x86_64-intelmicemul-linux-gnu/libexec/gcc/x86_64-unknown-linux-gnu/5.0.0 -Binstall/offload-x86_64-intelmicemul-linux-gnu/bin -Lbuild-gcc/x86_64-unknown-linux-gnu/./libgomp/.libs -Wl,-rpath,build-gcc/x86_64-unknown-linux-gnu/./libgomp/.libs -Wall ../a.c -fopenacc -g
    $ gdb -q a.out 
    Reading symbols from a.out...done.
    (gdb) r
    Starting program: [...]/a.out 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    
    Program received signal SIGSEGV, Segmentation fault.
    acc_async_test (async=0) at [...]/source-gcc/libgomp/oacc-async.c:42
    42        assert (thr->dev);

Also, I'm not sure what the expected outcome of this code sequence is:

    acc_init(acc_device_nvidia);
    acc_shutdown(acc_device_nvidia);
    acc_async_test(0);

    a.out: [...]/source-gcc/libgomp/oacc-async.c:42: acc_async_test: Assertion `thr->dev' failed.
    Aborted (core dumped)

If the OpenACC specification can be read such that all this indeed is
"undefined behavior", then aborting/crashing is OK, of course.

> --- a/libgomp/oacc-cuda.c
> +++ b/libgomp/oacc-cuda.c
> @@ -34,51 +34,53 @@
>  void *
>  acc_get_current_cuda_device (void)
>  {
> -  void *p = NULL;
> +  struct goacc_thread *thr = goacc_thread ();
>  
> -  if (base_dev && base_dev->openacc.cuda.get_current_device_func)
> -    p = base_dev->openacc.cuda.get_current_device_func ();
> +  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_device_func)
> +    return thr->dev->openacc.cuda.get_current_device_func ();
>  
> -  return p;
> +  return NULL;
>  }

Here, and in other places, it looks as if we'd fail gracefully.

>  int
>  acc_set_cuda_stream (int async, void *stream)
>  {
> -  int s = -1;
> +  struct goacc_thread *thr;
>  
>    if (async < 0 || stream == NULL)
>      return 0;
>  
>    goacc_lazy_initialize ();
>  
> -  if (base_dev && base_dev->openacc.cuda.set_stream_func)
> -    s = base_dev->openacc.cuda.set_stream_func (async, stream);
> +  thr = goacc_thread ();
> +
> +  if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
> +    return thr->dev->openacc.cuda.set_stream_func (async, stream);
>  
> -  return s;
> +  return -1;
>  }

This one does have a goacc_lazy_initialize call.

> --- a/libgomp/oacc-init.c
> +++ b/libgomp/oacc-init.c

> +static const char *
> +name_of_acc_device_t (enum acc_device_t type)
> +{
> +  switch (type)
> +    {
> +    case acc_device_none: return "none";
> +    case acc_device_default: return "default";
> +    case acc_device_host: return "host";
> +    case acc_device_host_nonshm: return "host_nonshm";
> +    case acc_device_not_host: return "not_host";
> +    case acc_device_nvidia: return "nvidia";
> +    default: return "<unknown>";
> +    }
> +}

I'd have made the default case abort.  (Does a missing case actually
trigger a compile-time error?)

> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -46,6 +46,7 @@
>  #include <dlfcn.h>
>  #include <unistd.h>
>  #include <assert.h>
> +#include <pthread.h>

That's already being included.

> -/* Initialize the device.  */
> -static int
> +/* Initialize the device.  Return TRUE on success, else FALSE.  PTX_DEV_LOCK
> +   should be locked on entry and remains locked on exit.  */
> +static bool
>  nvptx_init (void)
>  {
>    CUresult r;
>    int rc;
> +  int ndevs;
>  
> -  if (ptx_inited)
> -    return nvptx_get_num_devices ();
> +  if (instantiated_devices != 0)
> +    return true;

>  void
> -GOMP_OFFLOAD_register_image (void *host_table, void *target_data)
> +GOMP_OFFLOAD_init_device (int n)
>  {

> +  if (!nvptx_init ()
> +      || (instantiated_devices & (1 << n)) != 0)
> +    {
> +      pthread_mutex_unlock (&ptx_dev_lock);
> +      return NULL;

GOMP_OFFLOAD_init_device has a void return type.  (Why doesn't this cause
a compile warning/error?)

> +  ptx_devices[n] = nvptx_open_device (n);
> +  instantiated_devices |= 1 << n;

Here, and also in several other places: do we have to care about big
values of n?

>  int
> -GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
> -			struct mapping_table **tablep)
> +GOMP_OFFLOAD_load_image (int ord, void *target_data,
> +			 struct addr_pair **target_table)
>  {
>    CUmodule module;
> -  void **fn_table;
> -  char **fn_names;
> -  int fn_entries, i;
> +  char **fn_names, **var_names;
> +  unsigned int fn_entries, var_entries, i, j;
>    CUresult r;
>    struct targ_fn_descriptor *targ_fns;
> +  void **img_header = (void **) target_data;
> +  struct ptx_image_data *new_image;
> +
> +  nvptx_attach_host_thread_to_device (ord);
>  
>    if (nvptx_init () <= 0)
>      return 0;

Need to adapt to the interface change of nvptx_init.  Also, missing
locking of ptx_dev_lock.


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2015-04-15 14:26 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-15 14:26 libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch) Dominique Dhumieres
  -- strict thread matches above, loose matches on Subject: below --
2015-01-15 20:44 Merge current set of OpenACC changes from gomp-4_0-branch Thomas Schwinge
2015-01-16 23:22 ` Ilya Verbin
2015-01-23 18:28   ` Ilya Verbin
2015-01-26 14:01     ` Thomas Schwinge
2015-01-26 15:23       ` Ilya Verbin
2015-01-27 14:41         ` Julian Brown
2015-02-03 11:28           ` Ilya Verbin
2015-02-03 13:00             ` Julian Brown
2015-02-03 20:01               ` Ilya Verbin
2015-02-04 15:06                 ` Julian Brown
2015-02-24 12:49                   ` Julian Brown
2015-02-25  9:54                     ` libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch) Thomas Schwinge
2015-02-25 12:17                       ` Julian Brown
2015-02-25 12:23                       ` Ilya Verbin
2015-02-26 17:31                       ` Ilya Verbin
2015-03-06 14:01                         ` Ilya Verbin
2015-03-09 14:46                           ` Julian Brown
2015-03-23 19:44                             ` Ilya Verbin
2015-03-26 12:09                               ` Jakub Jelinek
2015-03-26 20:41                                 ` Ilya Verbin
2015-03-30 16:42                                   ` Jakub Jelinek
2015-03-30 21:43                                     ` Julian Brown
2015-03-31 12:52                                     ` Ilya Verbin
2015-03-31 13:08                                       ` Jakub Jelinek
2015-03-31 16:10                                         ` Ilya Verbin
2015-03-31 23:53                                           ` Ilya Verbin
2015-04-01  5:21                                             ` Jakub Jelinek
2015-04-01 13:14                                               ` Ilya Verbin
2015-04-01 13:20                                                 ` Jakub Jelinek
2015-04-01 17:26                                                   ` Ilya Verbin
2015-04-06 12:46                                                   ` Ilya Verbin
2015-04-07 15:26                                                     ` Jakub Jelinek
2015-04-08 14:32                                                       ` Julian Brown
2015-04-08 14:34                                                         ` Jakub Jelinek
2015-04-08 14:59                                                         ` Ilya Verbin
2015-04-08 16:14                                                           ` Julian Brown
2015-04-14 14:15                                                           ` Julian Brown
2015-03-31 18:25                                     ` Ilya Verbin
2015-03-31 19:06                                       ` Jakub Jelinek
2015-03-27 15:21                                 ` Julian Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).