public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Jakub Jelinek <jakub@redhat.com>
To: Ilya Verbin <iverbin@gmail.com>
Cc: Aldy Hernandez <aldyh@redhat.com>,
	gcc-patches@gcc.gnu.org,
	       Kirill Yukhin <kirill.yukhin@gmail.com>,
	       Thomas Schwinge <thomas@codesourcery.com>,
	       Alexander Monakov <amonakov@ispras.ru>,
	       Martin Jambor <mjambor@suse.cz>
Subject: Re: [gomp4.5] depend nowait support for target
Date: Fri, 13 Nov 2015 15:12:00 -0000	[thread overview]
Message-ID: <20151113151150.GQ5675@tucnak.redhat.com> (raw)
In-Reply-To: <20151113101841.GL5675@tucnak.redhat.com>

On Fri, Nov 13, 2015 at 11:18:41AM +0100, Jakub Jelinek wrote:
> For the offloading case, I actually see a problematic spot, namely that
> GOMP_PLUGIN_target_task_completion could finish too early, and get the
> task_lock before the thread that run the gomp_target_task_fn doing map_vars
> + async_run for it.  Bet I need to add further ttask state kinds and deal
> with that case (so GOMP_PLUGIN_target_task_completion would just take the
> task lock and tweak ttask state if it has not been added to the queues
> yet).
> Plus I think I want to improve the case where we are not waiting, in
> gomp_create_target_task if not waiting for dependencies actually schedule
> manually the gomp_target_task_fn.

These two have been resolved, plus target-34.c issue resolved too (the bug
was that I've been too lazy and just put target-33.c test into #pragma omp
parallel #pragma omp single, but that is invalid OpenMP, as single is a
worksharing region and #pragma omp barrier may not be encountered in such a
region.  Fixed by rewriting the testcase.

So here is a full patch that passes for me both non-offloading and
offloading, OMP_NUM_THREADS=16 (implicit on my box) as well as
OMP_NUM_THREADS=1 (explicit).  I've incorporated your incremental patch.

One option to avoid the static variable would be to pass two pointers
instead of one (async_data), one would be the callback function pointer,
another argument to it.  Or another possibility would be to say that
the async_data argument the plugin passes to liboffloadmic would be
pointer to structure, holding a function pointer (completion callback)
and the data pointer to pass to it, and then the plugin would just
GOMP_PLUGIN_malloc 2 * sizeof (void *) for it, fill it in and
register some function in itself that would call the
GOMP_PLUGIN_target_task_completion with the second structure element
as argument and then free the structure pointer.

Do you get still crashes on any of the testcases with this?

2015-11-13  Jakub Jelinek  <jakub@redhat.com>
	    Ilya Verbin  <ilya.verbin@intel.com>

	* parallel.c (gomp_resolve_num_threads): Don't assume that
	if thr->ts.team is non-NULL, then pool must be non-NULL.
	* libgomp-plugin.h (GOMP_PLUGIN_target_task_completion): Declare.
	* team.c (gomp_free_thread): Call gomp_team_end if thr->ts.team
	is artificial team created for target nowait in implicit parallel
	region.
	(gomp_team_start): For nested check, test thr->ts.level instead of
	thr->ts.team != NULL.
	* target.c (GOMP_target): Don't adjust *thr in any way around
	running offloaded task.
	(GOMP_target_ext): Likewise.  Handle target nowait.
	(GOMP_target_update_ext, GOMP_target_enter_exit_data): Check
	return value from gomp_create_target_task, if false, fallthrough
	as if no dependencies exist.
	(gomp_target_task_fn): Change return type to bool, return true
	if the task should have another part scheduled later.  Handle
	target nowait.
	(gomp_load_plugin_for_device): Initialize async_run.
	* libgomp.map (GOMP_PLUGIN_1.1): New symbol version, export
	GOMP_PLUGIN_target_task_completion.
	* task.c (priority_queue_move_task_first,
	gomp_target_task_completion, GOMP_PLUGIN_target_task_completion):
	New functions.
	(gomp_create_target_task): Change return type to bool, add
	state argument, return false if for async {{enter,exit} data,update}
	constructs no dependencies need to be waited for, handle target
	nowait.  Set task->fn to NULL instead of gomp_target_task_fn.
	(gomp_barrier_handle_tasks, GOMP_taskwait,
	gomp_task_maybe_wait_for_dependencies): Handle target nowait target
	tasks specially.
	(GOMP_taskgroup_end): Likewise.  If taskgroup is NULL, and
	thr->ts.level is 0, act as a barrier.
	target nowait tasks specially.
	* priority_queue.c (priority_queue_task_in_queue_p,
	priority_list_verify): Adjust for addition of
	GOMP_TASK_ASYNC_RUNNING kind.
	* libgomp.h (enum gomp_task_kind): Add GOMP_TASK_ASYNC_RUNNING.
	(enum gomp_target_task_state): New enum.
	(struct gomp_target_task): Add state, tgt, task and team fields.
	(gomp_create_target_task): Change return type to bool, add
	state argument.
	(gomp_target_task_fn): Change return type to bool.
	(struct gomp_device_descr): Add async_run_func.
	* testsuite/libgomp.c/target-32.c: New test.
	* testsuite/libgomp.c/target-34.c: New test.
	* testsuite/libgomp.c/target-33.c: New test.

2015-11-13  Ilya Verbin  <ilya.verbin@intel.com>

	* runtime/offload_host.cpp (task_completion_callback): New
	variable.
	(offload_proxy_task_completed_ooo): Call task_completion_callback.
	(__offload_register_task_callback): New function.
	* runtime/offload_host.h (__offload_register_task_callback): New
	declaration.
	* plugin/libgomp-plugin-intelmic.cpp (offload): Add async_data
	argument, handle async offloading.
	(register_main_image): Call register_main_image.
	(GOMP_OFFLOAD_init_device, get_target_table, GOMP_OFFLOAD_alloc,
	GOMP_OFFLOAD_free, GOMP_OFFLOAD_host2dev, GOMP_OFFLOAD_dev2host,
	GOMP_OFFLOAD_dev2dev) Adjust offload callers.
	(GOMP_OFFLOAD_async_run): New function.
	(GOMP_OFFLOAD_run): Implement using GOMP_OFFLOAD_async_run.

--- liboffloadmic/runtime/offload_host.cpp.jj	2015-11-05 11:31:05.013916598 +0100
+++ liboffloadmic/runtime/offload_host.cpp	2015-11-13 14:23:54.469798572 +0100
@@ -64,6 +64,8 @@ static void __offload_fini_library(void)
 #define GET_OFFLOAD_NUMBER(timer_data) \
     timer_data? timer_data->offload_number : 0
 
+static void (*task_completion_callback)(void *);
+
 extern "C" {
 #ifdef TARGET_WINNT
 // Windows does not support imports from libraries without actually
@@ -2507,7 +2509,7 @@ extern "C" {
         const void *info
     )
     {
-	/* TODO: Call callback function, pass info.  */
+	task_completion_callback ((void *) info);
     }
 }
 
@@ -5669,6 +5671,11 @@ extern "C" void __offload_unregister_ima
     }
 }
 
+extern "C" void __offload_register_task_callback(void (*cb)(void *))
+{
+    task_completion_callback = cb;
+}
+
 // Runtime trace interface for user programs
 
 void __offload_console_trace(int level)
--- liboffloadmic/runtime/offload_host.h.jj	2015-10-14 10:24:10.904194499 +0200
+++ liboffloadmic/runtime/offload_host.h	2015-11-13 14:23:54.470798557 +0100
@@ -376,6 +376,9 @@ extern "C" bool __offload_target_image_i
 extern "C" bool __offload_register_image(const void* image);
 extern "C" void __offload_unregister_image(const void* image);
 
+// Registers asynchronous task completion callback
+extern "C" void __offload_register_task_callback(void (*cb)(void *));
+
 // Initializes offload runtime library.
 DLL_LOCAL extern int __offload_init_library(void);
 
--- liboffloadmic/plugin/libgomp-plugin-intelmic.cpp.jj	2015-10-14 10:24:10.922194230 +0200
+++ liboffloadmic/plugin/libgomp-plugin-intelmic.cpp	2015-11-13 14:23:54.467798600 +0100
@@ -192,11 +192,23 @@ GOMP_OFFLOAD_get_num_devices (void)
 
 static void
 offload (const char *file, uint64_t line, int device, const char *name,
-	 int num_vars, VarDesc *vars, VarDesc2 *vars2)
+	 int num_vars, VarDesc *vars, VarDesc2 *vars2, const void **async_data)
 {
   OFFLOAD ofld = __offload_target_acquire1 (&device, file, line);
   if (ofld)
-    __offload_offload1 (ofld, name, 0, num_vars, vars, vars2, 0, NULL, NULL);
+    {
+      if (async_data == NULL)
+	__offload_offload1 (ofld, name, 0, num_vars, vars, vars2, 0, NULL,
+			    NULL);
+      else
+	{
+	  OffloadFlags flags;
+	  flags.flags = 0;
+	  flags.bits.omp_async = 1;
+	  __offload_offload3 (ofld, name, 0, num_vars, vars, NULL, 0, NULL,
+			      async_data, 0, NULL, flags, NULL);
+	}
+    }
   else
     {
       fprintf (stderr, "%s:%d: Offload target acquire failed\n", file, line);
@@ -208,6 +220,10 @@ static void
 register_main_image ()
 {
   __offload_register_image (&main_target_image);
+
+  /* liboffloadmic will call GOMP_PLUGIN_target_task_completion when
+     asynchronous task on target is completed.  */
+  __offload_register_task_callback (GOMP_PLUGIN_target_task_completion);
 }
 
 /* liboffloadmic loads and runs offload_target_main on all available devices
@@ -218,7 +234,7 @@ GOMP_OFFLOAD_init_device (int device)
   TRACE ("");
   pthread_once (&main_image_is_registered, register_main_image);
   offload (__FILE__, __LINE__, device, "__offload_target_init_proc", 0,
-	   NULL, NULL);
+	   NULL, NULL, NULL);
 }
 
 extern "C" void
@@ -240,7 +256,7 @@ get_target_table (int device, int &num_f
   VarDesc2 vd1g[2] = { { "num_funcs", 0 }, { "num_vars", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_table_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   int table_size = num_funcs + 2 * num_vars;
   if (table_size > 0)
@@ -254,7 +270,7 @@ get_target_table (int device, int &num_f
       VarDesc2 vd2g = { "table", 0 };
 
       offload (__FILE__, __LINE__, device, "__offload_target_table_p2", 1,
-	       &vd2, &vd2g);
+	       &vd2, &vd2g, NULL);
     }
 }
 
@@ -401,8 +417,8 @@ GOMP_OFFLOAD_alloc (int device, size_t s
   vd1[1].size = sizeof (void *);
   VarDesc2 vd1g[2] = { { "size", 0 }, { "tgt_ptr", 0 } };
 
-  offload (__FILE__, __LINE__, device, "__offload_target_alloc", 2, vd1, vd1g);
-
+  offload (__FILE__, __LINE__, device, "__offload_target_alloc", 2, vd1, vd1g,
+	   NULL);
   return tgt_ptr;
 }
 
@@ -416,7 +432,8 @@ GOMP_OFFLOAD_free (int device, void *tgt
   vd1.size = sizeof (void *);
   VarDesc2 vd1g = { "tgt_ptr", 0 };
 
-  offload (__FILE__, __LINE__, device, "__offload_target_free", 1, &vd1, &vd1g);
+  offload (__FILE__, __LINE__, device, "__offload_target_free", 1, &vd1, &vd1g,
+	   NULL);
 }
 
 extern "C" void *
@@ -435,7 +452,7 @@ GOMP_OFFLOAD_host2dev (int device, void
   VarDesc2 vd1g[2] = { { "tgt_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_host2tgt_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   VarDesc vd2 = vd_host2tgt;
   vd2.ptr = (void *) host_ptr;
@@ -443,7 +460,7 @@ GOMP_OFFLOAD_host2dev (int device, void
   VarDesc2 vd2g = { "var", 0 };
 
   offload (__FILE__, __LINE__, device, "__offload_target_host2tgt_p2", 1,
-	   &vd2, &vd2g);
+	   &vd2, &vd2g, NULL);
 
   return tgt_ptr;
 }
@@ -464,7 +481,7 @@ GOMP_OFFLOAD_dev2host (int device, void
   VarDesc2 vd1g[2] = { { "tgt_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2host_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   VarDesc vd2 = vd_tgt2host;
   vd2.ptr = (void *) host_ptr;
@@ -472,7 +489,7 @@ GOMP_OFFLOAD_dev2host (int device, void
   VarDesc2 vd2g = { "var", 0 };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2host_p2", 1,
-	   &vd2, &vd2g);
+	   &vd2, &vd2g, NULL);
 
   return host_ptr;
 }
@@ -495,22 +512,32 @@ GOMP_OFFLOAD_dev2dev (int device, void *
   VarDesc2 vd1g[3] = { { "dst_ptr", 0 }, { "src_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2tgt", 3, vd1,
-	   vd1g);
+	   vd1g, NULL);
 
   return dst_ptr;
 }
 
 extern "C" void
-GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars)
+GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
+			void *async_data)
 {
-  TRACE ("(tgt_fn = %p, tgt_vars = %p)", tgt_fn, tgt_vars);
+  TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p, async_data = %p)", device,
+	 tgt_fn, tgt_vars, async_data);
 
-  VarDesc vd1[2] = { vd_host2tgt, vd_host2tgt };
-  vd1[0].ptr = &tgt_fn;
-  vd1[0].size = sizeof (void *);
-  vd1[1].ptr = &tgt_vars;
-  vd1[1].size = sizeof (void *);
-  VarDesc2 vd1g[2] = { { "tgt_fn", 0 }, { "tgt_vars", 0 } };
+  VarDesc vd[2] = { vd_host2tgt, vd_host2tgt };
+  vd[0].ptr = &tgt_fn;
+  vd[0].size = sizeof (void *);
+  vd[1].ptr = &tgt_vars;
+  vd[1].size = sizeof (void *);
+
+  offload (__FILE__, __LINE__, device, "__offload_target_run", 2, vd, NULL,
+	   (const void **) async_data);
+}
+
+extern "C" void
+GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars)
+{
+  TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p)", device, tgt_fn, tgt_vars);
 
-  offload (__FILE__, __LINE__, device, "__offload_target_run", 2, vd1, vd1g);
+  GOMP_OFFLOAD_async_run (device, tgt_fn, tgt_vars, NULL);
 }
--- libgomp/parallel.c.jj	2015-10-14 10:24:10.000000000 +0200
+++ libgomp/parallel.c	2015-11-12 15:12:38.349901541 +0100
@@ -85,7 +85,7 @@ gomp_resolve_num_threads (unsigned speci
      nested parallel, so there is just one thread in the
      contention group as well, no need to handle it atomically.  */
   pool = thr->thread_pool;
-  if (thr->ts.team == NULL)
+  if (thr->ts.team == NULL || pool == NULL)
     {
       num_threads = max_num_threads;
       if (num_threads > icv->thread_limit_var)
--- libgomp/libgomp-plugin.h.jj	2015-10-14 10:24:10.000000000 +0200
+++ libgomp/libgomp-plugin.h	2015-11-13 11:40:03.366418330 +0100
@@ -63,6 +63,7 @@ struct addr_pair
 extern void *GOMP_PLUGIN_malloc (size_t) __attribute__ ((malloc));
 extern void *GOMP_PLUGIN_malloc_cleared (size_t) __attribute__ ((malloc));
 extern void *GOMP_PLUGIN_realloc (void *, size_t);
+void GOMP_PLUGIN_target_task_completion (void *);
 
 extern void GOMP_PLUGIN_debug (int, const char *, ...)
 	__attribute__ ((format (printf, 2, 3)));
--- libgomp/testsuite/libgomp.c/target-32.c.jj	2015-11-10 12:58:55.087951346 +0100
+++ libgomp/testsuite/libgomp.c/target-32.c	2015-11-12 13:28:55.053380366 +0100
@@ -0,0 +1,54 @@
+#include <stdlib.h>
+#include <unistd.h>
+
+int main ()
+{
+  int a = 0, b = 0, c = 0, d[7];
+
+  #pragma omp parallel
+  #pragma omp single
+  {
+    #pragma omp task depend(out: d[0])
+      a = 2;
+
+    #pragma omp target enter data nowait map(to: a,b,c) depend(in: d[0]) depend(out: d[1])
+
+    #pragma omp target nowait map(alloc: a) depend(in: d[1]) depend(out: d[2])
+      a++;
+
+    #pragma omp target nowait map(alloc: b) depend(in: d[2]) depend(out: d[3])
+    {
+      usleep (1000);
+      #pragma omp atomic update
+      b |= 4;
+    }
+
+    #pragma omp target nowait map(alloc: b) depend(in: d[2]) depend(out: d[4])
+    {
+      usleep (5000);
+      #pragma omp atomic update
+      b |= 1;
+    }
+
+    #pragma omp target nowait map(alloc: c) depend(in: d[3], d[4]) depend(out: d[5])
+    {
+      usleep (5000);
+      #pragma omp atomic update
+      c |= 8;
+    }
+
+    #pragma omp target nowait map(alloc: c) depend(in: d[3], d[4]) depend(out: d[6])
+    {
+      usleep (1000);
+      #pragma omp atomic update
+      c |= 2;
+    }
+
+    #pragma omp target exit data map(always,from: a,b,c) depend(in: d[5], d[6])
+  }
+
+  if (a != 3 || b != 5 || c != 10)
+    abort ();
+
+  return 0;
+}
--- libgomp/testsuite/libgomp.c/target-34.c.jj	2015-11-13 08:54:42.607799433 +0100
+++ libgomp/testsuite/libgomp.c/target-34.c	2015-11-13 14:16:09.104425707 +0100
@@ -0,0 +1,112 @@
+extern void abort (void);
+
+int
+main ()
+{
+  int a = 1, b = 2, c = 4, d[7];
+  #pragma omp parallel
+  {
+    #pragma omp single
+    {
+      #pragma omp taskgroup
+      {
+	#pragma omp target enter data nowait map (to: a, b, c) depend(out: d[0])
+	#pragma omp target nowait map (alloc: a, b) depend(in: d[0]) depend(out: d[1])
+	{
+	  #pragma omp atomic update
+	  a |= 4;
+	  #pragma omp atomic update
+	  b |= 8;
+	}
+	#pragma omp target nowait map (alloc: a, c) depend(in: d[0]) depend(out: d[2])
+	{
+	  #pragma omp atomic update
+	  a |= 16;
+	  #pragma omp atomic update
+	  c |= 32;
+	}
+	#pragma omp target exit data nowait map (from: a, b, c) depend(in: d[1], d[2])
+      }
+      if (a != 21 || b != 10 || c != 36)
+	abort ();
+      #pragma omp target map (tofrom: a, b) nowait
+      {
+	a &= ~16;
+	b &= ~2;
+      }
+      #pragma omp target map (tofrom: c) nowait
+      {
+	c |= 8;
+      }
+    } /* Implicit barrier here.  */
+    #pragma omp single
+    {
+      if (a != 5 || b != 8 || c != 44)
+	abort ();
+      #pragma omp target map (tofrom: a, b) nowait
+      {
+	a |= 32;
+	b |= 4;
+      }
+      #pragma omp target map (tofrom: c) nowait
+      c &= ~4;
+      #pragma omp taskwait
+      if (a != 37 || b != 12 || c != 40)
+	abort ();
+      #pragma omp target nowait map (tofrom: a, b) depend(out: d[3])
+      {
+	#pragma omp atomic update
+	a = a + 9;
+	b -= 8;
+      }
+      #pragma omp target nowait map (tofrom: a, c) depend(out: d[4])
+      {
+	#pragma omp atomic update
+	a = a + 4;
+	c >>= 1;
+      }
+      #pragma omp task if (0) depend (in: d[3], d[4]) shared (a, b, c)
+      if (a != 50 || b != 4 || c != 20)
+	abort ();
+      #pragma omp task shared (a)
+      a += 50;
+      #pragma omp target nowait map (tofrom: b)
+      b++;
+      #pragma omp target map (tofrom: c) nowait
+      c--;
+      #pragma omp taskwait
+      if (a != 100 || b != 5 || c != 19)
+	abort ();
+      #pragma omp target map (tofrom: a) nowait depend(out: d[5])
+      a++;
+      #pragma omp target map (tofrom: b) nowait depend(out: d[6])
+      b++;
+      #pragma omp target map (tofrom: a, b) depend(in: d[5], d[6])
+      {
+	if (a != 101 || b != 6)
+	  a = -9;
+	else
+	  {
+	    a = 24;
+	    b = 38;
+	  }
+      }
+      if (a != 24 || b != 38)
+	abort ();
+    } /* Implicit barrier here.  */
+    #pragma omp master
+    {
+      #pragma omp target nowait map (tofrom: a, b)
+      {
+	a *= 2;
+	b++;
+      }
+      #pragma omp target map (tofrom: c) nowait
+      c--;
+    }
+    #pragma omp barrier
+    if (a != 48 || b != 39 || c != 18)
+      abort ();
+  }
+  return 0;
+}
--- libgomp/testsuite/libgomp.c/target-33.c.jj	2015-11-12 16:20:23.332860573 +0100
+++ libgomp/testsuite/libgomp.c/target-33.c	2015-11-13 09:45:27.000000000 +0100
@@ -0,0 +1,93 @@
+extern void abort (void);
+
+int
+main ()
+{
+  int a = 1, b = 2, c = 4, d[7];
+  #pragma omp taskgroup
+  {
+    #pragma omp target enter data nowait map (to: a, b, c) depend(out: d[0])
+    #pragma omp target nowait map (alloc: a, b) depend(in: d[0]) depend(out: d[1])
+    {
+      #pragma omp atomic update
+      a |= 4;
+      #pragma omp atomic update
+      b |= 8;
+    }
+    #pragma omp target nowait map (alloc: a, c) depend(in: d[0]) depend(out: d[2])
+    {
+      #pragma omp atomic update
+      a |= 16;
+      #pragma omp atomic update
+      c |= 32;
+    }
+    #pragma omp target exit data nowait map (from: a, b, c) depend(in: d[1], d[2])
+  }
+  if (a != 21 || b != 10 || c != 36)
+    abort ();
+  #pragma omp target map (tofrom: a, b) nowait
+  {
+    a &= ~16;
+    b &= ~2;
+  }
+  #pragma omp target map (tofrom: c) nowait
+  {
+    c |= 8;
+  }
+  #pragma omp barrier
+  if (a != 5 || b != 8 || c != 44)
+    abort ();
+  #pragma omp target map (tofrom: a, b) nowait
+  {
+    a |= 32;
+    b |= 4;
+  }
+  #pragma omp target map (tofrom: c) nowait
+  {
+    c &= ~4;
+  }
+  #pragma omp taskwait
+  if (a != 37 || b != 12 || c != 40)
+    abort ();
+  #pragma omp target nowait map (tofrom: a, b) depend(out: d[3])
+  {
+    #pragma omp atomic update
+    a = a + 9;
+    b -= 8;
+  }
+  #pragma omp target nowait map (tofrom: a, c) depend(out: d[4])
+  {
+    #pragma omp atomic update
+    a = a + 4;
+    c >>= 1;
+  }
+  #pragma omp task if (0) depend (in: d[3], d[4]) shared (a, b, c)
+  if (a != 50 || b != 4 || c != 20)
+    abort ();
+  #pragma omp task shared (a)
+  a += 50;
+  #pragma omp target nowait map (tofrom: b)
+  b++;
+  #pragma omp target map (tofrom: c) nowait
+  c--;
+  #pragma omp taskwait
+  if (a != 100 || b != 5 || c != 19)
+    abort ();
+  #pragma omp target map (tofrom: a) nowait depend(out: d[5])
+  a++;
+  #pragma omp target map (tofrom: b) nowait depend(out: d[6])
+  b++;
+  #pragma omp target map (tofrom: a, b) depend(in: d[5], d[6])
+  {
+    if (a != 101 || b != 6)
+      a = -9;
+    else
+      {
+	a = 24;
+	b = 38;
+      }
+  }
+  if (a != 24 || b != 38)
+    abort ();
+  return 0;
+}
--- libgomp/team.c.jj	2015-11-09 11:14:37.000000000 +0100
+++ libgomp/team.c	2015-11-12 15:09:23.584644449 +0100
@@ -272,6 +272,8 @@ gomp_free_thread (void *arg __attribute_
       free (pool);
       thr->thread_pool = NULL;
     }
+  if (thr->ts.level == 0 && __builtin_expect (thr->ts.team != NULL, 0))
+    gomp_team_end ();
   if (thr->task != NULL)
     {
       struct gomp_task *task = thr->task;
@@ -301,7 +303,7 @@ gomp_team_start (void (*fn) (void *), vo
   struct gomp_thread **affinity_thr = NULL;
 
   thr = gomp_thread ();
-  nested = thr->ts.team != NULL;
+  nested = thr->ts.level;
   pool = thr->thread_pool;
   task = thr->task;
   icv = task ? &task->icv : &gomp_global_icv;
--- libgomp/target.c.jj	2015-11-09 11:14:37.325239961 +0100
+++ libgomp/target.c	2015-11-13 11:42:28.255345131 +0100
@@ -1348,17 +1348,7 @@ GOMP_target (int device, void (*fn) (voi
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     GOMP_MAP_VARS_TARGET);
-  struct gomp_thread old_thr, *thr = gomp_thread ();
-  old_thr = *thr;
-  memset (thr, '\0', sizeof (*thr));
-  if (gomp_places_list)
-    {
-      thr->place = old_thr.place;
-      thr->ts.place_partition_len = gomp_places_list_len;
-    }
   devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
-  gomp_free_thread (thr);
-  *thr = old_thr;
   gomp_unmap_vars (tgt_vars, true);
 }
 
@@ -1387,10 +1377,52 @@ GOMP_target_ext (int device, void (*fn)
   (void) num_teams;
   (void) thread_limit;
 
-  /* If there are depend clauses, but nowait is not present,
-     block the parent task until the dependencies are resolved
-     and then just continue with the rest of the function as if it
-     is a merged task.  */
+  if (flags & GOMP_TARGET_FLAG_NOWAIT)
+    {
+      struct gomp_thread *thr = gomp_thread ();
+      /* Create a team if we don't have any around, as nowait
+	 target tasks make sense to run asynchronously even when
+	 outside of any parallel.  */
+      if (__builtin_expect (thr->ts.team == NULL, 0))
+	{
+	  struct gomp_team *team = gomp_new_team (1);
+	  struct gomp_task *task = thr->task;
+	  struct gomp_task_icv *icv = task ? &task->icv : &gomp_global_icv;
+	  team->prev_ts = thr->ts;
+	  thr->ts.team = team;
+	  thr->ts.team_id = 0;
+	  thr->ts.work_share = &team->work_shares[0];
+	  thr->ts.last_work_share = NULL;
+#ifdef HAVE_SYNC_BUILTINS
+	  thr->ts.single_count = 0;
+#endif
+	  thr->ts.static_trip = 0;
+	  thr->task = &team->implicit_task[0];
+	  gomp_init_task (thr->task, NULL, icv);
+	  if (task)
+	    {
+	      thr->task = task;
+	      gomp_end_task ();
+	      free (task);
+	      thr->task = &team->implicit_task[0];
+	    }
+	  else
+	    pthread_setspecific (gomp_thread_destructor, thr);
+	}
+      if (thr->ts.team
+	  && !thr->task->final_task)
+	{
+	  gomp_create_target_task (devicep, fn, mapnum, hostaddrs,
+				   sizes, kinds, flags, depend,
+				   GOMP_TARGET_TASK_BEFORE_MAP);
+	  return;
+	}
+    }
+
+  /* If there are depend clauses, but nowait is not present
+     (or we are in a final task), block the parent task until the
+     dependencies are resolved and then just continue with the rest
+     of the function as if it is a merged task.  */
   if (depend != NULL)
     {
       struct gomp_thread *thr = gomp_thread ();
@@ -1410,17 +1442,7 @@ GOMP_target_ext (int device, void (*fn)
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
 		     GOMP_MAP_VARS_TARGET);
-  struct gomp_thread old_thr, *thr = gomp_thread ();
-  old_thr = *thr;
-  memset (thr, '\0', sizeof (*thr));
-  if (gomp_places_list)
-    {
-      thr->place = old_thr.place;
-      thr->ts.place_partition_len = gomp_places_list_len;
-    }
   devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
-  gomp_free_thread (thr);
-  *thr = old_thr;
   gomp_unmap_vars (tgt_vars, true);
 }
 
@@ -1527,23 +1549,25 @@ GOMP_target_update_ext (int device, size
 	      && thr->ts.team
 	      && !thr->task->final_task)
 	    {
-	      gomp_create_target_task (devicep, (void (*) (void *)) NULL,
-				       mapnum, hostaddrs, sizes, kinds,
-				       flags | GOMP_TARGET_FLAG_UPDATE,
-				       depend);
-	      return;
+	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
+					   mapnum, hostaddrs, sizes, kinds,
+					   flags | GOMP_TARGET_FLAG_UPDATE,
+					   depend, GOMP_TARGET_TASK_DATA))
+		return;
 	    }
+	  else
+	    {
+	      struct gomp_team *team = thr->ts.team;
+	      /* If parallel or taskgroup has been cancelled, don't start new
+		 tasks.  */
+	      if (team
+		  && (gomp_team_barrier_cancelled (&team->barrier)
+		      || (thr->task->taskgroup
+			  && thr->task->taskgroup->cancelled)))
+		return;
 
-	  struct gomp_team *team = thr->ts.team;
-	  /* If parallel or taskgroup has been cancelled, don't start new
-	     tasks.  */
-	  if (team
-	      && (gomp_team_barrier_cancelled (&team->barrier)
-		  || (thr->task->taskgroup
-		      && thr->task->taskgroup->cancelled)))
-	    return;
-
-	  gomp_task_maybe_wait_for_dependencies (depend);
+	      gomp_task_maybe_wait_for_dependencies (depend);
+	    }
 	}
     }
 
@@ -1647,22 +1671,25 @@ GOMP_target_enter_exit_data (int device,
 	      && thr->ts.team
 	      && !thr->task->final_task)
 	    {
-	      gomp_create_target_task (devicep, (void (*) (void *)) NULL,
-				       mapnum, hostaddrs, sizes, kinds,
-				       flags, depend);
-	      return;
+	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
+					   mapnum, hostaddrs, sizes, kinds,
+					   flags, depend,
+					   GOMP_TARGET_TASK_DATA))
+		return;
 	    }
+	  else
+	    {
+	      struct gomp_team *team = thr->ts.team;
+	      /* If parallel or taskgroup has been cancelled, don't start new
+		 tasks.  */
+	      if (team
+		  && (gomp_team_barrier_cancelled (&team->barrier)
+		      || (thr->task->taskgroup
+			  && thr->task->taskgroup->cancelled)))
+		return;
 
-	  struct gomp_team *team = thr->ts.team;
-	  /* If parallel or taskgroup has been cancelled, don't start new
-	     tasks.  */
-	  if (team
-	      && (gomp_team_barrier_cancelled (&team->barrier)
-		  || (thr->task->taskgroup
-		      && thr->task->taskgroup->cancelled)))
-	    return;
-
-	  gomp_task_maybe_wait_for_dependencies (depend);
+	      gomp_task_maybe_wait_for_dependencies (depend);
+	    }
 	}
     }
 
@@ -1694,38 +1721,65 @@ GOMP_target_enter_exit_data (int device,
     gomp_exit_data (devicep, mapnum, hostaddrs, sizes, kinds);
 }
 
-void
+bool
 gomp_target_task_fn (void *data)
 {
   struct gomp_target_task *ttask = (struct gomp_target_task *) data;
+  struct gomp_device_descr *devicep = ttask->devicep;
+
   if (ttask->fn != NULL)
     {
-      /* GOMP_target_ext */
-    }
-  else if (ttask->devicep == NULL
-	   || !(ttask->devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
-    return;
+      if (devicep == NULL
+	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+	{
+	  ttask->state = GOMP_TARGET_TASK_FALLBACK;
+	  gomp_target_fallback_firstprivate (ttask->fn, ttask->mapnum,
+					     ttask->hostaddrs, ttask->sizes,
+					     ttask->kinds);
+	  return false;
+	}
+
+      if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+	{
+	  gomp_unmap_vars (ttask->tgt, true);
+	  return false;
+	}
+
+      void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
+      ttask->tgt
+	= gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
+			 ttask->sizes, ttask->kinds, true,
+			 GOMP_MAP_VARS_TARGET);
+      ttask->state = GOMP_TARGET_TASK_READY_TO_RUN;
+
+      devicep->async_run_func (devicep->target_id, fn_addr,
+			       (void *) ttask->tgt->tgt_start, (void *) ttask);
+      return true;
+    }
+  else if (devicep == NULL
+	   || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+    return false;
 
   size_t i;
   if (ttask->flags & GOMP_TARGET_FLAG_UPDATE)
-    gomp_update (ttask->devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
+    gomp_update (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
 		 ttask->kinds, true);
   else if ((ttask->flags & GOMP_TARGET_FLAG_EXIT_DATA) == 0)
     for (i = 0; i < ttask->mapnum; i++)
       if ((ttask->kinds[i] & 0xff) == GOMP_MAP_STRUCT)
 	{
-	  gomp_map_vars (ttask->devicep, ttask->sizes[i] + 1,
-			 &ttask->hostaddrs[i], NULL, &ttask->sizes[i],
-			 &ttask->kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
+	  gomp_map_vars (devicep, ttask->sizes[i] + 1, &ttask->hostaddrs[i],
+			 NULL, &ttask->sizes[i], &ttask->kinds[i], true,
+			 GOMP_MAP_VARS_ENTER_DATA);
 	  i += ttask->sizes[i];
 	}
       else
-	gomp_map_vars (ttask->devicep, 1, &ttask->hostaddrs[i], NULL,
-		       &ttask->sizes[i], &ttask->kinds[i],
-		       true, GOMP_MAP_VARS_ENTER_DATA);
+	gomp_map_vars (devicep, 1, &ttask->hostaddrs[i], NULL, &ttask->sizes[i],
+		       &ttask->kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
   else
-    gomp_exit_data (ttask->devicep, ttask->mapnum, ttask->hostaddrs,
-		    ttask->sizes, ttask->kinds);
+    gomp_exit_data (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
+		    ttask->kinds);
+  return false;
 }
 
 void
@@ -2170,6 +2224,7 @@ gomp_load_plugin_for_device (struct gomp
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
     {
       DLSYM (run);
+      DLSYM (async_run);
       DLSYM (dev2dev);
     }
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
--- libgomp/libgomp.map.jj	2015-10-26 15:38:20.000000000 +0100
+++ libgomp/libgomp.map	2015-11-11 16:15:23.807818735 +0100
@@ -406,3 +406,8 @@ GOMP_PLUGIN_1.0 {
 	GOMP_PLUGIN_async_unmap_vars;
 	GOMP_PLUGIN_acc_thread;
 };
+
+GOMP_PLUGIN_1.1 {
+  global:
+	GOMP_PLUGIN_target_task_completion;
+} GOMP_PLUGIN_1.0;
--- libgomp/task.c.jj	2015-11-09 11:14:37.332239862 +0100
+++ libgomp/task.c	2015-11-13 15:36:05.954411999 +0100
@@ -480,13 +480,119 @@ ialias (GOMP_taskgroup_end)
 #undef UTYPE
 #undef GOMP_taskloop
 
-/* Called for nowait target tasks.  */
+static void inline
+priority_queue_move_task_first (enum priority_queue_type type,
+				struct priority_queue *head,
+				struct gomp_task *task)
+{
+#if _LIBGOMP_CHECKING_
+  if (!priority_queue_task_in_queue_p (type, head, task))
+    gomp_fatal ("Attempt to move first missing task %p", task);
+#endif
+  struct priority_list *list;
+  if (priority_queue_multi_p (head))
+    {
+      list = priority_queue_lookup_priority (head, task->priority);
+#if _LIBGOMP_CHECKING_
+      if (!list)
+	gomp_fatal ("Unable to find priority %d", task->priority);
+#endif
+    }
+  else
+    list = &head->l;
+  priority_list_remove (list, task_to_priority_node (type, task), 0);
+  priority_list_insert (type, list, task, task->priority,
+			PRIORITY_INSERT_BEGIN, type == PQ_CHILDREN,
+			task->parent_depends_on);
+}
+
+/* Actual body of GOMP_PLUGIN_target_task_completion that is executed
+   with team->task_lock held, or is executed in the thread that called
+   gomp_target_task_fn if GOMP_PLUGIN_target_task_completion has been
+   run before it acquires team->task_lock.  */
+
+static void
+gomp_target_task_completion (struct gomp_team *team, struct gomp_task *task)
+{
+  struct gomp_task *parent = task->parent;
+  if (parent)
+    priority_queue_move_task_first (PQ_CHILDREN, &parent->children_queue,
+				    task);
+
+  struct gomp_taskgroup *taskgroup = task->taskgroup;
+  if (taskgroup)
+    priority_queue_move_task_first (PQ_TASKGROUP, &taskgroup->taskgroup_queue,
+				    task);
+
+  priority_queue_insert (PQ_TEAM, &team->task_queue, task, task->priority,
+			 PRIORITY_INSERT_BEGIN, false,
+			 task->parent_depends_on);
+  task->kind = GOMP_TASK_WAITING;
+  if (parent && parent->taskwait)
+    {
+      if (parent->taskwait->in_taskwait)
+	{
+	  /* One more task has had its dependencies met.
+	     Inform any waiters.  */
+	  parent->taskwait->in_taskwait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+      else if (parent->taskwait->in_depend_wait)
+	{
+	  /* One more task has had its dependencies met.
+	     Inform any waiters.  */
+	  parent->taskwait->in_depend_wait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+    }
+  if (taskgroup && taskgroup->in_taskgroup_wait)
+    {
+      /* One more task has had its dependencies met.
+	 Inform any waiters.  */
+      taskgroup->in_taskgroup_wait = false;
+      gomp_sem_post (&taskgroup->taskgroup_sem);
+    }
+
+  ++team->task_queued_count;
+  gomp_team_barrier_set_task_pending (&team->barrier);
+  /* I'm afraid this can't be done after releasing team->task_lock,
+     as gomp_target_task_completion is run from unrelated thread and
+     therefore in between gomp_mutex_unlock and gomp_team_barrier_wake
+     the team could be gone already.  */
+  if (team->nthreads > team->task_running_count)
+    gomp_team_barrier_wake (&team->barrier, 1);
+}
+
+/* Signal that a target task TTASK has completed the asynchronously
+   running phase and should be requeued as a task to handle the
+   variable unmapping.  */
 
 void
+GOMP_PLUGIN_target_task_completion (void *data)
+{
+  struct gomp_target_task *ttask = (struct gomp_target_task *) data;
+  struct gomp_task *task = ttask->task;
+  struct gomp_team *team = ttask->team;
+
+  gomp_mutex_lock (&team->task_lock);
+  if (ttask->state == GOMP_TARGET_TASK_READY_TO_RUN)
+    {
+      ttask->state = GOMP_TARGET_TASK_FINISHED;
+      gomp_mutex_unlock (&team->task_lock);
+    }
+  ttask->state = GOMP_TARGET_TASK_FINISHED;
+  gomp_target_task_completion (team, task);
+  gomp_mutex_unlock (&team->task_lock);
+}
+
+/* Called for nowait target tasks.  */
+
+bool
 gomp_create_target_task (struct gomp_device_descr *devicep,
 			 void (*fn) (void *), size_t mapnum, void **hostaddrs,
 			 size_t *sizes, unsigned short *kinds,
-			 unsigned int flags, void **depend)
+			 unsigned int flags, void **depend,
+			 enum gomp_target_task_state state)
 {
   struct gomp_thread *thr = gomp_thread ();
   struct gomp_team *team = thr->ts.team;
@@ -495,7 +601,7 @@ gomp_create_target_task (struct gomp_dev
   if (team
       && (gomp_team_barrier_cancelled (&team->barrier)
 	  || (thr->task->taskgroup && thr->task->taskgroup->cancelled)))
-    return;
+    return true;
 
   struct gomp_target_task *ttask;
   struct gomp_task *task;
@@ -503,19 +609,45 @@ gomp_create_target_task (struct gomp_dev
   struct gomp_taskgroup *taskgroup = parent->taskgroup;
   bool do_wake;
   size_t depend_size = 0;
+  uintptr_t depend_cnt = 0;
+  size_t tgt_align = 0, tgt_size = 0;
 
   if (depend != NULL)
-    depend_size = ((uintptr_t) depend[0]
-		   * sizeof (struct gomp_task_depend_entry));
+    {
+      depend_cnt = (uintptr_t) depend[0];
+      depend_size = depend_cnt * sizeof (struct gomp_task_depend_entry);
+    }
+  if (fn)
+    {
+      /* GOMP_MAP_FIRSTPRIVATE need to be copied first, as they are
+	 firstprivate on the target task.  */
+      size_t i;
+      for (i = 0; i < mapnum; i++)
+	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
+	  {
+	    size_t align = (size_t) 1 << (kinds[i] >> 8);
+	    if (tgt_align < align)
+	      tgt_align = align;
+	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	    tgt_size += sizes[i];
+	  }
+      if (tgt_align)
+	tgt_size += tgt_align - 1;
+      else
+	tgt_size = 0;
+    }
+
   task = gomp_malloc (sizeof (*task) + depend_size
 		      + sizeof (*ttask)
 		      + mapnum * (sizeof (void *) + sizeof (size_t)
-				  + sizeof (unsigned short)));
+				  + sizeof (unsigned short))
+		      + tgt_size);
   gomp_init_task (task, parent, gomp_icv (false));
+  task->priority = 0;
   task->kind = GOMP_TASK_WAITING;
   task->in_tied_task = parent->in_tied_task;
   task->taskgroup = taskgroup;
-  ttask = (struct gomp_target_task *) &task->depend[(uintptr_t) depend[0]];
+  ttask = (struct gomp_target_task *) &task->depend[depend_cnt];
   ttask->devicep = devicep;
   ttask->fn = fn;
   ttask->mapnum = mapnum;
@@ -524,8 +656,29 @@ gomp_create_target_task (struct gomp_dev
   memcpy (ttask->sizes, sizes, mapnum * sizeof (size_t));
   ttask->kinds = (unsigned short *) &ttask->sizes[mapnum];
   memcpy (ttask->kinds, kinds, mapnum * sizeof (unsigned short));
+  if (tgt_align)
+    {
+      char *tgt = (char *) &ttask->kinds[mapnum];
+      size_t i;
+      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
+      if (al)
+	tgt += tgt_align - al;
+      tgt_size = 0;
+      for (i = 0; i < mapnum; i++)
+	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
+	  {
+	    size_t align = (size_t) 1 << (kinds[i] >> 8);
+	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	    memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
+	    ttask->hostaddrs[i] = tgt + tgt_size;
+	    tgt_size = tgt_size + sizes[i];
+	  }
+    }
   ttask->flags = flags;
-  task->fn = gomp_target_task_fn;
+  ttask->state = state;
+  ttask->task = task;
+  ttask->team = team;
+  task->fn = NULL;
   task->fn_data = ttask;
   task->final_task = 0;
   gomp_mutex_lock (&team->task_lock);
@@ -536,19 +689,65 @@ gomp_create_target_task (struct gomp_dev
       gomp_mutex_unlock (&team->task_lock);
       gomp_finish_task (task);
       free (task);
-      return;
+      return true;
     }
-  if (taskgroup)
-    taskgroup->num_children++;
   if (depend_size)
     {
       gomp_task_handle_depend (task, parent, depend);
       if (task->num_dependees)
 	{
+	  if (taskgroup)
+	    taskgroup->num_children++;
 	  gomp_mutex_unlock (&team->task_lock);
-	  return;
+	  return true;
 	}
     }
+  if (state == GOMP_TARGET_TASK_DATA)
+    {
+      gomp_mutex_unlock (&team->task_lock);
+      gomp_finish_task (task);
+      free (task);
+      return false;
+    }
+  if (taskgroup)
+    taskgroup->num_children++;
+  /* For async offloading, if we don't need to wait for dependencies,
+     run the gomp_target_task_fn right away, essentially schedule the
+     mapping part of the task in the current thread.  */
+  if (devicep != NULL
+      && (devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+    {
+      priority_queue_insert (PQ_CHILDREN, &parent->children_queue, task, 0,
+			     PRIORITY_INSERT_END,
+			     /*adjust_parent_depends_on=*/false,
+			     task->parent_depends_on);
+      if (taskgroup)
+	priority_queue_insert (PQ_TASKGROUP, &taskgroup->taskgroup_queue,
+			       task, 0, PRIORITY_INSERT_END,
+			       /*adjust_parent_depends_on=*/false,
+			       task->parent_depends_on);
+      task->pnode[PQ_TEAM].next = NULL;
+      task->pnode[PQ_TEAM].prev = NULL;
+      task->kind = GOMP_TASK_TIED;
+      ++team->task_count;
+      gomp_mutex_unlock (&team->task_lock);
+
+      thr->task = task;
+      gomp_target_task_fn (task->fn_data);
+      thr->task = parent;
+
+      gomp_mutex_lock (&team->task_lock);
+      task->kind = GOMP_TASK_ASYNC_RUNNING;
+      /* If GOMP_PLUGIN_target_task_completion has run already
+	 in between gomp_target_task_fn and the mutex lock,
+	 perform the requeuing here.  */
+      if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+	gomp_target_task_completion (team, task);
+      else
+	ttask->state = GOMP_TARGET_TASK_RUNNING;
+      gomp_mutex_unlock (&team->task_lock);
+      return true;
+    }
   priority_queue_insert (PQ_CHILDREN, &parent->children_queue, task, 0,
 			 PRIORITY_INSERT_BEGIN,
 			 /*adjust_parent_depends_on=*/false,
@@ -570,6 +769,7 @@ gomp_create_target_task (struct gomp_dev
   gomp_mutex_unlock (&team->task_lock);
   if (do_wake)
     gomp_team_barrier_wake (&team->barrier, 1);
+  return true;
 }
 
 /* Given a parent_depends_on task in LIST, move it to the front of its
@@ -1041,7 +1241,29 @@ gomp_barrier_handle_tasks (gomp_barrier_
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  team->task_running_count--;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
@@ -1170,7 +1392,28 @@ GOMP_taskwait (void)
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
@@ -1342,7 +1585,28 @@ gomp_task_maybe_wait_for_dependencies (v
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
@@ -1423,6 +1687,17 @@ GOMP_taskgroup_end (void)
   if (team == NULL)
     return;
   taskgroup = task->taskgroup;
+  if (__builtin_expect (taskgroup == NULL, 0)
+      && thr->ts.level == 0)
+    {
+      /* This can happen if GOMP_taskgroup_start is called when
+	 thr->ts.team == NULL, but inside of the taskgroup there
+	 is #pragma omp target nowait that creates an implicit
+	 team with a single thread.  In this case, we want to wait
+	 for all outstanding tasks in this team.  */
+      gomp_team_barrier_wait (&team->barrier);
+      return;
+    }
 
   /* The acquire barrier on load of taskgroup->num_children here
      synchronizes with the write of 0 in gomp_task_run_post_remove_taskgroup.
@@ -1450,8 +1725,8 @@ GOMP_taskgroup_end (void)
 		= priority_queue_next_task (PQ_CHILDREN, &task->children_queue,
 					    PQ_TEAM, &team->task_queue,
 					    &unused);
-            }
-          else
+	    }
+	  else
 	    {
 	      gomp_mutex_unlock (&team->task_lock);
 	      if (to_free)
@@ -1506,7 +1781,28 @@ GOMP_taskgroup_end (void)
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
--- libgomp/priority_queue.c.jj	2015-11-09 11:15:33.000000000 +0100
+++ libgomp/priority_queue.c	2015-11-10 17:52:33.769414428 +0100
@@ -85,7 +85,7 @@ priority_queue_task_in_queue_p (enum pri
    order.  LIST is a priority list of type TYPE.
 
    The expected order is that GOMP_TASK_WAITING tasks come before
-   GOMP_TASK_TIED ones.
+   GOMP_TASK_TIED/GOMP_TASK_ASYNC_RUNNING ones.
 
    If CHECK_DEPS is TRUE, we also check that parent_depends_on WAITING
    tasks come before !parent_depends_on WAITING tasks.  This is only
@@ -104,7 +104,7 @@ priority_list_verify (enum priority_queu
       struct gomp_task *t = priority_node_to_task (type, p);
       if (seen_tied && t->kind == GOMP_TASK_WAITING)
 	gomp_fatal ("priority_queue_verify: WAITING task after TIED");
-      if (t->kind == GOMP_TASK_TIED)
+      if (t->kind >= GOMP_TASK_TIED)
 	seen_tied = true;
       else if (check_deps && t->kind == GOMP_TASK_WAITING)
 	{
--- libgomp/libgomp.h.jj	2015-11-09 11:14:37.326239947 +0100
+++ libgomp/libgomp.h	2015-11-13 11:41:46.743939113 +0100
@@ -373,7 +373,12 @@ enum gomp_task_kind
   /* Task created by GOMP_task and waiting to be run.  */
   GOMP_TASK_WAITING,
   /* Task currently executing or scheduled and about to execute.  */
-  GOMP_TASK_TIED
+  GOMP_TASK_TIED,
+  /* Used for target tasks that have vars mapped and async run started,
+     but not yet completed.  Once that completes, they will be readded
+     into the queues as GOMP_TASK_WAITING in order to perform the var
+     unmapping.  */
+  GOMP_TASK_ASYNC_RUNNING
 };
 
 struct gomp_task_depend_entry
@@ -453,6 +458,8 @@ struct gomp_task
   struct gomp_task_depend_entry depend[];
 };
 
+/* This structure describes a single #pragma omp taskgroup.  */
+
 struct gomp_taskgroup
 {
   struct gomp_taskgroup *prev;
@@ -464,6 +471,19 @@ struct gomp_taskgroup
   size_t num_children;
 };
 
+/* Various state of OpenMP async offloading tasks.  */
+enum gomp_target_task_state
+{
+  GOMP_TARGET_TASK_DATA,
+  GOMP_TARGET_TASK_BEFORE_MAP,
+  GOMP_TARGET_TASK_FALLBACK,
+  GOMP_TARGET_TASK_READY_TO_RUN,
+  GOMP_TARGET_TASK_RUNNING,
+  GOMP_TARGET_TASK_FINISHED
+};
+
+/* This structure describes a target task.  */
+
 struct gomp_target_task
 {
   struct gomp_device_descr *devicep;
@@ -472,6 +492,10 @@ struct gomp_target_task
   size_t *sizes;
   unsigned short *kinds;
   unsigned int flags;
+  enum gomp_target_task_state state;
+  struct target_mem_desc *tgt;
+  struct gomp_task *task;
+  struct gomp_team *team;
   void *hostaddrs[];
 };
 
@@ -723,10 +747,10 @@ extern void gomp_init_task (struct gomp_
 extern void gomp_end_task (void);
 extern void gomp_barrier_handle_tasks (gomp_barrier_state_t);
 extern void gomp_task_maybe_wait_for_dependencies (void **);
-extern void gomp_create_target_task (struct gomp_device_descr *,
+extern bool gomp_create_target_task (struct gomp_device_descr *,
 				     void (*) (void *), size_t, void **,
 				     size_t *, unsigned short *, unsigned int,
-				     void **);
+				     void **, enum gomp_target_task_state);
 
 static void inline
 gomp_finish_task (struct gomp_task *task)
@@ -747,7 +771,7 @@ extern void gomp_free_thread (void *);
 
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
-extern void gomp_target_task_fn (void *);
+extern bool gomp_target_task_fn (void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
@@ -901,6 +925,7 @@ struct gomp_device_descr
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void *(*dev2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
+  void (*async_run_func) (int, void *, void *, void *);
 
   /* Splay tree containing information about mapped memory regions.  */
   struct splay_tree_s mem_map;

	Jakub

  reply	other threads:[~2015-11-13 15:12 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-08  9:26 [gomp4.1] depend nowait support for target {update,{enter,exit} data} Jakub Jelinek
2015-10-02 19:28 ` Ilya Verbin
2015-10-15 14:02   ` Jakub Jelinek
2015-10-15 16:18     ` Alexander Monakov
2015-10-15 17:18       ` Jakub Jelinek
2015-10-15 18:11         ` Alexander Monakov
2015-10-15 16:42     ` Ilya Verbin
2015-10-16 11:50     ` Martin Jambor
2015-10-19 19:55     ` Ilya Verbin
2015-11-11 16:52       ` [gomp4.5] depend nowait support for target Jakub Jelinek
2015-11-12 17:44         ` Ilya Verbin
2015-11-12 17:58           ` Jakub Jelinek
2015-11-12 18:07             ` Ilya Verbin
2015-11-12 17:45         ` Jakub Jelinek
2015-11-12 20:52           ` Ilya Verbin
2015-11-13 10:18             ` Jakub Jelinek
2015-11-13 15:12               ` Jakub Jelinek [this message]
2015-11-13 16:37                 ` Ilya Verbin
2015-11-13 16:42                   ` Jakub Jelinek
2015-11-13 18:37                     ` Ilya Verbin
2015-11-23 14:16                 ` [hsa] " Martin Jambor
2015-11-23 14:25                   ` Jakub Jelinek
2015-11-25 15:41                     ` Martin Jambor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151113151150.GQ5675@tucnak.redhat.com \
    --to=jakub@redhat.com \
    --cc=aldyh@redhat.com \
    --cc=amonakov@ispras.ru \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=iverbin@gmail.com \
    --cc=kirill.yukhin@gmail.com \
    --cc=mjambor@suse.cz \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).