public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
@ 2021-01-21 19:33 Kwok Cheung Yeung
  2021-01-21 22:46 ` Kwok Cheung Yeung
  2021-01-29 15:03 ` Jakub Jelinek
  0 siblings, 2 replies; 22+ messages in thread
From: Kwok Cheung Yeung @ 2021-01-21 19:33 UTC (permalink / raw)
  To: Jakub Jelinek, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 2967 bytes --]

Hello

This patch addresses the intermittent hanging seen in the 
libgomp.c-c++-common/task-detach-6.f90 test.

The main problem is due to the 'omp taskwait' in the test. GOMP_taskwait can run 
tasks, so for correct semantics it needs to be able to place finished tasks that 
have unfulfilled completion events into the detach queue, rather than just 
finishing them immediately (in effect ignoring the detach clause).

Unfinished tasks in the detach queue are still children of their parent task, so 
they can appear in next_task in the main GOMP_taskwait loop. If next_task is 
fulfilled then it can be finished immediately, otherwise it will wait on 
taskwait_sem.

omp_fulfill_event needs to be able to post the taskwait_sem semaphore as well as 
wake the team barrier. Since the semaphore is located on the parent of the task 
whose completion event is being fulfilled, I have changed the event handle to 
being a pointer to the task instead of just the completion semaphore in order to 
access the parent field.

This type of code is currently used to wake the threads for the team barrier:

   if (team->nthreads > team->task_running_count)
     gomp_team_barrier_wake (&team->barrier, 1);

This issues a gomp_team_barrier_wake if any of the threads are not running a 
task (and so might be sleeping). However, detach tasks that are queued waiting 
for a completion event are currently included in task_running_count (because the 
finish_cancelled code executed later decrements it). Since 
gomp_barrier_handle_tasks does not block if there are unfinished detached tasks 
remaining (since during development I found that doing so could cause deadlocks 
in single-threaded code), threads could be sleeping even if team->nthreads == 
team->task_running_count, and this code would fail to wake them. I fixed this by 
decrementing task_running_count when queuing an unfinished detach task, and 
skipping the decrement in finish_cancelled if the task was a queued detach tash. 
I added a new gomp_task_kind GOMP_TASK_DETACHED to mark these type of tasks.

I have tried running the task-detach-6 testcase (C and Fortran) 10,000 
iterations at a time using 32 threads, on a x86_64 Linux machine with GCC built 
with --disable-linux-futex, and no hangs. I have checked that it bootstraps, and 
noticed no regressions in the libgomp testsuite when run without offloading.

With Nvidia and GCN offloading though, task-detach-6 hangs... I _think_ the 
reason why it 'worked' before was because the taskwait allowed tasks with detach 
clauses to always complete immediately after execution. Since that backdoor has 
been closed, task-detach-6 hangs with or without the taskwait.

I think GOMP_taskgroup_end and maybe gomp_task_maybe_wait_for_dependencies also 
need the same type of TLC as they can also run tasks, but there are currently no 
tests that exercise it.

The detach support clearly needs more work, but is this particular patch okay 
for trunk?

Thanks

Kwok

[-- Attachment #2: 0001-openmp-Fix-intermittent-hanging-of-task-detach-6-lib.patch --]
[-- Type: text/plain, Size: 12806 bytes --]

From 12cc24c937e9294d5616dd0cd9a754c02ffb26fa Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Thu, 21 Jan 2021 05:38:47 -0800
Subject: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp
 tests [PR98738]

This adds support for the task detach clause to taskwait, and fixes a
number of problems related to semaphores that may lead to a hang in
some circumstances.

2021-01-21  Kwok Cheung Yeung  <kcy@codesourcery.com>

	libgomp/

	PR libgomp/98738
	* libgomp.h (enum gomp_task_kind): Add GOMP_TASK_DETACHED.
	* task.c (task_fulfilled_p): Check detach field as well.
	(GOMP_task): Use address of task as the event handle.
	(gomp_barrier_handle_tasks): Fix indentation.  Use address of task
	as event handle. Set kind of suspended detach task to
	GOMP_TASK_DETACHED and decrement task_running_count.  Move
	finish_cancelled block out of else branch.  Skip decrement of
	task_running_count if task kind is GOMP_TASK_DETACHED.
	(GOMP_taskwait): Finish fulfilled detach tasks.  Update comment.
	Queue detach tasks that have not been fulfilled.
	(omp_fulfill_event): Use address of task as event handle.  Post
	to taskwait_sem and taskgroup_sem if necessary.  Check
	task_running_count before calling gomp_team_barrier_wake.
	* testsuite/libgomp.c-c++-common/task-detach-5.c (main): Change
	data-sharing of detach events on enclosing parallel to private.
	* testsuite/libgomp.c-c++-common/task-detach-6.c (main): Likewise.
	* testsuite/libgomp.fortran/task-detach-5.f90 (task_detach_5):
	Likewise.
	* testsuite/libgomp.fortran/task-detach-6.f90 (task_detach_6):
	Likewise.
---
 libgomp/libgomp.h                                  |   5 +-
 libgomp/task.c                                     | 155 ++++++++++++++-------
 .../testsuite/libgomp.c-c++-common/task-detach-5.c |   2 +-
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |   2 +-
 .../testsuite/libgomp.fortran/task-detach-5.f90    |   2 +-
 .../testsuite/libgomp.fortran/task-detach-6.f90    |   2 +-
 6 files changed, 115 insertions(+), 53 deletions(-)

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index b4d0c93..b24de5c 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -481,7 +481,10 @@ enum gomp_task_kind
      but not yet completed.  Once that completes, they will be readded
      into the queues as GOMP_TASK_WAITING in order to perform the var
      unmapping.  */
-  GOMP_TASK_ASYNC_RUNNING
+  GOMP_TASK_ASYNC_RUNNING,
+  /* Task that has finished executing but is waiting for its
+     completion event to be fulfilled.  */
+  GOMP_TASK_DETACHED
 };
 
 struct gomp_task_depend_entry
diff --git a/libgomp/task.c b/libgomp/task.c
index b242e7c..dbd6284 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -330,7 +330,7 @@ gomp_task_handle_depend (struct gomp_task *task, struct gomp_task *parent,
 static bool
 task_fulfilled_p (struct gomp_task *task)
 {
-  return gomp_sem_getcount (&task->completion_sem) > 0;
+  return task->detach && gomp_sem_getcount (&task->completion_sem) > 0;
 }
 
 /* Called when encountering an explicit task directive.  If IF_CLAUSE is
@@ -419,11 +419,11 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
 	{
 	  task.detach = true;
 	  gomp_sem_init (&task.completion_sem, 0);
-	  *(void **) detach = &task.completion_sem;
+	  *(void **) detach = &task;
 	  if (data)
-	    *(void **) data = &task.completion_sem;
+	    *(void **) data = &task;
 
-	  gomp_debug (0, "New event: %p\n", &task.completion_sem);
+	  gomp_debug (0, "New event: %p\n", &task);
 	}
 
       if (thr->task)
@@ -488,11 +488,11 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
 	{
 	  task->detach = true;
 	  gomp_sem_init (&task->completion_sem, 0);
-	  *(void **) detach = &task->completion_sem;
+	  *(void **) detach = task;
 	  if (data)
-	    *(void **) data = &task->completion_sem;
+	    *(void **) data = task;
 
-	  gomp_debug (0, "New event: %p\n", &task->completion_sem);
+	  gomp_debug (0, "New event: %p\n", task);
 	}
       thr->task = task;
       if (cpyfn)
@@ -1372,14 +1372,14 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
 				 child_task, MEMMODEL_RELAXED);
 	  --team->task_detach_count;
 	  gomp_debug (0, "thread %d: found task with fulfilled event %p\n",
-		      thr->ts.team_id, &child_task->completion_sem);
+		      thr->ts.team_id, &child_task);
 
-	if (to_free)
-	  {
-	    gomp_finish_task (to_free);
-	    free (to_free);
-	    to_free = NULL;
-	  }
+	  if (to_free)
+	    {
+	      gomp_finish_task (to_free);
+	      free (to_free);
+	      to_free = NULL;
+	    }
 	  goto finish_cancelled;
 	}
 
@@ -1452,41 +1452,43 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
 	{
 	  if (child_task->detach && !task_fulfilled_p (child_task))
 	    {
+	      child_task->kind = GOMP_TASK_DETACHED;
 	      priority_queue_insert (PQ_TEAM, &team->task_detach_queue,
 				     child_task, child_task->priority,
 				     PRIORITY_INSERT_END,
 				     false, false);
 	      ++team->task_detach_count;
-	      gomp_debug (0, "thread %d: queueing task with event %p\n",
-			  thr->ts.team_id, &child_task->completion_sem);
+	      --team->task_running_count;
+	      gomp_debug (0,
+			  "thread %d: queuing detached task with event %p\n",
+			  thr->ts.team_id, child_task);
 	      child_task = NULL;
+	      continue;
 	    }
-	  else
+
+	 finish_cancelled:;
+	  size_t new_tasks
+	    = gomp_task_run_post_handle_depend (child_task, team);
+	  gomp_task_run_post_remove_parent (child_task);
+	  gomp_clear_parent (&child_task->children_queue);
+	  gomp_task_run_post_remove_taskgroup (child_task);
+	  to_free = child_task;
+	  if (!cancelled && child_task->kind != GOMP_TASK_DETACHED)
+	    team->task_running_count--;
+	  child_task = NULL;
+	  if (new_tasks > 1)
 	    {
-	     finish_cancelled:;
-	      size_t new_tasks
-		= gomp_task_run_post_handle_depend (child_task, team);
-	      gomp_task_run_post_remove_parent (child_task);
-	      gomp_clear_parent (&child_task->children_queue);
-	      gomp_task_run_post_remove_taskgroup (child_task);
-	      to_free = child_task;
-	      child_task = NULL;
-	      if (!cancelled)
-		team->task_running_count--;
-	      if (new_tasks > 1)
-		{
-		  do_wake = team->nthreads - team->task_running_count;
-		  if (do_wake > new_tasks)
-		    do_wake = new_tasks;
-		}
-	      if (--team->task_count == 0
-		  && gomp_team_barrier_waiting_for_tasks (&team->barrier))
-		{
-		  gomp_team_barrier_done (&team->barrier, state);
-		  gomp_mutex_unlock (&team->task_lock);
-		  gomp_team_barrier_wake (&team->barrier, 0);
-		  gomp_mutex_lock (&team->task_lock);
-		}
+	      do_wake = team->nthreads - team->task_running_count;
+	      if (do_wake > new_tasks)
+		do_wake = new_tasks;
+	    }
+	  if (--team->task_count == 0
+	      && gomp_team_barrier_waiting_for_tasks (&team->barrier))
+	    {
+	      gomp_team_barrier_done (&team->barrier, state);
+	      gomp_mutex_unlock (&team->task_lock);
+	      gomp_team_barrier_wake (&team->barrier, 0);
+	      gomp_mutex_lock (&team->task_lock);
 	    }
 	}
     }
@@ -1556,10 +1558,28 @@ GOMP_taskwait (void)
 	      goto finish_cancelled;
 	    }
 	}
+      else if (next_task->kind == GOMP_TASK_DETACHED
+	       && task_fulfilled_p (next_task))
+	{
+	  child_task = next_task;
+	  gomp_debug (0, "thread %d: found task with fulfilled event %p\n",
+		      thr->ts.team_id, &child_task);
+	  priority_queue_remove (PQ_TEAM, &team->task_detach_queue,
+				 child_task, MEMMODEL_RELAXED);
+	  --team->task_detach_count;
+	  if (to_free)
+	    {
+	      gomp_finish_task (to_free);
+	      free (to_free);
+	      to_free = NULL;
+	    }
+	  goto finish_cancelled;
+	}
       else
 	{
 	/* All tasks we are waiting for are either running in other
-	   threads, or they are tasks that have not had their
+	   threads, are detached and waiting for the completion event to be
+	   fulfilled, or they are tasks that have not had their
 	   dependencies met (so they're not even in the queue).  Wait
 	   for them.  */
 	  if (task->taskwait == NULL)
@@ -1614,6 +1634,21 @@ GOMP_taskwait (void)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
+	  if (child_task->detach && !task_fulfilled_p (child_task))
+	    {
+	      child_task->kind = GOMP_TASK_DETACHED;
+	      priority_queue_insert (PQ_TEAM, &team->task_detach_queue,
+				     child_task, child_task->priority,
+				     PRIORITY_INSERT_END,
+				     false, false);
+	      ++team->task_detach_count;
+	      gomp_debug (0,
+			  "thread %d: queuing detached task with event %p\n",
+			  thr->ts.team_id, child_task);
+	      child_task = NULL;
+	      continue;
+	    }
+
 	 finish_cancelled:;
 	  size_t new_tasks
 	    = gomp_task_run_post_handle_depend (child_task, team);
@@ -2402,17 +2437,41 @@ ialias (omp_in_final)
 void
 omp_fulfill_event (omp_event_handle_t event)
 {
-  gomp_sem_t *sem = (gomp_sem_t *) event;
+  struct gomp_task *task = (struct gomp_task *) event;
+  struct gomp_task *parent = task->parent;
   struct gomp_thread *thr = gomp_thread ();
   struct gomp_team *team = thr ? thr->ts.team : NULL;
 
-  if (gomp_sem_getcount (sem) > 0)
-    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", sem);
+  if (gomp_sem_getcount (&task->completion_sem) > 0)
+    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", task);
 
-  gomp_debug (0, "omp_fulfill_event: %p\n", sem);
-  gomp_sem_post (sem);
-  if (team)
+  gomp_debug (0, "omp_fulfill_event: %p\n", task);
+  gomp_sem_post (&task->completion_sem);
+
+  /* Wake up any threads that may be waiting for the detached task
+     to complete.  */
+  gomp_mutex_lock (&team->task_lock);
+  if (parent && parent->taskwait)
+    {
+      if (parent->taskwait->in_taskwait)
+	{
+	  parent->taskwait->in_taskwait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+      else if (parent->taskwait->in_depend_wait)
+	{
+	  parent->taskwait->in_depend_wait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+    }
+  if (task->taskgroup && task->taskgroup->in_taskgroup_wait)
+    {
+      task->taskgroup->in_taskgroup_wait = false;
+      gomp_sem_post (&task->taskgroup->taskgroup_sem);
+    }
+  if (team && team->nthreads > team->task_running_count)
     gomp_team_barrier_wake (&team->barrier, 1);
+  gomp_mutex_unlock (&team->task_lock);
 }
 
 ialias (omp_fulfill_event)
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
index 5a01517..71bcde9 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
@@ -12,7 +12,7 @@ int main (void)
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
 
-  #pragma omp parallel firstprivate(detach_event1, detach_event2)
+  #pragma omp parallel private(detach_event1, detach_event2)
   {
     #pragma omp single
       thread_count = omp_get_num_threads();
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index b5f68cc..e7af05a 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -14,7 +14,7 @@ int main (void)
   omp_event_handle_t detach_event1, detach_event2;
 
   #pragma omp target map(tofrom: x, y, z) map(from: thread_count)
-    #pragma omp parallel firstprivate(detach_event1, detach_event2)
+    #pragma omp parallel private(detach_event1, detach_event2)
       {
 	#pragma omp single
 	  thread_count = omp_get_num_threads();
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
index 955d687..8bebb5c 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
@@ -10,7 +10,7 @@ program task_detach_5
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  !$omp parallel firstprivate(detach_event1, detach_event2)
+  !$omp parallel private(detach_event1, detach_event2)
     !$omp single
       thread_count = omp_get_num_threads()
     !$omp end single
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index 0fe2155..437ca66 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -12,7 +12,7 @@ program task_detach_6
   integer :: thread_count
 
   !$omp target map(tofrom: x, y, z) map(from: thread_count)
-    !$omp parallel firstprivate(detach_event1, detach_event2)
+    !$omp parallel private(detach_event1, detach_event2)
       !$omp single
 	thread_count = omp_get_num_threads()
       !$omp end single
-- 
2.8.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-01-21 19:33 [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Kwok Cheung Yeung
@ 2021-01-21 22:46 ` Kwok Cheung Yeung
  2021-01-29 15:03 ` Jakub Jelinek
  1 sibling, 0 replies; 22+ messages in thread
From: Kwok Cheung Yeung @ 2021-01-21 22:46 UTC (permalink / raw)
  To: Jakub Jelinek, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

On 21/01/2021 7:33 pm, Kwok Cheung Yeung wrote:
> With Nvidia and GCN offloading though, task-detach-6 hangs... I _think_ the 
> reason why it 'worked' before was because the taskwait allowed tasks with detach 
> clauses to always complete immediately after execution. Since that backdoor has 
> been closed, task-detach-6 hangs with or without the taskwait.

It turns out that the hang is because the team barrier threads fail to wake up 
when gomp_team_barrier_wake is called from omp_fulfill_event, because it was 
done while task_lock was held. When the lock is freed first, the wake works as 
expected and the test completes.

Is this patch okay for trunk (to be squashed into the previous patch)?

Thanks

Kwok

[-- Attachment #2: 0001-openmp-Fix-hangs-when-task-constructs-with-detach-cl.patch --]
[-- Type: text/plain, Size: 1846 bytes --]

From 2ee183c22772bc7d80d24ae75d5bd57f419712fd Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Thu, 21 Jan 2021 14:01:16 -0800
Subject: [PATCH] openmp: Fix hangs when task constructs with detach clauses
 are offloaded

2021-01-21  Kwok Cheung Yeung  <kcy@codesourcery.com>

	libgomp/
	task.c (GOMP_task): Add thread to debug message.
	(gomp_barrier_handle_tasks): Do not take address of child_task in
	debug message.
	(omp_fulfill_event): Release team->task_lock before waking team
	barrier threads.
---
 libgomp/task.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/libgomp/task.c b/libgomp/task.c
index dbd6284..60b598e 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -492,7 +492,7 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
 	  if (data)
 	    *(void **) data = task;
 
-	  gomp_debug (0, "New event: %p\n", task);
+	  gomp_debug (0, "Thread %d: new event: %p\n", thr->ts.team_id, task);
 	}
       thr->task = task;
       if (cpyfn)
@@ -1372,7 +1372,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
 				 child_task, MEMMODEL_RELAXED);
 	  --team->task_detach_count;
 	  gomp_debug (0, "thread %d: found task with fulfilled event %p\n",
-		      thr->ts.team_id, &child_task);
+		      thr->ts.team_id, child_task);
 
 	  if (to_free)
 	    {
@@ -2470,8 +2470,12 @@ omp_fulfill_event (omp_event_handle_t event)
       gomp_sem_post (&task->taskgroup->taskgroup_sem);
     }
   if (team && team->nthreads > team->task_running_count)
-    gomp_team_barrier_wake (&team->barrier, 1);
-  gomp_mutex_unlock (&team->task_lock);
+    {
+      gomp_mutex_unlock (&team->task_lock);
+      gomp_team_barrier_wake (&team->barrier, 1);
+    }
+  else
+    gomp_mutex_unlock (&team->task_lock);
 }
 
 ialias (omp_fulfill_event)
-- 
2.8.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-01-21 19:33 [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Kwok Cheung Yeung
  2021-01-21 22:46 ` Kwok Cheung Yeung
@ 2021-01-29 15:03 ` Jakub Jelinek
  2021-02-12 14:36   ` H.J. Lu
  2021-02-19 19:12   ` [WIP] " Kwok Cheung Yeung
  1 sibling, 2 replies; 22+ messages in thread
From: Jakub Jelinek @ 2021-01-29 15:03 UTC (permalink / raw)
  To: Kwok Cheung Yeung; +Cc: GCC Patches

On Thu, Jan 21, 2021 at 07:33:34PM +0000, Kwok Cheung Yeung wrote:
> The detach support clearly needs more work, but is this particular patch
> okay for trunk?

Sorry for the delay.

I'm afraid it is far from being ready.

> @@ -2402,17 +2437,41 @@ ialias (omp_in_final)
>  void
>  omp_fulfill_event (omp_event_handle_t event)
>  {
> -  gomp_sem_t *sem = (gomp_sem_t *) event;
> +  struct gomp_task *task = (struct gomp_task *) event;
> +  struct gomp_task *parent = task->parent;
>    struct gomp_thread *thr = gomp_thread ();
>    struct gomp_team *team = thr ? thr->ts.team : NULL;
>  
> -  if (gomp_sem_getcount (sem) > 0)
> -    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", sem);
> +  if (gomp_sem_getcount (&task->completion_sem) > 0)
> +    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", task);

As written earlier, the intent of omp_fulfill_event is that it should be
callable from anywhere, not necessarily one of the threads in the team.
The application could have other threads (often called unshackeled threads)
from which it would call it, or just some other parallel or whatever else,
as long as it is not racy to pass in the omp_event_handle_t to there.
So,
   struct gomp_thread *thr = gomp_thread ();
   struct gomp_team *team = thr ? thr->ts.team : NULL;
is incorrect, it will give you the team of the current thread, rather than
the team of the task to be fulfilled.

It can also crash if team is NULL, which will happen any time
this is called outside of a parallel.  Just try (should go into testsuite
too):
#include <omp.h>

int
main ()
{
  omp_event_handle_t ev;
  #pragma omp task detach (ev)
  omp_fulfill_event (ev);
  return 0;
}

Additionally, there is an important difference between fulfill for
included tasks and for non-included tasks, for the former there is no team
or anything to care about, for the latter there is a team and one needs to
take the task_lock, but at that point it can do pretty much everything in
omp_fulfill_event rather than handling it elsewhere.

So, what I'm suggesting is:

Replace
  bool detach;
  gomp_sem_t completion_sem;
with
  struct gomp_task_detach *detach;
and add struct gomp_task_detach that would contain everything that will be
needed (indirect so that we don't waste space for it in every task, but only
for those that have detach clause).
We need:
1) some way to tell if it is an included task or not
2) for included tasks the gomp_sem_t completion_sem
(and nothing but 1) and 2) for those),
3) struct gomp_team * for non-included tasks
4) some way to find out if the task has finished and is just waiting for
fulfill event (perhaps your GOMP_TASK_DETACHED is ok for that)
5) some way to find out if the task has been fulfilled already
(gomp_sem_t for that seems an overkill though)

1) could be done through the struct gomp_team *team; member,
set it to NULL in included tasks (no matter if they are in some team or not)
and to non-NULL team of the task (non-included tasks must have a team).

And I don't see the point of task_detach_queue if we can handle the
dependers etc. all in omp_fulfill_event, which I think we can if we take the
task_lock.

So, I think omp_fulfill_event should look at the task->detach it got,
if task->detach->team is NULL, it is included task, GOMP_task should have
initialized task->detach->completion_sem and omp_fulfill_event should just
gomp_sem_post it and that is all, GOMP_task for included task needs to
gomp_sem_wait after it finishes before it returns.

Otherwise, take the team's task_lock, and look at whether the task is still
running, in that case just set the bool that it has been fulfilled (or
whatever way of signalling 5), perhaps it can be say clearing task->detach
pointer).  When creating non-included tasks in GOMP_task with detach clause
through gomp_malloc, it would add the size needed for struct
gomp_task_detach.
But if the task is already in GOMP_TASK_DETACHED state, instead we need
while holding the task_lock do everything that would have been done normally
on task finish, but we've skipped it because it hasn't been fulfilled.
Including the waking/sem_posts when something could be waiting on that task.

Do you agree with this, or see some reason why this can't work?

And testsuite should include also cases where we wait for the tasks with
detach clause to be fulfilled at the end of taskgroup (i.e. need to cover
all of taskwait, taskgroup end and barrier).

	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-01-29 15:03 ` Jakub Jelinek
@ 2021-02-12 14:36   ` H.J. Lu
  2021-02-19 19:12   ` [WIP] " Kwok Cheung Yeung
  1 sibling, 0 replies; 22+ messages in thread
From: H.J. Lu @ 2021-02-12 14:36 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kwok Cheung Yeung, GCC Patches

On Fri, Jan 29, 2021 at 7:53 AM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jan 21, 2021 at 07:33:34PM +0000, Kwok Cheung Yeung wrote:
> > The detach support clearly needs more work, but is this particular patch
> > okay for trunk?
>
> Sorry for the delay.
>
> I'm afraid it is far from being ready.
>
> > @@ -2402,17 +2437,41 @@ ialias (omp_in_final)
> >  void
> >  omp_fulfill_event (omp_event_handle_t event)
> >  {
> > -  gomp_sem_t *sem = (gomp_sem_t *) event;
> > +  struct gomp_task *task = (struct gomp_task *) event;
> > +  struct gomp_task *parent = task->parent;
> >    struct gomp_thread *thr = gomp_thread ();
> >    struct gomp_team *team = thr ? thr->ts.team : NULL;
> >
> > -  if (gomp_sem_getcount (sem) > 0)
> > -    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", sem);
> > +  if (gomp_sem_getcount (&task->completion_sem) > 0)
> > +    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", task);
>
> As written earlier, the intent of omp_fulfill_event is that it should be
> callable from anywhere, not necessarily one of the threads in the team.
> The application could have other threads (often called unshackeled threads)
> from which it would call it, or just some other parallel or whatever else,
> as long as it is not racy to pass in the omp_event_handle_t to there.
> So,
>    struct gomp_thread *thr = gomp_thread ();
>    struct gomp_team *team = thr ? thr->ts.team : NULL;
> is incorrect, it will give you the team of the current thread, rather than
> the team of the task to be fulfilled.
>
> It can also crash if team is NULL, which will happen any time
> this is called outside of a parallel.  Just try (should go into testsuite
> too):
> #include <omp.h>
>
> int
> main ()
> {
>   omp_event_handle_t ev;
>   #pragma omp task detach (ev)
>   omp_fulfill_event (ev);
>   return 0;
> }
>
> Additionally, there is an important difference between fulfill for
> included tasks and for non-included tasks, for the former there is no team
> or anything to care about, for the latter there is a team and one needs to
> take the task_lock, but at that point it can do pretty much everything in
> omp_fulfill_event rather than handling it elsewhere.
>
> So, what I'm suggesting is:
>
> Replace
>   bool detach;
>   gomp_sem_t completion_sem;
> with
>   struct gomp_task_detach *detach;
> and add struct gomp_task_detach that would contain everything that will be
> needed (indirect so that we don't waste space for it in every task, but only
> for those that have detach clause).
> We need:
> 1) some way to tell if it is an included task or not
> 2) for included tasks the gomp_sem_t completion_sem
> (and nothing but 1) and 2) for those),
> 3) struct gomp_team * for non-included tasks
> 4) some way to find out if the task has finished and is just waiting for
> fulfill event (perhaps your GOMP_TASK_DETACHED is ok for that)
> 5) some way to find out if the task has been fulfilled already
> (gomp_sem_t for that seems an overkill though)
>
> 1) could be done through the struct gomp_team *team; member,
> set it to NULL in included tasks (no matter if they are in some team or not)
> and to non-NULL team of the task (non-included tasks must have a team).
>
> And I don't see the point of task_detach_queue if we can handle the
> dependers etc. all in omp_fulfill_event, which I think we can if we take the
> task_lock.
>
> So, I think omp_fulfill_event should look at the task->detach it got,
> if task->detach->team is NULL, it is included task, GOMP_task should have
> initialized task->detach->completion_sem and omp_fulfill_event should just
> gomp_sem_post it and that is all, GOMP_task for included task needs to
> gomp_sem_wait after it finishes before it returns.
>
> Otherwise, take the team's task_lock, and look at whether the task is still
> running, in that case just set the bool that it has been fulfilled (or
> whatever way of signalling 5), perhaps it can be say clearing task->detach
> pointer).  When creating non-included tasks in GOMP_task with detach clause
> through gomp_malloc, it would add the size needed for struct
> gomp_task_detach.
> But if the task is already in GOMP_TASK_DETACHED state, instead we need
> while holding the task_lock do everything that would have been done normally
> on task finish, but we've skipped it because it hasn't been fulfilled.
> Including the waking/sem_posts when something could be waiting on that task.
>
> Do you agree with this, or see some reason why this can't work?
>
> And testsuite should include also cases where we wait for the tasks with
> detach clause to be fulfilled at the end of taskgroup (i.e. need to cover
> all of taskwait, taskgroup end and barrier).
>

task-detach-6.f90 should be disabled for now.  It has been blocking my testers
for weeks.

--
H.J.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-01-29 15:03 ` Jakub Jelinek
  2021-02-12 14:36   ` H.J. Lu
@ 2021-02-19 19:12   ` Kwok Cheung Yeung
  2021-02-22 13:49     ` Jakub Jelinek
  2021-02-23 21:43     ` Kwok Cheung Yeung
  1 sibling, 2 replies; 22+ messages in thread
From: Kwok Cheung Yeung @ 2021-02-19 19:12 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 5275 bytes --]

Hello

Sorry for taking so long in replying.

On 29/01/2021 3:03 pm, Jakub Jelinek wrote:
> It can also crash if team is NULL, which will happen any time
> this is called outside of a parallel.  Just try (should go into testsuite
> too):
> #include <omp.h>
> 
> int
> main ()
> {
>    omp_event_handle_t ev;
>    #pragma omp task detach (ev)
>    omp_fulfill_event (ev);
>    return 0;
> }
>

I have included this as task-detach-11.{c|f90}.

> Additionally, there is an important difference between fulfill for
> included tasks and for non-included tasks, for the former there is no team
> or anything to care about, for the latter there is a team and one needs to
> take the task_lock, but at that point it can do pretty much everything in
> omp_fulfill_event rather than handling it elsewhere.
> 
> So, what I'm suggesting is:
> 
> Replace
>    bool detach;
>    gomp_sem_t completion_sem;
> with
>    struct gomp_task_detach *detach;
> and add struct gomp_task_detach that would contain everything that will be
> needed (indirect so that we don't waste space for it in every task, but only
> for those that have detach clause).
> We need:
> 1) some way to tell if it is an included task or not
> 2) for included tasks the gomp_sem_t completion_sem
> (and nothing but 1) and 2) for those),
> 3) struct gomp_team * for non-included tasks
> 4) some way to find out if the task has finished and is just waiting for
> fulfill event (perhaps your GOMP_TASK_DETACHED is ok for that)
> 5) some way to find out if the task has been fulfilled already
> (gomp_sem_t for that seems an overkill though)
> 
> 1) could be done through the struct gomp_team *team; member,
> set it to NULL in included tasks (no matter if they are in some team or not)
> and to non-NULL team of the task (non-included tasks must have a team).
> 

I have opted for a union of completion_sem (for tasks that are undeferred) and a 
struct gomp_team *detach_team (for deferred tasks) that holds the team if the 
completion event has not yet fulfilled, or NULL if is it. I don't see the point 
of having an indirection to the union here since the union is just the size of a 
pointer, so it might as well be inlined.

> And I don't see the point of task_detach_queue if we can handle the
> dependers etc. all in omp_fulfill_event, which I think we can if we take the
> task_lock.

I have removed the task_detach_queue. The team barrier, taskwait and 
taskgroup_end now just set the task kind to GOMP_TASK_DETACHED without 
decrementing the task_count if a task finishes with detach_team non-NULL.

> So, I think omp_fulfill_event should look at the task->detach it got,
> if task->detach->team is NULL, it is included task, GOMP_task should have
> initialized task->detach->completion_sem and omp_fulfill_event should just
> gomp_sem_post it and that is all, GOMP_task for included task needs to
> gomp_sem_wait after it finishes before it returns.

omp_fulfill_event now posts completion_sem if the task kind is 
OMP_TASK_UNDEFERRED, and GOMP_task waits for it. Since the task is executed 
within GOMP_task, it already knows if the task has a detach clause or not, so we 
do not need to store that information in gomp_task.

> Otherwise, take the team's task_lock, and look at whether the task is still
> running, in that case just set the bool that it has been fulfilled (or
> whatever way of signalling 5), perhaps it can be say clearing task->detach
> pointer).

detach_team is now set to NULL when the event is fulfilled if the task has not 
started yet or is still executing (checked by the kind). In that case, when the 
task finishes executing, it behaves just like a task without detach would and 
finishes normally.

   When creating non-included tasks in GOMP_task with detach clause
> through gomp_malloc, it would add the size needed for struct
> gomp_task_detach.

Not necessary with the inlined union.

> But if the task is already in GOMP_TASK_DETACHED state, instead we need
> while holding the task_lock do everything that would have been done normally
> on task finish, but we've skipped it because it hasn't been fulfilled.
> Including the waking/sem_posts when something could be waiting on that task.
> 
> Do you agree with this, or see some reason why this can't work?

The main problem I see is this code in gomp_barrier_handle_tasks:

	  if (--team->task_count == 0
	      && gomp_team_barrier_waiting_for_tasks (&team->barrier))
	    {
	      gomp_team_barrier_done (&team->barrier, state);

We do not have access to state from within omp_fulfill_event, so how should this 
be handled?

> And testsuite should include also cases where we wait for the tasks with
> detach clause to be fulfilled at the end of taskgroup (i.e. need to cover
> all of taskwait, taskgroup end and barrier).

I have changed task-detach-[56].* to test the barrier, task-detach-[78].* to 
test taskwait, and task-detach-(9|10) to test taskgroup (with the first one 
without a target construct, the second with).

I have included the current state of my patch. All task-detach-* tests pass when 
executed without offloading or with offloading to GCN, but with offloading to 
Nvidia, task-detach-6.* hangs consistently but everything else passes (probably 
because of the missing gomp_team_barrier_done?).

Kwok

[-- Attachment #2: 0001-openmp-Fix-intermittent-hanging-of-task-detach-6-lib.patch --]
[-- Type: text/plain, Size: 39859 bytes --]

From 31a5c736910036364fd1f0f3cf7ac28437864a27 Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Thu, 21 Jan 2021 05:38:47 -0800
Subject: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp
 tests [PR98738]

This adds support for the task detach clause to taskwait and taskgroup, and
simplifies the handling of the detach clause by moving most of the extra
handling required for detach tasks to omp_fulfill_event.

2021-02-19  Kwok Cheung Yeung  <kcy@codesourcery.com>

	libgomp/

	PR libgomp/98738
	* libgomp.h (enum gomp_task_kind): Add GOMP_TASK_DETACHED.
	(struct gomp_task): Replace detach and completion_sem fields with
	union containing completion_sem and detach_team.
	(struct gomp_team): Remove task_detach_queue.
	* task.c: Include assert.h.
	(gomp_init_task): Initialize detach_team field.
	(task_fulfilled_p): Delete.
	(GOMP_task): Use address of task as the event handle.  Remove
	initialization of detach field.  Initialize detach_team field for
	deferred tasks.
	(gomp_barrier_handle_tasks): Remove handling of task_detach_queue.
	Set kind of suspended detach task to GOMP_TASK_DETACHED and
	decrement task_running_count.  Move finish_cancelled block out of
	else branch.
	(GOMP_taskwait): Handle tasks with completion events that have not
	been fulfilled.
	(GOMP_taskgroup_end): Likewise.
	(omp_fulfill_event): Use address of task as event handle.  Post to
	completion_sem for undeferred tasks.  Clear detach_team if task
	has not finished.  For finished tasks, handle post-execution tasks,
	post to taskwait_sem and taskgroup_sem if necessary, call
	gomp_team_barrier_wake if necessary, and free task.
	* testsuite/libgomp.c-c++-common/task-detach-1.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-2.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-3.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-4.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-5.c: Fix formatting.
	Change data-sharing of detach events on enclosing parallel to private.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Likewise.  Remove
	taskwait directive.
	* testsuite/libgomp.c-c++-common/task-detach-7.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-8.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-9.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-10.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-11.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-1.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-2.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-3.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-4.f90: Fix formatting.
	* testsuite/libgomp.fortran/task-detach-5.f90: Fix formatting.
	Change data-sharing of detach events on enclosing parallel to private.
	* testsuite/libgomp.fortran/task-detach-6.f90: Likewise.  Remove
	taskwait directive.
	* testsuite/libgomp.c-c++-common/task-detach-7.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-8.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-9.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-10.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-11.f90: New.
---
 libgomp/libgomp.h                                  |  18 +-
 libgomp/task.c                                     | 225 +++++++++++++--------
 libgomp/team.c                                     |   2 -
 .../testsuite/libgomp.c-c++-common/task-detach-1.c |   4 +-
 .../libgomp.c-c++-common/task-detach-10.c          |  45 +++++
 .../libgomp.c-c++-common/task-detach-11.c          |  13 ++
 .../testsuite/libgomp.c-c++-common/task-detach-2.c |   6 +-
 .../testsuite/libgomp.c-c++-common/task-detach-3.c |   6 +-
 .../testsuite/libgomp.c-c++-common/task-detach-4.c |   4 +-
 .../testsuite/libgomp.c-c++-common/task-detach-5.c |   8 +-
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |   8 +-
 .../testsuite/libgomp.c-c++-common/task-detach-7.c |  45 +++++
 .../testsuite/libgomp.c-c++-common/task-detach-8.c |  47 +++++
 .../testsuite/libgomp.c-c++-common/task-detach-9.c |  43 ++++
 .../testsuite/libgomp.fortran/task-detach-1.f90    |   4 +-
 .../testsuite/libgomp.fortran/task-detach-10.f90   |  44 ++++
 .../testsuite/libgomp.fortran/task-detach-11.f90   |  13 ++
 .../testsuite/libgomp.fortran/task-detach-2.f90    |   6 +-
 .../testsuite/libgomp.fortran/task-detach-3.f90    |   6 +-
 .../testsuite/libgomp.fortran/task-detach-4.f90    |   4 +-
 .../testsuite/libgomp.fortran/task-detach-5.f90    |   8 +-
 .../testsuite/libgomp.fortran/task-detach-6.f90    |  16 +-
 .../testsuite/libgomp.fortran/task-detach-7.f90    |  42 ++++
 .../testsuite/libgomp.fortran/task-detach-8.f90    |  45 +++++
 .../testsuite/libgomp.fortran/task-detach-9.f90    |  41 ++++
 25 files changed, 573 insertions(+), 130 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-10.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-11.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-7.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-8.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-9.f90

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index b4d0c93..90a6f02 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -481,7 +481,10 @@ enum gomp_task_kind
      but not yet completed.  Once that completes, they will be readded
      into the queues as GOMP_TASK_WAITING in order to perform the var
      unmapping.  */
-  GOMP_TASK_ASYNC_RUNNING
+  GOMP_TASK_ASYNC_RUNNING,
+  /* Task that has finished executing but is waiting for its
+     completion event to be fulfilled.  */
+  GOMP_TASK_DETACHED
 };
 
 struct gomp_task_depend_entry
@@ -545,8 +548,14 @@ struct gomp_task
      entries and the gomp_task in which they reside.  */
   struct priority_node pnode[3];
 
-  bool detach;
-  gomp_sem_t completion_sem;
+  union {
+    /* Valid only if kind == GOMP_TASK_UNDEFERRED.  */
+    gomp_sem_t completion_sem;
+    /* Valid for other values of kind.  Set to the team that executes the
+       task if the task is detached and the completion event has yet to be
+       fulfilled.  */
+    struct gomp_team *detach_team;
+  };
 
   struct gomp_task_icv icv;
   void (*fn) (void *);
@@ -688,8 +697,7 @@ struct gomp_team
   int work_share_cancelled;
   int team_cancelled;
 
-  /* Tasks waiting for their completion event to be fulfilled.  */
-  struct priority_queue task_detach_queue;
+  /* Number of tasks waiting for their completion event to be fulfilled.  */
   unsigned int task_detach_count;
 
   /* This array contains structures for implicit tasks.  */
diff --git a/libgomp/task.c b/libgomp/task.c
index b242e7c..399e18b 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -29,6 +29,7 @@
 #include "libgomp.h"
 #include <stdlib.h>
 #include <string.h>
+#include <assert.h>
 #include "gomp-constants.h"
 
 typedef struct gomp_task_depend_entry *hash_entry_type;
@@ -86,7 +87,7 @@ gomp_init_task (struct gomp_task *task, struct gomp_task *parent_task,
   task->dependers = NULL;
   task->depend_hash = NULL;
   task->depend_count = 0;
-  task->detach = false;
+  task->detach_team = NULL;
 }
 
 /* Clean up a task, after completing it.  */
@@ -327,12 +328,6 @@ gomp_task_handle_depend (struct gomp_task *task, struct gomp_task *parent,
     }
 }
 
-static bool
-task_fulfilled_p (struct gomp_task *task)
-{
-  return gomp_sem_getcount (&task->completion_sem) > 0;
-}
-
 /* Called when encountering an explicit task directive.  If IF_CLAUSE is
    false, then we must not delay in executing the task.  If UNTIED is true,
    then the task may be executed by any member of the team.
@@ -417,13 +412,13 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
 
       if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
 	{
-	  task.detach = true;
 	  gomp_sem_init (&task.completion_sem, 0);
-	  *(void **) detach = &task.completion_sem;
+	  *(void **) detach = &task;
 	  if (data)
-	    *(void **) data = &task.completion_sem;
+	    *(void **) data = &task;
 
-	  gomp_debug (0, "New event: %p\n", &task.completion_sem);
+	  gomp_debug (0, "Thread %d: new event: %p\n",
+		      thr->ts.team_id, &task);
 	}
 
       if (thr->task)
@@ -443,7 +438,7 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       else
 	fn (data);
 
-      if (task.detach && !task_fulfilled_p (&task))
+      if ((flags & GOMP_TASK_FLAG_DETACH) != 0 && detach)
 	gomp_sem_wait (&task.completion_sem);
 
       /* Access to "children" is normally done inside a task_lock
@@ -481,18 +476,17 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
 		      & ~(uintptr_t) (arg_align - 1));
       gomp_init_task (task, parent, gomp_icv (false));
       task->priority = priority;
-      task->kind = GOMP_TASK_UNDEFERRED;
       task->in_tied_task = parent->in_tied_task;
       task->taskgroup = taskgroup;
       if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
 	{
-	  task->detach = true;
-	  gomp_sem_init (&task->completion_sem, 0);
-	  *(void **) detach = &task->completion_sem;
+	  task->detach_team = team;
+
+	  *(void **) detach = task;
 	  if (data)
-	    *(void **) data = &task->completion_sem;
+	    *(void **) data = task;
 
-	  gomp_debug (0, "New event: %p\n", &task->completion_sem);
+	  gomp_debug (0, "Thread %d: new event: %p\n", thr->ts.team_id, task);
 	}
       thr->task = task;
       if (cpyfn)
@@ -1362,27 +1356,6 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
     {
       bool cancelled = false;
 
-      /* Look for a queued detached task with a fulfilled completion event
-	 that is ready to finish.  */
-      child_task = priority_queue_find (PQ_TEAM, &team->task_detach_queue,
-					task_fulfilled_p);
-      if (child_task)
-	{
-	  priority_queue_remove (PQ_TEAM, &team->task_detach_queue,
-				 child_task, MEMMODEL_RELAXED);
-	  --team->task_detach_count;
-	  gomp_debug (0, "thread %d: found task with fulfilled event %p\n",
-		      thr->ts.team_id, &child_task->completion_sem);
-
-	if (to_free)
-	  {
-	    gomp_finish_task (to_free);
-	    free (to_free);
-	    to_free = NULL;
-	  }
-	  goto finish_cancelled;
-	}
-
       if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
 	{
 	  bool ignored;
@@ -1450,43 +1423,43 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
-	  if (child_task->detach && !task_fulfilled_p (child_task))
+	  if (child_task->detach_team)
 	    {
-	      priority_queue_insert (PQ_TEAM, &team->task_detach_queue,
-				     child_task, child_task->priority,
-				     PRIORITY_INSERT_END,
-				     false, false);
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
 	      ++team->task_detach_count;
-	      gomp_debug (0, "thread %d: queueing task with event %p\n",
-			  thr->ts.team_id, &child_task->completion_sem);
+	      --team->task_running_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in team barrier\n",
+			  thr->ts.team_id, child_task);
 	      child_task = NULL;
+	      continue;
 	    }
-	  else
+
+	 finish_cancelled:;
+	  size_t new_tasks
+	    = gomp_task_run_post_handle_depend (child_task, team);
+	  gomp_task_run_post_remove_parent (child_task);
+	  gomp_clear_parent (&child_task->children_queue);
+	  gomp_task_run_post_remove_taskgroup (child_task);
+	  to_free = child_task;
+	  if (!cancelled)
+	    team->task_running_count--;
+	  child_task = NULL;
+	  if (new_tasks > 1)
 	    {
-	     finish_cancelled:;
-	      size_t new_tasks
-		= gomp_task_run_post_handle_depend (child_task, team);
-	      gomp_task_run_post_remove_parent (child_task);
-	      gomp_clear_parent (&child_task->children_queue);
-	      gomp_task_run_post_remove_taskgroup (child_task);
-	      to_free = child_task;
-	      child_task = NULL;
-	      if (!cancelled)
-		team->task_running_count--;
-	      if (new_tasks > 1)
-		{
-		  do_wake = team->nthreads - team->task_running_count;
-		  if (do_wake > new_tasks)
-		    do_wake = new_tasks;
-		}
-	      if (--team->task_count == 0
-		  && gomp_team_barrier_waiting_for_tasks (&team->barrier))
-		{
-		  gomp_team_barrier_done (&team->barrier, state);
-		  gomp_mutex_unlock (&team->task_lock);
-		  gomp_team_barrier_wake (&team->barrier, 0);
-		  gomp_mutex_lock (&team->task_lock);
-		}
+	      do_wake = team->nthreads - team->task_running_count;
+	      if (do_wake > new_tasks)
+		do_wake = new_tasks;
+	    }
+	  if (--team->task_count == 0
+	      && gomp_team_barrier_waiting_for_tasks (&team->barrier))
+	    {
+	      gomp_team_barrier_done (&team->barrier, state);
+	      gomp_mutex_unlock (&team->task_lock);
+	      gomp_team_barrier_wake (&team->barrier, 0);
+	      gomp_mutex_lock (&team->task_lock);
 	    }
 	}
     }
@@ -1559,7 +1532,8 @@ GOMP_taskwait (void)
       else
 	{
 	/* All tasks we are waiting for are either running in other
-	   threads, or they are tasks that have not had their
+	   threads, are detached and waiting for the completion event to be
+	   fulfilled, or they are tasks that have not had their
 	   dependencies met (so they're not even in the queue).  Wait
 	   for them.  */
 	  if (task->taskwait == NULL)
@@ -1614,6 +1588,19 @@ GOMP_taskwait (void)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
+	  if (child_task->detach_team)
+	    {
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
+	      ++team->task_detach_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in taskwait\n",
+			  thr->ts.team_id, child_task);
+	      child_task = NULL;
+	      continue;
+	    }
+
 	 finish_cancelled:;
 	  size_t new_tasks
 	    = gomp_task_run_post_handle_depend (child_task, team);
@@ -2069,6 +2056,19 @@ GOMP_taskgroup_end (void)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
+	  if (child_task->detach_team)
+	    {
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
+	      ++team->task_detach_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in taskgroup\n",
+			  thr->ts.team_id, child_task);
+	      child_task = NULL;
+	      continue;
+	    }
+
 	 finish_cancelled:;
 	  size_t new_tasks
 	    = gomp_task_run_post_handle_depend (child_task, team);
@@ -2402,17 +2402,80 @@ ialias (omp_in_final)
 void
 omp_fulfill_event (omp_event_handle_t event)
 {
-  gomp_sem_t *sem = (gomp_sem_t *) event;
-  struct gomp_thread *thr = gomp_thread ();
-  struct gomp_team *team = thr ? thr->ts.team : NULL;
+  struct gomp_task *task = (struct gomp_task *) event;
+  if (task->kind == GOMP_TASK_UNDEFERRED)
+  {
+    if (gomp_sem_getcount (&task->completion_sem) > 0)
+      gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", task);
+
+    gomp_debug (0, "omp_fulfill_event: %p event for undeferred task\n", task);
+    gomp_sem_post (&task->completion_sem);
+    return;
+  }
 
-  if (gomp_sem_getcount (sem) > 0)
-    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", sem);
+  struct gomp_team *team = task->detach_team;
+  if (!team)
+    gomp_fatal ("omp_fulfill_event: %p event is invalid or has already "
+		"been fulfilled!\n", task);
 
-  gomp_debug (0, "omp_fulfill_event: %p\n", sem);
-  gomp_sem_post (sem);
-  if (team)
-    gomp_team_barrier_wake (&team->barrier, 1);
+  gomp_mutex_lock (&team->task_lock);
+  if (task->kind != GOMP_TASK_DETACHED)
+    {
+      /* The task has not finished running yet.  */
+      gomp_debug (0,
+		  "omp_fulfill_event: %p event fulfilled for unfinished "
+		  "task\n", task);
+      task->detach_team = NULL;
+      gomp_mutex_unlock (&team->task_lock);
+      return;
+    }
+
+  gomp_debug (0, "omp_fulfill_event: %p event fulfilled for finished task\n",
+	      task);
+  size_t new_tasks = gomp_task_run_post_handle_depend (task, team);
+  gomp_task_run_post_remove_parent (task);
+  gomp_clear_parent (&task->children_queue);
+  gomp_task_run_post_remove_taskgroup (task);
+  team->task_count--;
+  team->task_detach_count--;
+
+  /* Wake up any threads that may be waiting for the detached task
+     to complete.  */
+  struct gomp_task *parent = task->parent;
+
+  if (parent && parent->taskwait)
+    {
+      if (parent->taskwait->in_taskwait)
+	{
+	  parent->taskwait->in_taskwait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+      else if (parent->taskwait->in_depend_wait)
+	{
+	  parent->taskwait->in_depend_wait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+    }
+  if (task->taskgroup && task->taskgroup->in_taskgroup_wait)
+    {
+      task->taskgroup->in_taskgroup_wait = false;
+      gomp_sem_post (&task->taskgroup->taskgroup_sem);
+    }
+
+  int do_wake = 0;
+  if (new_tasks > 1)
+    {
+      do_wake = team->nthreads - team->task_running_count;
+      if (do_wake > new_tasks)
+	do_wake = new_tasks;
+    }
+
+  gomp_mutex_unlock (&team->task_lock);
+  if (do_wake)
+    gomp_team_barrier_wake (&team->barrier, do_wake);
+
+  gomp_finish_task (task);
+  free (task);
 }
 
 ialias (omp_fulfill_event)
diff --git a/libgomp/team.c b/libgomp/team.c
index 0f3707c..9662234 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -206,7 +206,6 @@ gomp_new_team (unsigned nthreads)
   team->work_share_cancelled = 0;
   team->team_cancelled = 0;
 
-  priority_queue_init (&team->task_detach_queue);
   team->task_detach_count = 0;
 
   return team;
@@ -224,7 +223,6 @@ free_team (struct gomp_team *team)
   gomp_barrier_destroy (&team->barrier);
   gomp_mutex_destroy (&team->task_lock);
   priority_queue_free (&team->task_queue);
-  priority_queue_free (&team->task_detach_queue);
   team_free (team);
 }
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
index 8583e37..14932b0 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
@@ -14,10 +14,10 @@ int main (void)
   #pragma omp parallel
     #pragma omp single
     {
-      #pragma omp task detach(detach_event1)
+      #pragma omp task detach (detach_event1)
 	x++;
 
-      #pragma omp task detach(detach_event2)
+      #pragma omp task detach (detach_event2)
       {
 	y++;
 	omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
new file mode 100644
index 0000000..10d6746
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause on an offload device.  Each device
+   thread spawns off a chain of tasks in a taskgroup, that can then
+   be executed by any available thread.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
+      #pragma omp taskgroup
+	{
+	  #pragma omp single
+	    thread_count = omp_get_num_threads ();
+
+	  #pragma omp task detach (detach_event1) untied
+	    #pragma omp atomic update
+	      x++;
+
+	  #pragma omp task detach (detach_event2) untied
+	  {
+	    #pragma omp atomic update
+	      y++;
+	    omp_fulfill_event (detach_event1);
+	  }
+
+	  #pragma omp task untied
+	  {
+	    #pragma omp atomic update
+	      z++;
+	    omp_fulfill_event (detach_event2);
+	  }
+	}
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
new file mode 100644
index 0000000..dd002dc
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
@@ -0,0 +1,13 @@
+/* { dg-do run } */
+
+#include <omp.h>
+
+/* Test the detach clause when the task is undeferred.  */
+
+int main (void)
+{
+  omp_event_handle_t event;
+
+  #pragma omp task detach (event)
+    omp_fulfill_event (event);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
index 943ac2a..3e33c40 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
@@ -12,13 +12,13 @@ int main (void)
   omp_event_handle_t detach_event1, detach_event2;
   int x = 0, y = 0, z = 0;
 
-  #pragma omp parallel num_threads(1)
+  #pragma omp parallel num_threads (1)
     #pragma omp single
     {
-      #pragma omp task detach(detach_event1)
+      #pragma omp task detach (detach_event1)
 	x++;
 
-      #pragma omp task detach(detach_event2)
+      #pragma omp task detach (detach_event2)
       {
 	y++;
 	omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
index 2609fb1..c85857d 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
@@ -14,16 +14,16 @@ int main (void)
   #pragma omp parallel
     #pragma omp single
     {
-      #pragma omp task depend(out:dep) detach(detach_event)
+      #pragma omp task depend (out:dep) detach (detach_event)
 	x++;
 
       #pragma omp task
       {
 	y++;
-	omp_fulfill_event(detach_event);
+	omp_fulfill_event (detach_event);
       }
 
-      #pragma omp task depend(in:dep)
+      #pragma omp task depend (in:dep)
 	z++;
     }
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
index eeb9554..cd0d2b3 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
@@ -14,10 +14,10 @@ int main (void)
 
   #pragma omp parallel
     #pragma omp single
-      #pragma omp task detach(detach_event)
+      #pragma omp task detach (detach_event)
       {
 	x++;
-	omp_fulfill_event(detach_event);
+	omp_fulfill_event (detach_event);
       }
 
   assert (x == 1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
index 5a01517..382f377 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
@@ -12,16 +12,16 @@ int main (void)
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
 
-  #pragma omp parallel firstprivate(detach_event1, detach_event2)
+  #pragma omp parallel private (detach_event1, detach_event2)
   {
     #pragma omp single
-      thread_count = omp_get_num_threads();
+      thread_count = omp_get_num_threads ();
 
-    #pragma omp task detach(detach_event1) untied
+    #pragma omp task detach (detach_event1) untied
       #pragma omp atomic update
 	x++;
 
-    #pragma omp task detach(detach_event2) untied
+    #pragma omp task detach (detach_event2) untied
     {
       #pragma omp atomic update
 	y++;
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index b5f68cc..e5c2291 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -13,11 +13,11 @@ int main (void)
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
 
-  #pragma omp target map(tofrom: x, y, z) map(from: thread_count)
-    #pragma omp parallel firstprivate(detach_event1, detach_event2)
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
       {
 	#pragma omp single
-	  thread_count = omp_get_num_threads();
+	  thread_count = omp_get_num_threads ();
 
 	#pragma omp task detach(detach_event1) untied
 	  #pragma omp atomic update
@@ -36,8 +36,6 @@ int main (void)
 	    z++;
 	  omp_fulfill_event (detach_event2);
 	}
-
-	#pragma omp taskwait
       }
 
   assert (x == thread_count);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
new file mode 100644
index 0000000..3f025d6
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause.  Each thread spawns off a chain of tasks,
+   that can then be executed by any available thread.  Each thread uses
+   taskwait to wait for the child tasks to complete.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp parallel private (detach_event1, detach_event2)
+  {
+    #pragma omp single
+      thread_count = omp_get_num_threads ();
+
+    #pragma omp task detach (detach_event1) untied
+      #pragma omp atomic update
+	x++;
+
+    #pragma omp task detach (detach_event2) untied
+    {
+      #pragma omp atomic update
+	y++;
+      omp_fulfill_event (detach_event1);
+    }
+
+    #pragma omp task untied
+    {
+      #pragma omp atomic update
+	z++;
+      omp_fulfill_event (detach_event2);
+    }
+
+    #pragma omp taskwait
+  }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
new file mode 100644
index 0000000..6f77f12
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause on an offload device.  Each device
+   thread spawns off a chain of tasks, that can then be executed by
+   any available thread.  Each thread uses taskwait to wait for the
+   child tasks to complete.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
+      {
+	#pragma omp single
+	  thread_count = omp_get_num_threads ();
+
+	#pragma omp task detach (detach_event1) untied
+	  #pragma omp atomic update
+	    x++;
+
+	#pragma omp task detach (detach_event2) untied
+	{
+	  #pragma omp atomic update
+	    y++;
+	  omp_fulfill_event (detach_event1);
+	}
+
+	#pragma omp task untied
+	{
+	  #pragma omp atomic update
+	    z++;
+	  omp_fulfill_event (detach_event2);
+	}
+
+	#pragma omp taskwait
+      }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
new file mode 100644
index 0000000..5316ca5
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause.  Each thread spawns off a chain of tasks
+   in a taskgroup, that can then be executed by any available thread.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp parallel private (detach_event1, detach_event2)
+    #pragma omp taskgroup
+    {
+      #pragma omp single
+	thread_count = omp_get_num_threads ();
+
+      #pragma omp task detach (detach_event1) untied
+	#pragma omp atomic update
+	  x++;
+
+      #pragma omp task detach (detach_event2) untied
+      {
+	#pragma omp atomic update
+	  y++;
+	omp_fulfill_event (detach_event1);
+      }
+
+      #pragma omp task untied
+      {
+	#pragma omp atomic update
+	  z++;
+	omp_fulfill_event (detach_event2);
+      }
+    }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-1.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
index 217bf65..c53b1ca 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
@@ -11,11 +11,11 @@ program task_detach_1
 
   !$omp parallel
     !$omp single
-      !$omp task detach(detach_event1)
+      !$omp task detach (detach_event1)
         x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2)
+      !$omp task detach (detach_event2)
         y = y + 1
 	call omp_fulfill_event (detach_event1)
       !$omp end task
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-10.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-10.f90
new file mode 100644
index 0000000..61f0ea8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-10.f90
@@ -0,0 +1,44 @@
+! { dg-do run }
+
+! Test tasks with detach clause on an offload device.  Each device
+! thread spawns off a chain of tasks in a taskgroup, that can then
+! be executed by any available thread.
+
+program task_detach_10
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
+      !$omp taskgroup
+	!$omp single
+	  thread_count = omp_get_num_threads ()
+	!$omp end single
+
+	!$omp task detach (detach_event1) untied
+	  !$omp atomic update
+	    x = x + 1
+	!$omp end task
+
+	!$omp task detach (detach_event2) untied
+	  !$omp atomic update
+	    y = y + 1
+	  call omp_fulfill_event (detach_event1)
+	!$omp end task
+
+	!$omp task untied
+	  !$omp atomic update
+	    z = z + 1
+	  call omp_fulfill_event (detach_event2)
+	!$omp end task
+      !$omp end taskgroup
+    !$omp end parallel
+  !$omp end target
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-11.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-11.f90
new file mode 100644
index 0000000..b33baff
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-11.f90
@@ -0,0 +1,13 @@
+! { dg-do run }
+
+! Test the detach clause when the task is undeferred.
+
+program task_detach_11
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event
+
+  !$omp task detach (detach_event)
+    call omp_fulfill_event (detach_event)
+  !$omp end task
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-2.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
index ecb4829..68e3ff2 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
@@ -10,13 +10,13 @@ program task_detach_2
   integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
   integer :: x = 0, y = 0, z = 0
 
-  !$omp parallel num_threads(1)
+  !$omp parallel num_threads (1)
     !$omp single
-      !$omp task detach(detach_event1)
+      !$omp task detach (detach_event1)
         x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2)
+      !$omp task detach (detach_event2)
         y = y + 1
 	call omp_fulfill_event (detach_event1)
       !$omp end task
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-3.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
index bdf93a5..5ac68d5 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
@@ -12,16 +12,16 @@ program task_detach_3
 
   !$omp parallel
     !$omp single
-      !$omp task depend(out:dep) detach(detach_event)
+      !$omp task depend (out:dep) detach (detach_event)
         x = x + 1
       !$omp end task
 
       !$omp task
         y = y + 1
-	call omp_fulfill_event(detach_event)
+	call omp_fulfill_event (detach_event)
       !$omp end task
 
-      !$omp task depend(in:dep)
+      !$omp task depend (in:dep)
         z = z + 1
       !$omp end task
     !$omp end single
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-4.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
index 6d0843c..159624c 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
@@ -11,9 +11,9 @@ program task_detach_4
 
   !$omp parallel
     !$omp single
-      !$omp task detach(detach_event)
+      !$omp task detach (detach_event)
         x = x + 1
-	call omp_fulfill_event(detach_event)
+	call omp_fulfill_event (detach_event)
       !$omp end task
     !$omp end single
   !$omp end parallel
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
index 955d687..95bd132 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
@@ -10,17 +10,17 @@ program task_detach_5
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  !$omp parallel firstprivate(detach_event1, detach_event2)
+  !$omp parallel private (detach_event1, detach_event2)
     !$omp single
-      thread_count = omp_get_num_threads()
+      thread_count = omp_get_num_threads ()
     !$omp end single
 
-    !$omp task detach(detach_event1) untied
+    !$omp task detach (detach_event1) untied
       !$omp atomic update
 	x = x + 1
     !$omp end task
 
-    !$omp task detach(detach_event2) untied
+    !$omp task detach (detach_event2) untied
       !$omp atomic update
 	y = y + 1
       call omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index 0fe2155..b2c476f 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -11,30 +11,28 @@ program task_detach_6
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  !$omp target map(tofrom: x, y, z) map(from: thread_count)
-    !$omp parallel firstprivate(detach_event1, detach_event2)
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
       !$omp single
-	thread_count = omp_get_num_threads()
+	thread_count = omp_get_num_threads ()
       !$omp end single
 
-      !$omp task detach(detach_event1) untied
+      !$omp task detach (detach_event1) untied
 	!$omp atomic update
 	  x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2) untied
+      !$omp task detach (detach_event2) untied
 	!$omp atomic update
 	  y = y + 1
-	call omp_fulfill_event (detach_event1);
+	call omp_fulfill_event (detach_event1)
       !$omp end task
 
       !$omp task untied
 	!$omp atomic update
 	  z = z + 1
-	call omp_fulfill_event (detach_event2);
+	call omp_fulfill_event (detach_event2)
       !$omp end task
-
-      !$omp taskwait
     !$omp end parallel
   !$omp end target
 
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-7.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-7.f90
new file mode 100644
index 0000000..32e715e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-7.f90
@@ -0,0 +1,42 @@
+! { dg-do run }
+
+! Test tasks with detach clause.  Each thread spawns off a chain of tasks,
+! that can then be executed by any available thread.  Each thread uses
+! taskwait to wait for the child tasks to complete.
+
+program task_detach_7
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp parallel private (detach_event1, detach_event2)
+    !$omp single
+      thread_count = omp_get_num_threads()
+    !$omp end single
+
+    !$omp task detach (detach_event1) untied
+      !$omp atomic update
+	x = x + 1
+    !$omp end task
+
+    !$omp task detach (detach_event2) untied
+      !$omp atomic update
+	y = y + 1
+      call omp_fulfill_event (detach_event1)
+    !$omp end task
+
+    !$omp task untied
+      !$omp atomic update
+	z = z + 1
+      call omp_fulfill_event (detach_event2)
+    !$omp end task
+
+    !$omp taskwait
+  !$omp end parallel
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-8.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-8.f90
new file mode 100644
index 0000000..e760eab
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-8.f90
@@ -0,0 +1,45 @@
+! { dg-do run }
+
+! Test tasks with detach clause on an offload device.  Each device
+! thread spawns off a chain of tasks, that can then be executed by
+! any available thread.  Each thread uses taskwait to wait for the
+! child tasks to complete.
+
+program task_detach_8
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
+      !$omp single
+	thread_count = omp_get_num_threads ()
+      !$omp end single
+
+      !$omp task detach (detach_event1) untied
+	!$omp atomic update
+	  x = x + 1
+      !$omp end task
+
+      !$omp task detach (detach_event2) untied
+	!$omp atomic update
+	  y = y + 1
+	call omp_fulfill_event (detach_event1)
+      !$omp end task
+
+      !$omp task untied
+	!$omp atomic update
+	  z = z + 1
+	call omp_fulfill_event (detach_event2)
+      !$omp end task
+
+      !$omp taskwait
+    !$omp end parallel
+  !$omp end target
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-9.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-9.f90
new file mode 100644
index 0000000..540c6de
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-9.f90
@@ -0,0 +1,41 @@
+! { dg-do run }
+
+! Test tasks with detach clause.  Each thread spawns off a chain of tasks
+! in a taskgroup, that can then be executed by any available thread.
+
+program task_detach_9
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp parallel private (detach_event1, detach_event2)
+    !$omp taskgroup
+      !$omp single
+	thread_count = omp_get_num_threads ()
+      !$omp end single
+
+      !$omp task detach (detach_event1) untied
+	!$omp atomic update
+	  x = x + 1
+      !$omp end task
+
+      !$omp task detach (detach_event2) untied
+	!$omp atomic update
+	  y = y + 1
+	call omp_fulfill_event (detach_event1);
+      !$omp end task
+
+      !$omp task untied
+	!$omp atomic update
+	  z = z + 1
+	call omp_fulfill_event (detach_event2);
+      !$omp end task
+    !$omp end taskgroup
+  !$omp end parallel
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
-- 
2.8.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-19 19:12   ` [WIP] " Kwok Cheung Yeung
@ 2021-02-22 13:49     ` Jakub Jelinek
  2021-02-22 18:14       ` Jakub Jelinek
  2021-02-24 18:17       ` Kwok Cheung Yeung
  2021-02-23 21:43     ` Kwok Cheung Yeung
  1 sibling, 2 replies; 22+ messages in thread
From: Jakub Jelinek @ 2021-02-22 13:49 UTC (permalink / raw)
  To: Kwok Cheung Yeung; +Cc: GCC Patches

On Fri, Feb 19, 2021 at 07:12:42PM +0000, Kwok Cheung Yeung wrote:
> I have opted for a union of completion_sem (for tasks that are undeferred)
> and a struct gomp_team *detach_team (for deferred tasks) that holds the team
> if the completion event has not yet fulfilled, or NULL if is it. I don't see
> the point of having an indirection to the union here since the union is just
> the size of a pointer, so it might as well be inlined.

I see three issues with the union of completion_sem and detach_team done
that way.

1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems
   and especially posix gomp_sem_t is large; so while it might be a good
   idea to inline gomp_sem_t on config/{linux,accel} into the union, for
   the rest it might be better to use indirection; if it is only for the
   undeferred tasks, it could be just using an automatic variable and
   put into the struct address of that; could be done either always,
   or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is
   small and decide on the indirection based on that macro
2) kind == GOMP_TASK_UNDEFERRED is true also for the deferred tasks while
   running the cpyfn callback; guess this could be dealt with making sure
   the detach handling is done only after
      thr->task = task;
      if (cpyfn)
        {
          cpyfn (arg, data);
          task->copy_ctors_done = true;
        }
      else
        memcpy (arg, data, arg_size);
      thr->task = parent;
      task->kind = GOMP_TASK_WAITING;
      task->fn = fn;
      task->fn_data = arg;
      task->final_task = (flags & GOMP_TASK_FLAG_FINAL) >> 1;
   I see you've instead removed the GOMP_TASK_UNDEFERRED but the rationale
   for that is that the copy constructors are being run synchronously
3) kind is not constant, for the deferred tasks it can change over the
   lifetime of the task, as you've said in the comments, it is kind ==
   GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind
   are done while holding the task lock, omp_fulfill_event reads it before
   locking that lock, so I think it needs to be done using
   if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == GOMP_TASK_UNDEFERRED)
   Pedantically the stores to task->kind also need to be done
   with __atomic_store_n MEMMODEL_RELAXED.

Now, similarly for 3) on task->kind, task->detach_team is similar case,
again, some other omp_fulfill_event can clear it (under lock, but still read
outside of the lock), so it
probably should be read with
  struct gomp_team *team
    = __atomic_load_n (&task->detach_team, MEMMODEL_RELAXED);
And again, pedantically the detach_team stores should be atomic relaxed
stores too.  

> > Do you agree with this, or see some reason why this can't work?
> 
> The main problem I see is this code in gomp_barrier_handle_tasks:
> 
> 	  if (--team->task_count == 0
> 	      && gomp_team_barrier_waiting_for_tasks (&team->barrier))
> 	    {
> 	      gomp_team_barrier_done (&team->barrier, state);
> 
> We do not have access to state from within omp_fulfill_event, so how should
> this be handled?

Sure, omp_fulfill_event shouldn't do any waiting, it needs to awake anything
that could have been waiting.

> @@ -688,8 +697,7 @@ struct gomp_team
>    int work_share_cancelled;
>    int team_cancelled;
>  
> -  /* Tasks waiting for their completion event to be fulfilled.  */
> -  struct priority_queue task_detach_queue;
> +  /* Number of tasks waiting for their completion event to be fulfilled.  */
>    unsigned int task_detach_count;

Do we need task_detach_count?  Currently it is only initialized and
incremented/decremented, but never tested for anything.
Though see below.

> +  gomp_debug (0, "omp_fulfill_event: %p event fulfilled for finished task\n",
> +	      task);
> +  size_t new_tasks = gomp_task_run_post_handle_depend (task, team);
> +  gomp_task_run_post_remove_parent (task);
> +  gomp_clear_parent (&task->children_queue);
> +  gomp_task_run_post_remove_taskgroup (task);
> +  team->task_count--;
> +  team->task_detach_count--;
> +
> +  /* Wake up any threads that may be waiting for the detached task
> +     to complete.  */
> +  struct gomp_task *parent = task->parent;
> +
> +  if (parent && parent->taskwait)
> +    {
> +      if (parent->taskwait->in_taskwait)
> +	{
> +	  parent->taskwait->in_taskwait = false;
> +	  gomp_sem_post (&parent->taskwait->taskwait_sem);
> +	}
> +      else if (parent->taskwait->in_depend_wait)
> +	{
> +	  parent->taskwait->in_depend_wait = false;
> +	  gomp_sem_post (&parent->taskwait->taskwait_sem);
> +	}
> +    }

Looking at gomp_task_run_post_remove_parent, doesn't that function
already handle the in_taskwait and in_depend_wait gomp_sem_posts?

> +  if (task->taskgroup && task->taskgroup->in_taskgroup_wait)
> +    {
> +      task->taskgroup->in_taskgroup_wait = false;
> +      gomp_sem_post (&task->taskgroup->taskgroup_sem);
> +    }

And into gomp_task_run_post_remove_taskgroup, doesn't that already
handle the in_taskgroup_wait gomp_sem_post?

> +
> +  int do_wake = 0;
> +  if (new_tasks > 1)
> +    {
> +      do_wake = team->nthreads - team->task_running_count;
> +      if (do_wake > new_tasks)
> +	do_wake = new_tasks;
> +    }
> +
> +  gomp_mutex_unlock (&team->task_lock);
> +  if (do_wake)
> +    gomp_team_barrier_wake (&team->barrier, do_wake);

I think for the barrier case we need to make a difference between
team == gomp_thread ()->ts.team case (I guess the more usual one),
where the fact that we know some thread (the one calling omp_fulfill_event)
is provably doing something other than sitting on a barrier waiting for
tasks means it is much simpler, i.e. that
gomp_team_barrier_set_task_pending (&team->barrier);
is needed only when some tasks dependent on completion of the current one
were added to the queues, but
gomp_task_run_post_handle_depend -> gomp_task_run_post_handle_dependers
takes care of that call and the above do_wake handles that too,
though the exact details I think need fixing
- in gomp_barrier_handle_tasks the reason for if (new_tasks > 1)
is that if there is a single dependent task, the current thread
just finished handling one task and so can take that single task and so no
need to wake up.  While in the omp_fulfill_event case, even if there
is just one new task, we need to schedule it to some thread and so
is desirable to wake some thread.  All we know
(if team == gomp_thread ()->ts.team) is that at least one thread is doing
something else but that one could be busy for quite some time.

And the other case is the omp_fulfill_event call from unshackeled thread,
i.e. team != gomp_thread ()->ts.team.
Here, e.g. what gomp_target_task_completion talks about applies:
  /* I'm afraid this can't be done after releasing team->task_lock,
     as gomp_target_task_completion is run from unrelated thread and
     therefore in between gomp_mutex_unlock and gomp_team_barrier_wake
     the team could be gone already.  */
Even there are 2 different cases.
One is where team->task_running_count > 0, at that point we know
at least one task is running and so the only thing that is unsafe
gomp_team_barrier_wake (&team->barrier, do_wake);
after gomp_mutex_unlock (&team->task_lock); - there is a possibility
that in between the two calls the thread running omp_fulfill_event
gets interrupted or just delayed and the team finishes barrier and
is freed too.  So the gomp_team_barrier_wake needs to be done before
the unlock in that case.

And then there is the case where all tasks finish on a barrier but some
haven't been fulfilled yet.
In that case, when the last thread calls
gomp_team_barrier_wait_end we run in there:
  if (__builtin_expect (state & BAR_WAS_LAST, 0))
...
      if (__builtin_expect (team->task_count, 0))
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
        }
and
gomp_barrier_handle_tasks
it will hit the
  gomp_mutex_lock (&team->task_lock);
  if (gomp_barrier_last_thread (state))
    {
      if (team->task_count == 0)
        {
          gomp_team_barrier_done (&team->barrier, state);
          gomp_mutex_unlock (&team->task_lock);
          gomp_team_barrier_wake (&team->barrier, 0);
          return;
        }
      gomp_team_barrier_set_waiting_for_tasks (&team->barrier);
    }
but team->task_count is not 0 (because of the one or more non-fulfilled
tasks), priority_queue_empty_p is true and so after unlocking
the function will just return.
And then it will jump to do_wait and on wake ups do
gomp_barrier_handle_tasks again.

So, I think for the team != gomp_thread ()->ts.team
&& !do_wake
&& gomp_team_barrier_waiting_for_tasks (&team->barrier)
&& team->task_detach_count == 0
case, we need to wake up 1 thread anyway and arrange for it to do:
              gomp_team_barrier_done (&team->barrier, state);
              gomp_mutex_unlock (&team->task_lock);
              gomp_team_barrier_wake (&team->barrier, 0);
Possibly in
      if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
add
      else if (team->task_count == 0
	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
	{
	  gomp_team_barrier_done (&team->barrier, state);
	  gomp_mutex_unlock (&team->task_lock);
	  gomp_team_barrier_wake (&team->barrier, 0);
	  if (to_free)
	    {
	      gomp_finish_task (to_free);
	      free (to_free);
	    }
	  return;
	}
but the:
          if (--team->task_count == 0
              && gomp_team_barrier_waiting_for_tasks (&team->barrier))
            {
              gomp_team_barrier_done (&team->barrier, state);
              gomp_mutex_unlock (&team->task_lock);
              gomp_team_barrier_wake (&team->barrier, 0);
              gomp_mutex_lock (&team->task_lock);
            }
in that case would then be incorrect, we don't want to do that twice.
So, either that second if would need to do the to_free handling
and return instead of taking the lock again and looping, or
perhaps we can just do
	  --team->task_count;
there instead and let the above added else if handle that?

	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-22 13:49     ` Jakub Jelinek
@ 2021-02-22 18:14       ` Jakub Jelinek
  2021-02-24 18:17       ` Kwok Cheung Yeung
  1 sibling, 0 replies; 22+ messages in thread
From: Jakub Jelinek @ 2021-02-22 18:14 UTC (permalink / raw)
  To: Kwok Cheung Yeung; +Cc: GCC Patches

On Mon, Feb 22, 2021 at 02:49:44PM +0100, Jakub Jelinek wrote:
> So, I think for the team != gomp_thread ()->ts.team
> && !do_wake
> && gomp_team_barrier_waiting_for_tasks (&team->barrier)
> && team->task_detach_count == 0
> case, we need to wake up 1 thread anyway and arrange for it to do:
>               gomp_team_barrier_done (&team->barrier, state);
>               gomp_mutex_unlock (&team->task_lock);
>               gomp_team_barrier_wake (&team->barrier, 0);
> Possibly in
>       if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
> add
>       else if (team->task_count == 0
> 	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
> 	{
> 	  gomp_team_barrier_done (&team->barrier, state);
> 	  gomp_mutex_unlock (&team->task_lock);
> 	  gomp_team_barrier_wake (&team->barrier, 0);
> 	  if (to_free)
> 	    {
> 	      gomp_finish_task (to_free);
> 	      free (to_free);
> 	    }
> 	  return;
> 	}
> but the:
>           if (--team->task_count == 0
>               && gomp_team_barrier_waiting_for_tasks (&team->barrier))
>             {
>               gomp_team_barrier_done (&team->barrier, state);
>               gomp_mutex_unlock (&team->task_lock);
>               gomp_team_barrier_wake (&team->barrier, 0);
>               gomp_mutex_lock (&team->task_lock);
>             }
> in that case would then be incorrect, we don't want to do that twice.
> So, either that second if would need to do the to_free handling
> and return instead of taking the lock again and looping, or
> perhaps we can just do
> 	  --team->task_count;
> there instead and let the above added else if handle that?

FYI, I've just tested that part of change alone whether it doesn't break
anything else and it worked fine:

diff --git a/libgomp/task.c b/libgomp/task.c
index b242e7c8d20..9c27c3b5148 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -1405,6 +1405,19 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
 	  team->task_running_count++;
 	  child_task->in_tied_task = true;
 	}
+      else if (team->task_count == 0
+	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
+	{
+	  gomp_team_barrier_done (&team->barrier, state);
+	  gomp_mutex_unlock (&team->task_lock);
+	  gomp_team_barrier_wake (&team->barrier, 0);
+	  if (to_free)
+	    {
+	      gomp_finish_task (to_free);
+	      free (to_free);
+	    }
+	  return;
+	}
       gomp_mutex_unlock (&team->task_lock);
       if (do_wake)
 	{
@@ -1479,14 +1492,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
 		  if (do_wake > new_tasks)
 		    do_wake = new_tasks;
 		}
-	      if (--team->task_count == 0
-		  && gomp_team_barrier_waiting_for_tasks (&team->barrier))
-		{
-		  gomp_team_barrier_done (&team->barrier, state);
-		  gomp_mutex_unlock (&team->task_lock);
-		  gomp_team_barrier_wake (&team->barrier, 0);
-		  gomp_mutex_lock (&team->task_lock);
-		}
+	      --team->task_count;
 	    }
 	}
     }


	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-19 19:12   ` [WIP] " Kwok Cheung Yeung
  2021-02-22 13:49     ` Jakub Jelinek
@ 2021-02-23 21:43     ` Kwok Cheung Yeung
  2021-02-23 21:52       ` Jakub Jelinek
  1 sibling, 1 reply; 22+ messages in thread
From: Kwok Cheung Yeung @ 2021-02-23 21:43 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

On 19/02/2021 7:12 pm, Kwok Cheung Yeung wrote:
> I have included the current state of my patch. All task-detach-* tests pass when 
> executed without offloading or with offloading to GCN, but with offloading to 
> Nvidia, task-detach-6.* hangs consistently but everything else passes (probably 
> because of the missing gomp_team_barrier_done?).
> 

It looks like the hang has nothing to do with the detach patch - this hangs 
consistently for me when offloaded to NVPTX:

#include <omp.h>

int main (void)
{
#pragma omp target
   #pragma omp parallel
     #pragma omp task
       ;
}

This doesn't hang when offloaded to GCN or the host device, or if num_threads(1) 
is specified on the omp parallel.

Kwok

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-23 21:43     ` Kwok Cheung Yeung
@ 2021-02-23 21:52       ` Jakub Jelinek
  2021-03-11 16:52         ` Thomas Schwinge
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Jelinek @ 2021-02-23 21:52 UTC (permalink / raw)
  To: Kwok Cheung Yeung; +Cc: GCC Patches

On Tue, Feb 23, 2021 at 09:43:51PM +0000, Kwok Cheung Yeung wrote:
> On 19/02/2021 7:12 pm, Kwok Cheung Yeung wrote:
> > I have included the current state of my patch. All task-detach-* tests
> > pass when executed without offloading or with offloading to GCN, but
> > with offloading to Nvidia, task-detach-6.* hangs consistently but
> > everything else passes (probably because of the missing
> > gomp_team_barrier_done?).
> > 
> 
> It looks like the hang has nothing to do with the detach patch - this hangs
> consistently for me when offloaded to NVPTX:
> 
> #include <omp.h>
> 
> int main (void)
> {
> #pragma omp target
>   #pragma omp parallel
>     #pragma omp task
>       ;
> }
> 
> This doesn't hang when offloaded to GCN or the host device, or if
> num_threads(1) is specified on the omp parallel.

Then it can be solved separately, I'll try to have a look if I see something
bad from the dumps, but I admit I don't have much experience with debugging
NVPTX offloaded code...

	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-22 13:49     ` Jakub Jelinek
  2021-02-22 18:14       ` Jakub Jelinek
@ 2021-02-24 18:17       ` Kwok Cheung Yeung
  2021-02-24 19:46         ` Jakub Jelinek
  1 sibling, 1 reply; 22+ messages in thread
From: Kwok Cheung Yeung @ 2021-02-24 18:17 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 8241 bytes --]

On 22/02/2021 1:49 pm, Jakub Jelinek wrote:
> I see three issues with the union of completion_sem and detach_team done
> that way.
> 
> 1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems
>     and especially posix gomp_sem_t is large; so while it might be a good
>     idea to inline gomp_sem_t on config/{linux,accel} into the union, for
>     the rest it might be better to use indirection; if it is only for the
>     undeferred tasks, it could be just using an automatic variable and
>     put into the struct address of that; could be done either always,
>     or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is
>     small and decide on the indirection based on that macro

I think a pointer to an automatic variable would be simplest.

> 2) kind == GOMP_TASK_UNDEFERRED is true also for the deferred tasks while
>     running the cpyfn callback; guess this could be dealt with making sure
>     the detach handling is done only after
>        thr->task = task;
>        if (cpyfn)
>          {
>            cpyfn (arg, data);
>            task->copy_ctors_done = true;
>          }
>        else
>          memcpy (arg, data, arg_size);
>        thr->task = parent;
>        task->kind = GOMP_TASK_WAITING;
>        task->fn = fn;
>        task->fn_data = arg;
>        task->final_task = (flags & GOMP_TASK_FLAG_FINAL) >> 1;
>     I see you've instead removed the GOMP_TASK_UNDEFERRED but the rationale
>     for that is that the copy constructors are being run synchronously

Can anything in cpyfn make use of the fact that kind==GOMP_TASK_UNDEFERRED while 
executing it? Anyway, if we want to keep this, then I suppose we could just add 
an extra field deferred_p that does not change for the lifetime of the task to 
indicate that the task is 'really' a deferred task.

> 3) kind is not constant, for the deferred tasks it can change over the
>     lifetime of the task, as you've said in the comments, it is kind ==
>     GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind
>     are done while holding the task lock, omp_fulfill_event reads it before
>     locking that lock, so I think it needs to be done using
>     if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == GOMP_TASK_UNDEFERRED)
>     Pedantically the stores to task->kind also need to be done
>     with __atomic_store_n MEMMODEL_RELAXED.

If we check task->deferred_p instead (which never changes for a task after 
instantiation), is that still necessary?

> Now, similarly for 3) on task->kind, task->detach_team is similar case,
> again, some other omp_fulfill_event can clear it (under lock, but still read
> outside of the lock), so it
> probably should be read with
>    struct gomp_team *team
>      = __atomic_load_n (&task->detach_team, MEMMODEL_RELAXED);
> And again, pedantically the detach_team stores should be atomic relaxed
> stores too.
> 

Done.

> Looking at gomp_task_run_post_remove_parent, doesn't that function
> already handle the in_taskwait and in_depend_wait gomp_sem_posts?

> And into gomp_task_run_post_remove_taskgroup, doesn't that already
> handle the in_taskgroup_wait gomp_sem_post?

The extra code has been removed.

> - in gomp_barrier_handle_tasks the reason for if (new_tasks > 1)
> is that if there is a single dependent task, the current thread
> just finished handling one task and so can take that single task and so no
> need to wake up.  While in the omp_fulfill_event case, even if there
> is just one new task, we need to schedule it to some thread and so
> is desirable to wake some thread.

In that case, we could just do 'if (new_tasks > 0)' instead?

 > All we know
 > (if team == gomp_thread ()->ts.team) is that at least one thread is doing
 > something else but that one could be busy for quite some time.

Well, it should still get around to the new task eventually, so there is no 
problem in terms of correctness here. I suppose we could always wake up one more 
thread than strictly necessary, but that might have knock-on effects on 
performance elsewhere?

> And the other case is the omp_fulfill_event call from unshackeled thread,
> i.e. team != gomp_thread ()->ts.team.
> Here, e.g. what gomp_target_task_completion talks about applies:
>    /* I'm afraid this can't be done after releasing team->task_lock,
>       as gomp_target_task_completion is run from unrelated thread and
>       therefore in between gomp_mutex_unlock and gomp_team_barrier_wake
>       the team could be gone already.  */
> Even there are 2 different cases.
> One is where team->task_running_count > 0, at that point we know
> at least one task is running and so the only thing that is unsafe
> gomp_team_barrier_wake (&team->barrier, do_wake);
> after gomp_mutex_unlock (&team->task_lock); - there is a possibility
> that in between the two calls the thread running omp_fulfill_event
> gets interrupted or just delayed and the team finishes barrier and
> is freed too.  So the gomp_team_barrier_wake needs to be done before
> the unlock in that case.

The lock is now freed after the call for unshackeled threads, before otherwise.

> And then there is the case where all tasks finish on a barrier but some
> haven't been fulfilled yet.
> In that case, when the last thread calls
...
> So, I think for the team != gomp_thread ()->ts.team
> && !do_wake
> && gomp_team_barrier_waiting_for_tasks (&team->barrier)
> && team->task_detach_count == 0
> case, we need to wake up 1 thread anyway and arrange for it to do:
>                gomp_team_barrier_done (&team->barrier, state);
>                gomp_mutex_unlock (&team->task_lock);
>                gomp_team_barrier_wake (&team->barrier, 0);
> Possibly in
>        if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
> add
>        else if (team->task_count == 0
> 	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
> 	{
> 	  gomp_team_barrier_done (&team->barrier, state);
> 	  gomp_mutex_unlock (&team->task_lock);
> 	  gomp_team_barrier_wake (&team->barrier, 0);
> 	  if (to_free)
> 	    {
> 	      gomp_finish_task (to_free);
> 	      free (to_free);
> 	    }
> 	  return;
> 	}
> but the:
>            if (--team->task_count == 0
>                && gomp_team_barrier_waiting_for_tasks (&team->barrier))
>              {
>                gomp_team_barrier_done (&team->barrier, state);
>                gomp_mutex_unlock (&team->task_lock);
>                gomp_team_barrier_wake (&team->barrier, 0);
>                gomp_mutex_lock (&team->task_lock);
>              }
> in that case would then be incorrect, we don't want to do that twice.
> So, either that second if would need to do the to_free handling
> and return instead of taking the lock again and looping, or
> perhaps we can just do
> 	  --team->task_count;
> there instead and let the above added else if handle that?
>
I have applied your patch to move the gomp_team_barrier_done, and in 
omp_fulfill_event, I ensure that a single thread is woken up so that 
gomp_barrier_handle_tasks can signal for the barrier to finish.

I'm having some trouble coming up with a testcase to test this scenario though. 
I tried having a testcase like this to have threads in separate teams:

   #pragma omp teams num_teams (2) shared (event, started)
     #pragma omp parallel num_threads (1)
       if (omp_get_team_num () == 0)
	{
	  #pragma omp task detach (event)
	    started = 1;
	}
       else
         // Wait for started to become 1
	omp_fulfill_event (event);

but it does not work because GOMP_teams_reg launches the enclosed block 
sequentially:

   for (gomp_team_num = 0; gomp_team_num < num_teams; gomp_team_num++)
     fn (data);

and when the first team launches, it blocks waiting for the detach event in 
GOMP_parallel_end->gomp_team_end->gomp_team_barrier_wait_end, and never gets 
around to launching the second team. If I omit the 'omp parallel' (to try to get 
an undeferred task), GCC refuses to compile (only 'distribute', 'parallel' or 
'loop' regions are allowed to be strictly nested inside 'teams' region). And you 
can't nest 'omp teams' inside an 'omp parallel' either. Is there any way of 
doing this within OpenMP or do we have to resort to creating threads outside of 
OpenMP?

Thanks

Kwok

[-- Attachment #2: 0001-openmp-Fix-intermittent-hanging-of-task-detach-6-lib.patch --]
[-- Type: text/plain, Size: 41150 bytes --]

From 0fa4deb89f3778ccacd64b01de377ba2b7879db1 Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Thu, 21 Jan 2021 05:38:47 -0800
Subject: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp
 tests [PR98738]

This adds support for the task detach clause to taskwait and taskgroup, and
simplifies the handling of the detach clause by moving most of the extra
handling required for detach tasks to omp_fulfill_event.

2021-02-24  Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Jakub Jelinek  <jakub@redhat.com>

	libgomp/

	PR libgomp/98738
	* libgomp.h (enum gomp_task_kind): Add GOMP_TASK_DETACHED.
	(struct gomp_task): Replace detach and completion_sem fields with
	union containing completion_sem and detach_team.  Add deferred_p
	field.
	(struct gomp_team): Remove task_detach_queue.
	* task.c: Include assert.h.
	(gomp_init_task): Initialize deferred_p and detach_team fields.
	(task_fulfilled_p): Delete.
	(GOMP_task): Use address of task as the event handle.  Remove
	initialization of detach field.  Initialize deferred_p field.
	Use automatic local for completion_sem.  Initialize detach_team field
	for deferred tasks.
	(gomp_barrier_handle_tasks): Remove handling of task_detach_queue.
	Set kind of suspended detach task to GOMP_TASK_DETACHED and
	decrement task_running_count.  Move finish_cancelled block out of
	else branch.  Relocate call to gomp_team_barrier_done.
	(GOMP_taskwait): Handle tasks with completion events that have not
	been fulfilled.
	(GOMP_taskgroup_end): Likewise.
	(omp_fulfill_event): Use address of task as event handle.  Post to
	completion_sem for undeferred tasks.  Clear detach_team if task
	has not finished.  For finished tasks, handle post-execution tasks,
	call gomp_team_barrier_wake if necessary, and free task.
	* team.c (gomp_new_team): Remove initialization of task_detach_queue.
	(free_team): Remove free of task_detach_queue.
	* testsuite/libgomp.c-c++-common/task-detach-1.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-2.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-3.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-4.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-5.c: Fix formatting.
	Change data-sharing of detach events on enclosing parallel to private.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Likewise.  Remove
	taskwait directive.
	* testsuite/libgomp.c-c++-common/task-detach-7.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-8.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-9.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-10.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-11.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-1.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-2.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-3.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-4.f90: Fix formatting.
	* testsuite/libgomp.fortran/task-detach-5.f90: Fix formatting.
	Change data-sharing of detach events on enclosing parallel to private.
	* testsuite/libgomp.fortran/task-detach-6.f90: Likewise.  Remove
	taskwait directive.
	* testsuite/libgomp.c-c++-common/task-detach-7.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-8.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-9.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-10.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-11.f90: New.
---
 libgomp/libgomp.h                                  |  19 +-
 libgomp/task.c                                     | 236 ++++++++++++++-------
 libgomp/team.c                                     |   2 -
 .../testsuite/libgomp.c-c++-common/task-detach-1.c |   4 +-
 .../libgomp.c-c++-common/task-detach-10.c          |  45 ++++
 .../libgomp.c-c++-common/task-detach-11.c          |  13 ++
 .../testsuite/libgomp.c-c++-common/task-detach-2.c |   6 +-
 .../testsuite/libgomp.c-c++-common/task-detach-3.c |   6 +-
 .../testsuite/libgomp.c-c++-common/task-detach-4.c |   4 +-
 .../testsuite/libgomp.c-c++-common/task-detach-5.c |   8 +-
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |   8 +-
 .../testsuite/libgomp.c-c++-common/task-detach-7.c |  45 ++++
 .../testsuite/libgomp.c-c++-common/task-detach-8.c |  47 ++++
 .../testsuite/libgomp.c-c++-common/task-detach-9.c |  43 ++++
 .../testsuite/libgomp.fortran/task-detach-1.f90    |   4 +-
 .../testsuite/libgomp.fortran/task-detach-10.f90   |  44 ++++
 .../testsuite/libgomp.fortran/task-detach-11.f90   |  13 ++
 .../testsuite/libgomp.fortran/task-detach-2.f90    |   6 +-
 .../testsuite/libgomp.fortran/task-detach-3.f90    |   6 +-
 .../testsuite/libgomp.fortran/task-detach-4.f90    |   4 +-
 .../testsuite/libgomp.fortran/task-detach-5.f90    |   8 +-
 .../testsuite/libgomp.fortran/task-detach-6.f90    |  16 +-
 .../testsuite/libgomp.fortran/task-detach-7.f90    |  42 ++++
 .../testsuite/libgomp.fortran/task-detach-8.f90    |  45 ++++
 .../testsuite/libgomp.fortran/task-detach-9.f90    |  41 ++++
 25 files changed, 584 insertions(+), 131 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-10.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-11.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-7.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-8.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-9.f90

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index b4d0c93..cd10d12 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -481,7 +481,10 @@ enum gomp_task_kind
      but not yet completed.  Once that completes, they will be readded
      into the queues as GOMP_TASK_WAITING in order to perform the var
      unmapping.  */
-  GOMP_TASK_ASYNC_RUNNING
+  GOMP_TASK_ASYNC_RUNNING,
+  /* Task that has finished executing but is waiting for its
+     completion event to be fulfilled.  */
+  GOMP_TASK_DETACHED
 };
 
 struct gomp_task_depend_entry
@@ -545,8 +548,15 @@ struct gomp_task
      entries and the gomp_task in which they reside.  */
   struct priority_node pnode[3];
 
-  bool detach;
-  gomp_sem_t completion_sem;
+  union {
+    /* Valid only if deferred_p is false.  */
+    gomp_sem_t *completion_sem;
+    /* Valid only if deferred_p is true.  Set to the team that executes the
+       task if the task is detached and the completion event has yet to be
+       fulfilled.  */
+    struct gomp_team *detach_team;
+  };
+  bool deferred_p;
 
   struct gomp_task_icv icv;
   void (*fn) (void *);
@@ -688,8 +698,7 @@ struct gomp_team
   int work_share_cancelled;
   int team_cancelled;
 
-  /* Tasks waiting for their completion event to be fulfilled.  */
-  struct priority_queue task_detach_queue;
+  /* Number of tasks waiting for their completion event to be fulfilled.  */
   unsigned int task_detach_count;
 
   /* This array contains structures for implicit tasks.  */
diff --git a/libgomp/task.c b/libgomp/task.c
index b242e7c..79df733 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -29,6 +29,7 @@
 #include "libgomp.h"
 #include <stdlib.h>
 #include <string.h>
+#include <assert.h>
 #include "gomp-constants.h"
 
 typedef struct gomp_task_depend_entry *hash_entry_type;
@@ -86,7 +87,8 @@ gomp_init_task (struct gomp_task *task, struct gomp_task *parent_task,
   task->dependers = NULL;
   task->depend_hash = NULL;
   task->depend_count = 0;
-  task->detach = false;
+  task->deferred_p = true;
+  task->detach_team = NULL;
 }
 
 /* Clean up a task, after completing it.  */
@@ -327,12 +329,6 @@ gomp_task_handle_depend (struct gomp_task *task, struct gomp_task *parent,
     }
 }
 
-static bool
-task_fulfilled_p (struct gomp_task *task)
-{
-  return gomp_sem_getcount (&task->completion_sem) > 0;
-}
-
 /* Called when encountering an explicit task directive.  If IF_CLAUSE is
    false, then we must not delay in executing the task.  If UNTIED is true,
    then the task may be executed by any member of the team.
@@ -398,6 +394,7 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       || team->task_count > 64 * team->nthreads)
     {
       struct gomp_task task;
+      gomp_sem_t completion_sem;
 
       /* If there are depend clauses and earlier deferred sibling tasks
 	 with depend clauses, check if there isn't a dependency.  If there
@@ -414,16 +411,18 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       task.final_task = (thr->task && thr->task->final_task)
 			|| (flags & GOMP_TASK_FLAG_FINAL);
       task.priority = priority;
+      task.deferred_p = false;
 
       if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
 	{
-	  task.detach = true;
-	  gomp_sem_init (&task.completion_sem, 0);
-	  *(void **) detach = &task.completion_sem;
+	  gomp_sem_init (&completion_sem, 0);
+	  task.completion_sem = &completion_sem;
+	  *(void **) detach = &task;
 	  if (data)
-	    *(void **) data = &task.completion_sem;
+	    *(void **) data = &task;
 
-	  gomp_debug (0, "New event: %p\n", &task.completion_sem);
+	  gomp_debug (0, "Thread %d: new event: %p\n",
+		      thr->ts.team_id, &task);
 	}
 
       if (thr->task)
@@ -443,8 +442,8 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       else
 	fn (data);
 
-      if (task.detach && !task_fulfilled_p (&task))
-	gomp_sem_wait (&task.completion_sem);
+      if ((flags & GOMP_TASK_FLAG_DETACH) != 0 && detach)
+	gomp_sem_wait (&completion_sem);
 
       /* Access to "children" is normally done inside a task_lock
 	 mutex region, but the only way this particular task.children
@@ -484,15 +483,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       task->kind = GOMP_TASK_UNDEFERRED;
       task->in_tied_task = parent->in_tied_task;
       task->taskgroup = taskgroup;
+      task->deferred_p = true;
       if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
 	{
-	  task->detach = true;
-	  gomp_sem_init (&task->completion_sem, 0);
-	  *(void **) detach = &task->completion_sem;
+	  task->detach_team = team;
+
+	  *(void **) detach = task;
 	  if (data)
-	    *(void **) data = &task->completion_sem;
+	    *(void **) data = task;
 
-	  gomp_debug (0, "New event: %p\n", &task->completion_sem);
+	  gomp_debug (0, "Thread %d: new event: %p\n", thr->ts.team_id, task);
 	}
       thr->task = task;
       if (cpyfn)
@@ -1362,27 +1362,6 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
     {
       bool cancelled = false;
 
-      /* Look for a queued detached task with a fulfilled completion event
-	 that is ready to finish.  */
-      child_task = priority_queue_find (PQ_TEAM, &team->task_detach_queue,
-					task_fulfilled_p);
-      if (child_task)
-	{
-	  priority_queue_remove (PQ_TEAM, &team->task_detach_queue,
-				 child_task, MEMMODEL_RELAXED);
-	  --team->task_detach_count;
-	  gomp_debug (0, "thread %d: found task with fulfilled event %p\n",
-		      thr->ts.team_id, &child_task->completion_sem);
-
-	if (to_free)
-	  {
-	    gomp_finish_task (to_free);
-	    free (to_free);
-	    to_free = NULL;
-	  }
-	  goto finish_cancelled;
-	}
-
       if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
 	{
 	  bool ignored;
@@ -1405,6 +1384,19 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
 	  team->task_running_count++;
 	  child_task->in_tied_task = true;
 	}
+      else if (team->task_count == 0
+	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
+	{
+	  gomp_team_barrier_done (&team->barrier, state);
+	  gomp_mutex_unlock (&team->task_lock);
+	  gomp_team_barrier_wake (&team->barrier, 0);
+	  if (to_free)
+	    {
+	      gomp_finish_task (to_free);
+	      free (to_free);
+	    }
+	  return;
+	}
       gomp_mutex_unlock (&team->task_lock);
       if (do_wake)
 	{
@@ -1450,44 +1442,37 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
-	  if (child_task->detach && !task_fulfilled_p (child_task))
+	  if (child_task->detach_team)
 	    {
-	      priority_queue_insert (PQ_TEAM, &team->task_detach_queue,
-				     child_task, child_task->priority,
-				     PRIORITY_INSERT_END,
-				     false, false);
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
 	      ++team->task_detach_count;
-	      gomp_debug (0, "thread %d: queueing task with event %p\n",
-			  thr->ts.team_id, &child_task->completion_sem);
+	      --team->task_running_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in team barrier\n",
+			  thr->ts.team_id, child_task);
 	      child_task = NULL;
+	      continue;
 	    }
-	  else
+
+	 finish_cancelled:;
+	  size_t new_tasks
+	    = gomp_task_run_post_handle_depend (child_task, team);
+	  gomp_task_run_post_remove_parent (child_task);
+	  gomp_clear_parent (&child_task->children_queue);
+	  gomp_task_run_post_remove_taskgroup (child_task);
+	  to_free = child_task;
+	  if (!cancelled)
+	    team->task_running_count--;
+	  child_task = NULL;
+	  if (new_tasks > 1)
 	    {
-	     finish_cancelled:;
-	      size_t new_tasks
-		= gomp_task_run_post_handle_depend (child_task, team);
-	      gomp_task_run_post_remove_parent (child_task);
-	      gomp_clear_parent (&child_task->children_queue);
-	      gomp_task_run_post_remove_taskgroup (child_task);
-	      to_free = child_task;
-	      child_task = NULL;
-	      if (!cancelled)
-		team->task_running_count--;
-	      if (new_tasks > 1)
-		{
-		  do_wake = team->nthreads - team->task_running_count;
-		  if (do_wake > new_tasks)
-		    do_wake = new_tasks;
-		}
-	      if (--team->task_count == 0
-		  && gomp_team_barrier_waiting_for_tasks (&team->barrier))
-		{
-		  gomp_team_barrier_done (&team->barrier, state);
-		  gomp_mutex_unlock (&team->task_lock);
-		  gomp_team_barrier_wake (&team->barrier, 0);
-		  gomp_mutex_lock (&team->task_lock);
-		}
+	      do_wake = team->nthreads - team->task_running_count;
+	      if (do_wake > new_tasks)
+		do_wake = new_tasks;
 	    }
+	  --team->task_count;
 	}
     }
 }
@@ -1559,7 +1544,8 @@ GOMP_taskwait (void)
       else
 	{
 	/* All tasks we are waiting for are either running in other
-	   threads, or they are tasks that have not had their
+	   threads, are detached and waiting for the completion event to be
+	   fulfilled, or they are tasks that have not had their
 	   dependencies met (so they're not even in the queue).  Wait
 	   for them.  */
 	  if (task->taskwait == NULL)
@@ -1614,6 +1600,19 @@ GOMP_taskwait (void)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
+	  if (child_task->detach_team)
+	    {
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
+	      ++team->task_detach_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in taskwait\n",
+			  thr->ts.team_id, child_task);
+	      child_task = NULL;
+	      continue;
+	    }
+
 	 finish_cancelled:;
 	  size_t new_tasks
 	    = gomp_task_run_post_handle_depend (child_task, team);
@@ -2069,6 +2068,19 @@ GOMP_taskgroup_end (void)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
+	  if (child_task->detach_team)
+	    {
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
+	      ++team->task_detach_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in taskgroup\n",
+			  thr->ts.team_id, child_task);
+	      child_task = NULL;
+	      continue;
+	    }
+
 	 finish_cancelled:;
 	  size_t new_tasks
 	    = gomp_task_run_post_handle_depend (child_task, team);
@@ -2402,17 +2414,77 @@ ialias (omp_in_final)
 void
 omp_fulfill_event (omp_event_handle_t event)
 {
-  gomp_sem_t *sem = (gomp_sem_t *) event;
-  struct gomp_thread *thr = gomp_thread ();
-  struct gomp_team *team = thr ? thr->ts.team : NULL;
+  struct gomp_task *task = (struct gomp_task *) event;
+  if (!task->deferred_p)
+    {
+      if (gomp_sem_getcount (task->completion_sem) > 0)
+	gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", task);
 
-  if (gomp_sem_getcount (sem) > 0)
-    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", sem);
+      gomp_debug (0, "omp_fulfill_event: %p event for undeferred task\n",
+		  task);
+      gomp_sem_post (task->completion_sem);
+      return;
+    }
 
-  gomp_debug (0, "omp_fulfill_event: %p\n", sem);
-  gomp_sem_post (sem);
-  if (team)
-    gomp_team_barrier_wake (&team->barrier, 1);
+  struct gomp_team *team = __atomic_load_n (&task->detach_team,
+					    MEMMODEL_RELAXED);
+  if (!team)
+    gomp_fatal ("omp_fulfill_event: %p event is invalid or has already "
+		"been fulfilled!\n", task);
+
+  gomp_mutex_lock (&team->task_lock);
+  if (task->kind != GOMP_TASK_DETACHED)
+    {
+      /* The task has not finished running yet.  */
+      gomp_debug (0,
+		  "omp_fulfill_event: %p event fulfilled for unfinished "
+		  "task\n", task);
+      __atomic_store_n (&task->detach_team, NULL, MEMMODEL_RELAXED);
+      gomp_mutex_unlock (&team->task_lock);
+      return;
+    }
+
+  gomp_debug (0, "omp_fulfill_event: %p event fulfilled for finished task\n",
+	      task);
+  size_t new_tasks = gomp_task_run_post_handle_depend (task, team);
+  gomp_task_run_post_remove_parent (task);
+  gomp_clear_parent (&task->children_queue);
+  gomp_task_run_post_remove_taskgroup (task);
+  team->task_count--;
+  team->task_detach_count--;
+
+  int do_wake = 0;
+  bool shackled_thread_p = team == gomp_thread ()->ts.team;
+  if (new_tasks > 0)
+    {
+      /* Wake up threads to run new tasks.  */
+      do_wake = team->nthreads - team->task_running_count;
+      if (do_wake > new_tasks)
+	do_wake = new_tasks;
+    }
+
+  if (!shackled_thread_p
+      && !do_wake
+      && gomp_team_barrier_waiting_for_tasks (&team->barrier)
+      && team->task_detach_count == 0)
+    {
+      /* Ensure that at least one thread is woken up to signal that the
+	 barrier can finish.  */
+      do_wake = 1;
+    }
+
+  /* If we are running in an unshackled thread, the team might vanish before
+     gomp_team_barrier_wake is run if we release the lock first, so keep the
+     lock for the call in that case.  */
+  if (shackled_thread_p)
+    gomp_mutex_unlock (&team->task_lock);
+  if (do_wake)
+    gomp_team_barrier_wake (&team->barrier, do_wake);
+  if (!shackled_thread_p)
+    gomp_mutex_unlock (&team->task_lock);
+
+  gomp_finish_task (task);
+  free (task);
 }
 
 ialias (omp_fulfill_event)
diff --git a/libgomp/team.c b/libgomp/team.c
index 0f3707c..9662234 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -206,7 +206,6 @@ gomp_new_team (unsigned nthreads)
   team->work_share_cancelled = 0;
   team->team_cancelled = 0;
 
-  priority_queue_init (&team->task_detach_queue);
   team->task_detach_count = 0;
 
   return team;
@@ -224,7 +223,6 @@ free_team (struct gomp_team *team)
   gomp_barrier_destroy (&team->barrier);
   gomp_mutex_destroy (&team->task_lock);
   priority_queue_free (&team->task_queue);
-  priority_queue_free (&team->task_detach_queue);
   team_free (team);
 }
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
index 8583e37..14932b0 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
@@ -14,10 +14,10 @@ int main (void)
   #pragma omp parallel
     #pragma omp single
     {
-      #pragma omp task detach(detach_event1)
+      #pragma omp task detach (detach_event1)
 	x++;
 
-      #pragma omp task detach(detach_event2)
+      #pragma omp task detach (detach_event2)
       {
 	y++;
 	omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
new file mode 100644
index 0000000..10d6746
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause on an offload device.  Each device
+   thread spawns off a chain of tasks in a taskgroup, that can then
+   be executed by any available thread.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
+      #pragma omp taskgroup
+	{
+	  #pragma omp single
+	    thread_count = omp_get_num_threads ();
+
+	  #pragma omp task detach (detach_event1) untied
+	    #pragma omp atomic update
+	      x++;
+
+	  #pragma omp task detach (detach_event2) untied
+	  {
+	    #pragma omp atomic update
+	      y++;
+	    omp_fulfill_event (detach_event1);
+	  }
+
+	  #pragma omp task untied
+	  {
+	    #pragma omp atomic update
+	      z++;
+	    omp_fulfill_event (detach_event2);
+	  }
+	}
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
new file mode 100644
index 0000000..dd002dc
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
@@ -0,0 +1,13 @@
+/* { dg-do run } */
+
+#include <omp.h>
+
+/* Test the detach clause when the task is undeferred.  */
+
+int main (void)
+{
+  omp_event_handle_t event;
+
+  #pragma omp task detach (event)
+    omp_fulfill_event (event);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
index 943ac2a..3e33c40 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
@@ -12,13 +12,13 @@ int main (void)
   omp_event_handle_t detach_event1, detach_event2;
   int x = 0, y = 0, z = 0;
 
-  #pragma omp parallel num_threads(1)
+  #pragma omp parallel num_threads (1)
     #pragma omp single
     {
-      #pragma omp task detach(detach_event1)
+      #pragma omp task detach (detach_event1)
 	x++;
 
-      #pragma omp task detach(detach_event2)
+      #pragma omp task detach (detach_event2)
       {
 	y++;
 	omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
index 2609fb1..c85857d 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
@@ -14,16 +14,16 @@ int main (void)
   #pragma omp parallel
     #pragma omp single
     {
-      #pragma omp task depend(out:dep) detach(detach_event)
+      #pragma omp task depend (out:dep) detach (detach_event)
 	x++;
 
       #pragma omp task
       {
 	y++;
-	omp_fulfill_event(detach_event);
+	omp_fulfill_event (detach_event);
       }
 
-      #pragma omp task depend(in:dep)
+      #pragma omp task depend (in:dep)
 	z++;
     }
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
index eeb9554..cd0d2b3 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
@@ -14,10 +14,10 @@ int main (void)
 
   #pragma omp parallel
     #pragma omp single
-      #pragma omp task detach(detach_event)
+      #pragma omp task detach (detach_event)
       {
 	x++;
-	omp_fulfill_event(detach_event);
+	omp_fulfill_event (detach_event);
       }
 
   assert (x == 1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
index 5a01517..382f377 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
@@ -12,16 +12,16 @@ int main (void)
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
 
-  #pragma omp parallel firstprivate(detach_event1, detach_event2)
+  #pragma omp parallel private (detach_event1, detach_event2)
   {
     #pragma omp single
-      thread_count = omp_get_num_threads();
+      thread_count = omp_get_num_threads ();
 
-    #pragma omp task detach(detach_event1) untied
+    #pragma omp task detach (detach_event1) untied
       #pragma omp atomic update
 	x++;
 
-    #pragma omp task detach(detach_event2) untied
+    #pragma omp task detach (detach_event2) untied
     {
       #pragma omp atomic update
 	y++;
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index b5f68cc..e5c2291 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -13,11 +13,11 @@ int main (void)
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
 
-  #pragma omp target map(tofrom: x, y, z) map(from: thread_count)
-    #pragma omp parallel firstprivate(detach_event1, detach_event2)
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
       {
 	#pragma omp single
-	  thread_count = omp_get_num_threads();
+	  thread_count = omp_get_num_threads ();
 
 	#pragma omp task detach(detach_event1) untied
 	  #pragma omp atomic update
@@ -36,8 +36,6 @@ int main (void)
 	    z++;
 	  omp_fulfill_event (detach_event2);
 	}
-
-	#pragma omp taskwait
       }
 
   assert (x == thread_count);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
new file mode 100644
index 0000000..3f025d6
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause.  Each thread spawns off a chain of tasks,
+   that can then be executed by any available thread.  Each thread uses
+   taskwait to wait for the child tasks to complete.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp parallel private (detach_event1, detach_event2)
+  {
+    #pragma omp single
+      thread_count = omp_get_num_threads ();
+
+    #pragma omp task detach (detach_event1) untied
+      #pragma omp atomic update
+	x++;
+
+    #pragma omp task detach (detach_event2) untied
+    {
+      #pragma omp atomic update
+	y++;
+      omp_fulfill_event (detach_event1);
+    }
+
+    #pragma omp task untied
+    {
+      #pragma omp atomic update
+	z++;
+      omp_fulfill_event (detach_event2);
+    }
+
+    #pragma omp taskwait
+  }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
new file mode 100644
index 0000000..6f77f12
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause on an offload device.  Each device
+   thread spawns off a chain of tasks, that can then be executed by
+   any available thread.  Each thread uses taskwait to wait for the
+   child tasks to complete.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
+      {
+	#pragma omp single
+	  thread_count = omp_get_num_threads ();
+
+	#pragma omp task detach (detach_event1) untied
+	  #pragma omp atomic update
+	    x++;
+
+	#pragma omp task detach (detach_event2) untied
+	{
+	  #pragma omp atomic update
+	    y++;
+	  omp_fulfill_event (detach_event1);
+	}
+
+	#pragma omp task untied
+	{
+	  #pragma omp atomic update
+	    z++;
+	  omp_fulfill_event (detach_event2);
+	}
+
+	#pragma omp taskwait
+      }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
new file mode 100644
index 0000000..5316ca5
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause.  Each thread spawns off a chain of tasks
+   in a taskgroup, that can then be executed by any available thread.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp parallel private (detach_event1, detach_event2)
+    #pragma omp taskgroup
+    {
+      #pragma omp single
+	thread_count = omp_get_num_threads ();
+
+      #pragma omp task detach (detach_event1) untied
+	#pragma omp atomic update
+	  x++;
+
+      #pragma omp task detach (detach_event2) untied
+      {
+	#pragma omp atomic update
+	  y++;
+	omp_fulfill_event (detach_event1);
+      }
+
+      #pragma omp task untied
+      {
+	#pragma omp atomic update
+	  z++;
+	omp_fulfill_event (detach_event2);
+      }
+    }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-1.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
index 217bf65..c53b1ca 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
@@ -11,11 +11,11 @@ program task_detach_1
 
   !$omp parallel
     !$omp single
-      !$omp task detach(detach_event1)
+      !$omp task detach (detach_event1)
         x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2)
+      !$omp task detach (detach_event2)
         y = y + 1
 	call omp_fulfill_event (detach_event1)
       !$omp end task
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-10.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-10.f90
new file mode 100644
index 0000000..61f0ea8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-10.f90
@@ -0,0 +1,44 @@
+! { dg-do run }
+
+! Test tasks with detach clause on an offload device.  Each device
+! thread spawns off a chain of tasks in a taskgroup, that can then
+! be executed by any available thread.
+
+program task_detach_10
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
+      !$omp taskgroup
+	!$omp single
+	  thread_count = omp_get_num_threads ()
+	!$omp end single
+
+	!$omp task detach (detach_event1) untied
+	  !$omp atomic update
+	    x = x + 1
+	!$omp end task
+
+	!$omp task detach (detach_event2) untied
+	  !$omp atomic update
+	    y = y + 1
+	  call omp_fulfill_event (detach_event1)
+	!$omp end task
+
+	!$omp task untied
+	  !$omp atomic update
+	    z = z + 1
+	  call omp_fulfill_event (detach_event2)
+	!$omp end task
+      !$omp end taskgroup
+    !$omp end parallel
+  !$omp end target
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-11.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-11.f90
new file mode 100644
index 0000000..b33baff
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-11.f90
@@ -0,0 +1,13 @@
+! { dg-do run }
+
+! Test the detach clause when the task is undeferred.
+
+program task_detach_11
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event
+
+  !$omp task detach (detach_event)
+    call omp_fulfill_event (detach_event)
+  !$omp end task
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-2.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
index ecb4829..68e3ff2 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
@@ -10,13 +10,13 @@ program task_detach_2
   integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
   integer :: x = 0, y = 0, z = 0
 
-  !$omp parallel num_threads(1)
+  !$omp parallel num_threads (1)
     !$omp single
-      !$omp task detach(detach_event1)
+      !$omp task detach (detach_event1)
         x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2)
+      !$omp task detach (detach_event2)
         y = y + 1
 	call omp_fulfill_event (detach_event1)
       !$omp end task
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-3.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
index bdf93a5..5ac68d5 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
@@ -12,16 +12,16 @@ program task_detach_3
 
   !$omp parallel
     !$omp single
-      !$omp task depend(out:dep) detach(detach_event)
+      !$omp task depend (out:dep) detach (detach_event)
         x = x + 1
       !$omp end task
 
       !$omp task
         y = y + 1
-	call omp_fulfill_event(detach_event)
+	call omp_fulfill_event (detach_event)
       !$omp end task
 
-      !$omp task depend(in:dep)
+      !$omp task depend (in:dep)
         z = z + 1
       !$omp end task
     !$omp end single
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-4.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
index 6d0843c..159624c 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
@@ -11,9 +11,9 @@ program task_detach_4
 
   !$omp parallel
     !$omp single
-      !$omp task detach(detach_event)
+      !$omp task detach (detach_event)
         x = x + 1
-	call omp_fulfill_event(detach_event)
+	call omp_fulfill_event (detach_event)
       !$omp end task
     !$omp end single
   !$omp end parallel
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
index 955d687..95bd132 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
@@ -10,17 +10,17 @@ program task_detach_5
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  !$omp parallel firstprivate(detach_event1, detach_event2)
+  !$omp parallel private (detach_event1, detach_event2)
     !$omp single
-      thread_count = omp_get_num_threads()
+      thread_count = omp_get_num_threads ()
     !$omp end single
 
-    !$omp task detach(detach_event1) untied
+    !$omp task detach (detach_event1) untied
       !$omp atomic update
 	x = x + 1
     !$omp end task
 
-    !$omp task detach(detach_event2) untied
+    !$omp task detach (detach_event2) untied
       !$omp atomic update
 	y = y + 1
       call omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index 0fe2155..b2c476f 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -11,30 +11,28 @@ program task_detach_6
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  !$omp target map(tofrom: x, y, z) map(from: thread_count)
-    !$omp parallel firstprivate(detach_event1, detach_event2)
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
       !$omp single
-	thread_count = omp_get_num_threads()
+	thread_count = omp_get_num_threads ()
       !$omp end single
 
-      !$omp task detach(detach_event1) untied
+      !$omp task detach (detach_event1) untied
 	!$omp atomic update
 	  x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2) untied
+      !$omp task detach (detach_event2) untied
 	!$omp atomic update
 	  y = y + 1
-	call omp_fulfill_event (detach_event1);
+	call omp_fulfill_event (detach_event1)
       !$omp end task
 
       !$omp task untied
 	!$omp atomic update
 	  z = z + 1
-	call omp_fulfill_event (detach_event2);
+	call omp_fulfill_event (detach_event2)
       !$omp end task
-
-      !$omp taskwait
     !$omp end parallel
   !$omp end target
 
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-7.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-7.f90
new file mode 100644
index 0000000..32e715e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-7.f90
@@ -0,0 +1,42 @@
+! { dg-do run }
+
+! Test tasks with detach clause.  Each thread spawns off a chain of tasks,
+! that can then be executed by any available thread.  Each thread uses
+! taskwait to wait for the child tasks to complete.
+
+program task_detach_7
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp parallel private (detach_event1, detach_event2)
+    !$omp single
+      thread_count = omp_get_num_threads()
+    !$omp end single
+
+    !$omp task detach (detach_event1) untied
+      !$omp atomic update
+	x = x + 1
+    !$omp end task
+
+    !$omp task detach (detach_event2) untied
+      !$omp atomic update
+	y = y + 1
+      call omp_fulfill_event (detach_event1)
+    !$omp end task
+
+    !$omp task untied
+      !$omp atomic update
+	z = z + 1
+      call omp_fulfill_event (detach_event2)
+    !$omp end task
+
+    !$omp taskwait
+  !$omp end parallel
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-8.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-8.f90
new file mode 100644
index 0000000..e760eab
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-8.f90
@@ -0,0 +1,45 @@
+! { dg-do run }
+
+! Test tasks with detach clause on an offload device.  Each device
+! thread spawns off a chain of tasks, that can then be executed by
+! any available thread.  Each thread uses taskwait to wait for the
+! child tasks to complete.
+
+program task_detach_8
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
+      !$omp single
+	thread_count = omp_get_num_threads ()
+      !$omp end single
+
+      !$omp task detach (detach_event1) untied
+	!$omp atomic update
+	  x = x + 1
+      !$omp end task
+
+      !$omp task detach (detach_event2) untied
+	!$omp atomic update
+	  y = y + 1
+	call omp_fulfill_event (detach_event1)
+      !$omp end task
+
+      !$omp task untied
+	!$omp atomic update
+	  z = z + 1
+	call omp_fulfill_event (detach_event2)
+      !$omp end task
+
+      !$omp taskwait
+    !$omp end parallel
+  !$omp end target
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-9.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-9.f90
new file mode 100644
index 0000000..540c6de
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-9.f90
@@ -0,0 +1,41 @@
+! { dg-do run }
+
+! Test tasks with detach clause.  Each thread spawns off a chain of tasks
+! in a taskgroup, that can then be executed by any available thread.
+
+program task_detach_9
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp parallel private (detach_event1, detach_event2)
+    !$omp taskgroup
+      !$omp single
+	thread_count = omp_get_num_threads ()
+      !$omp end single
+
+      !$omp task detach (detach_event1) untied
+	!$omp atomic update
+	  x = x + 1
+      !$omp end task
+
+      !$omp task detach (detach_event2) untied
+	!$omp atomic update
+	  y = y + 1
+	call omp_fulfill_event (detach_event1);
+      !$omp end task
+
+      !$omp task untied
+	!$omp atomic update
+	  z = z + 1
+	call omp_fulfill_event (detach_event2);
+      !$omp end task
+    !$omp end taskgroup
+  !$omp end parallel
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
-- 
2.8.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-24 18:17       ` Kwok Cheung Yeung
@ 2021-02-24 19:46         ` Jakub Jelinek
  2021-02-25 16:21           ` Kwok Cheung Yeung
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Jelinek @ 2021-02-24 19:46 UTC (permalink / raw)
  To: Kwok Cheung Yeung; +Cc: GCC Patches

On Wed, Feb 24, 2021 at 06:17:01PM +0000, Kwok Cheung Yeung wrote:
> > 1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems
> >     and especially posix gomp_sem_t is large; so while it might be a good
> >     idea to inline gomp_sem_t on config/{linux,accel} into the union, for
> >     the rest it might be better to use indirection; if it is only for the
> >     undeferred tasks, it could be just using an automatic variable and
> >     put into the struct address of that; could be done either always,
> >     or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is
> >     small and decide on the indirection based on that macro
> 
> I think a pointer to an automatic variable would be simplest.

Agreed.

> Can anything in cpyfn make use of the fact that kind==GOMP_TASK_UNDEFERRED
> while executing it? Anyway, if we want to keep this, then I suppose we could
> just add an extra field deferred_p that does not change for the lifetime of
> the task to indicate that the task is 'really' a deferred task.

Adding a bool is fine, but see bellow.

> > 3) kind is not constant, for the deferred tasks it can change over the
> >     lifetime of the task, as you've said in the comments, it is kind ==
> >     GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind
> >     are done while holding the task lock, omp_fulfill_event reads it before
> >     locking that lock, so I think it needs to be done using
> >     if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == GOMP_TASK_UNDEFERRED)
> >     Pedantically the stores to task->kind also need to be done
> >     with __atomic_store_n MEMMODEL_RELAXED.
> 
> If we check task->deferred_p instead (which never changes for a task after
> instantiation), is that still necessary?

Not for kind or the new field.

> > - in gomp_barrier_handle_tasks the reason for if (new_tasks > 1)
> > is that if there is a single dependent task, the current thread
> > just finished handling one task and so can take that single task and so no
> > need to wake up.  While in the omp_fulfill_event case, even if there
> > is just one new task, we need to schedule it to some thread and so
> > is desirable to wake some thread.
> 
> In that case, we could just do 'if (new_tasks > 0)' instead?

Yes.
> 
> > All we know
> > (if team == gomp_thread ()->ts.team) is that at least one thread is doing
> > something else but that one could be busy for quite some time.
> 
> Well, it should still get around to the new task eventually, so there is no
> problem in terms of correctness here. I suppose we could always wake up one
> more thread than strictly necessary, but that might have knock-on effects on
> performance elsewhere?

Yeah, waking something unnecessarily is always going to cause performance
problems.

> I have applied your patch to move the gomp_team_barrier_done, and in
> omp_fulfill_event, I ensure that a single thread is woken up so that
> gomp_barrier_handle_tasks can signal for the barrier to finish.
> 
> I'm having some trouble coming up with a testcase to test this scenario
> though. I tried having a testcase like this to have threads in separate
> teams:

The unshackeled thread testcase would probably need a pthread_create
call and restricting the testcase to POSIX threads targets.
The teams in host teams (or target) don't have at least in OpenMP a way
to serialize, e.g. it can always be implemented like we do ATM.

But I guess that testcase can be done incrementally.

> @@ -545,8 +548,15 @@ struct gomp_task
>       entries and the gomp_task in which they reside.  */
>    struct priority_node pnode[3];
>  
> -  bool detach;
> -  gomp_sem_t completion_sem;
> +  union {
> +    /* Valid only if deferred_p is false.  */
> +    gomp_sem_t *completion_sem;
> +    /* Valid only if deferred_p is true.  Set to the team that executes the
> +       task if the task is detached and the completion event has yet to be
> +       fulfilled.  */
> +    struct gomp_team *detach_team;
> +  };
> +  bool deferred_p;
>  
>    struct gomp_task_icv icv;
>    void (*fn) (void *);

What I don't like is that this creates too much wasteful padding
in a struct that should be as small as possible.
At least on 64-bit hosts which we care about most, pahole shows with your
patch:
struct gomp_task {
        struct gomp_task *         parent;               /*     0     8 */
        struct priority_queue      children_queue;       /*     8    32 */
        struct gomp_taskgroup *    taskgroup;            /*    40     8 */
        struct gomp_dependers_vec * dependers;           /*    48     8 */
        struct htab *              depend_hash;          /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct gomp_taskwait *     taskwait;             /*    64     8 */
        size_t                     depend_count;         /*    72     8 */
        size_t                     num_dependees;        /*    80     8 */
        int                        priority;             /*    88     4 */

        /* XXX 4 bytes hole, try to pack */

        struct priority_node       pnode[3];             /*    96    48 */
        /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
        union {
                gomp_sem_t *       completion_sem;       /*   144     8 */
                struct gomp_team * detach_team;          /*   144     8 */
        };                                               /*   144     8 */
        _Bool                      deferred_p;           /*   152     1 */

        /* XXX 7 bytes hole, try to pack */

        struct gomp_task_icv       icv;                  /*   160    40 */
        /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */
        void                       (*fn)(void *);        /*   200     8 */
        void *                     fn_data;              /*   208     8 */
        enum gomp_task_kind        kind;                 /*   216     4 */
        _Bool                      in_tied_task;         /*   220     1 */
        _Bool                      final_task;           /*   221     1 */
        _Bool                      copy_ctors_done;      /*   222     1 */
        _Bool                      parent_depends_on;    /*   223     1 */
        struct gomp_task_depend_entry depend[];          /*   224     0 */

        /* size: 224, cachelines: 4, members: 21 */
        /* sum members: 213, holes: 2, sum holes: 11 */
        /* last cacheline: 32 bytes */
};

So perhaps it might be better to put the new 1 fields before int priority;
field, in order bool deferred_p; union { };
That way, there will be just 3 bytes hole in the whole struct,
not 4 + 7 byte holes.

>  
> -      if (task.detach && !task_fulfilled_p (&task))
> -	gomp_sem_wait (&task.completion_sem);
> +      if ((flags & GOMP_TASK_FLAG_DETACH) != 0 && detach)
> +	gomp_sem_wait (&completion_sem);

I think gomp_sem_destroy is missing here (in the conditional if it was
only initialized.  Furthermore, I don't understand the && detach,
the earlier code assumes that if (flags & GOMP_TASK_FLAG_DETACH) != 0
then it can dereference *(void *)) detach, so the && detach seems
to be unnecessary.

> @@ -484,15 +483,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
>        task->kind = GOMP_TASK_UNDEFERRED;
>        task->in_tied_task = parent->in_tied_task;
>        task->taskgroup = taskgroup;
> +      task->deferred_p = true;
>        if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
>  	{
> -	  task->detach = true;
> -	  gomp_sem_init (&task->completion_sem, 0);
> -	  *(void **) detach = &task->completion_sem;

I think you can move task->deferred_p into the if stmt.
> +  if (!shackled_thread_p
> +      && !do_wake
> +      && gomp_team_barrier_waiting_for_tasks (&team->barrier)
> +      && team->task_detach_count == 0)

&& team->task_detach_count == 0 is cheaper than the
  && gomp_team_barrier_waiting_for_tasks (&team->barrier)
so please swap those two.

> +    {
> +      /* Ensure that at least one thread is woken up to signal that the
> +	 barrier can finish.  */
> +      do_wake = 1;
> +    }

Please drop the {}s around the single do_wake = 1; stmt.

Otherwise LGTM.

	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-24 19:46         ` Jakub Jelinek
@ 2021-02-25 16:21           ` Kwok Cheung Yeung
  2021-02-25 16:38             ` Jakub Jelinek
  0 siblings, 1 reply; 22+ messages in thread
From: Kwok Cheung Yeung @ 2021-02-25 16:21 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 7110 bytes --]

On 24/02/2021 7:46 pm, Jakub Jelinek wrote:
>> @@ -545,8 +548,15 @@ struct gomp_task
>>        entries and the gomp_task in which they reside.  */
>>     struct priority_node pnode[3];
>>   
>> -  bool detach;
>> -  gomp_sem_t completion_sem;
>> +  union {
>> +    /* Valid only if deferred_p is false.  */
>> +    gomp_sem_t *completion_sem;
>> +    /* Valid only if deferred_p is true.  Set to the team that executes the
>> +       task if the task is detached and the completion event has yet to be
>> +       fulfilled.  */
>> +    struct gomp_team *detach_team;
>> +  };
>> +  bool deferred_p;
>>   
>>     struct gomp_task_icv icv;
>>     void (*fn) (void *);
> 
> What I don't like is that this creates too much wasteful padding
> in a struct that should be as small as possible.
> At least on 64-bit hosts which we care about most, pahole shows with your
> patch:
> struct gomp_task {
>          struct gomp_task *         parent;               /*     0     8 */
>          struct priority_queue      children_queue;       /*     8    32 */
>          struct gomp_taskgroup *    taskgroup;            /*    40     8 */
>          struct gomp_dependers_vec * dependers;           /*    48     8 */
>          struct htab *              depend_hash;          /*    56     8 */
>          /* --- cacheline 1 boundary (64 bytes) --- */
>          struct gomp_taskwait *     taskwait;             /*    64     8 */
>          size_t                     depend_count;         /*    72     8 */
>          size_t                     num_dependees;        /*    80     8 */
>          int                        priority;             /*    88     4 */
> 
>          /* XXX 4 bytes hole, try to pack */
> 
>          struct priority_node       pnode[3];             /*    96    48 */
>          /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
>          union {
>                  gomp_sem_t *       completion_sem;       /*   144     8 */
>                  struct gomp_team * detach_team;          /*   144     8 */
>          };                                               /*   144     8 */
>          _Bool                      deferred_p;           /*   152     1 */
> 
>          /* XXX 7 bytes hole, try to pack */
> 
>          struct gomp_task_icv       icv;                  /*   160    40 */
>          /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */
>          void                       (*fn)(void *);        /*   200     8 */
>          void *                     fn_data;              /*   208     8 */
>          enum gomp_task_kind        kind;                 /*   216     4 */
>          _Bool                      in_tied_task;         /*   220     1 */
>          _Bool                      final_task;           /*   221     1 */
>          _Bool                      copy_ctors_done;      /*   222     1 */
>          _Bool                      parent_depends_on;    /*   223     1 */
>          struct gomp_task_depend_entry depend[];          /*   224     0 */
> 
>          /* size: 224, cachelines: 4, members: 21 */
>          /* sum members: 213, holes: 2, sum holes: 11 */
>          /* last cacheline: 32 bytes */
> };
> 
> So perhaps it might be better to put the new 1 fields before int priority;
> field, in order bool deferred_p; union { };
> That way, there will be just 3 bytes hole in the whole struct,
> not 4 + 7 byte holes.
>

Moving the fields in that order before priority results in the same holes due to 
the alignment requirement of the pointers:

         size_t                     num_dependees;        /*    80     8 */
         _Bool                      deferred_p;           /*    88     1 */

         /* XXX 7 bytes hole, try to pack */

         union {
                 gomp_sem_t *       completion_sem;       /*    96     8 */
                 struct gomp_team * detach_team;          /*    96     8 */
         };                                               /*    96     8 */
         int                        priority;             /*   104     4 */

         /* XXX 4 bytes hole, try to pack */

         struct priority_node pnode[3];                   /*   112    48 */

Reversing the order reduces the hole to 3 bytes:

         size_t                     num_dependees;        /*    80     8 */
         union {
                 gomp_sem_t *       completion_sem;       /*    88     8 */
                 struct gomp_team * detach_team;          /*    88     8 */
         };                                               /*    88     8 */
         _Bool                      deferred_p;           /*    96     1 */

         /* XXX 3 bytes hole, try to pack */

         int                        priority;             /*   100     4 */

If we were really determined to get rid of the hole, we could stash deferred_p 
in the least-significant bit of the pointer union, but I think that might be 
more trouble than it is worth...

>>   
>> -      if (task.detach && !task_fulfilled_p (&task))
>> -	gomp_sem_wait (&task.completion_sem);
>> +      if ((flags & GOMP_TASK_FLAG_DETACH) != 0 && detach)
>> +	gomp_sem_wait (&completion_sem);
> 
> I think gomp_sem_destroy is missing here (in the conditional if it was
> only initialized.  Furthermore, I don't understand the && detach,
> the earlier code assumes that if (flags & GOMP_TASK_FLAG_DETACH) != 0
> then it can dereference *(void *)) detach, so the && detach seems
> to be unnecessary.

I have added a call to gomp_sem_destroy, and removed the redundant check for detach.

>> @@ -484,15 +483,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
>>         task->kind = GOMP_TASK_UNDEFERRED;
>>         task->in_tied_task = parent->in_tied_task;
>>         task->taskgroup = taskgroup;
>> +      task->deferred_p = true;
>>         if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
>>   	{
>> -	  task->detach = true;
>> -	  gomp_sem_init (&task->completion_sem, 0);
>> -	  *(void **) detach = &task->completion_sem;
> 
> I think you can move task->deferred_p into the if stmt.

That can be done (since the code for detach is currently the only thing using 
it), but I think it would be better to have deferred_p always have the right 
value, regardless of whether or not it is used? Otherwise that might lead to 
some confusion if it is later used by something else.

>> +  if (!shackled_thread_p
>> +      && !do_wake
>> +      && gomp_team_barrier_waiting_for_tasks (&team->barrier)
>> +      && team->task_detach_count == 0)
> 
> && team->task_detach_count == 0 is cheaper than the
>    && gomp_team_barrier_waiting_for_tasks (&team->barrier)
> so please swap those two.
>

Done.

>> +    {
>> +      /* Ensure that at least one thread is woken up to signal that the
>> +	 barrier can finish.  */
>> +      do_wake = 1;
>> +    }
> 
> Please drop the {}s around the single do_wake = 1; stmt.

Okay. I put the braces in because it looked a little odd with the comment.

> Otherwise LGTM.
> 

I will get this committed later if the regression tests finish with no surprises.

Thank you for your time in reviewing this patch!

Kwok

[-- Attachment #2: 0001-openmp-Fix-intermittent-hanging-of-task-detach-6-lib.patch --]
[-- Type: text/plain, Size: 41434 bytes --]

From 462c86549de28f28d5a71e4a7e83e2c17fd19c17 Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Thu, 21 Jan 2021 05:38:47 -0800
Subject: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp
 tests [PR98738]

This adds support for the task detach clause to taskwait and taskgroup, and
simplifies the handling of the detach clause by moving most of the extra
handling required for detach tasks to omp_fulfill_event.

2021-02-25  Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Jakub Jelinek  <jakub@redhat.com>

	libgomp/

	PR libgomp/98738
	* libgomp.h (enum gomp_task_kind): Add GOMP_TASK_DETACHED.
	(struct gomp_task): Replace detach and completion_sem fields with
	union containing completion_sem and detach_team.  Add deferred_p
	field.
	(struct gomp_team): Remove task_detach_queue.
	* task.c: Include assert.h.
	(gomp_init_task): Initialize deferred_p and detach_team fields.
	(task_fulfilled_p): Delete.
	(GOMP_task): Use address of task as the event handle.  Remove
	initialization of detach field.  Initialize deferred_p field.
	Use automatic local for completion_sem.  Initialize detach_team field
	for deferred tasks.
	(gomp_barrier_handle_tasks): Remove handling of task_detach_queue.
	Set kind of suspended detach task to GOMP_TASK_DETACHED and
	decrement task_running_count.  Move finish_cancelled block out of
	else branch.  Relocate call to gomp_team_barrier_done.
	(GOMP_taskwait): Handle tasks with completion events that have not
	been fulfilled.
	(GOMP_taskgroup_end): Likewise.
	(omp_fulfill_event): Use address of task as event handle.  Post to
	completion_sem for undeferred tasks.  Clear detach_team if task
	has not finished.  For finished tasks, handle post-execution tasks,
	call gomp_team_barrier_wake if necessary, and free task.
	* team.c (gomp_new_team): Remove initialization of task_detach_queue.
	(free_team): Remove free of task_detach_queue.
	* testsuite/libgomp.c-c++-common/task-detach-1.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-2.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-3.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-4.c: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-5.c: Fix formatting.
	Change data-sharing of detach events on enclosing parallel to private.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Likewise.  Remove
	taskwait directive.
	* testsuite/libgomp.c-c++-common/task-detach-7.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-8.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-9.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-10.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-11.c: New.
	* testsuite/libgomp.c-c++-common/task-detach-1.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-2.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-3.f90: Fix formatting.
	* testsuite/libgomp.c-c++-common/task-detach-4.f90: Fix formatting.
	* testsuite/libgomp.fortran/task-detach-5.f90: Fix formatting.
	Change data-sharing of detach events on enclosing parallel to private.
	* testsuite/libgomp.fortran/task-detach-6.f90: Likewise.  Remove
	taskwait directive.
	* testsuite/libgomp.c-c++-common/task-detach-7.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-8.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-9.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-10.f90: New.
	* testsuite/libgomp.c-c++-common/task-detach-11.f90: New.
---
 libgomp/libgomp.h                                  |  21 +-
 libgomp/task.c                                     | 237 ++++++++++++++-------
 libgomp/team.c                                     |   2 -
 .../testsuite/libgomp.c-c++-common/task-detach-1.c |   4 +-
 .../libgomp.c-c++-common/task-detach-10.c          |  45 ++++
 .../libgomp.c-c++-common/task-detach-11.c          |  13 ++
 .../testsuite/libgomp.c-c++-common/task-detach-2.c |   6 +-
 .../testsuite/libgomp.c-c++-common/task-detach-3.c |   6 +-
 .../testsuite/libgomp.c-c++-common/task-detach-4.c |   4 +-
 .../testsuite/libgomp.c-c++-common/task-detach-5.c |   8 +-
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |   8 +-
 .../testsuite/libgomp.c-c++-common/task-detach-7.c |  45 ++++
 .../testsuite/libgomp.c-c++-common/task-detach-8.c |  47 ++++
 .../testsuite/libgomp.c-c++-common/task-detach-9.c |  43 ++++
 .../testsuite/libgomp.fortran/task-detach-1.f90    |   4 +-
 .../testsuite/libgomp.fortran/task-detach-10.f90   |  44 ++++
 .../testsuite/libgomp.fortran/task-detach-11.f90   |  13 ++
 .../testsuite/libgomp.fortran/task-detach-2.f90    |   6 +-
 .../testsuite/libgomp.fortran/task-detach-3.f90    |   6 +-
 .../testsuite/libgomp.fortran/task-detach-4.f90    |   4 +-
 .../testsuite/libgomp.fortran/task-detach-5.f90    |   8 +-
 .../testsuite/libgomp.fortran/task-detach-6.f90    |  16 +-
 .../testsuite/libgomp.fortran/task-detach-7.f90    |  42 ++++
 .../testsuite/libgomp.fortran/task-detach-8.f90    |  45 ++++
 .../testsuite/libgomp.fortran/task-detach-9.f90    |  41 ++++
 25 files changed, 586 insertions(+), 132 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-10.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-11.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-7.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-8.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/task-detach-9.f90

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index b4d0c93..ef1bb49 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -481,7 +481,10 @@ enum gomp_task_kind
      but not yet completed.  Once that completes, they will be readded
      into the queues as GOMP_TASK_WAITING in order to perform the var
      unmapping.  */
-  GOMP_TASK_ASYNC_RUNNING
+  GOMP_TASK_ASYNC_RUNNING,
+  /* Task that has finished executing but is waiting for its
+     completion event to be fulfilled.  */
+  GOMP_TASK_DETACHED
 };
 
 struct gomp_task_depend_entry
@@ -537,6 +540,16 @@ struct gomp_task
      into the various queues to be scheduled.  */
   size_t num_dependees;
 
+  union {
+      /* Valid only if deferred_p is false.  */
+      gomp_sem_t *completion_sem;
+      /* Valid only if deferred_p is true.  Set to the team that executes the
+	 task if the task is detached and the completion event has yet to be
+	 fulfilled.  */
+      struct gomp_team *detach_team;
+    };
+  bool deferred_p;
+
   /* Priority of this task.  */
   int priority;
   /* The priority node for this task in each of the different queues.
@@ -545,9 +558,6 @@ struct gomp_task
      entries and the gomp_task in which they reside.  */
   struct priority_node pnode[3];
 
-  bool detach;
-  gomp_sem_t completion_sem;
-
   struct gomp_task_icv icv;
   void (*fn) (void *);
   void *fn_data;
@@ -688,8 +698,7 @@ struct gomp_team
   int work_share_cancelled;
   int team_cancelled;
 
-  /* Tasks waiting for their completion event to be fulfilled.  */
-  struct priority_queue task_detach_queue;
+  /* Number of tasks waiting for their completion event to be fulfilled.  */
   unsigned int task_detach_count;
 
   /* This array contains structures for implicit tasks.  */
diff --git a/libgomp/task.c b/libgomp/task.c
index b242e7c..d263e54 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -29,6 +29,7 @@
 #include "libgomp.h"
 #include <stdlib.h>
 #include <string.h>
+#include <assert.h>
 #include "gomp-constants.h"
 
 typedef struct gomp_task_depend_entry *hash_entry_type;
@@ -86,7 +87,8 @@ gomp_init_task (struct gomp_task *task, struct gomp_task *parent_task,
   task->dependers = NULL;
   task->depend_hash = NULL;
   task->depend_count = 0;
-  task->detach = false;
+  task->deferred_p = true;
+  task->detach_team = NULL;
 }
 
 /* Clean up a task, after completing it.  */
@@ -327,12 +329,6 @@ gomp_task_handle_depend (struct gomp_task *task, struct gomp_task *parent,
     }
 }
 
-static bool
-task_fulfilled_p (struct gomp_task *task)
-{
-  return gomp_sem_getcount (&task->completion_sem) > 0;
-}
-
 /* Called when encountering an explicit task directive.  If IF_CLAUSE is
    false, then we must not delay in executing the task.  If UNTIED is true,
    then the task may be executed by any member of the team.
@@ -398,6 +394,7 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       || team->task_count > 64 * team->nthreads)
     {
       struct gomp_task task;
+      gomp_sem_t completion_sem;
 
       /* If there are depend clauses and earlier deferred sibling tasks
 	 with depend clauses, check if there isn't a dependency.  If there
@@ -414,16 +411,18 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       task.final_task = (thr->task && thr->task->final_task)
 			|| (flags & GOMP_TASK_FLAG_FINAL);
       task.priority = priority;
+      task.deferred_p = false;
 
       if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
 	{
-	  task.detach = true;
-	  gomp_sem_init (&task.completion_sem, 0);
-	  *(void **) detach = &task.completion_sem;
+	  gomp_sem_init (&completion_sem, 0);
+	  task.completion_sem = &completion_sem;
+	  *(void **) detach = &task;
 	  if (data)
-	    *(void **) data = &task.completion_sem;
+	    *(void **) data = &task;
 
-	  gomp_debug (0, "New event: %p\n", &task.completion_sem);
+	  gomp_debug (0, "Thread %d: new event: %p\n",
+		      thr->ts.team_id, &task);
 	}
 
       if (thr->task)
@@ -443,8 +442,11 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       else
 	fn (data);
 
-      if (task.detach && !task_fulfilled_p (&task))
-	gomp_sem_wait (&task.completion_sem);
+      if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
+	{
+	  gomp_sem_wait (&completion_sem);
+	  gomp_sem_destroy (&completion_sem);
+	}
 
       /* Access to "children" is normally done inside a task_lock
 	 mutex region, but the only way this particular task.children
@@ -484,15 +486,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       task->kind = GOMP_TASK_UNDEFERRED;
       task->in_tied_task = parent->in_tied_task;
       task->taskgroup = taskgroup;
+      task->deferred_p = true;
       if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
 	{
-	  task->detach = true;
-	  gomp_sem_init (&task->completion_sem, 0);
-	  *(void **) detach = &task->completion_sem;
+	  task->detach_team = team;
+
+	  *(void **) detach = task;
 	  if (data)
-	    *(void **) data = &task->completion_sem;
+	    *(void **) data = task;
 
-	  gomp_debug (0, "New event: %p\n", &task->completion_sem);
+	  gomp_debug (0, "Thread %d: new event: %p\n", thr->ts.team_id, task);
 	}
       thr->task = task;
       if (cpyfn)
@@ -1362,27 +1365,6 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
     {
       bool cancelled = false;
 
-      /* Look for a queued detached task with a fulfilled completion event
-	 that is ready to finish.  */
-      child_task = priority_queue_find (PQ_TEAM, &team->task_detach_queue,
-					task_fulfilled_p);
-      if (child_task)
-	{
-	  priority_queue_remove (PQ_TEAM, &team->task_detach_queue,
-				 child_task, MEMMODEL_RELAXED);
-	  --team->task_detach_count;
-	  gomp_debug (0, "thread %d: found task with fulfilled event %p\n",
-		      thr->ts.team_id, &child_task->completion_sem);
-
-	if (to_free)
-	  {
-	    gomp_finish_task (to_free);
-	    free (to_free);
-	    to_free = NULL;
-	  }
-	  goto finish_cancelled;
-	}
-
       if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
 	{
 	  bool ignored;
@@ -1405,6 +1387,19 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
 	  team->task_running_count++;
 	  child_task->in_tied_task = true;
 	}
+      else if (team->task_count == 0
+	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
+	{
+	  gomp_team_barrier_done (&team->barrier, state);
+	  gomp_mutex_unlock (&team->task_lock);
+	  gomp_team_barrier_wake (&team->barrier, 0);
+	  if (to_free)
+	    {
+	      gomp_finish_task (to_free);
+	      free (to_free);
+	    }
+	  return;
+	}
       gomp_mutex_unlock (&team->task_lock);
       if (do_wake)
 	{
@@ -1450,44 +1445,37 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
-	  if (child_task->detach && !task_fulfilled_p (child_task))
+	  if (child_task->detach_team)
 	    {
-	      priority_queue_insert (PQ_TEAM, &team->task_detach_queue,
-				     child_task, child_task->priority,
-				     PRIORITY_INSERT_END,
-				     false, false);
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
 	      ++team->task_detach_count;
-	      gomp_debug (0, "thread %d: queueing task with event %p\n",
-			  thr->ts.team_id, &child_task->completion_sem);
+	      --team->task_running_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in team barrier\n",
+			  thr->ts.team_id, child_task);
 	      child_task = NULL;
+	      continue;
 	    }
-	  else
+
+	 finish_cancelled:;
+	  size_t new_tasks
+	    = gomp_task_run_post_handle_depend (child_task, team);
+	  gomp_task_run_post_remove_parent (child_task);
+	  gomp_clear_parent (&child_task->children_queue);
+	  gomp_task_run_post_remove_taskgroup (child_task);
+	  to_free = child_task;
+	  if (!cancelled)
+	    team->task_running_count--;
+	  child_task = NULL;
+	  if (new_tasks > 1)
 	    {
-	     finish_cancelled:;
-	      size_t new_tasks
-		= gomp_task_run_post_handle_depend (child_task, team);
-	      gomp_task_run_post_remove_parent (child_task);
-	      gomp_clear_parent (&child_task->children_queue);
-	      gomp_task_run_post_remove_taskgroup (child_task);
-	      to_free = child_task;
-	      child_task = NULL;
-	      if (!cancelled)
-		team->task_running_count--;
-	      if (new_tasks > 1)
-		{
-		  do_wake = team->nthreads - team->task_running_count;
-		  if (do_wake > new_tasks)
-		    do_wake = new_tasks;
-		}
-	      if (--team->task_count == 0
-		  && gomp_team_barrier_waiting_for_tasks (&team->barrier))
-		{
-		  gomp_team_barrier_done (&team->barrier, state);
-		  gomp_mutex_unlock (&team->task_lock);
-		  gomp_team_barrier_wake (&team->barrier, 0);
-		  gomp_mutex_lock (&team->task_lock);
-		}
+	      do_wake = team->nthreads - team->task_running_count;
+	      if (do_wake > new_tasks)
+		do_wake = new_tasks;
 	    }
+	  --team->task_count;
 	}
     }
 }
@@ -1559,7 +1547,8 @@ GOMP_taskwait (void)
       else
 	{
 	/* All tasks we are waiting for are either running in other
-	   threads, or they are tasks that have not had their
+	   threads, are detached and waiting for the completion event to be
+	   fulfilled, or they are tasks that have not had their
 	   dependencies met (so they're not even in the queue).  Wait
 	   for them.  */
 	  if (task->taskwait == NULL)
@@ -1614,6 +1603,19 @@ GOMP_taskwait (void)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
+	  if (child_task->detach_team)
+	    {
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
+	      ++team->task_detach_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in taskwait\n",
+			  thr->ts.team_id, child_task);
+	      child_task = NULL;
+	      continue;
+	    }
+
 	 finish_cancelled:;
 	  size_t new_tasks
 	    = gomp_task_run_post_handle_depend (child_task, team);
@@ -2069,6 +2071,19 @@ GOMP_taskgroup_end (void)
       gomp_mutex_lock (&team->task_lock);
       if (child_task)
 	{
+	  if (child_task->detach_team)
+	    {
+	      assert (child_task->detach_team == team);
+	      child_task->kind = GOMP_TASK_DETACHED;
+	      ++team->task_detach_count;
+	      gomp_debug (0,
+			  "thread %d: task with event %p finished without "
+			  "completion event fulfilled in taskgroup\n",
+			  thr->ts.team_id, child_task);
+	      child_task = NULL;
+	      continue;
+	    }
+
 	 finish_cancelled:;
 	  size_t new_tasks
 	    = gomp_task_run_post_handle_depend (child_task, team);
@@ -2402,17 +2417,75 @@ ialias (omp_in_final)
 void
 omp_fulfill_event (omp_event_handle_t event)
 {
-  gomp_sem_t *sem = (gomp_sem_t *) event;
-  struct gomp_thread *thr = gomp_thread ();
-  struct gomp_team *team = thr ? thr->ts.team : NULL;
+  struct gomp_task *task = (struct gomp_task *) event;
+  if (!task->deferred_p)
+    {
+      if (gomp_sem_getcount (task->completion_sem) > 0)
+	gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", task);
 
-  if (gomp_sem_getcount (sem) > 0)
-    gomp_fatal ("omp_fulfill_event: %p event already fulfilled!\n", sem);
+      gomp_debug (0, "omp_fulfill_event: %p event for undeferred task\n",
+		  task);
+      gomp_sem_post (task->completion_sem);
+      return;
+    }
 
-  gomp_debug (0, "omp_fulfill_event: %p\n", sem);
-  gomp_sem_post (sem);
-  if (team)
-    gomp_team_barrier_wake (&team->barrier, 1);
+  struct gomp_team *team = __atomic_load_n (&task->detach_team,
+					    MEMMODEL_RELAXED);
+  if (!team)
+    gomp_fatal ("omp_fulfill_event: %p event is invalid or has already "
+		"been fulfilled!\n", task);
+
+  gomp_mutex_lock (&team->task_lock);
+  if (task->kind != GOMP_TASK_DETACHED)
+    {
+      /* The task has not finished running yet.  */
+      gomp_debug (0,
+		  "omp_fulfill_event: %p event fulfilled for unfinished "
+		  "task\n", task);
+      __atomic_store_n (&task->detach_team, NULL, MEMMODEL_RELAXED);
+      gomp_mutex_unlock (&team->task_lock);
+      return;
+    }
+
+  gomp_debug (0, "omp_fulfill_event: %p event fulfilled for finished task\n",
+	      task);
+  size_t new_tasks = gomp_task_run_post_handle_depend (task, team);
+  gomp_task_run_post_remove_parent (task);
+  gomp_clear_parent (&task->children_queue);
+  gomp_task_run_post_remove_taskgroup (task);
+  team->task_count--;
+  team->task_detach_count--;
+
+  int do_wake = 0;
+  bool shackled_thread_p = team == gomp_thread ()->ts.team;
+  if (new_tasks > 0)
+    {
+      /* Wake up threads to run new tasks.  */
+      do_wake = team->nthreads - team->task_running_count;
+      if (do_wake > new_tasks)
+	do_wake = new_tasks;
+    }
+
+  if (!shackled_thread_p
+      && !do_wake
+      && team->task_detach_count == 0
+      && gomp_team_barrier_waiting_for_tasks (&team->barrier))
+    /* Ensure that at least one thread is woken up to signal that the
+       barrier can finish.  */
+    do_wake = 1;
+
+  /* If we are running in an unshackled thread, the team might vanish before
+     gomp_team_barrier_wake is run if we release the lock first, so keep the
+     lock for the call in that case.  */
+  if (shackled_thread_p)
+    gomp_mutex_unlock (&team->task_lock);
+  if (do_wake)
+    gomp_team_barrier_wake (&team->barrier, do_wake);
+  if (!shackled_thread_p)
+    gomp_mutex_unlock (&team->task_lock);
+
+  gomp_finish_task (task);
+  free (task);
 }
 
 ialias (omp_fulfill_event)
diff --git a/libgomp/team.c b/libgomp/team.c
index 0f3707c..9662234 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -206,7 +206,6 @@ gomp_new_team (unsigned nthreads)
   team->work_share_cancelled = 0;
   team->team_cancelled = 0;
 
-  priority_queue_init (&team->task_detach_queue);
   team->task_detach_count = 0;
 
   return team;
@@ -224,7 +223,6 @@ free_team (struct gomp_team *team)
   gomp_barrier_destroy (&team->barrier);
   gomp_mutex_destroy (&team->task_lock);
   priority_queue_free (&team->task_queue);
-  priority_queue_free (&team->task_detach_queue);
   team_free (team);
 }
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
index 8583e37..14932b0 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-1.c
@@ -14,10 +14,10 @@ int main (void)
   #pragma omp parallel
     #pragma omp single
     {
-      #pragma omp task detach(detach_event1)
+      #pragma omp task detach (detach_event1)
 	x++;
 
-      #pragma omp task detach(detach_event2)
+      #pragma omp task detach (detach_event2)
       {
 	y++;
 	omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
new file mode 100644
index 0000000..10d6746
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-10.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause on an offload device.  Each device
+   thread spawns off a chain of tasks in a taskgroup, that can then
+   be executed by any available thread.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
+      #pragma omp taskgroup
+	{
+	  #pragma omp single
+	    thread_count = omp_get_num_threads ();
+
+	  #pragma omp task detach (detach_event1) untied
+	    #pragma omp atomic update
+	      x++;
+
+	  #pragma omp task detach (detach_event2) untied
+	  {
+	    #pragma omp atomic update
+	      y++;
+	    omp_fulfill_event (detach_event1);
+	  }
+
+	  #pragma omp task untied
+	  {
+	    #pragma omp atomic update
+	      z++;
+	    omp_fulfill_event (detach_event2);
+	  }
+	}
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
new file mode 100644
index 0000000..dd002dc
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-11.c
@@ -0,0 +1,13 @@
+/* { dg-do run } */
+
+#include <omp.h>
+
+/* Test the detach clause when the task is undeferred.  */
+
+int main (void)
+{
+  omp_event_handle_t event;
+
+  #pragma omp task detach (event)
+    omp_fulfill_event (event);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
index 943ac2a..3e33c40 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-2.c
@@ -12,13 +12,13 @@ int main (void)
   omp_event_handle_t detach_event1, detach_event2;
   int x = 0, y = 0, z = 0;
 
-  #pragma omp parallel num_threads(1)
+  #pragma omp parallel num_threads (1)
     #pragma omp single
     {
-      #pragma omp task detach(detach_event1)
+      #pragma omp task detach (detach_event1)
 	x++;
 
-      #pragma omp task detach(detach_event2)
+      #pragma omp task detach (detach_event2)
       {
 	y++;
 	omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
index 2609fb1..c85857d 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-3.c
@@ -14,16 +14,16 @@ int main (void)
   #pragma omp parallel
     #pragma omp single
     {
-      #pragma omp task depend(out:dep) detach(detach_event)
+      #pragma omp task depend (out:dep) detach (detach_event)
 	x++;
 
       #pragma omp task
       {
 	y++;
-	omp_fulfill_event(detach_event);
+	omp_fulfill_event (detach_event);
       }
 
-      #pragma omp task depend(in:dep)
+      #pragma omp task depend (in:dep)
 	z++;
     }
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
index eeb9554..cd0d2b3 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-4.c
@@ -14,10 +14,10 @@ int main (void)
 
   #pragma omp parallel
     #pragma omp single
-      #pragma omp task detach(detach_event)
+      #pragma omp task detach (detach_event)
       {
 	x++;
-	omp_fulfill_event(detach_event);
+	omp_fulfill_event (detach_event);
       }
 
   assert (x == 1);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
index 5a01517..382f377 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-5.c
@@ -12,16 +12,16 @@ int main (void)
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
 
-  #pragma omp parallel firstprivate(detach_event1, detach_event2)
+  #pragma omp parallel private (detach_event1, detach_event2)
   {
     #pragma omp single
-      thread_count = omp_get_num_threads();
+      thread_count = omp_get_num_threads ();
 
-    #pragma omp task detach(detach_event1) untied
+    #pragma omp task detach (detach_event1) untied
       #pragma omp atomic update
 	x++;
 
-    #pragma omp task detach(detach_event2) untied
+    #pragma omp task detach (detach_event2) untied
     {
       #pragma omp atomic update
 	y++;
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index b5f68cc..e5c2291 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -13,11 +13,11 @@ int main (void)
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
 
-  #pragma omp target map(tofrom: x, y, z) map(from: thread_count)
-    #pragma omp parallel firstprivate(detach_event1, detach_event2)
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
       {
 	#pragma omp single
-	  thread_count = omp_get_num_threads();
+	  thread_count = omp_get_num_threads ();
 
 	#pragma omp task detach(detach_event1) untied
 	  #pragma omp atomic update
@@ -36,8 +36,6 @@ int main (void)
 	    z++;
 	  omp_fulfill_event (detach_event2);
 	}
-
-	#pragma omp taskwait
       }
 
   assert (x == thread_count);
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
new file mode 100644
index 0000000..3f025d6
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-7.c
@@ -0,0 +1,45 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause.  Each thread spawns off a chain of tasks,
+   that can then be executed by any available thread.  Each thread uses
+   taskwait to wait for the child tasks to complete.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp parallel private (detach_event1, detach_event2)
+  {
+    #pragma omp single
+      thread_count = omp_get_num_threads ();
+
+    #pragma omp task detach (detach_event1) untied
+      #pragma omp atomic update
+	x++;
+
+    #pragma omp task detach (detach_event2) untied
+    {
+      #pragma omp atomic update
+	y++;
+      omp_fulfill_event (detach_event1);
+    }
+
+    #pragma omp task untied
+    {
+      #pragma omp atomic update
+	z++;
+      omp_fulfill_event (detach_event2);
+    }
+
+    #pragma omp taskwait
+  }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
new file mode 100644
index 0000000..6f77f12
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-8.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause on an offload device.  Each device
+   thread spawns off a chain of tasks, that can then be executed by
+   any available thread.  Each thread uses taskwait to wait for the
+   child tasks to complete.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp target map (tofrom: x, y, z) map (from: thread_count)
+    #pragma omp parallel private (detach_event1, detach_event2)
+      {
+	#pragma omp single
+	  thread_count = omp_get_num_threads ();
+
+	#pragma omp task detach (detach_event1) untied
+	  #pragma omp atomic update
+	    x++;
+
+	#pragma omp task detach (detach_event2) untied
+	{
+	  #pragma omp atomic update
+	    y++;
+	  omp_fulfill_event (detach_event1);
+	}
+
+	#pragma omp task untied
+	{
+	  #pragma omp atomic update
+	    z++;
+	  omp_fulfill_event (detach_event2);
+	}
+
+	#pragma omp taskwait
+      }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
new file mode 100644
index 0000000..5316ca5
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-9.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <assert.h>
+
+/* Test tasks with detach clause.  Each thread spawns off a chain of tasks
+   in a taskgroup, that can then be executed by any available thread.  */
+
+int main (void)
+{
+  int x = 0, y = 0, z = 0;
+  int thread_count;
+  omp_event_handle_t detach_event1, detach_event2;
+
+  #pragma omp parallel private (detach_event1, detach_event2)
+    #pragma omp taskgroup
+    {
+      #pragma omp single
+	thread_count = omp_get_num_threads ();
+
+      #pragma omp task detach (detach_event1) untied
+	#pragma omp atomic update
+	  x++;
+
+      #pragma omp task detach (detach_event2) untied
+      {
+	#pragma omp atomic update
+	  y++;
+	omp_fulfill_event (detach_event1);
+      }
+
+      #pragma omp task untied
+      {
+	#pragma omp atomic update
+	  z++;
+	omp_fulfill_event (detach_event2);
+      }
+    }
+
+  assert (x == thread_count);
+  assert (y == thread_count);
+  assert (z == thread_count);
+}
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-1.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
index 217bf65..c53b1ca 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-1.f90
@@ -11,11 +11,11 @@ program task_detach_1
 
   !$omp parallel
     !$omp single
-      !$omp task detach(detach_event1)
+      !$omp task detach (detach_event1)
         x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2)
+      !$omp task detach (detach_event2)
         y = y + 1
 	call omp_fulfill_event (detach_event1)
       !$omp end task
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-10.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-10.f90
new file mode 100644
index 0000000..61f0ea8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-10.f90
@@ -0,0 +1,44 @@
+! { dg-do run }
+
+! Test tasks with detach clause on an offload device.  Each device
+! thread spawns off a chain of tasks in a taskgroup, that can then
+! be executed by any available thread.
+
+program task_detach_10
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
+      !$omp taskgroup
+	!$omp single
+	  thread_count = omp_get_num_threads ()
+	!$omp end single
+
+	!$omp task detach (detach_event1) untied
+	  !$omp atomic update
+	    x = x + 1
+	!$omp end task
+
+	!$omp task detach (detach_event2) untied
+	  !$omp atomic update
+	    y = y + 1
+	  call omp_fulfill_event (detach_event1)
+	!$omp end task
+
+	!$omp task untied
+	  !$omp atomic update
+	    z = z + 1
+	  call omp_fulfill_event (detach_event2)
+	!$omp end task
+      !$omp end taskgroup
+    !$omp end parallel
+  !$omp end target
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-11.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-11.f90
new file mode 100644
index 0000000..b33baff
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-11.f90
@@ -0,0 +1,13 @@
+! { dg-do run }
+
+! Test the detach clause when the task is undeferred.
+
+program task_detach_11
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event
+
+  !$omp task detach (detach_event)
+    call omp_fulfill_event (detach_event)
+  !$omp end task
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-2.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
index ecb4829..68e3ff2 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-2.f90
@@ -10,13 +10,13 @@ program task_detach_2
   integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
   integer :: x = 0, y = 0, z = 0
 
-  !$omp parallel num_threads(1)
+  !$omp parallel num_threads (1)
     !$omp single
-      !$omp task detach(detach_event1)
+      !$omp task detach (detach_event1)
         x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2)
+      !$omp task detach (detach_event2)
         y = y + 1
 	call omp_fulfill_event (detach_event1)
       !$omp end task
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-3.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
index bdf93a5..5ac68d5 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-3.f90
@@ -12,16 +12,16 @@ program task_detach_3
 
   !$omp parallel
     !$omp single
-      !$omp task depend(out:dep) detach(detach_event)
+      !$omp task depend (out:dep) detach (detach_event)
         x = x + 1
       !$omp end task
 
       !$omp task
         y = y + 1
-	call omp_fulfill_event(detach_event)
+	call omp_fulfill_event (detach_event)
       !$omp end task
 
-      !$omp task depend(in:dep)
+      !$omp task depend (in:dep)
         z = z + 1
       !$omp end task
     !$omp end single
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-4.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
index 6d0843c..159624c 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-4.f90
@@ -11,9 +11,9 @@ program task_detach_4
 
   !$omp parallel
     !$omp single
-      !$omp task detach(detach_event)
+      !$omp task detach (detach_event)
         x = x + 1
-	call omp_fulfill_event(detach_event)
+	call omp_fulfill_event (detach_event)
       !$omp end task
     !$omp end single
   !$omp end parallel
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
index 955d687..95bd132 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-5.f90
@@ -10,17 +10,17 @@ program task_detach_5
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  !$omp parallel firstprivate(detach_event1, detach_event2)
+  !$omp parallel private (detach_event1, detach_event2)
     !$omp single
-      thread_count = omp_get_num_threads()
+      thread_count = omp_get_num_threads ()
     !$omp end single
 
-    !$omp task detach(detach_event1) untied
+    !$omp task detach (detach_event1) untied
       !$omp atomic update
 	x = x + 1
     !$omp end task
 
-    !$omp task detach(detach_event2) untied
+    !$omp task detach (detach_event2) untied
       !$omp atomic update
 	y = y + 1
       call omp_fulfill_event (detach_event1);
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index 0fe2155..b2c476f 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -11,30 +11,28 @@ program task_detach_6
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  !$omp target map(tofrom: x, y, z) map(from: thread_count)
-    !$omp parallel firstprivate(detach_event1, detach_event2)
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
       !$omp single
-	thread_count = omp_get_num_threads()
+	thread_count = omp_get_num_threads ()
       !$omp end single
 
-      !$omp task detach(detach_event1) untied
+      !$omp task detach (detach_event1) untied
 	!$omp atomic update
 	  x = x + 1
       !$omp end task
 
-      !$omp task detach(detach_event2) untied
+      !$omp task detach (detach_event2) untied
 	!$omp atomic update
 	  y = y + 1
-	call omp_fulfill_event (detach_event1);
+	call omp_fulfill_event (detach_event1)
       !$omp end task
 
       !$omp task untied
 	!$omp atomic update
 	  z = z + 1
-	call omp_fulfill_event (detach_event2);
+	call omp_fulfill_event (detach_event2)
       !$omp end task
-
-      !$omp taskwait
     !$omp end parallel
   !$omp end target
 
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-7.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-7.f90
new file mode 100644
index 0000000..32e715e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-7.f90
@@ -0,0 +1,42 @@
+! { dg-do run }
+
+! Test tasks with detach clause.  Each thread spawns off a chain of tasks,
+! that can then be executed by any available thread.  Each thread uses
+! taskwait to wait for the child tasks to complete.
+
+program task_detach_7
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp parallel private (detach_event1, detach_event2)
+    !$omp single
+      thread_count = omp_get_num_threads()
+    !$omp end single
+
+    !$omp task detach (detach_event1) untied
+      !$omp atomic update
+	x = x + 1
+    !$omp end task
+
+    !$omp task detach (detach_event2) untied
+      !$omp atomic update
+	y = y + 1
+      call omp_fulfill_event (detach_event1)
+    !$omp end task
+
+    !$omp task untied
+      !$omp atomic update
+	z = z + 1
+      call omp_fulfill_event (detach_event2)
+    !$omp end task
+
+    !$omp taskwait
+  !$omp end parallel
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-8.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-8.f90
new file mode 100644
index 0000000..e760eab
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-8.f90
@@ -0,0 +1,45 @@
+! { dg-do run }
+
+! Test tasks with detach clause on an offload device.  Each device
+! thread spawns off a chain of tasks, that can then be executed by
+! any available thread.  Each thread uses taskwait to wait for the
+! child tasks to complete.
+
+program task_detach_8
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp target map (tofrom: x, y, z) map (from: thread_count)
+    !$omp parallel private (detach_event1, detach_event2)
+      !$omp single
+	thread_count = omp_get_num_threads ()
+      !$omp end single
+
+      !$omp task detach (detach_event1) untied
+	!$omp atomic update
+	  x = x + 1
+      !$omp end task
+
+      !$omp task detach (detach_event2) untied
+	!$omp atomic update
+	  y = y + 1
+	call omp_fulfill_event (detach_event1)
+      !$omp end task
+
+      !$omp task untied
+	!$omp atomic update
+	  z = z + 1
+	call omp_fulfill_event (detach_event2)
+      !$omp end task
+
+      !$omp taskwait
+    !$omp end parallel
+  !$omp end target
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-9.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-9.f90
new file mode 100644
index 0000000..540c6de
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-9.f90
@@ -0,0 +1,41 @@
+! { dg-do run }
+
+! Test tasks with detach clause.  Each thread spawns off a chain of tasks
+! in a taskgroup, that can then be executed by any available thread.
+
+program task_detach_9
+  use omp_lib
+
+  integer (kind=omp_event_handle_kind) :: detach_event1, detach_event2
+  integer :: x = 0, y = 0, z = 0
+  integer :: thread_count
+
+  !$omp parallel private (detach_event1, detach_event2)
+    !$omp taskgroup
+      !$omp single
+	thread_count = omp_get_num_threads ()
+      !$omp end single
+
+      !$omp task detach (detach_event1) untied
+	!$omp atomic update
+	  x = x + 1
+      !$omp end task
+
+      !$omp task detach (detach_event2) untied
+	!$omp atomic update
+	  y = y + 1
+	call omp_fulfill_event (detach_event1);
+      !$omp end task
+
+      !$omp task untied
+	!$omp atomic update
+	  z = z + 1
+	call omp_fulfill_event (detach_event2);
+      !$omp end task
+    !$omp end taskgroup
+  !$omp end parallel
+
+  if (x /= thread_count) stop 1
+  if (y /= thread_count) stop 2
+  if (z /= thread_count) stop 3
+end program
-- 
2.8.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-25 16:21           ` Kwok Cheung Yeung
@ 2021-02-25 16:38             ` Jakub Jelinek
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Jelinek @ 2021-02-25 16:38 UTC (permalink / raw)
  To: Kwok Cheung Yeung; +Cc: GCC Patches

On Thu, Feb 25, 2021 at 04:21:31PM +0000, Kwok Cheung Yeung wrote:
> Reversing the order reduces the hole to 3 bytes:
> 
>         size_t                     num_dependees;        /*    80     8 */
>         union {
>                 gomp_sem_t *       completion_sem;       /*    88     8 */
>                 struct gomp_team * detach_team;          /*    88     8 */
>         };                                               /*    88     8 */
>         _Bool                      deferred_p;           /*    96     1 */
> 
>         /* XXX 3 bytes hole, try to pack */
> 
>         int                        priority;             /*   100     4 */
> 
> If we were really determined to get rid of the hole, we could stash
> deferred_p in the least-significant bit of the pointer union, but I think

Sorry, indeed, I was thinking about how it would be packed after priority,
not before it but putting it in between priority and the related prio queues
is undesirable.  So union, deferred_p, priority is the right order.

> > I think you can move task->deferred_p into the if stmt.
> 
> That can be done (since the code for detach is currently the only thing
> using it), but I think it would be better to have deferred_p always have the
> right value, regardless of whether or not it is used? Otherwise that might
> lead to some confusion if it is later used by something else.

Ok either way.

> I will get this committed later if the regression tests finish with no surprises.

Two more nits below.

> @@ -86,7 +87,8 @@ gomp_init_task (struct gomp_task *task, struct gomp_task *parent_task,
>    task->dependers = NULL;
>    task->depend_hash = NULL;
>    task->depend_count = 0;
> -  task->detach = false;
> +  task->deferred_p = true;
> +  task->detach_team = NULL;
>  }

Please initialize deferred_p to false rather than true, because gomp_init_task
is called in many places and except for that one spot in GOMP_task (and one
in taskloop.c) the tasks are undeferred (e.g. the implicit tasks in parallel
or the initial one etc.).
And maybe also reorder the fields initialized in there to match the order of increasing
field offsets.

> @@ -414,16 +411,18 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
>        task.final_task = (thr->task && thr->task->final_task)
>  			|| (flags & GOMP_TASK_FLAG_FINAL);
>        task.priority = priority;
> +      task.deferred_p = false;

And then this shouldn't be here, gomp_init_task has already initialized it
that way.

	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-02-23 21:52       ` Jakub Jelinek
@ 2021-03-11 16:52         ` Thomas Schwinge
  2021-03-25 12:02           ` Thomas Schwinge
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Schwinge @ 2021-03-11 16:52 UTC (permalink / raw)
  To: Jakub Jelinek, Kwok Cheung Yeung; +Cc: GCC Patches, Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 1857 bytes --]

Hi!

On 2021-02-23T22:52:38+0100, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> On Tue, Feb 23, 2021 at 09:43:51PM +0000, Kwok Cheung Yeung wrote:
>> On 19/02/2021 7:12 pm, Kwok Cheung Yeung wrote:
>> > I have included the current state of my patch. All task-detach-* tests
>> > pass when executed without offloading or with offloading to GCN, but
>> > with offloading to Nvidia, task-detach-6.* hangs consistently but
>> > everything else passes (probably because of the missing
>> > gomp_team_barrier_done?).
>>
>> It looks like the hang has nothing to do with the detach patch - this hangs
>> consistently for me when offloaded to NVPTX:
>>
>> #include <omp.h>
>>
>> int main (void)
>> {
>> #pragma omp target
>>   #pragma omp parallel
>>     #pragma omp task
>>       ;
>> }
>>
>> This doesn't hang when offloaded to GCN or the host device, or if
>> num_threads(1) is specified on the omp parallel.

So, I reproduced this the hard way;
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98738#c13> :-/

Please always file issues when you run into such things.  I've now filed
PR99555 "[OpenMP/nvptx] Execution-time hang for simple nested OpenMP
'target'/'parallel'/'task' constructs".

> Then it can be solved separately, I'll try to have a look if I see something
> bad from the dumps, but I admit I don't have much experience with debugging
> NVPTX offloaded code...

Any luck?


Until this gets resolved properly, OK to push something like the attached
(currently testing) "Avoid OpenMP/nvptx execution-time hangs for simple
nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]"?


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Avoid-OpenMP-nvptx-execution-time-hangs-for-simple-n.patch --]
[-- Type: text/x-diff, Size: 4404 bytes --]

From 61cb5c237ec3a402696797e5459f181d501cd0fb Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 11 Mar 2021 17:01:22 +0100
Subject: [PATCH] Avoid OpenMP/nvptx execution-time hangs for simple nested
 OpenMP 'target'/'parallel'/'task' constructs [PR99555]

... awaiting proper resolution, of course.

	libgomp/
	PR99555
	* testsuite/lib/on_device_arch.c: New file.
	* testsuite/libgomp.c/pr99555-1.c: Likewise.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Until resolved,
	skip for nvptx offloading, with error status.
	* testsuite/libgomp.fortran/task-detach-6.f90: Likewise.
---
 libgomp/testsuite/lib/on_device_arch.c        | 30 +++++++++++++++++++
 .../libgomp.c-c++-common/task-detach-6.c      |  7 +++++
 libgomp/testsuite/libgomp.c/pr99555-1.c       | 19 ++++++++++++
 .../libgomp.fortran/task-detach-6.f90         | 13 ++++++++
 4 files changed, 69 insertions(+)
 create mode 100644 libgomp/testsuite/lib/on_device_arch.c
 create mode 100644 libgomp/testsuite/libgomp.c/pr99555-1.c

diff --git a/libgomp/testsuite/lib/on_device_arch.c b/libgomp/testsuite/lib/on_device_arch.c
new file mode 100644
index 00000000000..1c0753c3181
--- /dev/null
+++ b/libgomp/testsuite/lib/on_device_arch.c
@@ -0,0 +1,30 @@
+#include <gomp-constants.h>
+
+/* static */ int
+device_arch_nvptx (void)
+{
+  return GOMP_DEVICE_NVIDIA_PTX;
+}
+
+#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
+/* static */ int
+device_arch (void)
+{
+  return GOMP_DEVICE_DEFAULT;
+}
+
+static int
+on_device_arch (int d)
+{
+  int d_cur;
+  #pragma omp target map(from:d_cur)
+  d_cur = device_arch ();
+
+  return d_cur == d;
+}
+
+int
+on_device_arch_nvptx ()
+{
+  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index e5c2291e6ff..4a3e4a2a3d2 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -1,5 +1,8 @@
 /* { dg-do run } */
 
+/* { dg-additional-sources "../lib/on_device_arch.c" } */
+extern int on_device_arch_nvptx ();
+
 #include <omp.h>
 #include <assert.h>
 
@@ -9,6 +12,10 @@
 
 int main (void)
 {
+  //TODO See '../libgomp.c/pr99555-1.c'.
+  if (on_device_arch_nvptx ())
+    __builtin_abort (); //TODO Until resolved, skip, with error status.
+
   int x = 0, y = 0, z = 0;
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
new file mode 100644
index 00000000000..9ba330959d8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -0,0 +1,19 @@
+// PR99555 "[OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs"
+
+// { dg-additional-options "-O0" }
+
+// { dg-additional-sources "../lib/on_device_arch.c" }
+extern int on_device_arch_nvptx ();
+
+int main (void)
+{
+  if (on_device_arch_nvptx ())
+    __builtin_abort (); //TODO Until resolved, skip, with error status.
+
+#pragma omp target
+#pragma omp parallel // num_threads(1)
+#pragma omp task
+  ;
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index b2c476fd6a6..eda20e73bb8 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -1,5 +1,8 @@
 ! { dg-do run }
 
+! { dg-additional-sources ../lib/on_device_arch.c }
+  ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
+
 ! Test tasks with detach clause on an offload device.  Each device
 ! thread spawns off a chain of tasks, that can then be executed by
 ! any available thread.
@@ -11,6 +14,16 @@ program task_detach_6
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
+  interface
+    integer function on_device_arch_nvptx() bind(C)
+    end function on_device_arch_nvptx
+  end interface
+
+  !TODO See '../libgomp.c/pr99555-1.c'.
+  if (on_device_arch_nvptx () /= 0) then
+     error stop !TODO Until resolved, skip, with error status.
+  end if
+
   !$omp target map (tofrom: x, y, z) map (from: thread_count)
     !$omp parallel private (detach_event1, detach_event2)
       !$omp single
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-03-11 16:52         ` Thomas Schwinge
@ 2021-03-25 12:02           ` Thomas Schwinge
  2021-03-26 14:42             ` [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]) Tobias Burnus
  2021-04-09 11:00             ` [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Thomas Schwinge
  0 siblings, 2 replies; 22+ messages in thread
From: Thomas Schwinge @ 2021-03-25 12:02 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Kwok Cheung Yeung, Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 2225 bytes --]

Hi!

On 2021-03-11T17:52:55+0100, I wrote:
> On 2021-02-23T22:52:38+0100, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>> On Tue, Feb 23, 2021 at 09:43:51PM +0000, Kwok Cheung Yeung wrote:
>>> On 19/02/2021 7:12 pm, Kwok Cheung Yeung wrote:
>>> > I have included the current state of my patch. All task-detach-* tests
>>> > pass when executed without offloading or with offloading to GCN, but
>>> > with offloading to Nvidia, task-detach-6.* hangs consistently but
>>> > everything else passes (probably because of the missing
>>> > gomp_team_barrier_done?).
>>>
>>> It looks like the hang has nothing to do with the detach patch - this hangs
>>> consistently for me when offloaded to NVPTX:
>>>
>>> #include <omp.h>
>>>
>>> int main (void)
>>> {
>>> #pragma omp target
>>>   #pragma omp parallel
>>>     #pragma omp task
>>>       ;
>>> }
>>>
>>> This doesn't hang when offloaded to GCN or the host device, or if
>>> num_threads(1) is specified on the omp parallel.
>
> So, I reproduced this the hard way;
> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98738#c13> :-/
>
> Please always file issues when you run into such things.  I've now filed
> PR99555 "[OpenMP/nvptx] Execution-time hang for simple nested OpenMP
> 'target'/'parallel'/'task' constructs".
>
>> Then it can be solved separately, I'll try to have a look if I see something
>> bad from the dumps, but I admit I don't have much experience with debugging
>> NVPTX offloaded code...
>
> Any luck?
>
>
> Until this gets resolved properly, OK to push something like the attached
> (currently testing) "Avoid OpenMP/nvptx execution-time hangs for simple
> nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]"?

As posted, I've now pushed "Avoid OpenMP/nvptx execution-time hangs for
simple nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]" to
master branch in commit d99111fd8e12deffdd9a965ce17e8a760d531ec3, see
attached.  "... awaiting proper resolution, of course."


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Avoid-OpenMP-nvptx-execution-time-hangs-for-simple-n.patch --]
[-- Type: text/x-diff, Size: 4420 bytes --]

From d99111fd8e12deffdd9a965ce17e8a760d531ec3 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 11 Mar 2021 17:01:22 +0100
Subject: [PATCH] Avoid OpenMP/nvptx execution-time hangs for simple nested
 OpenMP 'target'/'parallel'/'task' constructs [PR99555]

... awaiting proper resolution, of course.

	libgomp/
	PR target/99555
	* testsuite/lib/on_device_arch.c: New file.
	* testsuite/libgomp.c/pr99555-1.c: Likewise.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Until resolved,
	skip for nvptx offloading, with error status.
	* testsuite/libgomp.fortran/task-detach-6.f90: Likewise.
---
 libgomp/testsuite/lib/on_device_arch.c        | 30 +++++++++++++++++++
 .../libgomp.c-c++-common/task-detach-6.c      |  7 +++++
 libgomp/testsuite/libgomp.c/pr99555-1.c       | 19 ++++++++++++
 .../libgomp.fortran/task-detach-6.f90         | 13 ++++++++
 4 files changed, 69 insertions(+)
 create mode 100644 libgomp/testsuite/lib/on_device_arch.c
 create mode 100644 libgomp/testsuite/libgomp.c/pr99555-1.c

diff --git a/libgomp/testsuite/lib/on_device_arch.c b/libgomp/testsuite/lib/on_device_arch.c
new file mode 100644
index 000000000000..1c0753c31814
--- /dev/null
+++ b/libgomp/testsuite/lib/on_device_arch.c
@@ -0,0 +1,30 @@
+#include <gomp-constants.h>
+
+/* static */ int
+device_arch_nvptx (void)
+{
+  return GOMP_DEVICE_NVIDIA_PTX;
+}
+
+#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
+/* static */ int
+device_arch (void)
+{
+  return GOMP_DEVICE_DEFAULT;
+}
+
+static int
+on_device_arch (int d)
+{
+  int d_cur;
+  #pragma omp target map(from:d_cur)
+  d_cur = device_arch ();
+
+  return d_cur == d;
+}
+
+int
+on_device_arch_nvptx ()
+{
+  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index e5c2291e6ff0..4a3e4a2a3d28 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -1,5 +1,8 @@
 /* { dg-do run } */
 
+/* { dg-additional-sources "../lib/on_device_arch.c" } */
+extern int on_device_arch_nvptx ();
+
 #include <omp.h>
 #include <assert.h>
 
@@ -9,6 +12,10 @@
 
 int main (void)
 {
+  //TODO See '../libgomp.c/pr99555-1.c'.
+  if (on_device_arch_nvptx ())
+    __builtin_abort (); //TODO Until resolved, skip, with error status.
+
   int x = 0, y = 0, z = 0;
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
new file mode 100644
index 000000000000..9ba330959d80
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -0,0 +1,19 @@
+// PR99555 "[OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs"
+
+// { dg-additional-options "-O0" }
+
+// { dg-additional-sources "../lib/on_device_arch.c" }
+extern int on_device_arch_nvptx ();
+
+int main (void)
+{
+  if (on_device_arch_nvptx ())
+    __builtin_abort (); //TODO Until resolved, skip, with error status.
+
+#pragma omp target
+#pragma omp parallel // num_threads(1)
+#pragma omp task
+  ;
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index b2c476fd6a6b..eda20e73bb84 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -1,5 +1,8 @@
 ! { dg-do run }
 
+! { dg-additional-sources ../lib/on_device_arch.c }
+  ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
+
 ! Test tasks with detach clause on an offload device.  Each device
 ! thread spawns off a chain of tasks, that can then be executed by
 ! any available thread.
@@ -11,6 +14,16 @@ program task_detach_6
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
+  interface
+    integer function on_device_arch_nvptx() bind(C)
+    end function on_device_arch_nvptx
+  end interface
+
+  !TODO See '../libgomp.c/pr99555-1.c'.
+  if (on_device_arch_nvptx () /= 0) then
+     error stop !TODO Until resolved, skip, with error status.
+  end if
+
   !$omp target map (tofrom: x, y, z) map (from: thread_count)
     !$omp parallel private (detach_event1, detach_event2)
       !$omp single
-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])
  2021-03-25 12:02           ` Thomas Schwinge
@ 2021-03-26 14:42             ` Tobias Burnus
  2021-03-26 14:46               ` Jakub Jelinek
  2021-04-09 11:00             ` [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Thomas Schwinge
  1 sibling, 1 reply; 22+ messages in thread
From: Tobias Burnus @ 2021-03-26 14:42 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches; +Cc: Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2547 bytes --]

Hi Thomas, hi all,

your commit causes compile fails:

cc1: fatal error: ../lib/on_device_arch.c: No such file or directory

FAIL: libgomp.c/../libgomp.c-c++-common/task-detach-6.c (test for excess errors)
FAIL: libgomp.c/pr99555-1.c (test for excess errors)
FAIL: libgomp.fortran/task-detach-6.f90   -O0  (test for excess errors)

That's with embedded testing, where files are copied into the test directory, i.e.
cp .../libgomp/testsuite/libgomp.fortran/../lib/on_device_arch.c on_device_arch.c
cp .../libgomp/testsuite/libgomp.fortran/task-detach-6.f90 task-detach-6.f90
and then executed as:
powerpc64le-none-linux-gnu-gcc $TESTDIR/task-detach-6.f90 ../lib/on_device_arch.c
which fails.

How about the following patch? It moves the aux function to libgomp.c-c++-common/on_device_arch.c
and #includes it in the new wrapper files libgomp.{c,fortran}/on_device_arch.c.
(Based on the observation that #include with relative paths always works,
while dg-additional-sources may not, depending how the testsuite it run.)

OK? Or does anyone have a better suggestion?

Tobias

PS: The testcases still FAIL with nvptx offloading – but now at execution time.
I think that's expected, is it? (→PR99555?)
FAIL: libgomp.c/../libgomp.c-c++-common/pr96390.c execution test
FAIL: libgomp.c/../libgomp.c-c++-common/task-detach-6.c execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O0  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O1  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O2  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O3 -g  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -Os  execution test

On 25.03.21 13:02, Thomas Schwinge wrote:
>> Until this gets resolved properly, OK to push something like the attached
>> (currently testing) "Avoid OpenMP/nvptx execution-time hangs for simple
>> nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]"?
> [...] I've now pushed "Avoid OpenMP/nvptx execution-time hangs for
> simple nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]" to
> master branch in commit d99111fd8e12deffdd9a965ce17e8a760d531ec3, see
> attached.  "... awaiting proper resolution, of course."
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

[-- Attachment #2: on_arch.diff --]
[-- Type: text/x-patch, Size: 4889 bytes --]

libgomp: Fix on_device_arch.c aux-file handling [PR99555]

libgomp/ChangeLog:

	PR target/99555
        * testsuite/libgomp.c-c++-common/task-detach-6.c:
	* testsuite/libgomp.c/pr99555-1.c:
	* testsuite/libgomp.fortran/task-detach-6.f90:
	* testsuite/lib/on_device_arch.c: Removed.
	* testsuite/libgomp.c-c++-common/on_device_arch.c: New test.
	* testsuite/libgomp.c/on_device_arch.c: New test.
	* testsuite/libgomp.fortran/on_device_arch.c: New test.

 libgomp/testsuite/lib/on_device_arch.c             | 30 --------------------
 .../libgomp.c-c++-common/on_device_arch.c          | 33 ++++++++++++++++++++++
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |  2 +-
 libgomp/testsuite/libgomp.c/on_device_arch.c       |  3 ++
 libgomp/testsuite/libgomp.c/pr99555-1.c            |  2 +-
 libgomp/testsuite/libgomp.fortran/on_device_arch.c |  3 ++
 .../testsuite/libgomp.fortran/task-detach-6.f90    |  2 +-
 7 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/libgomp/testsuite/lib/on_device_arch.c b/libgomp/testsuite/lib/on_device_arch.c
deleted file mode 100644
index 1c0753c..0000000
--- a/libgomp/testsuite/lib/on_device_arch.c
+++ /dev/null
@@ -1,30 +0,0 @@
-#include <gomp-constants.h>
-
-/* static */ int
-device_arch_nvptx (void)
-{
-  return GOMP_DEVICE_NVIDIA_PTX;
-}
-
-#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
-/* static */ int
-device_arch (void)
-{
-  return GOMP_DEVICE_DEFAULT;
-}
-
-static int
-on_device_arch (int d)
-{
-  int d_cur;
-  #pragma omp target map(from:d_cur)
-  d_cur = device_arch ();
-
-  return d_cur == d;
-}
-
-int
-on_device_arch_nvptx ()
-{
-  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
-}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.c b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.c
new file mode 100644
index 0000000..00524b5
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.c
@@ -0,0 +1,33 @@
+/* Auxiliar file.  */
+/* { dg-do compile  { target skip-all-targets } } */
+/* Note: this file is also #included in ../libgomp.fortran/on_device_arch.c  */
+#include <gomp-constants.h>
+
+/* static */ int
+device_arch_nvptx (void)
+{
+  return GOMP_DEVICE_NVIDIA_PTX;
+}
+
+#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
+/* static */ int
+device_arch (void)
+{
+  return GOMP_DEVICE_DEFAULT;
+}
+
+static int
+on_device_arch (int d)
+{
+  int d_cur;
+  #pragma omp target map(from:d_cur)
+  d_cur = device_arch ();
+
+  return d_cur == d;
+}
+
+int
+on_device_arch_nvptx ()
+{
+  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index 4a3e4a2..c88cec2 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -1,6 +1,6 @@
 /* { dg-do run } */
 
-/* { dg-additional-sources "../lib/on_device_arch.c" } */
+/* { dg-additional-sources "on_device_arch.c" } */
 extern int on_device_arch_nvptx ();
 
 #include <omp.h>
diff --git a/libgomp/testsuite/libgomp.c/on_device_arch.c b/libgomp/testsuite/libgomp.c/on_device_arch.c
new file mode 100644
index 0000000..af71103
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/on_device_arch.c
@@ -0,0 +1,3 @@
+/* Auxiliar file.  */
+/* { dg-do compile  { target skip-all-targets } } */
+#include "../libgomp.c-c++-common/on_device_arch.c"
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
index 9ba3309..d661a888 100644
--- a/libgomp/testsuite/libgomp.c/pr99555-1.c
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -2,7 +2,7 @@
 
 // { dg-additional-options "-O0" }
 
-// { dg-additional-sources "../lib/on_device_arch.c" }
+// { dg-additional-sources "on_device_arch.c" }
 extern int on_device_arch_nvptx ();
 
 int main (void)
diff --git a/libgomp/testsuite/libgomp.fortran/on_device_arch.c b/libgomp/testsuite/libgomp.fortran/on_device_arch.c
new file mode 100644
index 0000000..af71103
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/on_device_arch.c
@@ -0,0 +1,3 @@
+/* Auxiliar file.  */
+/* { dg-do compile  { target skip-all-targets } } */
+#include "../libgomp.c-c++-common/on_device_arch.c"
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index eda20e7..bd0beb6 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -1,6 +1,6 @@
 ! { dg-do run }
 
-! { dg-additional-sources ../lib/on_device_arch.c }
+! { dg-additional-sources on_device_arch.c }
   ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
 
 ! Test tasks with detach clause on an offload device.  Each device

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])
  2021-03-26 14:42             ` [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]) Tobias Burnus
@ 2021-03-26 14:46               ` Jakub Jelinek
  2021-03-26 15:19                 ` Tobias Burnus
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Jelinek @ 2021-03-26 14:46 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Thomas Schwinge, gcc-patches

On Fri, Mar 26, 2021 at 03:42:22PM +0100, Tobias Burnus wrote:
> How about the following patch? It moves the aux function to libgomp.c-c++-common/on_device_arch.c
> and #includes it in the new wrapper files libgomp.{c,fortran}/on_device_arch.c.
> (Based on the observation that #include with relative paths always works,
> while dg-additional-sources may not, depending how the testsuite it run.)
> 
> OK? Or does anyone have a better suggestion?

For C/C++, why do we call it on_device_arch.c at all?  Can't be just
on_device_arch.h that is #included in each test instead of additional
sources?  If we don't like inlining, just use noinline attribute, but I
don't see why inlining would hurt.
For Fortran, sure, we can't include it, so let's add
libgomp.fortran/on_device_arch.c that #includes that header.

	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])
  2021-03-26 14:46               ` Jakub Jelinek
@ 2021-03-26 15:19                 ` Tobias Burnus
  2021-03-26 15:22                   ` Jakub Jelinek
  0 siblings, 1 reply; 22+ messages in thread
From: Tobias Burnus @ 2021-03-26 15:19 UTC (permalink / raw)
  To: Jakub Jelinek, Tobias Burnus; +Cc: gcc-patches, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1105 bytes --]

Hi Jakub,

great suggestion – I did now as proposed.

On 26.03.21 15:46, Jakub Jelinek via Gcc-patches wrote:
> On Fri, Mar 26, 2021 at 03:42:22PM +0100, Tobias Burnus wrote:
>> How about the following patch? It moves the aux function to libgomp.c-c++-common/on_device_arch.c
>> and #includes it in the new wrapper files libgomp.{c,fortran}/on_device_arch.c.
>> (Based on the observation that #include with relative paths always works,
>> while dg-additional-sources may not, depending how the testsuite it run.) [...]
> For C/C++, why do we call it on_device_arch.c at all?  Can't be just
> on_device_arch.h that is #included in each test instead of additional
> sources?  If we don't like inlining, just use noinline attribute, but I
> don't see why inlining would hurt.
> For Fortran, sure, we can't include it, so let's add
> libgomp.fortran/on_device_arch.c that #includes that header.

OK?

Tobias

-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

[-- Attachment #2: on_arch.diff --]
[-- Type: text/x-patch, Size: 4608 bytes --]

libgomp: Fix on_device_arch.c aux-file handling [PR99555]

libgomp/ChangeLog:

	PR target/99555
        * testsuite/lib/on_device_arch.c: Move to ...
        * testsuite/libgomp.c-c++-common/on_device_arch.h: ... here.
        * testsuite/libgomp.fortran/on_device_arch.c: New file;
	#include on_device_arch.h.
        * testsuite/libgomp.c-c++-common/task-detach-6.c: #include
	on_device_arch.h instead of using dg-additional-source.
        * testsuite/libgomp.c/pr99555-1.c: Likewise.
        * testsuite/libgomp.fortran/task-detach-6.f90: Update to use
	on_device_arch.c without relative paths.

 libgomp/testsuite/lib/on_device_arch.c             | 30 ----------------------
 .../libgomp.c-c++-common/on_device_arch.h          | 30 ++++++++++++++++++++++
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |  4 +--
 libgomp/testsuite/libgomp.c/pr99555-1.c            |  3 +--
 libgomp/testsuite/libgomp.fortran/on_device_arch.c |  3 +++
 .../testsuite/libgomp.fortran/task-detach-6.f90    |  2 +-
 6 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/libgomp/testsuite/lib/on_device_arch.c b/libgomp/testsuite/lib/on_device_arch.c
deleted file mode 100644
index 1c0753c..0000000
--- a/libgomp/testsuite/lib/on_device_arch.c
+++ /dev/null
@@ -1,30 +0,0 @@
-#include <gomp-constants.h>
-
-/* static */ int
-device_arch_nvptx (void)
-{
-  return GOMP_DEVICE_NVIDIA_PTX;
-}
-
-#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
-/* static */ int
-device_arch (void)
-{
-  return GOMP_DEVICE_DEFAULT;
-}
-
-static int
-on_device_arch (int d)
-{
-  int d_cur;
-  #pragma omp target map(from:d_cur)
-  d_cur = device_arch ();
-
-  return d_cur == d;
-}
-
-int
-on_device_arch_nvptx ()
-{
-  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
-}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h
new file mode 100644
index 0000000..1c0753c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h
@@ -0,0 +1,30 @@
+#include <gomp-constants.h>
+
+/* static */ int
+device_arch_nvptx (void)
+{
+  return GOMP_DEVICE_NVIDIA_PTX;
+}
+
+#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
+/* static */ int
+device_arch (void)
+{
+  return GOMP_DEVICE_DEFAULT;
+}
+
+static int
+on_device_arch (int d)
+{
+  int d_cur;
+  #pragma omp target map(from:d_cur)
+  d_cur = device_arch ();
+
+  return d_cur == d;
+}
+
+int
+on_device_arch_nvptx ()
+{
+  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index 4a3e4a2..119d7f5 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -1,10 +1,8 @@
 /* { dg-do run } */
 
-/* { dg-additional-sources "../lib/on_device_arch.c" } */
-extern int on_device_arch_nvptx ();
-
 #include <omp.h>
 #include <assert.h>
+#include "on_device_arch.h"
 
 /* Test tasks with detach clause on an offload device.  Each device
    thread spawns off a chain of tasks, that can then be executed by
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
index 9ba3309..0dc17bf 100644
--- a/libgomp/testsuite/libgomp.c/pr99555-1.c
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -2,8 +2,7 @@
 
 // { dg-additional-options "-O0" }
 
-// { dg-additional-sources "../lib/on_device_arch.c" }
-extern int on_device_arch_nvptx ();
+#include "../libgomp.c-c++-common/on_device_arch.h"
 
 int main (void)
 {
diff --git a/libgomp/testsuite/libgomp.fortran/on_device_arch.c b/libgomp/testsuite/libgomp.fortran/on_device_arch.c
new file mode 100644
index 0000000..98822c4
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/on_device_arch.c
@@ -0,0 +1,3 @@
+/* Auxiliar file.  */
+/* { dg-do compile  { target skip-all-targets } } */
+#include "../libgomp.c-c++-common/on_device_arch.h"
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index eda20e7..bd0beb6 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -1,6 +1,6 @@
 ! { dg-do run }
 
-! { dg-additional-sources ../lib/on_device_arch.c }
+! { dg-additional-sources on_device_arch.c }
   ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
 
 ! Test tasks with detach clause on an offload device.  Each device

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])
  2021-03-26 15:19                 ` Tobias Burnus
@ 2021-03-26 15:22                   ` Jakub Jelinek
  2021-03-29  9:09                     ` [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] Thomas Schwinge
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Jelinek @ 2021-03-26 15:22 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Tobias Burnus, gcc-patches, Thomas Schwinge

On Fri, Mar 26, 2021 at 04:19:56PM +0100, Tobias Burnus wrote:
> Hi Jakub,
> 
> great suggestion – I did now as proposed.
> 
> On 26.03.21 15:46, Jakub Jelinek via Gcc-patches wrote:
> > On Fri, Mar 26, 2021 at 03:42:22PM +0100, Tobias Burnus wrote:
> > > How about the following patch? It moves the aux function to libgomp.c-c++-common/on_device_arch.c
> > > and #includes it in the new wrapper files libgomp.{c,fortran}/on_device_arch.c.
> > > (Based on the observation that #include with relative paths always works,
> > > while dg-additional-sources may not, depending how the testsuite it run.) [...]
> > For C/C++, why do we call it on_device_arch.c at all?  Can't be just
> > on_device_arch.h that is #included in each test instead of additional
> > sources?  If we don't like inlining, just use noinline attribute, but I
> > don't see why inlining would hurt.
> > For Fortran, sure, we can't include it, so let's add
> > libgomp.fortran/on_device_arch.c that #includes that header.
> 
> OK?

LGTM, but please give Thomas a chance to chime in.

	Jakub


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555]
  2021-03-26 15:22                   ` Jakub Jelinek
@ 2021-03-29  9:09                     ` Thomas Schwinge
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Schwinge @ 2021-03-29  9:09 UTC (permalink / raw)
  To: Jakub Jelinek, Tobias Burnus; +Cc: gcc-patches

Hi!

On 2021-03-26T16:22:20+0100, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Mar 26, 2021 at 04:19:56PM +0100, Tobias Burnus wrote:
>> On 26.03.21 15:46, Jakub Jelinek via Gcc-patches wrote:
>> > On Fri, Mar 26, 2021 at 03:42:22PM +0100, Tobias Burnus wrote:
>> > > How about the following patch? It moves the aux function to libgomp.c-c++-common/on_device_arch.c
>> > > and #includes it in the new wrapper files libgomp.{c,fortran}/on_device_arch.c.
>> > > (Based on the observation that #include with relative paths always works,
>> > > while dg-additional-sources may not, depending how the testsuite it run.) [...]

I didn't know that 'dg-additional-sources' had such issues.

>> > For C/C++, why do we call it on_device_arch.c at all?  Can't be just
>> > on_device_arch.h that is #included in each test instead of additional
>> > sources?  If we don't like inlining, just use noinline attribute, but I
>> > don't see why inlining would hurt.
>> > For Fortran, sure, we can't include it, so let's add
>> > libgomp.fortran/on_device_arch.c that #includes that header.

No strong opinion on my side -- I simply did the same for C/C++/Fortran.

>> OK?
>
> LGTM, but please give Thomas a chance to chime in.

ACK, thanks.


And, I hope you did appreciate that I used OpenMP 'declare variant' for
this.  :-)


Grüße
 Thomas
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-03-25 12:02           ` Thomas Schwinge
  2021-03-26 14:42             ` [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]) Tobias Burnus
@ 2021-04-09 11:00             ` Thomas Schwinge
  2021-04-15  9:19               ` Thomas Schwinge
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Schwinge @ 2021-04-09 11:00 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Kwok Cheung Yeung, Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 3102 bytes --]

Hi!

On 2021-03-25T12:02:15+0100, I wrote:
> On 2021-03-11T17:52:55+0100, I wrote:
>> On 2021-02-23T22:52:38+0100, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>>> On Tue, Feb 23, 2021 at 09:43:51PM +0000, Kwok Cheung Yeung wrote:
>>>> On 19/02/2021 7:12 pm, Kwok Cheung Yeung wrote:
>>>> > I have included the current state of my patch. All task-detach-* tests
>>>> > pass when executed without offloading or with offloading to GCN, but
>>>> > with offloading to Nvidia, task-detach-6.* hangs consistently but
>>>> > everything else passes (probably because of the missing
>>>> > gomp_team_barrier_done?).
>>>>
>>>> It looks like the hang has nothing to do with the detach patch - this hangs
>>>> consistently for me when offloaded to NVPTX:
>>>>
>>>> #include <omp.h>
>>>>
>>>> int main (void)
>>>> {
>>>> #pragma omp target
>>>>   #pragma omp parallel
>>>>     #pragma omp task
>>>>       ;
>>>> }
>>>>
>>>> This doesn't hang when offloaded to GCN or the host device, or if
>>>> num_threads(1) is specified on the omp parallel.
>>
>> So, I reproduced this the hard way;
>> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98738#c13> :-/
>>
>> Please always file issues when you run into such things.  I've now filed
>> PR99555 "[OpenMP/nvptx] Execution-time hang for simple nested OpenMP
>> 'target'/'parallel'/'task' constructs".
>>
>>> Then it can be solved separately, I'll try to have a look if I see something
>>> bad from the dumps, but I admit I don't have much experience with debugging
>>> NVPTX offloaded code...
>>
>> Any luck?
>>
>>
>> Until this gets resolved properly, OK to push something like the attached
>> (currently testing) "Avoid OpenMP/nvptx execution-time hangs for simple
>> nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]"?
>
> As posted, I've now pushed "Avoid OpenMP/nvptx execution-time hangs for
> simple nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]" to
> master branch in commit d99111fd8e12deffdd9a965ce17e8a760d531ec3, see
> attached.  "... awaiting proper resolution, of course."

> +  if (on_device_arch_nvptx ())
> +    __builtin_abort (); //TODO Until resolved, skip, with error status.

Actually, we can do better: do try to execute this trivial OpenMP code
(expected to complete in no time), but for nvptx offloading "make sure
that we exit quickly, with error status", and XFAIL that.  So that we'll
get XFAIL -> XPASS when this starts to work for nvptx offloading.  Is
that attached "XFAIL OpenMP/nvptx execution-time hangs for simple nested
OpenMP 'target'/'parallel'/'task' constructs [PR99555]" OK to push?

There are other testcases that '#include <unistd.h>' -- do we have to
worry about 'alarm' not being available in some configurations where the
libgomp testsuite executes (and OpenMP 'target' doesn't already fail for
other reasons)?


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-XFAIL-OpenMP-nvptx-execution-time-hangs-for-simple-n.patch --]
[-- Type: text/x-diff, Size: 3976 bytes --]

From ac247a5962955b20cbf5e4e1a5c4dad81591aeb7 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 7 Apr 2021 10:36:36 +0200
Subject: [PATCH] XFAIL OpenMP/nvptx execution-time hangs for simple nested
 OpenMP 'target'/'parallel'/'task' constructs [PR99555]

... still awaiting proper resolution, of course.

	libgomp/
	PR target/99555
	* testsuite/libgomp.c/pr99555-1.c <nvptx offload device>: Until
	resolved, make sure that we exit quickly, with error status,
	XFAILed.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Likewise.
	* testsuite/libgomp.fortran/task-detach-6.f90: Likewise.
---
 libgomp/testsuite/lib/libgomp.exp                    | 12 ++++++++++++
 .../testsuite/libgomp.c-c++-common/task-detach-6.c   |  5 ++++-
 libgomp/testsuite/libgomp.c/pr99555-1.c              |  5 ++++-
 libgomp/testsuite/libgomp.fortran/task-detach-6.f90  |  3 ++-
 4 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/libgomp/testsuite/lib/libgomp.exp b/libgomp/testsuite/lib/libgomp.exp
index 72d001186a5..14dcfdfd00a 100644
--- a/libgomp/testsuite/lib/libgomp.exp
+++ b/libgomp/testsuite/lib/libgomp.exp
@@ -401,6 +401,18 @@ proc check_effective_target_offload_device_shared_as { } {
     } ]
 }
 
+# Return 1 if using nvptx offload device.
+proc check_effective_target_offload_device_nvptx { } {
+    return [check_runtime_nocache offload_device_nvptx {
+      #include <omp.h>
+      #include "testsuite/libgomp.c-c++-common/on_device_arch.h"
+      int main ()
+	{
+	  return !on_device_arch_nvptx ();
+	}
+    } ]
+}
+
 # Return 1 if at least one Nvidia GPU is accessible.
 
 proc check_effective_target_openacc_nvidia_accel_present { } {
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index 119d7f52f8f..f18b57bf047 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -2,6 +2,8 @@
 
 #include <omp.h>
 #include <assert.h>
+#include <unistd.h> // For 'alarm'.
+
 #include "on_device_arch.h"
 
 /* Test tasks with detach clause on an offload device.  Each device
@@ -12,7 +14,8 @@ int main (void)
 {
   //TODO See '../libgomp.c/pr99555-1.c'.
   if (on_device_arch_nvptx ())
-    __builtin_abort (); //TODO Until resolved, skip, with error status.
+    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
+		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
 
   int x = 0, y = 0, z = 0;
   int thread_count;
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
index 0dc17bfa337..bd33b93716b 100644
--- a/libgomp/testsuite/libgomp.c/pr99555-1.c
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -2,12 +2,15 @@
 
 // { dg-additional-options "-O0" }
 
+#include <unistd.h> // For 'alarm'.
+
 #include "../libgomp.c-c++-common/on_device_arch.h"
 
 int main (void)
 {
   if (on_device_arch_nvptx ())
-    __builtin_abort (); //TODO Until resolved, skip, with error status.
+    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
+		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
 
 #pragma omp target
 #pragma omp parallel // num_threads(1)
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index bd0beb63179..e4373b4c6f1 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -21,7 +21,8 @@ program task_detach_6
 
   !TODO See '../libgomp.c/pr99555-1.c'.
   if (on_device_arch_nvptx () /= 0) then
-     error stop !TODO Until resolved, skip, with error status.
+     call alarm (4, 0); !TODO Until resolved, make sure that we exit quickly, with error status.
+     ! { dg-xfail-run-if "PR99555" { offload_device_nvptx } }
   end if
 
   !$omp target map (tofrom: x, y, z) map (from: thread_count)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]
  2021-04-09 11:00             ` [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Thomas Schwinge
@ 2021-04-15  9:19               ` Thomas Schwinge
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Schwinge @ 2021-04-15  9:19 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Kwok Cheung Yeung, Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 3028 bytes --]

Hi!

On 2021-04-09T13:00:39+0200, I wrote:
> On 2021-03-25T12:02:15+0100, I wrote:
>> On 2021-03-11T17:52:55+0100, I wrote:
>>> On 2021-02-23T22:52:38+0100, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>>>> On Tue, Feb 23, 2021 at 09:43:51PM +0000, Kwok Cheung Yeung wrote:
>>>>> On 19/02/2021 7:12 pm, Kwok Cheung Yeung wrote:
>>>>> > I have included the current state of my patch. All task-detach-* tests
>>>>> > pass when executed without offloading or with offloading to GCN, but
>>>>> > with offloading to Nvidia, task-detach-6.* hangs consistently but
>>>>> > everything else passes (probably because of the missing
>>>>> > gomp_team_barrier_done?).
>>>>>
>>>>> It looks like the hang has nothing to do with the detach patch - this hangs
>>>>> consistently for me when offloaded to NVPTX:
>>>>>
>>>>> #include <omp.h>
>>>>>
>>>>> int main (void)
>>>>> {
>>>>> #pragma omp target
>>>>>   #pragma omp parallel
>>>>>     #pragma omp task
>>>>>       ;
>>>>> }
>>>>>
>>>>> This doesn't hang when offloaded to GCN or the host device, or if
>>>>> num_threads(1) is specified on the omp parallel.
>>>
>>> So, I reproduced this the hard way;
>>> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98738#c13> :-/
>>>
>>> Please always file issues when you run into such things.  I've now filed
>>> PR99555 "[OpenMP/nvptx] Execution-time hang for simple nested OpenMP
>>> 'target'/'parallel'/'task' constructs".
>>>
>>>> Then it can be solved separately, I'll try to have a look if I see something
>>>> bad from the dumps, but I admit I don't have much experience with debugging
>>>> NVPTX offloaded code...
>>>
>>> Any luck?
>>>
>>>
>>> Until this gets resolved properly, OK to push something like the attached
>>> (currently testing) "Avoid OpenMP/nvptx execution-time hangs for simple
>>> nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]"?
>>
>> As posted, I've now pushed "Avoid OpenMP/nvptx execution-time hangs for
>> simple nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]" to
>> master branch in commit d99111fd8e12deffdd9a965ce17e8a760d531ec3, see
>> attached.  "... awaiting proper resolution, of course."
>
>> +  if (on_device_arch_nvptx ())
>> +    __builtin_abort (); //TODO Until resolved, skip, with error status.
>
> Actually, we can do better: do try to execute this trivial OpenMP code
> (expected to complete in no time), but for nvptx offloading "make sure
> that we exit quickly, with error status", and XFAIL that.  So that we'll
> get XFAIL -> XPASS when this starts to work for nvptx offloading.

Pushed "XFAIL OpenMP/nvptx execution-time hangs for simple nested OpenMP
'target'/'parallel'/'task' constructs [PR99555]" to master branch in
commit 4dd9e1c541e0eb921d62c8652c854b1259e56aac, see attached.


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-XFAIL-OpenMP-nvptx-execution-time-hangs-for-simple-n.patch --]
[-- Type: text/x-diff, Size: 4058 bytes --]

From 4dd9e1c541e0eb921d62c8652c854b1259e56aac Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 7 Apr 2021 10:36:36 +0200
Subject: [PATCH] XFAIL OpenMP/nvptx execution-time hangs for simple nested
 OpenMP 'target'/'parallel'/'task' constructs [PR99555]

... still awaiting proper resolution, of course.

	libgomp/
	PR target/99555
	* testsuite/lib/libgomp.exp
	(check_effective_target_offload_device_nvptx): New.
	* testsuite/libgomp.c/pr99555-1.c <nvptx offload device>: Until
	resolved, make sure that we exit quickly, with error status,
	XFAILed.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Likewise.
	* testsuite/libgomp.fortran/task-detach-6.f90: Likewise.
---
 libgomp/testsuite/lib/libgomp.exp                    | 12 ++++++++++++
 .../testsuite/libgomp.c-c++-common/task-detach-6.c   |  5 ++++-
 libgomp/testsuite/libgomp.c/pr99555-1.c              |  5 ++++-
 libgomp/testsuite/libgomp.fortran/task-detach-6.f90  |  3 ++-
 4 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/libgomp/testsuite/lib/libgomp.exp b/libgomp/testsuite/lib/libgomp.exp
index 72d001186a5..14dcfdfd00a 100644
--- a/libgomp/testsuite/lib/libgomp.exp
+++ b/libgomp/testsuite/lib/libgomp.exp
@@ -401,6 +401,18 @@ proc check_effective_target_offload_device_shared_as { } {
     } ]
 }
 
+# Return 1 if using nvptx offload device.
+proc check_effective_target_offload_device_nvptx { } {
+    return [check_runtime_nocache offload_device_nvptx {
+      #include <omp.h>
+      #include "testsuite/libgomp.c-c++-common/on_device_arch.h"
+      int main ()
+	{
+	  return !on_device_arch_nvptx ();
+	}
+    } ]
+}
+
 # Return 1 if at least one Nvidia GPU is accessible.
 
 proc check_effective_target_openacc_nvidia_accel_present { } {
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index 119d7f52f8f..f18b57bf047 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -2,6 +2,8 @@
 
 #include <omp.h>
 #include <assert.h>
+#include <unistd.h> // For 'alarm'.
+
 #include "on_device_arch.h"
 
 /* Test tasks with detach clause on an offload device.  Each device
@@ -12,7 +14,8 @@ int main (void)
 {
   //TODO See '../libgomp.c/pr99555-1.c'.
   if (on_device_arch_nvptx ())
-    __builtin_abort (); //TODO Until resolved, skip, with error status.
+    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
+		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
 
   int x = 0, y = 0, z = 0;
   int thread_count;
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
index 0dc17bfa337..bd33b93716b 100644
--- a/libgomp/testsuite/libgomp.c/pr99555-1.c
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -2,12 +2,15 @@
 
 // { dg-additional-options "-O0" }
 
+#include <unistd.h> // For 'alarm'.
+
 #include "../libgomp.c-c++-common/on_device_arch.h"
 
 int main (void)
 {
   if (on_device_arch_nvptx ())
-    __builtin_abort (); //TODO Until resolved, skip, with error status.
+    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
+		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
 
 #pragma omp target
 #pragma omp parallel // num_threads(1)
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index bd0beb63179..e4373b4c6f1 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -21,7 +21,8 @@ program task_detach_6
 
   !TODO See '../libgomp.c/pr99555-1.c'.
   if (on_device_arch_nvptx () /= 0) then
-     error stop !TODO Until resolved, skip, with error status.
+     call alarm (4, 0); !TODO Until resolved, make sure that we exit quickly, with error status.
+     ! { dg-xfail-run-if "PR99555" { offload_device_nvptx } }
   end if
 
   !$omp target map (tofrom: x, y, z) map (from: thread_count)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-04-15  9:20 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 19:33 [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Kwok Cheung Yeung
2021-01-21 22:46 ` Kwok Cheung Yeung
2021-01-29 15:03 ` Jakub Jelinek
2021-02-12 14:36   ` H.J. Lu
2021-02-19 19:12   ` [WIP] " Kwok Cheung Yeung
2021-02-22 13:49     ` Jakub Jelinek
2021-02-22 18:14       ` Jakub Jelinek
2021-02-24 18:17       ` Kwok Cheung Yeung
2021-02-24 19:46         ` Jakub Jelinek
2021-02-25 16:21           ` Kwok Cheung Yeung
2021-02-25 16:38             ` Jakub Jelinek
2021-02-23 21:43     ` Kwok Cheung Yeung
2021-02-23 21:52       ` Jakub Jelinek
2021-03-11 16:52         ` Thomas Schwinge
2021-03-25 12:02           ` Thomas Schwinge
2021-03-26 14:42             ` [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]) Tobias Burnus
2021-03-26 14:46               ` Jakub Jelinek
2021-03-26 15:19                 ` Tobias Burnus
2021-03-26 15:22                   ` Jakub Jelinek
2021-03-29  9:09                     ` [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] Thomas Schwinge
2021-04-09 11:00             ` [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Thomas Schwinge
2021-04-15  9:19               ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).