[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end Consider the following omp fragment. ... #pragma omp target #pragma omp parallel num_threads (2) #pragma omp task ; ... This hangs at -O0 for nvptx. Investigating the behaviour gives us the following trace of events: - both threads execute GOMP_task, where they: - deposit a task, and - execute gomp_team_barrier_wake - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread, proceeds to wait at the team barrier - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it calls gomp_barrier_handle_tasks, where it: - executes both tasks and marks the team barrier done - executes a gomp_team_barrier_wake which wakes up thread 1 - thread 1 exits the team barrier - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at the team barrier. - thread 0 hangs. To understand why there is a hang here, it's good to understand how things are setup for nvptx. The libgomp/config/nvptx/bar.c implementation is a copy of the libgomp/config/linux/bar.c implementation, with uses of both futex_wake and do_wait replaced with uses of ptx insn bar.sync: ... if (bar->total > 1) asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); ... The point where thread 0 goes to wait at the team barrier, corresponds in the linux implementation with a do_wait. In the linux case, the call to do_wait doesn't hang, because it's waiting for bar->generation to become a certain value, and if bar->generation already has that value, it just proceeds, without any need for coordination with other threads. In the nvtpx case, the bar.sync waits until thread 1 joins it in the same logical barrier, which never happens: thread 1 is lingering in the thread pool at the thread pool barrier (using a different logical barrier), waiting to join a new team. The easiest way to fix this is to revert to the posix implementation for bar.{c,h}. That however falls back on a busy-waiting approach, and does not take advantage of the ptx bar.sync insn. Instead, we revert to the linux implementation for bar.c, and implement the primitives futex_wait and futex_wake using the bar.sync insn. This is a WIP version that does not yet take performance into consideration, but instead focuses on copying a working version as completely as possible, and isolating the machine-specific changes to as little functions as possible. The bar.sync insn takes an argument specifying how many threads are participating, and that doesn't play well with the futex syntax where it's not clear in advance how many threads will be woken up. This is solved by waking up all waiting threads each time a futex_wait or futex_wake happens, and possibly going back to sleep with an updated thread count. Tested libgomp on x86_64 with nvptx accelerator, both as-is and with do_spin hardcoded to 1. libgomp/ChangeLog: 2021-04-20 Tom de Vries PR target/99555 * config/nvptx/bar.c (futex_wait, futex_wake): New function. (cpu_relax): Copy from config/linux/futex.h. (do_spin, do_wait): Copy from config/linux/wait.h. (gomp_barrier_wait_end, gomp_barrier_wait_last) (gomp_team_barrier_wake, gomp_team_barrier_wait_end): (gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Copy from config/linux/bar.c. Add bar argument to futex_wait/futex_wake. * config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock. (gomp_barrier_init): Init new fields. * testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific workarounds. * testsuite/libgomp.c/pr99555-1.c: Same. * testsuite/libgomp.fortran/task-detach-6.f90: Same. futex implementation for nvptx --- libgomp/config/nvptx/bar.c | 152 ++++++++++++++++++--- libgomp/config/nvptx/bar.h | 4 + .../testsuite/libgomp.c-c++-common/task-detach-6.c | 8 -- libgomp/testsuite/libgomp.c/pr99555-1.c | 8 -- .../testsuite/libgomp.fortran/task-detach-6.f90 | 12 -- 5 files changed, 139 insertions(+), 45 deletions(-) diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c index c5c2fa8829b..4d6bb646330 100644 --- a/libgomp/config/nvptx/bar.c +++ b/libgomp/config/nvptx/bar.c @@ -30,6 +30,125 @@ #include #include "libgomp.h" +static inline void +futex_wait (gomp_barrier_t *bar, int *addr, int val) +{ + if (bar->total <= 1) + /* A barrier with one thread, nop. */ + return; + + gomp_mutex_lock (&bar->lock); + + /* Futex semantics: only go to sleep if *addr == val. */ + if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_RELAXED) != val, 0)) + { + gomp_mutex_unlock (&bar->lock); + return; + } + + /* Register as waiter. */ + unsigned int waiters + = __atomic_add_fetch (&bar->waiters, 1, MEMMODEL_ACQ_REL); + if (waiters == 0) + __builtin_abort (); + unsigned int waiter_id = waiters; + + if (waiters > 1) + { + /* Wake other threads in bar.sync. */ + asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters)); + + /* Ensure that they have updated waiters. */ + asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters)); + } + + gomp_mutex_unlock (&bar->lock); + + while (1) + { + /* Wait for next thread in barrier. */ + asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); + + /* Get updated waiters. */ + unsigned int updated_waiters + = __atomic_load_n (&bar->waiters, MEMMODEL_RELAXED); + + /* Notify that we have updated waiters. */ + asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); + + waiters = updated_waiters; + + if (waiter_id > waiters) + /* A wake happened, and we're in the group of woken threads. */ + break; + + /* Continue waiting. */ + } +} + +static inline void +futex_wake (gomp_barrier_t *bar, int *addr, int count) +{ + if (bar->total <= 1) + /* A barrier with one thread, nop. */ + return; + + gomp_mutex_lock (&bar->lock); + unsigned int waiters = __atomic_load_n (&bar->waiters, MEMMODEL_RELAXED); + if (waiters == 0) + { + /* No threads to wake. */ + gomp_mutex_unlock (&bar->lock); + return; + } + + if (count == INT_MAX) + /* Release all threads. */ + __atomic_store_n (&bar->waiters, 0, MEMMODEL_RELEASE); + else if (count < bar->total) + /* Release count threads. */ + __atomic_add_fetch (&bar->waiters, -count, MEMMODEL_ACQ_REL); + else + /* Count has an illegal value. */ + __builtin_abort (); + + /* Wake other threads in bar.sync. */ + asm ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); + + /* Let them get the updated waiters. */ + asm ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); + + gomp_mutex_unlock (&bar->lock); +} + +static inline void +cpu_relax (void) +{ + __asm volatile ("" : : : "memory"); +} + +static inline int do_spin (int *addr, int val) +{ + unsigned long long i, count = gomp_spin_count_var; + + if (__builtin_expect (__atomic_load_n (&gomp_managed_threads, + MEMMODEL_RELAXED) + > gomp_available_cpus, 0)) + count = gomp_throttled_spin_count_var; + for (i = 0; i < count; i++) + if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_RELAXED) != val, 0)) + return 0; + else + cpu_relax (); + return 1; +} + + +static inline void do_wait (gomp_barrier_t *bar, int *addr, int val) +{ + if (do_spin (addr, val)) + futex_wait (bar, addr, val); +} void gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) @@ -40,9 +159,14 @@ gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) bar->awaited = bar->total; __atomic_store_n (&bar->generation, bar->generation + BAR_INCR, MEMMODEL_RELEASE); + futex_wake (bar, (int *) &bar->generation, INT_MAX); + } + else + { + do + do_wait (bar, (int *) &bar->generation, state); + while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) == state); } - if (bar->total > 1) - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); } void @@ -61,17 +185,15 @@ gomp_barrier_wait (gomp_barrier_t *bar) void gomp_barrier_wait_last (gomp_barrier_t *bar) { - /* Deferring to gomp_barrier_wait does not use the optimization opportunity - allowed by the interface contract for all-but-last participants. The - original implementation in config/linux/bar.c handles this better. */ - gomp_barrier_wait (bar); + gomp_barrier_state_t state = gomp_barrier_wait_start (bar); + if (state & BAR_WAS_LAST) + gomp_barrier_wait_end (bar, state); } void gomp_team_barrier_wake (gomp_barrier_t *bar, int count) { - if (bar->total > 1) - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + futex_wake (bar, (int *) &bar->generation, count == 0 ? INT_MAX : count); } void @@ -97,8 +219,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) state &= ~BAR_CANCELLED; state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE); - if (bar->total > 1) - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + futex_wake (bar, (int *) &bar->generation, INT_MAX); return; } } @@ -107,8 +228,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) state &= ~BAR_CANCELLED; do { - if (bar->total > 1) - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + do_wait (bar, (int *) &bar->generation, generation); gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE); if (__builtin_expect (gen & BAR_TASK_PENDING, 0)) { @@ -162,8 +282,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, { state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE); - if (bar->total > 1) - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + futex_wake (bar, (int *) &bar->generation, INT_MAX); return false; } } @@ -174,8 +293,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, generation = state; do { - if (bar->total > 1) - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + do_wait (bar, (int *) &bar->generation, generation); gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE); if (__builtin_expect (gen & BAR_CANCELLED, 0)) return true; @@ -208,5 +326,5 @@ gomp_team_barrier_cancel (struct gomp_team *team) } team->barrier.generation |= BAR_CANCELLED; gomp_mutex_unlock (&team->task_lock); - gomp_team_barrier_wake (&team->barrier, INT_MAX); + futex_wake (&team->barrier, (int *) &team->barrier.generation, INT_MAX); } diff --git a/libgomp/config/nvptx/bar.h b/libgomp/config/nvptx/bar.h index 9bf3d914a02..c69426e1629 100644 --- a/libgomp/config/nvptx/bar.h +++ b/libgomp/config/nvptx/bar.h @@ -38,6 +38,8 @@ typedef struct unsigned generation; unsigned awaited; unsigned awaited_final; + unsigned waiters; + gomp_mutex_t lock; } gomp_barrier_t; typedef unsigned int gomp_barrier_state_t; @@ -57,6 +59,8 @@ static inline void gomp_barrier_init (gomp_barrier_t *bar, unsigned count) bar->awaited = count; bar->awaited_final = count; bar->generation = 0; + bar->waiters = 0; + gomp_mutex_init (&bar->lock); } static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count) diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c index f18b57bf047..e5c2291e6ff 100644 --- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c +++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c @@ -2,9 +2,6 @@ #include #include -#include // For 'alarm'. - -#include "on_device_arch.h" /* Test tasks with detach clause on an offload device. Each device thread spawns off a chain of tasks, that can then be executed by @@ -12,11 +9,6 @@ int main (void) { - //TODO See '../libgomp.c/pr99555-1.c'. - if (on_device_arch_nvptx ()) - alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status. - { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */ - int x = 0, y = 0, z = 0; int thread_count; omp_event_handle_t detach_event1, detach_event2; diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c index bd33b93716b..7386e016fd2 100644 --- a/libgomp/testsuite/libgomp.c/pr99555-1.c +++ b/libgomp/testsuite/libgomp.c/pr99555-1.c @@ -2,16 +2,8 @@ // { dg-additional-options "-O0" } -#include // For 'alarm'. - -#include "../libgomp.c-c++-common/on_device_arch.h" - int main (void) { - if (on_device_arch_nvptx ()) - alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status. - { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */ - #pragma omp target #pragma omp parallel // num_threads(1) #pragma omp task diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 index e4373b4c6f1..03a3b61540d 100644 --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 @@ -1,6 +1,5 @@ ! { dg-do run } -! { dg-additional-sources on_device_arch.c } ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" } ! Test tasks with detach clause on an offload device. Each device @@ -14,17 +13,6 @@ program task_detach_6 integer :: x = 0, y = 0, z = 0 integer :: thread_count - interface - integer function on_device_arch_nvptx() bind(C) - end function on_device_arch_nvptx - end interface - - !TODO See '../libgomp.c/pr99555-1.c'. - if (on_device_arch_nvptx () /= 0) then - call alarm (4, 0); !TODO Until resolved, make sure that we exit quickly, with error status. - ! { dg-xfail-run-if "PR99555" { offload_device_nvptx } } - end if - !$omp target map (tofrom: x, y, z) map (from: thread_count) !$omp parallel private (detach_event1, detach_event2) !$omp single