Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Alexander Monakov <amonakov@ispras.ru>
To: Tom de Vries <tdevries@suse.de>
Cc: gcc-patches@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>,
	 Andrew Stubbs <ams@codesourcery.com>
Subject: Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
Date: Tue, 20 Apr 2021 19:11:45 +0300 (MSK)	[thread overview]
Message-ID: <alpine.LNX.2.20.13.2104201849200.19608@monopod.intra.ispras.ru> (raw)
In-Reply-To: <20210420112344.GA7277@delia>

Hello Tom,

Thank you for the investigation and the detailed writeup. It was difficult for
me to infer the internal API contracts here (and still is), sorry about the
mistake.

Most importantly: does GCN handle this, and if yes, how? I think the solution
should be the same for config/gcn and config/nvptx (I guess this is a question
for Andrew).

Some comments inline below:

On Tue, 20 Apr 2021, Tom de Vries wrote:

> Hi,
> 
> Consider the following omp fragment.
> ...
>   #pragma omp target
>   #pragma omp parallel num_threads (2)
>   #pragma omp task
>     ;
> ...
> 
> This hangs at -O0 for nvptx.
> 
> Investigating the behaviour gives us the following trace of events:
> - both threads execute GOMP_task, where they:
>   - deposit a task, and
>   - execute gomp_team_barrier_wake
> - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
>   proceeds to wait at the team barrier

Shouldn't it try to handle deposited tasks before suspending on the barrier?

I guess you are describing what the code does, I'm just commenting that I'm
confused why it behaves so.

> - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
>   calls gomp_barrier_handle_tasks, where it:
>   - executes both tasks and marks the team barrier done
>   - executes a gomp_team_barrier_wake which wakes up thread 1
> - thread 1 exits the team barrier

Up to this point it looks reasonable.

> - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
>   the team barrier.

At this point the code should realize that the team barrier was already released
and not attempt to wait on it again. Maybe by inspecting the generation counter?

I may be wrong though, I don't understand the overall flow well enough yet.

> - thread 0 hangs.
> 
> To understand why there is a hang here, it's good to understand how things
> are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
> a copy of the libgomp/config/linux/bar.c implementation, with uses of both
> futex_wake and do_wait replaced with uses of nvptx insn bar.sync:
> ...
>   if (bar->total > 1)
>     asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> ...
> 
> The point where thread 0 goes to wait at the team barrier, corresponds in
> the linux implementation with a do_wait.  In the linux case, the call to
> do_wait doesn't hang, because it's waiting for bar->generation to become
> a certain value, and if bar->generation already has that value, it just
> proceeds, without any need for coordination with other threads.
> 
> In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
> logical barrier, which never happens: thread 1 is lingering in the
> thread pool at the thread pool barrier (using a different logical barrier),
> waiting to join a new team.
> 
> The easiest way to fix this is to revert to the posix implementation for
> bar.{c,h}.
> 
> Another way would be to revert to the linux implementation for bar.{c,h},
> and implement the primitives futex_wait and do_wait using nvptx insns.

I don't think implementing futex_wait is possible on nvptx.

Alexander

> This patch instead implements a minimal fix (which makes the implementation
> deviate further from the linux one).
> 
> The hang was only observed in gomp_team_barrier_wait_end, but we propagate the
> fix to its twin gomp_team_barrier_wait_cancel_end as well.
> 
> The fix is based on the assumptions that at the point of the fix, after the
> call to gomp_barrier_handle_tasks:
> - all tasks are done
>   (an assert is added to check this), and consequently:
> - the executing thread is the only thread left in the team barrier
>   (so it's accurate to set nthreads to 1).
> 
> Tested libgomp on x86_64 with nvptx accelerator.
> 
> Any comments?
> 
> Thanks,
> - Tom
> 
> [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
> 
> libgomp/ChangeLog:
> 
> 2021-04-20  Tom de Vries  <tdevries@suse.de>
> 
> 	PR target/99555
> 	* config/nvptx/bar.c (gomp_team_barrier_wait_end)
> 	(gomp_team_barrier_wait_cancel_end): Don't try to sync with team threads
> 	that have left the team barrier.
> 	* testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
> 	workarounds.
> 	* testsuite/libgomp.c/pr99555-1.c: Same.
> 	* testsuite/libgomp.fortran/task-detach-6.f90: Same.
> 
> ---
>  libgomp/config/nvptx/bar.c                         | 32 ++++++++++++++++------
>  .../testsuite/libgomp.c-c++-common/task-detach-6.c |  8 ------
>  libgomp/testsuite/libgomp.c/pr99555-1.c            |  8 ------
>  .../testsuite/libgomp.fortran/task-detach-6.f90    | 12 --------
>  4 files changed, 24 insertions(+), 36 deletions(-)
> 
> diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
> index c5c2fa8829b..058a8d4d5ca 100644
> --- a/libgomp/config/nvptx/bar.c
> +++ b/libgomp/config/nvptx/bar.c
> @@ -78,6 +78,7 @@ void
>  gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>  {
>    unsigned int generation, gen;
> +  unsigned int nthreads = bar->total;
>  
>    if (__builtin_expect (state & BAR_WAS_LAST, 0))
>      {
> @@ -90,6 +91,15 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>        if (__builtin_expect (team->task_count, 0))
>  	{
>  	  gomp_barrier_handle_tasks (state);
> +	  /* Assert that all tasks have been handled.  */
> +	  if (team->task_count != 0)
> +	    __builtin_abort ();
> +	  /* In gomp_barrier_handle_tasks, the team barrier has been marked
> +	     as done, and all pending threads woken up.  So this is now the
> +	     last and only thread in the barrier.  Adjust nthreads to
> +	     reflect the new situation, to make sure we don't hang
> +	     indefinitely at the bar.sync below.  */
> +	  nthreads = 1;
>  	  state &= ~BAR_WAS_LAST;
>  	}
>        else
> @@ -97,8 +107,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>  	  state &= ~BAR_CANCELLED;
>  	  state += BAR_INCR - BAR_WAS_LAST;
>  	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
> -	  if (bar->total > 1)
> -	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +	  if (nthreads > 1)
> +	    asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>  	  return;
>  	}
>      }
> @@ -107,8 +117,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>    state &= ~BAR_CANCELLED;
>    do
>      {
> -      if (bar->total > 1)
> -	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +      if (nthreads > 1)
> +	asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
>  	{
> @@ -140,6 +150,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
>  				   gomp_barrier_state_t state)
>  {
>    unsigned int generation, gen;
> +  unsigned int nthreads = bar->total;
>  
>    if (__builtin_expect (state & BAR_WAS_LAST, 0))
>      {
> @@ -156,14 +167,19 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
>        if (__builtin_expect (team->task_count, 0))
>  	{
>  	  gomp_barrier_handle_tasks (state);
> +	  /* Assert that all tasks have been handled.  */
> +	  if (team->task_count != 0)
> +	    __builtin_abort ();
> +	  /* See comment in gomp_team_barrier_wait_end.  */
> +	  nthreads = 1;
>  	  state &= ~BAR_WAS_LAST;
>  	}
>        else
>  	{
>  	  state += BAR_INCR - BAR_WAS_LAST;
>  	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
> -	  if (bar->total > 1)
> -	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +	  if (nthreads > 1)
> +	    asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>  	  return false;
>  	}
>      }
> @@ -174,8 +190,8 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
>    generation = state;
>    do
>      {
> -      if (bar->total > 1)
> -	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +      if (nthreads > 1)
> +	asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>        if (__builtin_expect (gen & BAR_CANCELLED, 0))
>  	return true;
> diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> index f18b57bf047..e5c2291e6ff 100644
> --- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> +++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> @@ -2,9 +2,6 @@
>  
>  #include <omp.h>
>  #include <assert.h>
> -#include <unistd.h> // For 'alarm'.
> -
> -#include "on_device_arch.h"
>  
>  /* Test tasks with detach clause on an offload device.  Each device
>     thread spawns off a chain of tasks, that can then be executed by
> @@ -12,11 +9,6 @@
>  
>  int main (void)
>  {
> -  //TODO See '../libgomp.c/pr99555-1.c'.
> -  if (on_device_arch_nvptx ())
> -    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
> -		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
> -
>    int x = 0, y = 0, z = 0;
>    int thread_count;
>    omp_event_handle_t detach_event1, detach_event2;
> diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
> index bd33b93716b..7386e016fd2 100644
> --- a/libgomp/testsuite/libgomp.c/pr99555-1.c
> +++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
> @@ -2,16 +2,8 @@
>  
>  // { dg-additional-options "-O0" }
>  
> -#include <unistd.h> // For 'alarm'.
> -
> -#include "../libgomp.c-c++-common/on_device_arch.h"
> -
>  int main (void)
>  {
> -  if (on_device_arch_nvptx ())
> -    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
> -		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
> -
>  #pragma omp target
>  #pragma omp parallel // num_threads(1)
>  #pragma omp task
> diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> index e4373b4c6f1..03a3b61540d 100644
> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> @@ -1,6 +1,5 @@
>  ! { dg-do run }
>  
> -! { dg-additional-sources on_device_arch.c }
>    ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
>  
>  ! Test tasks with detach clause on an offload device.  Each device
> @@ -14,17 +13,6 @@ program task_detach_6
>    integer :: x = 0, y = 0, z = 0
>    integer :: thread_count
>  
> -  interface
> -    integer function on_device_arch_nvptx() bind(C)
> -    end function on_device_arch_nvptx
> -  end interface
> -
> -  !TODO See '../libgomp.c/pr99555-1.c'.
> -  if (on_device_arch_nvptx () /= 0) then
> -     call alarm (4, 0); !TODO Until resolved, make sure that we exit quickly, with error status.
> -     ! { dg-xfail-run-if "PR99555" { offload_device_nvptx } }
> -  end if
> -
>    !$omp target map (tofrom: x, y, z) map (from: thread_count)
>      !$omp parallel private (detach_event1, detach_event2)
>        !$omp single
>

next prev parent reply	other threads:[~2021-04-20 16:11 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-20 11:23 Tom de Vries
2021-04-20 16:11 ` Alexander Monakov [this message]
2021-04-21 16:10   ` Tom de Vries
2021-04-21 17:02     ` Alexander Monakov
2021-04-22 11:11       ` Tom de Vries
2021-04-23 15:45         ` Alexander Monakov
2021-04-23 16:48           ` Tom de Vries
2021-05-19 14:52             ` [PING][PATCH][libgomp, " Tom de Vries
2022-02-22 14:52               ` Tom de Vries
2021-05-20  9:52             ` [PATCH][libgomp, " Thomas Schwinge
2021-05-20 11:41               ` Tom de Vries
2021-11-26 12:10             ` *PING* " Tobias Burnus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LNX.2.20.13.2104201849200.19608@monopod.intra.ispras.ru \
    --to=amonakov@ispras.ru \
    --cc=ams@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=tdevries@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).