[PATCH] libgomp, nvptx: Honor OpenMP 5.1 num

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] libgomp, nvptx: Honor OpenMP 5.1 num_teams lower bound
@ 2021-11-12 13:20 Jakub Jelinek
  2021-11-12 13:27 ` [PATCH] libgomp, nvptx, v2: " Jakub Jelinek
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2021-11-12 13:20 UTC (permalink / raw)
  To: Tobias Burnus, Andrew Stubbs, Tom de Vries; +Cc: gcc-patches

Hi!

Here is an completely untested attempt at implementing what I was talking
about, that for num_teams_upper 0 or whenever num_teams_lower <= num_blocks,
the current implementation is fine but if the user explicitly asks for more
teams than we can provide in hardware, we need to stop assuming that
omp_get_team_num () is equal to the hw team id, but instead need to use some
team specific memory (I believe it is .shared for PTX), or if none is
provided, array indexed by the hw team id and run some teams serially within
the same hw thread.

This patch assumes that .shared variables are initialized to 0,
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html lists
in Table 7. .shared as non-initializable.  If that isn't the case,
we need to initialize it somewhere for the case of #pragma omp target
without #pragma omp teams in it, maybe in libgcc/config/nvptx/crt0.c ?

2021-11-12  Jakub Jelinek  <jakub@redhat.com>

	* config/nvptx/target.c (__gomp_team_num): Define using inline asm as
	a .shared var.
	(GOMP_teams4): Use __gomp_team_num as the team number instead of
	%ctaid.x.  If first, initialize it to %ctaid.x.  If num_teams_lower
	is bigger than num_blocks, use num_teams_lower teams and arrange for
	bumping of __gomp_team_num if !first and returning false once we run
	out of teams.
	* config/nvptx/teams.c (__gomp_team_num): Declare using inline asm as
	an external .shared var.
	(omp_get_team_num): Return __gomp_team_num value instead of %ctaid.x.

--- libgomp/config/nvptx/target.c.jj	2021-11-12 12:41:11.433501988 +0100
+++ libgomp/config/nvptx/target.c	2021-11-12 14:02:56.231477929 +0100
@@ -26,28 +26,43 @@
 #include "libgomp.h"
 #include <limits.h>
 
+asm ("\n// BEGIN GLOBAL VAR DECL: __gomp_team_num"
+     "\n.visible .shared .align 4 .u32 __gomp_team_num[1];");
+
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
 	     unsigned int thread_limit, bool first)
 {
+  unsigned int num_blocks, block_id;
+  asm ("mov.u32 %0, %%nctaid.x;" : "=r" (num_blocks));
   if (!first)
-    return false;
+    {
+      unsigned in team_num;
+      if (num_blocks > gomp_num_teams_var)
+	return false;
+      asm ("ld.shared.u32\t%0, [__gomp_team_num]" : "=r" (team_num));
+      if (team_num > gomp_num_teams_var - num_blocks)
+	return false;
+      asm ("st.shared.u32\t[__gomp_team_num], %0"
+	   : : "r" (team_num + num_blocks));
+      return true;
+    }
   if (thread_limit)
     {
       struct gomp_task_icv *icv = gomp_icv (true);
       icv->thread_limit_var
 	= thread_limit > INT_MAX ? UINT_MAX : thread_limit;
     }
-  unsigned int num_blocks, block_id;
-  asm ("mov.u32 %0, %%nctaid.x;" : "=r" (num_blocks));
-  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (block_id));
-  /* FIXME: If num_teams_lower > num_blocks, we want to loop multiple
-     times for some CTAs.  */
-  (void) num_teams_lower;
-  if (!num_teams_upper || num_teams_upper >= num_blocks)
+  if (!num_teams_upper)
     num_teams_upper = num_blocks;
-  else if (block_id >= num_teams_upper)
+  else if (num_blocks < num_teams_lower)
+    num_teams_upper = num_teams_lower;
+  else if (num_blocks < num_teams_upper)
+    num_teams_upper = num_blocks;
+  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (block_id));
+  if (block_id >= num_teams_upper)
     return false;
+  asm ("st.shared.u32\t[__gomp_team_num], %0" : : "r" (block_id));
   gomp_num_teams_var = num_teams_upper - 1;
   return true;
 }
--- libgomp/config/nvptx/teams.c.jj	2021-01-05 00:13:58.255297642 +0100
+++ libgomp/config/nvptx/teams.c	2021-11-12 13:55:59.950421993 +0100
@@ -28,6 +28,9 @@
 
 #include "libgomp.h"
 
+asm ("\n// BEGIN GLOBAL VAR DECL: __gomp_team_num"
+     "\n.extern .shared .align 4 .u32 __gomp_team_num[1];");
+
 void
 GOMP_teams_reg (void (*fn) (void *), void *data, unsigned int num_teams,
 		unsigned int thread_limit, unsigned int flags)
@@ -48,9 +50,9 @@ omp_get_num_teams (void)
 int
 omp_get_team_num (void)
 {
-  int ctaid;
-  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (ctaid));
-  return ctaid;
+  int team_num;
+  asm ("ld.shared.u32\t%0, [__gomp_team_num]" : "=r" (team_num));
+  return team_num;
 }
 
 ialias (omp_get_num_teams)

	Jakub


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] libgomp, nvptx, v2: Honor OpenMP 5.1 num_teams lower bound
  2021-11-12 13:20 [PATCH] libgomp, nvptx: Honor OpenMP 5.1 num_teams lower bound Jakub Jelinek
@ 2021-11-12 13:27 ` Jakub Jelinek
  2021-11-12 17:58   ` [PATCH] libgomp, nvptx, v3: " Jakub Jelinek
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2021-11-12 13:27 UTC (permalink / raw)
  To: Tobias Burnus, Andrew Stubbs, Tom de Vries, gcc-patches

On Fri, Nov 12, 2021 at 02:20:23PM +0100, Jakub Jelinek via Gcc-patches wrote:
> This patch assumes that .shared variables are initialized to 0,
> https://docs.nvidia.com/cuda/parallel-thread-execution/index.html lists
> in Table 7. .shared as non-initializable.  If that isn't the case,
> we need to initialize it somewhere for the case of #pragma omp target
> without #pragma omp teams in it, maybe in libgcc/config/nvptx/crt0.c ?

A quick look at libgcc/config/nvptx/crt0.c shows the target supports
__attribute__((shared)), so perhaps either following instead, or, if
.shared isn't preinitialized to zero, defining the variable in
libgcc/config/nvptx/crt0.c , adding there __gomp_team_num = 0;
and adding extern keyword before int __gomp_team_num __attribute__((shared));
in libgomp/config/nvptx/target.c.

2021-11-12  Jakub Jelinek  <jakub@redhat.com>

	* config/nvptx/target.c (__gomp_team_num): Define as
	__attribute__((shared)) var.
	(GOMP_teams4): Use __gomp_team_num as the team number instead of
	%ctaid.x.  If first, initialize it to %ctaid.x.  If num_teams_lower
	is bigger than num_blocks, use num_teams_lower teams and arrange for
	bumping of __gomp_team_num if !first and returning false once we run
	out of teams.
	* config/nvptx/teams.c (__gomp_team_num): Declare as
	extern __attribute__((shared)) var.
	(omp_get_team_num): Return __gomp_team_num value instead of %ctaid.x.

--- libgomp/config/nvptx/target.c.jj	2021-11-12 12:41:11.433501988 +0100
+++ libgomp/config/nvptx/target.c	2021-11-12 14:21:39.451426717 +0100
@@ -26,28 +26,41 @@
 #include "libgomp.h"
 #include <limits.h>
 
+int __gomp_team_num __attribute__((shared));
+
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
 	     unsigned int thread_limit, bool first)
 {
+  unsigned int num_blocks, block_id;
+  asm ("mov.u32 %0, %%nctaid.x;" : "=r" (num_blocks));
   if (!first)
-    return false;
+    {
+      unsigned in team_num;
+      if (num_blocks > gomp_num_teams_var)
+	return false;
+      team_num = __gomp_team_num;
+      if (team_num > gomp_num_teams_var - num_blocks)
+	return false;
+      __gomp_team_num = team_num + num_blocks;
+      return true;
+    }
   if (thread_limit)
     {
       struct gomp_task_icv *icv = gomp_icv (true);
       icv->thread_limit_var
 	= thread_limit > INT_MAX ? UINT_MAX : thread_limit;
     }
-  unsigned int num_blocks, block_id;
-  asm ("mov.u32 %0, %%nctaid.x;" : "=r" (num_blocks));
-  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (block_id));
-  /* FIXME: If num_teams_lower > num_blocks, we want to loop multiple
-     times for some CTAs.  */
-  (void) num_teams_lower;
-  if (!num_teams_upper || num_teams_upper >= num_blocks)
+  if (!num_teams_upper)
     num_teams_upper = num_blocks;
-  else if (block_id >= num_teams_upper)
+  else if (num_blocks < num_teams_lower)
+    num_teams_upper = num_teams_lower;
+  else if (num_blocks < num_teams_upper)
+    num_teams_upper = num_blocks;
+  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (block_id));
+  if (block_id >= num_teams_upper)
     return false;
+  __gomp_team_num = block_id;
   gomp_num_teams_var = num_teams_upper - 1;
   return true;
 }
--- libgomp/config/nvptx/teams.c.jj	2021-01-05 00:13:58.255297642 +0100
+++ libgomp/config/nvptx/teams.c	2021-11-12 14:22:06.443039863 +0100
@@ -28,6 +28,8 @@
 
 #include "libgomp.h"
 
+extern int __gomp_team_num __attribute__((shared));
+
 void
 GOMP_teams_reg (void (*fn) (void *), void *data, unsigned int num_teams,
 		unsigned int thread_limit, unsigned int flags)
@@ -48,9 +50,7 @@ omp_get_num_teams (void)
 int
 omp_get_team_num (void)
 {
-  int ctaid;
-  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (ctaid));
-  return ctaid;
+  return __gomp_team_num;
 }
 
 ialias (omp_get_num_teams)


	Jakub


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound
  2021-11-12 13:27 ` [PATCH] libgomp, nvptx, v2: " Jakub Jelinek
@ 2021-11-12 17:58   ` Jakub Jelinek
  2021-11-12 19:16     ` Alexander Monakov
  2021-11-12 21:08     ` Alexander Monakov
  0 siblings, 2 replies; 8+ messages in thread
From: Jakub Jelinek @ 2021-11-12 17:58 UTC (permalink / raw)
  To: Tom de Vries, Alexander Monakov; +Cc: gcc-patches, Tobias Burnus

On Fri, Nov 12, 2021 at 02:27:16PM +0100, Jakub Jelinek via Gcc-patches wrote:
> On Fri, Nov 12, 2021 at 02:20:23PM +0100, Jakub Jelinek via Gcc-patches wrote:
> > This patch assumes that .shared variables are initialized to 0,
> > https://docs.nvidia.com/cuda/parallel-thread-execution/index.html lists
> > in Table 7. .shared as non-initializable.  If that isn't the case,
> > we need to initialize it somewhere for the case of #pragma omp target
> > without #pragma omp teams in it, maybe in libgcc/config/nvptx/crt0.c ?
> 
> A quick look at libgcc/config/nvptx/crt0.c shows the target supports
> __attribute__((shared)), so perhaps either following instead, or, if
> .shared isn't preinitialized to zero, defining the variable in
> libgcc/config/nvptx/crt0.c , adding there __gomp_team_num = 0;
> and adding extern keyword before int __gomp_team_num __attribute__((shared));
> in libgomp/config/nvptx/target.c.

And finally here is a third version, which fixes a typo in the previous
patch (in instead of int) and actually initializes the shared var because
PTX documentation doesn't say anything about how the shared vars are
initialized.

Tested on x86_64-linux with nvptx-none offloading, ok for trunk?

2021-11-12  Jakub Jelinek  <jakub@redhat.com>

	* config/nvptx/team.c (__gomp_team_num): Define as
	__attribute__((shared)) var.
	(gomp_nvptx_main): Initialize __gomp_team_num to 0.
	* config/nvptx/target.c (__gomp_team_num): Declare as
	extern __attribute__((shared)) var.
	(GOMP_teams4): Use __gomp_team_num as the team number instead of
	%ctaid.x.  If first, initialize it to %ctaid.x.  If num_teams_lower
	is bigger than num_blocks, use num_teams_lower teams and arrange for
	bumping of __gomp_team_num if !first and returning false once we run
	out of teams.
	* config/nvptx/teams.c (__gomp_team_num): Declare as
	extern __attribute__((shared)) var.
	(omp_get_team_num): Return __gomp_team_num value instead of %ctaid.x.

--- libgomp/config/nvptx/team.c.jj	2021-05-25 13:43:02.793121350 +0200
+++ libgomp/config/nvptx/team.c	2021-11-12 17:49:02.847341650 +0100
@@ -32,6 +32,7 @@
 #include <string.h>
 
 struct gomp_thread *nvptx_thrs __attribute__((shared,nocommon));
+int __gomp_team_num __attribute__((shared));
 
 static void gomp_thread_start (struct gomp_thread_pool *);
 
@@ -57,6 +58,7 @@ gomp_nvptx_main (void (*fn) (void *), vo
       /* Starting additional threads is not supported.  */
       gomp_global_icv.dyn_var = true;
 
+      __gomp_team_num = 0;
       nvptx_thrs = alloca (ntids * sizeof (*nvptx_thrs));
       memset (nvptx_thrs, 0, ntids * sizeof (*nvptx_thrs));
 
--- libgomp/config/nvptx/target.c.jj	2021-11-12 15:57:29.400632875 +0100
+++ libgomp/config/nvptx/target.c	2021-11-12 17:47:39.499533296 +0100
@@ -26,28 +26,41 @@
 #include "libgomp.h"
 #include <limits.h>
 
+extern int __gomp_team_num __attribute__((shared));
+
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
 	     unsigned int thread_limit, bool first)
 {
+  unsigned int num_blocks, block_id;
+  asm ("mov.u32 %0, %%nctaid.x;" : "=r" (num_blocks));
   if (!first)
-    return false;
+    {
+      unsigned int team_num;
+      if (num_blocks > gomp_num_teams_var)
+	return false;
+      team_num = __gomp_team_num;
+      if (team_num > gomp_num_teams_var - num_blocks)
+	return false;
+      __gomp_team_num = team_num + num_blocks;
+      return true;
+    }
   if (thread_limit)
     {
       struct gomp_task_icv *icv = gomp_icv (true);
       icv->thread_limit_var
 	= thread_limit > INT_MAX ? UINT_MAX : thread_limit;
     }
-  unsigned int num_blocks, block_id;
-  asm ("mov.u32 %0, %%nctaid.x;" : "=r" (num_blocks));
-  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (block_id));
-  /* FIXME: If num_teams_lower > num_blocks, we want to loop multiple
-     times for some CTAs.  */
-  (void) num_teams_lower;
-  if (!num_teams_upper || num_teams_upper >= num_blocks)
+  if (!num_teams_upper)
     num_teams_upper = num_blocks;
-  else if (block_id >= num_teams_upper)
+  else if (num_blocks < num_teams_lower)
+    num_teams_upper = num_teams_lower;
+  else if (num_blocks < num_teams_upper)
+    num_teams_upper = num_blocks;
+  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (block_id));
+  if (block_id >= num_teams_upper)
     return false;
+  __gomp_team_num = block_id;
   gomp_num_teams_var = num_teams_upper - 1;
   return true;
 }
--- libgomp/config/nvptx/teams.c.jj	2021-05-25 13:43:02.793121350 +0200
+++ libgomp/config/nvptx/teams.c	2021-11-12 17:37:18.933361024 +0100
@@ -28,6 +28,8 @@
 
 #include "libgomp.h"
 
+extern int __gomp_team_num __attribute__((shared));
+
 void
 GOMP_teams_reg (void (*fn) (void *), void *data, unsigned int num_teams,
 		unsigned int thread_limit, unsigned int flags)
@@ -48,9 +50,7 @@ omp_get_num_teams (void)
 int
 omp_get_team_num (void)
 {
-  int ctaid;
-  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (ctaid));
-  return ctaid;
+  return __gomp_team_num;
 }
 
 ialias (omp_get_num_teams)


	Jakub


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound
  2021-11-12 17:58   ` [PATCH] libgomp, nvptx, v3: " Jakub Jelinek
@ 2021-11-12 19:16     ` Alexander Monakov
  2021-11-12 19:47       ` Jakub Jelinek
  2021-11-12 21:08     ` Alexander Monakov
  1 sibling, 1 reply; 8+ messages in thread
From: Alexander Monakov @ 2021-11-12 19:16 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Tom de Vries, Tobias Burnus, gcc-patches

Hello Jakub,

On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote:

> On Fri, Nov 12, 2021 at 02:27:16PM +0100, Jakub Jelinek via Gcc-patches wrote:
> > On Fri, Nov 12, 2021 at 02:20:23PM +0100, Jakub Jelinek via Gcc-patches wrote:
> > > This patch assumes that .shared variables are initialized to 0,
> > > https://docs.nvidia.com/cuda/parallel-thread-execution/index.html lists
> > > in Table 7. .shared as non-initializable.  If that isn't the case,
> > > we need to initialize it somewhere for the case of #pragma omp target
> > > without #pragma omp teams in it, maybe in libgcc/config/nvptx/crt0.c ?
> > 
> > A quick look at libgcc/config/nvptx/crt0.c shows the target supports
> > __attribute__((shared)), so perhaps either following instead, or, if
> > .shared isn't preinitialized to zero, defining the variable in
> > libgcc/config/nvptx/crt0.c , adding there __gomp_team_num = 0;
> > and adding extern keyword before int __gomp_team_num __attribute__((shared));
> > in libgomp/config/nvptx/target.c.
> 
> And finally here is a third version, which fixes a typo in the previous
> patch (in instead of int) and actually initializes the shared var because
> PTX documentation doesn't say anything about how the shared vars are
> initialized.
> 
> Tested on x86_64-linux with nvptx-none offloading, ok for trunk?

I suspect there may be a misunderstanding here, or maybe your explanation is
incomplete. I don't think the intention of the standard was to force such
complexity. You can launch as many blocks on the GPU as you like, limited only
by the bitwidth of the indexing register used in hardware, NVIDIA guarantees
at least INT_MAX blocks (in fact almost 1<<63 blocks if you launch a
three-dimensional grid with INT_MAX x 65535 x 65535 blocks).

The hardware will schedule blocks automatically (so for example if the hardware
can run 40 blocks simultaneously and you launch 100, the hardware may launch
blocks 0 to 39, then when one of those finishes it will launch the 40'th block
and so on).

So isn't the solution simply to adjust the logic around
nvptx_adjust_launch_bounds in GOMP_OFFLOAD_run, that is, if there's a lower
bound specified, use it instead of what adjust_launch_bounds is computing as
max_blocks?

Yours,
Alexander

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound
  2021-11-12 19:16     ` Alexander Monakov
@ 2021-11-12 19:47       ` Jakub Jelinek
  2021-11-12 19:49         ` Jakub Jelinek
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2021-11-12 19:47 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Tom de Vries, Tobias Burnus, gcc-patches

On Fri, Nov 12, 2021 at 10:16:11PM +0300, Alexander Monakov wrote:
> I suspect there may be a misunderstanding here, or maybe your explanation is
> incomplete. I don't think the intention of the standard was to force such
> complexity. You can launch as many blocks on the GPU as you like, limited only
> by the bitwidth of the indexing register used in hardware, NVIDIA guarantees
> at least INT_MAX blocks (in fact almost 1<<63 blocks if you launch a
> three-dimensional grid with INT_MAX x 65535 x 65535 blocks).
> 
> The hardware will schedule blocks automatically (so for example if the hardware
> can run 40 blocks simultaneously and you launch 100, the hardware may launch
> blocks 0 to 39, then when one of those finishes it will launch the 40'th block
> and so on).
> 
> So isn't the solution simply to adjust the logic around
> nvptx_adjust_launch_bounds in GOMP_OFFLOAD_run, that is, if there's a lower
> bound specified, use it instead of what adjust_launch_bounds is computing as
> max_blocks?

The problem is that the argument of the num_teams clause isn't always known
before target is launched.
While gimplify.c tries hard to figure it out as often as possible and the
standard makes it easy for the combined target teams case where we say
that the expressions in the num_teams/thread_limit clauses are evaluated on
the host before the target construct - in that case the plugin is told the
expected number and unless CUDA decides to allocate fewer than requested,
we are fine, there are cases where target is not combined with teams where
per the spec the expressions need to be evaluated on the target, not on the
host (gimplify still tries to optimize some of those cases by e.g. seeing if
it is some simple arithmetic expression where all the vars would be
firstprivatized), and in that case we create some default number of CTAs and
only later on find out what the user asked for.
extern int foo (void);
#pragma omp declare target to (foo)
void bar (void)
{
  #pragma omp target
  #pragma omp teams num_teams (foo ())
  ;
}
is such a case, we simply don't know and foo () needs to be called in
target.  In OpenMP 5.0 we had the option to always create fewer teams if
we decided so (of course at least 1), but in 5.1 we don't have that option,
if there is just one expression, we need to create exactly that many teams,
if it is num_teams (foo () - 10 : foo () + 10), we need to be within that
range (inclusive).

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound
  2021-11-12 19:47       ` Jakub Jelinek
@ 2021-11-12 19:49         ` Jakub Jelinek
  2021-11-12 21:21           ` Alexander Monakov
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2021-11-12 19:49 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Tom de Vries, Tobias Burnus, gcc-patches

On Fri, Nov 12, 2021 at 08:47:09PM +0100, Jakub Jelinek wrote:
> The problem is that the argument of the num_teams clause isn't always known
> before target is launched.

There was a design mistake that the clause has been put on teams rather than
on target (well, for host teams we need it on teams), and 5.1 actually
partially fixes this up for thread_limit by allowing that clause on both,
but not for num_teams.

	Jakub


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound
  2021-11-12 17:58   ` [PATCH] libgomp, nvptx, v3: " Jakub Jelinek
  2021-11-12 19:16     ` Alexander Monakov
@ 2021-11-12 21:08     ` Alexander Monakov
  1 sibling, 0 replies; 8+ messages in thread
From: Alexander Monakov @ 2021-11-12 21:08 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Tom de Vries, Tobias Burnus, gcc-patches

On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote:

> --- libgomp/config/nvptx/team.c.jj	2021-05-25 13:43:02.793121350 +0200
> +++ libgomp/config/nvptx/team.c	2021-11-12 17:49:02.847341650 +0100
> @@ -32,6 +32,7 @@
>  #include <string.h>
>  
>  struct gomp_thread *nvptx_thrs __attribute__((shared,nocommon));
> +int __gomp_team_num __attribute__((shared));

It's going to be weird to have two declarations next to each other, one with
'nocommon', one without. Could you have 'nocommon' also on the new one, and
then, if you like, to add extern declarations for both variables and drop the
attribute (in a separate patch)?

Alexander

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound
  2021-11-12 19:49         ` Jakub Jelinek
@ 2021-11-12 21:21           ` Alexander Monakov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexander Monakov @ 2021-11-12 21:21 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Tobias Burnus, gcc-patches

On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote:

> On Fri, Nov 12, 2021 at 08:47:09PM +0100, Jakub Jelinek wrote:
> > The problem is that the argument of the num_teams clause isn't always known
> > before target is launched.
> 
> There was a design mistake that the clause has been put on teams rather than
> on target (well, for host teams we need it on teams), and 5.1 actually
> partially fixes this up for thread_limit by allowing that clause on both,
> but not for num_teams.

If this is a mistake in the standard, can GCC say "the spec is bad; fix the
spec" and refuse to implement support, since it penalizes the common case?

Technically, this could be implemented without penalizing the common case via
CUDA "dynamic parallelism" where you initially launch just one block on the
device that figures out the dimensions and then performs a GPU-side launch of
the required amount of blocks, but that's a nontrivial amount of work.

I looked over your patch. I sent a small nitpick about 'nocommon' in a separate
message, and I still think it's better to adjust GOMP_OFFLOAD_run to take into
account the lower bound when it's known on the host side (otherwise you do
static scheduling of blocks which is going to be inferior to dynamic scheduling:
imagine lower bound is 3, and maximum resident blocks is 2: then you first do
teams 0 and 1 in parallel, then you do team 2 from the 0'th block, while in fact
you want to do it from whichever block finished its initial team first).

Alexander

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-11-12 21:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12 13:20 [PATCH] libgomp, nvptx: Honor OpenMP 5.1 num_teams lower bound Jakub Jelinek
2021-11-12 13:27 ` [PATCH] libgomp, nvptx, v2: " Jakub Jelinek
2021-11-12 17:58   ` [PATCH] libgomp, nvptx, v3: " Jakub Jelinek
2021-11-12 19:16     ` Alexander Monakov
2021-11-12 19:47       ` Jakub Jelinek
2021-11-12 19:49         ` Jakub Jelinek
2021-11-12 21:21           ` Alexander Monakov
2021-11-12 21:08     ` Alexander Monakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).