[gomp-nvptx 0/5] Reorganize soft-stack setup

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [gomp-nvptx 0/5] Reorganize soft-stack setup
@ 2016-02-15 18:44 Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

I've committed the following 5-patch series to amonakov/gomp-nvptx git branch.
The first two patches are unrelated fixes to previously landed code.  Patches
3-5 reorganize the way initial soft-stack setup is done.

Previously soft-stacks used to be allocated by libgomp/config/nvptx/team.c in
a function that wrapped gomp_nvptx_main.  However:

  - the default device heap is only 8 MB, which is not enough for multiple
    teams with 128 KiB per-warp stacks; libgomp plugin would need to increase
    heap size;
  - device heap persists between launches, so it's possible to leak
    soft-stack allocations if a team exits without cleaning up;
  - device malloc is rather slow, so I'd like to eliminate or reuse device
    allocations as much as possible; it's easier to arrange reuse of soft
    stack storage from the host side;
  - there's a chicken-and-egg problem with setting up soft stacks from C code.

So the above motivates a transition to a scheme where libgomp core is
oblivious to soft stack setup, and instead the storage is allocated from the
libgomp plugin (via cuMemAlloc) and passed to the compiler-emitted entry
function as the 2nd (base pointer) and 3rd (per-warp size) arguments.  This
obviously addresses bullets 1-2 above, bullet 4 is addressed since the entry
code is emitted in assembly from the backend, and bullet 3 is left to a
followup change: cuMemAlloc is roughly as slow on the host as malloc is slow
on the device, but we should be able to reuse allocations on the host.

This changes the binary interface between libgomp plugin (GOMP_OFFLOAD_run)
and compiler-emitted kernel entry functions for OpenMP target regions.  For
now, I am free to do that on the branch without worries, but if a similar
change is required in the future after a release, libgomp plugin should be
able to detect which arguments the entry expects.  Assuming the argument list
is only appended to, libgomp plugin only needs to know the argument count.  So
a possible solution is to invent a tagging mechanism when the change needs to
be made, and provide the default 3 arguments to untagged entries.  Old libgomp
plugins unaware of the change should be able to detect failure to provide
sufficient arguments to entries emitted from new compiler from the failure of
cuLaunchKernel

Alexander Monakov (5):
  libgomp plugin: correct types
  Revert "nvptx plugin: bump heap size to 1GB"
  nvptx backend: set up stacks in entry code
  libgomp: remove __nvptx_stacks setup code
  libgomp plugin: manage soft-stack storage

 gcc/ChangeLog.gomp-nvptx      |  6 +++++
 gcc/config/nvptx/nvptx.c      | 57 ++++++++++++++++++++++++++++++------------
 libgomp/ChangeLog.gomp-nvptx  | 26 +++++++++++++++++++
 libgomp/config/nvptx/team.c   | 31 ++++-------------------
 libgomp/plugin/plugin-nvptx.c | 58 +++++++++++++++++++++++++++++++++++--------
 5 files changed, 126 insertions(+), 52 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
  2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
  2016-02-22 16:20   ` Nathan Sidwell
  2016-02-15 18:44 ` [gomp-nvptx 4/5] libgomp: remove __nvptx_stacks setup code Alexander Monakov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch implements the NVPTX backend part of the transition to
host-allocated soft stacks.  The compiler-emitted kernel entry code now
accepts a pointer to stack storage and per-warp stack size, and initialized
__nvptx_stacks based on that (as well as trivially initializing __nvptx_uni).

The rewritten part of write_omp_entry now uses macro-expanded assembly
snippets to avoid highly repetitive dynamic code accounting for 32/64-bit
differences.

	* config/nvptx/nvptx.c (write_omp_entry): Expand entry code to
	initialize __nvptx_uni and __nvptx_stacks (based on pointer to storage
	allocated by the libgomp plugin).

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index efd0f8e..81dd9a2 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -979,8 +979,10 @@ nvptx_init_unisimt_predicate (FILE *file)
 /* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
 
    extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
-   void __attribute__((kernel)) NAME(void *arg)
+   void __attribute__((kernel)) NAME (void *arg, char *stack, size_t stacksize)
    {
+     __nvptx_stacks[tid.y] = stack + stacksize * (ctaid.x * ntid.y + tid.y + 1);
+     __nvptx_uni[tid.y] = 0;
      gomp_nvptx_main (ORIG, arg);
    }
    ORIG itself should not be emitted as a PTX .entry function.  */
@@ -1000,21 +1002,44 @@ write_omp_entry (std::stringstream &s, const char *name, const char *orig)
       s << ".extern .func gomp_nvptx_main";
       s << "(.param" << sfx << " %in_ar1, .param" << sfx << " %in_ar2);\n";
     }
-  s << ".visible .entry " << name << "(.param" << sfx << " %in_ar1)\n";
-  s << "{\n";
-  s << "\t.reg" << sfx << " %ar1;\n";
-  s << "\t.reg" << sfx << " %r1;\n";
-  s << "\tld.param" << sfx << " %ar1, [%in_ar1];\n";
-  s << "\tmov" << sfx << " %r1, " << orig << ";\n";
-  s << "\t{\n";
-  s << "\t\t.param" << sfx << " %out_arg0;\n";
-  s << "\t\t.param" << sfx << " %out_arg1;\n";
-  s << "\t\tst.param" << sfx << " [%out_arg0], %r1;\n";
-  s << "\t\tst.param" << sfx << " [%out_arg1], %ar1;\n";
-  s << "\t\tcall.uni gomp_nvptx_main, (%out_arg0, %out_arg1);\n";
-  s << "\t}\n";
-  s << "\tret;\n";
-  s << "}\n";
+#define ENTRY_TEMPLATE(PS, PS_BYTES, MAD_PS_32) "\
+ (.param.u" PS " %arg, .param.u" PS " %stack, .param.u" PS " %sz)\n\
+{\n\
+	.reg.u32 %r<3>;\n\
+	.reg.u" PS " %R<4>;\n\
+	mov.u32 %r0, %tid.y;\n\
+	mov.u32 %r1, %ntid.y;\n\
+	mov.u32 %r2, %ctaid.x;\n\
+	cvt.u" PS ".u32 %R1, %r0;\n\
+	" MAD_PS_32 " %R1, %r1, %r2, %R1;\n\
+	mov.u" PS " %R0, __nvptx_stacks;\n\
+	" MAD_PS_32 " %R0, %r0, " PS_BYTES ", %R0;\n\
+	ld.param.u" PS " %R2, [%stack];\n\
+	ld.param.u" PS " %R3, [%sz];\n\
+	add.u" PS " %R2, %R2, %R3;\n\
+	mad.lo.u" PS " %R2, %R1, %R3, %R2;\n\
+	st.shared.u" PS " [%R0], %R2;\n\
+	mov.u" PS " %R0, __nvptx_uni;\n\
+	" MAD_PS_32 " %R0, %r0, 4, %R0;\n\
+	mov.u32 %r0, 0;\n\
+	st.shared.u32 [%R0], %r0;\n\
+	mov.u" PS " %R0, \0;\n\
+	ld.param.u" PS " %R1, [%arg];\n\
+	{\n\
+		.param.u" PS " %P<2>;\n\
+		st.param.u" PS " [%P0], %R0;\n\
+		st.param.u" PS " [%P1], %R1;\n\
+		call.uni gomp_nvptx_main, (%P0, %P1);\n\
+	}\n\
+	ret.uni;\n\
+}\n"
+  static const char template64[] = ENTRY_TEMPLATE ("64", "8", "mad.wide.u32");
+  static const char template32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32  ");
+#undef ENTRY_TEMPLATE
+  const char *template_1 = TARGET_ABI64 ? template64 : template32;
+  const char *template_2 = template_1 + strlen (template64) + 1;
+  s << ".visible .entry " << name << template_1 << orig << template_2;
+  need_softstack_decl = need_unisimt_decl = true;
 }
 
 /* Implement ASM_DECLARE_FUNCTION_NAME.  Writes the start of a ptx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
  2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
@ 2016-02-22 16:20   ` Nathan Sidwell
  2016-02-22 20:25     ` Alexander Monakov
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-02-22 16:20 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches

On 02/15/16 13:44, Alexander Monakov wrote:
> This patch implements the NVPTX backend part of the transition to

> +  static const char template64[] = ENTRY_TEMPLATE ("64", "8", "mad.wide.u32");
> +  static const char template32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32  ");
> +#undef ENTRY_TEMPLATE
> +  const char *template_1 = TARGET_ABI64 ? template64 : template32;
> +  const char *template_2 = template_1 + strlen (template64) + 1;
                                          ^^^
this looks mighty suspicious -- are you presuming some specific placement of 
template64 & template32? (and even then I  think it'll only work for TARGET_ABI64)

> +  s << ".visible .entry " << name << template_1 << orig << template_2;
> +  need_softstack_decl = need_unisimt_decl = true;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
  2016-02-22 16:20   ` Nathan Sidwell
@ 2016-02-22 20:25     ` Alexander Monakov
  2016-02-22 20:31       ` Nathan Sidwell
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Monakov @ 2016-02-22 20:25 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc-patches

On Mon, 22 Feb 2016, Nathan Sidwell wrote:
> On 02/15/16 13:44, Alexander Monakov wrote:
> > This patch implements the NVPTX backend part of the transition to
> 
> > +  static const char template64[] = ENTRY_TEMPLATE ("64", "8",
> > "mad.wide.u32");
> > +  static const char template32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32
> > ");
> > +#undef ENTRY_TEMPLATE
> > +  const char *template_1 = TARGET_ABI64 ? template64 : template32;
> > +  const char *template_2 = template_1 + strlen (template64) + 1;
>                                          ^^^
> this looks mighty suspicious -- are you presuming some specific placement of
> template64 & template32? (and even then I  think it'll only work for
> TARGET_ABI64)

Template strings have an embedded nul character at the position where ORIG
goes, so template_2 is set to point at the position following the embedded nul
in template_1.  Offset of the embedded nul is the same in each template
string, so it doesn't matter which goes into the argument of strlen (but
supplying template64 or template32 instead of template_1 allows easier folding).

Alexander

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
  2016-02-22 20:25     ` Alexander Monakov
@ 2016-02-22 20:31       ` Nathan Sidwell
  2016-02-23  7:56         ` Alexander Monakov
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-02-22 20:31 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches

On 02/22/16 15:25, Alexander Monakov wrote:

> Template strings have an embedded nul character at the position where ORIG
> goes, so template_2 is set to point at the position following the embedded nul
> in template_1.  Offset of the embedded nul is the same in each template
> string, so it doesn't matter which goes into the argument of strlen (but
> supplying template64 or template32 instead of template_1 allows easier folding).

ew, that's disgusting!

nathan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
  2016-02-22 20:31       ` Nathan Sidwell
@ 2016-02-23  7:56         ` Alexander Monakov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-23  7:56 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc-patches

On Mon, 22 Feb 2016, Nathan Sidwell wrote:
> On 02/22/16 15:25, Alexander Monakov wrote:
> 
> > Template strings have an embedded nul character at the position where ORIG
> > goes, so template_2 is set to point at the position following the embedded
> > nul
> > in template_1.  Offset of the embedded nul is the same in each template
> > string, so it doesn't matter which goes into the argument of strlen (but
> > supplying template64 or template32 instead of template_1 allows easier
> > folding).
> 
> ew, that's disgusting!

So it'll blend in perfectly well with the rest of the ptx stuff in gcc, right?
Sorry, could not resist.

Please understand that I considered other approaches, and saw none that would
appear more beautiful/less ugly.  I'd be happy to add some comments to the
code, similar to my explanatory text above, if that's the problem.  If not,
and you actually imply that the code is not good (rather than delivering a
friendly jab) please give me some specific feedback to act on.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [gomp-nvptx 4/5] libgomp: remove __nvptx_stacks setup code
  2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 5/5] libgomp plugin: manage soft-stack storage Alexander Monakov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
  To: gcc-patches

This patch implements the NVPTX libgomp part of the transition to
host-allocated soft stacks.  The wrapper around gomp_nvptx_main previously
responsible for that is no longer needed.

This mostly reverts commit b408f1293e29a009ba70a3fda7b800277e1f310a.

	* config/nvptx/team.c: (gomp_nvptx_main_1): Rename back to...
	(gomp_nvptx_main): ...this; delete the wrapper.

diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index bc8c4e6..b9f9f9f 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -34,9 +34,12 @@ struct gomp_thread *nvptx_thrs __attribute__((shared));
 
 static void gomp_thread_start (struct gomp_thread_pool *);
 
-static void __attribute__((noinline))
-gomp_nvptx_main_1 (void (*fn) (void *), void *fn_data, int ntids, int tid)
+void
+gomp_nvptx_main (void (*fn) (void *), void *fn_data)
 {
+  int tid, ntids;
+  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
+  asm ("mov.u32 %0, %%ntid.y;" : "=r" (ntids));
   if (tid == 0)
     {
       gomp_global_icv.nthreads_var = ntids;
@@ -69,30 +72,6 @@ gomp_nvptx_main_1 (void (*fn) (void *), void *fn_data, int ntids, int tid)
     }
 }
 
-void
-gomp_nvptx_main (void (*fn) (void *), void *fn_data)
-{
-  int tid, ntids;
-  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
-  asm ("mov.u32 %0, %%ntid.y;" : "=r" (ntids));
-  char *stacks = 0;
-  int *__nvptx_uni;
-  asm ("cvta.shared.u64 %0, __nvptx_uni;" : "=r" (__nvptx_uni));
-  __nvptx_uni[tid] = 0;
-  if (tid == 0)
-    {
-      size_t stacksize = 131072;
-      stacks = gomp_malloc (stacksize * ntids);
-      char **__nvptx_stacks = 0;
-      asm ("cvta.shared.u64 %0, __nvptx_stacks;" : "=r" (__nvptx_stacks));
-      for (int i = 0; i < ntids; i++)
-	__nvptx_stacks[i] = stacks + stacksize * (i + 1);
-    }
-  asm ("bar.sync 0;");
-  gomp_nvptx_main_1 (fn, fn_data, ntids, tid);
-  free (stacks);
-}
-
 /* This function is a pthread_create entry point.  This contains the idle
    loop in which a thread waits to be called up to become part of a team.  */
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [gomp-nvptx 5/5] libgomp plugin: manage soft-stack storage
  2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 4/5] libgomp: remove __nvptx_stacks setup code Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 1/5] libgomp plugin: correct types Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 2/5] Revert "nvptx plugin: bump heap size to 1GB" Alexander Monakov
  4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
  To: gcc-patches

This patch implements the libgomp plugin part of the transition to
host-allocated soft stacks.  For now only a simple scheme with
allocation/deallocation per launch is implemented; a followup change is
planned to cache and reuse allocations when appropriate.

The call to cuLaunchKernel is changed to pass kernel entry function arguments
in a way that allows the driver to check for mismatch (but only when the
cumulative size of passed arguments is different).

	* plugin/plugin-nvptx.c (nvptx_stacks_size): New.
	(nvptx_stacks_alloc): New.
	(nvptx_stacks_free): New.
	(GOMP_OFFLOAD_run): Allocate soft-stacks storage from the host using
	the above new functions.  Use kernel launch interface that allows
	checking for mismatched total size of entry function arguments.
 
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index cb6a3ac..adf57b1 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1892,6 +1892,37 @@ nvptx_adjust_launch_bounds (struct targ_fn_descriptor *fn,
     *teams_p = max_blocks;
 }
 
+/* Return the size of per-warp stacks (see gcc -msoft-stack) to use for OpenMP
+   target regions.  */
+
+static size_t
+nvptx_stacks_size ()
+{
+  return 128 * 1024;
+}
+
+/* Return contiguous storage for NUM stacks, each SIZE bytes.  */
+
+static void *
+nvptx_stacks_alloc (size_t size, int num)
+{
+  CUdeviceptr stacks;
+  CUresult r = cuMemAlloc (&stacks, size * num);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r));
+  return (void *) stacks;
+}
+
+/* Release storage previously allocated by nvptx_stacks_alloc.  */
+
+static void
+nvptx_stacks_free (void *p, int num)
+{
+  CUresult r = cuMemFree ((CUdeviceptr) p);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -1899,7 +1930,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   CUresult r;
   struct ptx_device *ptx_dev = ptx_devices[ord];
   const char *maybe_abort_msg = "(perhaps abort was called)";
-  void *fn_args = &tgt_vars;
   int teams = 0, threads = 0;
 
   if (!args)
@@ -1922,10 +1952,19 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
     }
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
+  size_t stack_size = nvptx_stacks_size ();
+  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
+  void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
+  size_t fn_args_size = sizeof fn_args;
+  void *config[] = {
+    CU_LAUNCH_PARAM_BUFFER_POINTER, fn_args,
+    CU_LAUNCH_PARAM_BUFFER_SIZE, &fn_args_size,
+    CU_LAUNCH_PARAM_END
+  };
   r = cuLaunchKernel (function,
 		      teams, 1, 1,
 		      32, threads, 1,
-		      0, ptx_dev->null_stream->stream, &fn_args, 0);
+		      0, ptx_dev->null_stream->stream, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 
@@ -1935,6 +1974,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 		       maybe_abort_msg);
   else if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
+  nvptx_stacks_free (stacks, teams * threads);
 }
 
 void

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [gomp-nvptx 1/5] libgomp plugin: correct types
  2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
                   ` (2 preceding siblings ...)
  2016-02-15 18:44 ` [gomp-nvptx 5/5] libgomp plugin: manage soft-stack storage Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
  2016-02-15 18:44 ` [gomp-nvptx 2/5] Revert "nvptx plugin: bump heap size to 1GB" Alexander Monakov
  4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
  To: gcc-patches

Handling of arguments array wrongly assumed that 'long' would match the size
of 'void *'.  As that would break on MinGW, use 'intptr_t'.  Use 'int' for
'teams' and 'threads', as that's what cuLaunchKernel accepts.

	* plugin/plugin-nvptx.c (nvptx_adjust_launch_bounds): Adjust types.
	(GOMP_OFFLOAD_run): Ditto.

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 39575d9..79fd253 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -41,6 +41,7 @@
 #include <cuda.h>
 #include <stdbool.h>
 #include <stdint.h>
+#include <limits.h>
 #include <string.h>
 #include <stdio.h>
 #include <unistd.h>
@@ -1874,7 +1875,7 @@ GOMP_OFFLOAD_openacc_set_cuda_stream (int async, void *stream)
 static void
 nvptx_adjust_launch_bounds (struct targ_fn_descriptor *fn,
 			    struct ptx_device *ptx_dev,
-			    long *teams_p, long *threads_p)
+			    int *teams_p, int *threads_p)
 {
   int max_warps_block = fn->max_threads_per_block / 32;
   /* Maximum 32 warps per block is an implementation limit in NVPTX backend
@@ -1903,19 +1904,20 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   struct ptx_device *ptx_dev = ptx_devices[ord];
   const char *maybe_abort_msg = "(perhaps abort was called)";
   void *fn_args = &tgt_vars;
-  long teams = 0, threads = 0;
+  int teams = 0, threads = 0;
 
   if (!args)
     GOMP_PLUGIN_fatal ("No target arguments provided");
   while (*args)
     {
-      long id = (long) *args++, val;
+      intptr_t id = (intptr_t) *args++, val;
       if (id & GOMP_TARGET_ARG_SUBSEQUENT_PARAM)
-	val = (long) *args++;
+	val = (intptr_t) *args++;
       else
         val = id >> GOMP_TARGET_ARG_VALUE_SHIFT;
       if ((id & GOMP_TARGET_ARG_DEVICE_MASK) != GOMP_TARGET_ARG_DEVICE_ALL)
 	continue;
+      val = val > INT_MAX ? INT_MAX : val;
       id &= GOMP_TARGET_ARG_ID_MASK;
       if (id == GOMP_TARGET_ARG_NUM_TEAMS)
 	teams = val;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [gomp-nvptx 2/5] Revert "nvptx plugin: bump heap size to 1GB"
  2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
                   ` (3 preceding siblings ...)
  2016-02-15 18:44 ` [gomp-nvptx 1/5] libgomp plugin: correct types Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
  4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
  To: gcc-patches

This reverts commit 7d36b841341cde96f6cf89c5232916062da3fe4c.

The change was not well motivated: soft stacks would not fit in the default
8 MB heap only with multiple teams.  With the transition to host-allocated
soft stacks, libgomp uses the device heap only for relatively small
allocations.

	Revert
	2015-12-09  Alexander Monakov  <amonakov@ispras.ru>

	* plugin/plugin-nvptx.c (nvptx_open_device): Adjust heap size.

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 79fd253..cb6a3ac 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -694,10 +694,6 @@ nvptx_open_device (int n)
 
   init_streams_for_device (ptx_dev, async_engines);
 
-  r = cuCtxSetLimit (CU_LIMIT_MALLOC_HEAP_SIZE, 1<<30);
-  if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuCtxSetLimit error: %s", cuda_error (r));
-
   return ptx_dev;
 }
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-02-23  7:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
2016-02-22 16:20   ` Nathan Sidwell
2016-02-22 20:25     ` Alexander Monakov
2016-02-22 20:31       ` Nathan Sidwell
2016-02-23  7:56         ` Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 4/5] libgomp: remove __nvptx_stacks setup code Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 5/5] libgomp plugin: manage soft-stack storage Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 1/5] libgomp plugin: correct types Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 2/5] Revert "nvptx plugin: bump heap size to 1GB" Alexander Monakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).