* [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
2016-02-22 16:20 ` Nathan Sidwell
2016-02-15 18:44 ` [gomp-nvptx 4/5] libgomp: remove __nvptx_stacks setup code Alexander Monakov
` (3 subsequent siblings)
4 siblings, 1 reply; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
To: gcc-patches; +Cc: Nathan Sidwell
This patch implements the NVPTX backend part of the transition to
host-allocated soft stacks. The compiler-emitted kernel entry code now
accepts a pointer to stack storage and per-warp stack size, and initialized
__nvptx_stacks based on that (as well as trivially initializing __nvptx_uni).
The rewritten part of write_omp_entry now uses macro-expanded assembly
snippets to avoid highly repetitive dynamic code accounting for 32/64-bit
differences.
* config/nvptx/nvptx.c (write_omp_entry): Expand entry code to
initialize __nvptx_uni and __nvptx_stacks (based on pointer to storage
allocated by the libgomp plugin).
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index efd0f8e..81dd9a2 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -979,8 +979,10 @@ nvptx_init_unisimt_predicate (FILE *file)
/* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
- void __attribute__((kernel)) NAME(void *arg)
+ void __attribute__((kernel)) NAME (void *arg, char *stack, size_t stacksize)
{
+ __nvptx_stacks[tid.y] = stack + stacksize * (ctaid.x * ntid.y + tid.y + 1);
+ __nvptx_uni[tid.y] = 0;
gomp_nvptx_main (ORIG, arg);
}
ORIG itself should not be emitted as a PTX .entry function. */
@@ -1000,21 +1002,44 @@ write_omp_entry (std::stringstream &s, const char *name, const char *orig)
s << ".extern .func gomp_nvptx_main";
s << "(.param" << sfx << " %in_ar1, .param" << sfx << " %in_ar2);\n";
}
- s << ".visible .entry " << name << "(.param" << sfx << " %in_ar1)\n";
- s << "{\n";
- s << "\t.reg" << sfx << " %ar1;\n";
- s << "\t.reg" << sfx << " %r1;\n";
- s << "\tld.param" << sfx << " %ar1, [%in_ar1];\n";
- s << "\tmov" << sfx << " %r1, " << orig << ";\n";
- s << "\t{\n";
- s << "\t\t.param" << sfx << " %out_arg0;\n";
- s << "\t\t.param" << sfx << " %out_arg1;\n";
- s << "\t\tst.param" << sfx << " [%out_arg0], %r1;\n";
- s << "\t\tst.param" << sfx << " [%out_arg1], %ar1;\n";
- s << "\t\tcall.uni gomp_nvptx_main, (%out_arg0, %out_arg1);\n";
- s << "\t}\n";
- s << "\tret;\n";
- s << "}\n";
+#define ENTRY_TEMPLATE(PS, PS_BYTES, MAD_PS_32) "\
+ (.param.u" PS " %arg, .param.u" PS " %stack, .param.u" PS " %sz)\n\
+{\n\
+ .reg.u32 %r<3>;\n\
+ .reg.u" PS " %R<4>;\n\
+ mov.u32 %r0, %tid.y;\n\
+ mov.u32 %r1, %ntid.y;\n\
+ mov.u32 %r2, %ctaid.x;\n\
+ cvt.u" PS ".u32 %R1, %r0;\n\
+ " MAD_PS_32 " %R1, %r1, %r2, %R1;\n\
+ mov.u" PS " %R0, __nvptx_stacks;\n\
+ " MAD_PS_32 " %R0, %r0, " PS_BYTES ", %R0;\n\
+ ld.param.u" PS " %R2, [%stack];\n\
+ ld.param.u" PS " %R3, [%sz];\n\
+ add.u" PS " %R2, %R2, %R3;\n\
+ mad.lo.u" PS " %R2, %R1, %R3, %R2;\n\
+ st.shared.u" PS " [%R0], %R2;\n\
+ mov.u" PS " %R0, __nvptx_uni;\n\
+ " MAD_PS_32 " %R0, %r0, 4, %R0;\n\
+ mov.u32 %r0, 0;\n\
+ st.shared.u32 [%R0], %r0;\n\
+ mov.u" PS " %R0, \0;\n\
+ ld.param.u" PS " %R1, [%arg];\n\
+ {\n\
+ .param.u" PS " %P<2>;\n\
+ st.param.u" PS " [%P0], %R0;\n\
+ st.param.u" PS " [%P1], %R1;\n\
+ call.uni gomp_nvptx_main, (%P0, %P1);\n\
+ }\n\
+ ret.uni;\n\
+}\n"
+ static const char template64[] = ENTRY_TEMPLATE ("64", "8", "mad.wide.u32");
+ static const char template32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32 ");
+#undef ENTRY_TEMPLATE
+ const char *template_1 = TARGET_ABI64 ? template64 : template32;
+ const char *template_2 = template_1 + strlen (template64) + 1;
+ s << ".visible .entry " << name << template_1 << orig << template_2;
+ need_softstack_decl = need_unisimt_decl = true;
}
/* Implement ASM_DECLARE_FUNCTION_NAME. Writes the start of a ptx
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
@ 2016-02-22 16:20 ` Nathan Sidwell
2016-02-22 20:25 ` Alexander Monakov
0 siblings, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-02-22 16:20 UTC (permalink / raw)
To: Alexander Monakov, gcc-patches
On 02/15/16 13:44, Alexander Monakov wrote:
> This patch implements the NVPTX backend part of the transition to
> + static const char template64[] = ENTRY_TEMPLATE ("64", "8", "mad.wide.u32");
> + static const char template32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32 ");
> +#undef ENTRY_TEMPLATE
> + const char *template_1 = TARGET_ABI64 ? template64 : template32;
> + const char *template_2 = template_1 + strlen (template64) + 1;
^^^
this looks mighty suspicious -- are you presuming some specific placement of
template64 & template32? (and even then I think it'll only work for TARGET_ABI64)
> + s << ".visible .entry " << name << template_1 << orig << template_2;
> + need_softstack_decl = need_unisimt_decl = true;
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
2016-02-22 16:20 ` Nathan Sidwell
@ 2016-02-22 20:25 ` Alexander Monakov
2016-02-22 20:31 ` Nathan Sidwell
0 siblings, 1 reply; 10+ messages in thread
From: Alexander Monakov @ 2016-02-22 20:25 UTC (permalink / raw)
To: Nathan Sidwell; +Cc: gcc-patches
On Mon, 22 Feb 2016, Nathan Sidwell wrote:
> On 02/15/16 13:44, Alexander Monakov wrote:
> > This patch implements the NVPTX backend part of the transition to
>
> > + static const char template64[] = ENTRY_TEMPLATE ("64", "8",
> > "mad.wide.u32");
> > + static const char template32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32
> > ");
> > +#undef ENTRY_TEMPLATE
> > + const char *template_1 = TARGET_ABI64 ? template64 : template32;
> > + const char *template_2 = template_1 + strlen (template64) + 1;
> ^^^
> this looks mighty suspicious -- are you presuming some specific placement of
> template64 & template32? (and even then I think it'll only work for
> TARGET_ABI64)
Template strings have an embedded nul character at the position where ORIG
goes, so template_2 is set to point at the position following the embedded nul
in template_1. Offset of the embedded nul is the same in each template
string, so it doesn't matter which goes into the argument of strlen (but
supplying template64 or template32 instead of template_1 allows easier folding).
Alexander
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
2016-02-22 20:25 ` Alexander Monakov
@ 2016-02-22 20:31 ` Nathan Sidwell
2016-02-23 7:56 ` Alexander Monakov
0 siblings, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-02-22 20:31 UTC (permalink / raw)
To: Alexander Monakov; +Cc: gcc-patches
On 02/22/16 15:25, Alexander Monakov wrote:
> Template strings have an embedded nul character at the position where ORIG
> goes, so template_2 is set to point at the position following the embedded nul
> in template_1. Offset of the embedded nul is the same in each template
> string, so it doesn't matter which goes into the argument of strlen (but
> supplying template64 or template32 instead of template_1 allows easier folding).
ew, that's disgusting!
nathan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code
2016-02-22 20:31 ` Nathan Sidwell
@ 2016-02-23 7:56 ` Alexander Monakov
0 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-23 7:56 UTC (permalink / raw)
To: Nathan Sidwell; +Cc: gcc-patches
On Mon, 22 Feb 2016, Nathan Sidwell wrote:
> On 02/22/16 15:25, Alexander Monakov wrote:
>
> > Template strings have an embedded nul character at the position where ORIG
> > goes, so template_2 is set to point at the position following the embedded
> > nul
> > in template_1. Offset of the embedded nul is the same in each template
> > string, so it doesn't matter which goes into the argument of strlen (but
> > supplying template64 or template32 instead of template_1 allows easier
> > folding).
>
> ew, that's disgusting!
So it'll blend in perfectly well with the rest of the ptx stuff in gcc, right?
Sorry, could not resist.
Please understand that I considered other approaches, and saw none that would
appear more beautiful/less ugly. I'd be happy to add some comments to the
code, similar to my explanatory text above, if that's the problem. If not,
and you actually imply that the code is not good (rather than delivering a
friendly jab) please give me some specific feedback to act on.
Thanks.
Alexander
^ permalink raw reply [flat|nested] 10+ messages in thread
* [gomp-nvptx 4/5] libgomp: remove __nvptx_stacks setup code
2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 5/5] libgomp plugin: manage soft-stack storage Alexander Monakov
` (2 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
To: gcc-patches
This patch implements the NVPTX libgomp part of the transition to
host-allocated soft stacks. The wrapper around gomp_nvptx_main previously
responsible for that is no longer needed.
This mostly reverts commit b408f1293e29a009ba70a3fda7b800277e1f310a.
* config/nvptx/team.c: (gomp_nvptx_main_1): Rename back to...
(gomp_nvptx_main): ...this; delete the wrapper.
diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index bc8c4e6..b9f9f9f 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -34,9 +34,12 @@ struct gomp_thread *nvptx_thrs __attribute__((shared));
static void gomp_thread_start (struct gomp_thread_pool *);
-static void __attribute__((noinline))
-gomp_nvptx_main_1 (void (*fn) (void *), void *fn_data, int ntids, int tid)
+void
+gomp_nvptx_main (void (*fn) (void *), void *fn_data)
{
+ int tid, ntids;
+ asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
+ asm ("mov.u32 %0, %%ntid.y;" : "=r" (ntids));
if (tid == 0)
{
gomp_global_icv.nthreads_var = ntids;
@@ -69,30 +72,6 @@ gomp_nvptx_main_1 (void (*fn) (void *), void *fn_data, int ntids, int tid)
}
}
-void
-gomp_nvptx_main (void (*fn) (void *), void *fn_data)
-{
- int tid, ntids;
- asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
- asm ("mov.u32 %0, %%ntid.y;" : "=r" (ntids));
- char *stacks = 0;
- int *__nvptx_uni;
- asm ("cvta.shared.u64 %0, __nvptx_uni;" : "=r" (__nvptx_uni));
- __nvptx_uni[tid] = 0;
- if (tid == 0)
- {
- size_t stacksize = 131072;
- stacks = gomp_malloc (stacksize * ntids);
- char **__nvptx_stacks = 0;
- asm ("cvta.shared.u64 %0, __nvptx_stacks;" : "=r" (__nvptx_stacks));
- for (int i = 0; i < ntids; i++)
- __nvptx_stacks[i] = stacks + stacksize * (i + 1);
- }
- asm ("bar.sync 0;");
- gomp_nvptx_main_1 (fn, fn_data, ntids, tid);
- free (stacks);
-}
-
/* This function is a pthread_create entry point. This contains the idle
loop in which a thread waits to be called up to become part of a team. */
^ permalink raw reply [flat|nested] 10+ messages in thread
* [gomp-nvptx 5/5] libgomp plugin: manage soft-stack storage
2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 3/5] nvptx backend: set up stacks in entry code Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 4/5] libgomp: remove __nvptx_stacks setup code Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 1/5] libgomp plugin: correct types Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 2/5] Revert "nvptx plugin: bump heap size to 1GB" Alexander Monakov
4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
To: gcc-patches
This patch implements the libgomp plugin part of the transition to
host-allocated soft stacks. For now only a simple scheme with
allocation/deallocation per launch is implemented; a followup change is
planned to cache and reuse allocations when appropriate.
The call to cuLaunchKernel is changed to pass kernel entry function arguments
in a way that allows the driver to check for mismatch (but only when the
cumulative size of passed arguments is different).
* plugin/plugin-nvptx.c (nvptx_stacks_size): New.
(nvptx_stacks_alloc): New.
(nvptx_stacks_free): New.
(GOMP_OFFLOAD_run): Allocate soft-stacks storage from the host using
the above new functions. Use kernel launch interface that allows
checking for mismatched total size of entry function arguments.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index cb6a3ac..adf57b1 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1892,6 +1892,37 @@ nvptx_adjust_launch_bounds (struct targ_fn_descriptor *fn,
*teams_p = max_blocks;
}
+/* Return the size of per-warp stacks (see gcc -msoft-stack) to use for OpenMP
+ target regions. */
+
+static size_t
+nvptx_stacks_size ()
+{
+ return 128 * 1024;
+}
+
+/* Return contiguous storage for NUM stacks, each SIZE bytes. */
+
+static void *
+nvptx_stacks_alloc (size_t size, int num)
+{
+ CUdeviceptr stacks;
+ CUresult r = cuMemAlloc (&stacks, size * num);
+ if (r != CUDA_SUCCESS)
+ GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r));
+ return (void *) stacks;
+}
+
+/* Release storage previously allocated by nvptx_stacks_alloc. */
+
+static void
+nvptx_stacks_free (void *p, int num)
+{
+ CUresult r = cuMemFree ((CUdeviceptr) p);
+ if (r != CUDA_SUCCESS)
+ GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+}
+
void
GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
{
@@ -1899,7 +1930,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
CUresult r;
struct ptx_device *ptx_dev = ptx_devices[ord];
const char *maybe_abort_msg = "(perhaps abort was called)";
- void *fn_args = &tgt_vars;
int teams = 0, threads = 0;
if (!args)
@@ -1922,10 +1952,19 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
}
nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
+ size_t stack_size = nvptx_stacks_size ();
+ void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
+ void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
+ size_t fn_args_size = sizeof fn_args;
+ void *config[] = {
+ CU_LAUNCH_PARAM_BUFFER_POINTER, fn_args,
+ CU_LAUNCH_PARAM_BUFFER_SIZE, &fn_args_size,
+ CU_LAUNCH_PARAM_END
+ };
r = cuLaunchKernel (function,
teams, 1, 1,
32, threads, 1,
- 0, ptx_dev->null_stream->stream, &fn_args, 0);
+ 0, ptx_dev->null_stream->stream, NULL, config);
if (r != CUDA_SUCCESS)
GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
@@ -1935,6 +1974,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
maybe_abort_msg);
else if (r != CUDA_SUCCESS)
GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
+ nvptx_stacks_free (stacks, teams * threads);
}
void
^ permalink raw reply [flat|nested] 10+ messages in thread
* [gomp-nvptx 1/5] libgomp plugin: correct types
2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
` (2 preceding siblings ...)
2016-02-15 18:44 ` [gomp-nvptx 5/5] libgomp plugin: manage soft-stack storage Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
2016-02-15 18:44 ` [gomp-nvptx 2/5] Revert "nvptx plugin: bump heap size to 1GB" Alexander Monakov
4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
To: gcc-patches
Handling of arguments array wrongly assumed that 'long' would match the size
of 'void *'. As that would break on MinGW, use 'intptr_t'. Use 'int' for
'teams' and 'threads', as that's what cuLaunchKernel accepts.
* plugin/plugin-nvptx.c (nvptx_adjust_launch_bounds): Adjust types.
(GOMP_OFFLOAD_run): Ditto.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 39575d9..79fd253 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -41,6 +41,7 @@
#include <cuda.h>
#include <stdbool.h>
#include <stdint.h>
+#include <limits.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
@@ -1874,7 +1875,7 @@ GOMP_OFFLOAD_openacc_set_cuda_stream (int async, void *stream)
static void
nvptx_adjust_launch_bounds (struct targ_fn_descriptor *fn,
struct ptx_device *ptx_dev,
- long *teams_p, long *threads_p)
+ int *teams_p, int *threads_p)
{
int max_warps_block = fn->max_threads_per_block / 32;
/* Maximum 32 warps per block is an implementation limit in NVPTX backend
@@ -1903,19 +1904,20 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
struct ptx_device *ptx_dev = ptx_devices[ord];
const char *maybe_abort_msg = "(perhaps abort was called)";
void *fn_args = &tgt_vars;
- long teams = 0, threads = 0;
+ int teams = 0, threads = 0;
if (!args)
GOMP_PLUGIN_fatal ("No target arguments provided");
while (*args)
{
- long id = (long) *args++, val;
+ intptr_t id = (intptr_t) *args++, val;
if (id & GOMP_TARGET_ARG_SUBSEQUENT_PARAM)
- val = (long) *args++;
+ val = (intptr_t) *args++;
else
val = id >> GOMP_TARGET_ARG_VALUE_SHIFT;
if ((id & GOMP_TARGET_ARG_DEVICE_MASK) != GOMP_TARGET_ARG_DEVICE_ALL)
continue;
+ val = val > INT_MAX ? INT_MAX : val;
id &= GOMP_TARGET_ARG_ID_MASK;
if (id == GOMP_TARGET_ARG_NUM_TEAMS)
teams = val;
^ permalink raw reply [flat|nested] 10+ messages in thread
* [gomp-nvptx 2/5] Revert "nvptx plugin: bump heap size to 1GB"
2016-02-15 18:44 [gomp-nvptx 0/5] Reorganize soft-stack setup Alexander Monakov
` (3 preceding siblings ...)
2016-02-15 18:44 ` [gomp-nvptx 1/5] libgomp plugin: correct types Alexander Monakov
@ 2016-02-15 18:44 ` Alexander Monakov
4 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-02-15 18:44 UTC (permalink / raw)
To: gcc-patches
This reverts commit 7d36b841341cde96f6cf89c5232916062da3fe4c.
The change was not well motivated: soft stacks would not fit in the default
8 MB heap only with multiple teams. With the transition to host-allocated
soft stacks, libgomp uses the device heap only for relatively small
allocations.
Revert
2015-12-09 Alexander Monakov <amonakov@ispras.ru>
* plugin/plugin-nvptx.c (nvptx_open_device): Adjust heap size.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 79fd253..cb6a3ac 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -694,10 +694,6 @@ nvptx_open_device (int n)
init_streams_for_device (ptx_dev, async_engines);
- r = cuCtxSetLimit (CU_LIMIT_MALLOC_HEAP_SIZE, 1<<30);
- if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuCtxSetLimit error: %s", cuda_error (r));
-
return ptx_dev;
}
^ permalink raw reply [flat|nested] 10+ messages in thread