public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp4] Redesign oacc_parallel launch API
@ 2015-07-28 17:11 Nathan Sidwell
  2015-07-28 17:46 ` Nathan Sidwell
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Nathan Sidwell @ 2015-07-28 17:11 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

I've committed this patch to the gomp4 branch to redo the launch API.  I'll post 
a version for trunk once the versioning patch gets approved & committed.

This changes the API in a number of ways, allowing device-specific knowledge to 
be moved into the device compiler and out of the host compiler.

Firstly, we attach a tuple of launch dimensions as an attribute to the offloaded 
function's 'oacc function' attribute.  These are the constant launch dimensions. 
  Dynamic dimensions get a zero for their slot in this list.  Further this list 
can be extended in the future to an alist keyed by device_type.

Dynamic dimensions are computed on the host.  however they are passed via 
varadic args to the GOACC_parallel function (which is renamed).  The varadic 
args are passed using key/value representation, and 3 keys are currently defined:
END -- end of the varadic list
DIM - set of runtime-computed dimensions.  Only the dynamic ones are passed.
ASYNC_WAIT - an async and a set of waits (possibly zero).

I have arranged for the key to have a slot that can later be filled by 
device_type, and hence support multiple device types.

The constant dimensions can be used in expansion of the GOACC_nid function in 
the device compiler.  The device compiler could also process that list to select 
the device_type slot that is appropriate.

For PTX the backend is augmented to emit the launch dimensions into the target 
data, from whence the ptx plugin can pick them up and overwrite with any dynamic 
ones passed in from the launch function.

nathan

[-- Attachment #2: gomp4-launch.patch --]
[-- Type: text/x-patch, Size: 39869 bytes --]

2015-07-28  Nathan Sidwell  <nathan@codesourcery.com>

	include/
	* gomp-constants.h (GOMP_DIM_GANG, GOMP_DIM_WORKER,
	GOMP_DIM_VECTOR): New.
	(GOMP_DIM_MAX, GOMP_DIM_MASK): New.
	(GOMP_LAUNCH_END, GOMP_LAUNCH_DIM, GOMP_LAUNCH_ASYNC_WAIT): New.
	(GOMP_LAUNCH_CODE_SHIFT, GOMP_LAUCNH_DEVICE_SHIFT,
	GOMP_LAUNCH_OP_SHIFT): New.
	(GOMP_LAUNCH_PACK, GOMP_LAUNCH_CODE, GOMP_LAUNCH_DEVICE,
	GOMP_LAUNCH_OP): New.
	(GOMP_VERSION_NVIDIA_PTX): Increment to 1.

	gcc/
	* tree.h (OMP_CLAUSE_EXPR): New.
	* omp-low.c (creste_omp_child_function): Do not set oacc function
	attribute here.
	(oacc_launch_pack): New.
	(OACC_FN_ATTRIB): New define.
	(set_oacc_fn_attrib): New.
	(get_oacc_fn_attrib): New.
	(expand_omp_target): Reimplement openacc launch parameters.
	* omp-low.h (get_oacc_fn_attrib): Declare.
	* omp-builtins.def (BUILT_IN_GOACC_KERNELS_INTERNAL): Change type.
	(BUILT_IN_GOACC_PARALLEL): Change type and target name.
	* builtin-types.def
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR): Replace with ...
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR): ... this.
	* tree-parloops.c (create_parallel_loop): Adjust index of
	shared_size arg.
	* except.c: Include omp-low.h
	(finish_eh_generation): Call get_oacc_fn_attrib.
	* config/nvptx/mkoffload.c (process): Accumulate compute grid
	dimensions and emit them.
	* config/nvptx/nvptx.c: Include gomp-constants.h
	(nvptx_record_offload_symbol): Emit compute grid dimensions.

	libgomp/
	* libgomp.map: Add GOACC_parallel_keyed.
	* libgomp.h (struct acc_dispatch_t): Change exec_func parameters.
	* libgomp_g.h (GOACC_parallel): Replace with ...
	(GOACC_parallel_keyed): ... this.
	* oacc-parallel.c (goacc_wait): Take pointer to va_list.  Adjust
	all callers.
	(GOACC_parallel_keyed): Use varadic keyed interface for optional
	parameters.  Renamed from ...
	(GOACC_parallel): ... here.  Replace with forwarding fn.
	* plugin/plugin-host.c (GOMP_OFFLOAD_openacc_parallel): Adjust
	parameters.
	* plugin/plugin-nvptx.c (struct targ_fn_launch): New structure.
	(targ_fn_descriptor): Point to targ_fn_launch instance.
	(nvptx_exec): Adjust parameters.  Process compute dimensions.
	(struct nvptx_tdata): Adjust type.
	(GOMP_OFFLOAD_load_image_ver): Adjust function handling.
	(GOMP_OFFLOAD_openacc_parallel): Adjust.


	gcc/c-family/
	* c-common.c (DEF_FUNCTION_TYPE_VAR_12): Delete.

	gcc/fortran/
	* f95-lang.c (DEF_FUNCTION_TYPE_VAR_12): Delete.
	* types.def (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR): Replace with ...
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR): ... this.

	gcc/lto/
	* lto-lang.c (DEF_FUNCTION_TYPE_VAR_12): Delete.

Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c	(revision 226312)
+++ libgomp/oacc-parallel.c	(working copy)
@@ -171,15 +171,12 @@ goacc_deallocate_static (acc_device_t d)
   alloc_done = false;
 }
 
-static void goacc_wait (int async, int num_waits, va_list ap);
+static void goacc_wait (int async, int num_waits, va_list *ap);
 
 void
-GOACC_parallel (int device, void (*fn) (void *),
-		size_t mapnum, void **hostaddrs, size_t *sizes,
-		unsigned short *kinds,
-		int num_gangs, int num_workers, int vector_length,
-		size_t shared_size,
-		int async, int num_waits, ...)
+GOACC_parallel_keyed (int device, void (*fn) (void *), size_t mapnum,
+		      void **hostaddrs, size_t *sizes, unsigned short *kinds,
+		      size_t shared_size, ...)
 {
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
   va_list ap;
@@ -191,17 +188,21 @@ GOACC_parallel (int device, void (*fn) (
   struct splay_tree_key_s k;
   splay_tree_key tgt_fn_key;
   void (*tgt_fn);
+  int async = GOMP_ASYNC_SYNC;
+  unsigned dims[3];
+  unsigned tag;
 
+  memset (dims, 0, sizeof (dims));
 #ifdef HAVE_INTTYPES_H
   gomp_debug (0, "%s: mapnum=%"PRIu64", hostaddrs=%p, sizes=%p, kinds=%p, "
-	      "shared_size=%"PRIu64", async = %d\n",
+	      "shared_size=%"PRIu64"\n",
 	      __FUNCTION__, (uint64_t) mapnum, hostaddrs, sizes, kinds,
-	      (uint64_t) shared_size, async);
+	      (uint64_t) shared_size);
 #else
   gomp_debug (0, "%s: mapnum=%lu, hostaddrs=%p, sizes=%p, kinds=%p, "
-	      "shared_size=%lu, async=%d\n",
+	      "shared_size=%lu\n",
 	      __FUNCTION__, (unsigned long) mapnum, hostaddrs, sizes, kinds,
-	      (unsigned long) shared_size, async);
+	      (unsigned long) shared_size);
 #endif
 
   alloc_ganglocal_addrs (mapnum, hostaddrs, sizes, kinds);
@@ -249,12 +250,36 @@ GOACC_parallel (int device, void (*fn) (
   if (acc_device_type (acc_dev->type) == acc_device_host_nonshm)
     alloc_host_shared_mem (shared_size);
 
-  if (num_waits)
+  va_start (ap, shared_size);
+  /* TODO: This will need amending when device_type is implemented.  */
+  while (GOMP_LAUNCH_PACK (GOMP_LAUNCH_END, 0, 0)
+	 != (tag = va_arg (ap, unsigned)))
     {
-      va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
-      va_end (ap);
+      assert (!GOMP_LAUNCH_DEVICE (tag));
+      switch (GOMP_LAUNCH_CODE (tag))
+	{
+	case GOMP_LAUNCH_DIM:
+	  {
+	    unsigned mask = GOMP_LAUNCH_OP (tag);
+
+	    for (i = 0; i != 3; i++)
+	      if (mask & (1 << i)) /* FIXME: move to gomp-constants. */
+		dims[i] = va_arg (ap, unsigned);
+	  }
+	  break;
+
+	case GOMP_LAUNCH_ASYNC_WAIT:
+	  {
+	    unsigned num_waits = GOMP_LAUNCH_OP (tag);
+
+	    async = va_arg (ap, unsigned);
+	    if (num_waits)
+	      goacc_wait (async, num_waits, &ap);
+	    break;
+	  }
+	}
     }
+  va_end (ap);
   
   acc_dev->openacc.async_set_async_func (async);
 
@@ -287,9 +312,8 @@ GOACC_parallel (int device, void (*fn) (
 	devaddrs[i] = NULL;
     }
 
-  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, sizes, kinds,
-			      num_gangs, num_workers, vector_length, async,
-			      shared_size, tgt);
+  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+			      shared_size, async, dims, tgt);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
@@ -306,6 +330,39 @@ GOACC_parallel (int device, void (*fn) (
     free_host_shared_mem ();
 }
 
+/* Legacy entry point.   */
+
+void
+GOACC_parallel (int device, void (*fn) (void *), size_t mapnum,
+		void **hostaddrs, size_t *sizes, unsigned short *kinds,
+		int num_gangs, int num_workers, int vector_length,
+		size_t shared_size,
+		int async, int num_waits, ...)
+{
+  int waits[9];
+  unsigned ix;
+  va_list ap;
+
+  if (num_waits > 8)
+    gomp_fatal ("too many waits for legacy interface");
+
+  va_start (ap, num_waits);
+  for (ix = 0; ix != num_waits; ix++)
+    waits[ix] = va_arg (ap, int);
+  va_end (ap);
+  waits[ix] = GOMP_LAUNCH_PACK (GOMP_LAUNCH_END, 0, 0);
+  
+  GOACC_parallel_keyed (device, fn, mapnum, hostaddrs, sizes, kinds,
+			shared_size,
+			GOMP_LAUNCH_PACK (GOMP_LAUNCH_DIM, 0,
+					  GOMP_DIM_MASK (GOMP_DIM_MAX) - 1),
+			num_gangs, num_workers, vector_length,
+			GOMP_LAUNCH_PACK (GOMP_LAUNCH_ASYNC_WAIT,
+					  0, num_waits),
+			async, waits[0], waits[1], waits[2], waits[3],
+			waits[4], waits[5], waits[6], waits[7], waits[8]);
+}
+
 void
 GOACC_data_start (int device, size_t mapnum,
 		  void **hostaddrs, size_t *sizes, unsigned short *kinds)
@@ -399,7 +456,7 @@ GOACC_enter_exit_data (int device, size_
       va_list ap;
 
       va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
+      goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
 
@@ -529,15 +586,15 @@ GOACC_enter_exit_data (int device, size_
 }
 
 static void
-goacc_wait (int async, int num_waits, va_list ap)
+goacc_wait (int async, int num_waits, va_list *ap)
 {
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
   while (num_waits--)
     {
-      int qid = va_arg (ap, int);
-
+      int qid = va_arg (*ap, int);
+      
       if (acc_async_test (qid))
 	continue;
 
@@ -574,7 +631,7 @@ GOACC_update (int device, size_t mapnum,
       va_list ap;
 
       va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
+      goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
 
@@ -616,7 +673,7 @@ GOACC_wait (int async, int num_waits, ..
       va_list ap;
 
       va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
+      goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
   else if (async == acc_async_sync)
Index: libgomp/libgomp_g.h
===================================================================
--- libgomp/libgomp_g.h	(revision 226312)
+++ libgomp/libgomp_g.h	(working copy)
@@ -222,9 +222,9 @@ extern void GOACC_data_start (int, size_
 extern void GOACC_data_end (void);
 extern void GOACC_enter_exit_data (int, size_t, void **,
 				   size_t *, unsigned short *, int, int, ...);
-extern void GOACC_parallel (int, void (*) (void *), size_t,
-			    void **, size_t *, unsigned short *, int, int, int,
-			    size_t, int, int, ...);
+extern void GOACC_parallel_keyed (int, void (*) (void *), size_t,
+				  void **, size_t *, unsigned short *,
+				  size_t, ...);
 extern void GOACC_update (int, size_t, void **, size_t *,
 			  unsigned short *, int, int, ...);
 extern void GOACC_wait (int, int, ...);
Index: libgomp/libgomp.map
===================================================================
--- libgomp/libgomp.map	(revision 226312)
+++ libgomp/libgomp.map	(working copy)
@@ -337,6 +337,7 @@ GOACC_2.0.GOMP_4_BRANCH {
   global:
 	GOACC_deviceptr;
 	GOACC_get_ganglocal_ptr;
+	GOACC_parallel_keyed;
 	GOACC_register_static;
 } GOACC_2.0;
 
Index: libgomp/libgomp.h
===================================================================
--- libgomp/libgomp.h	(revision 226312)
+++ libgomp/libgomp.h	(working copy)
@@ -694,8 +694,8 @@ typedef struct acc_dispatch_t
   struct target_mem_desc *data_environ;
 
   /* Execute.  */
-  void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *,
-		     unsigned short *, int, int, int, int, size_t, void *);
+  void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t, int,
+		     unsigned *, void *);
 
   /* Async cleanup callback registration.  */
   void (*register_async_cleanup_func) (void *);
Index: libgomp/plugin/plugin-host.c
===================================================================
--- libgomp/plugin/plugin-host.c	(revision 226312)
+++ libgomp/plugin/plugin-host.c	(working copy)
@@ -170,13 +170,9 @@ GOMP_OFFLOAD_openacc_parallel (void (*fn
 			       size_t mapnum __attribute__ ((unused)),
 			       void **hostaddrs __attribute__ ((unused)),
 			       void **devaddrs __attribute__ ((unused)),
-			       size_t *sizes __attribute__ ((unused)),
-			       unsigned short *kinds __attribute__ ((unused)),
-			       int num_gangs __attribute__ ((unused)),
-			       int num_workers __attribute__ ((unused)),
-			       int vector_length __attribute__ ((unused)),
-			       int async __attribute__ ((unused)),
 			       size_t shared_size __attribute__ ((unused)),
+			       int async __attribute__ ((unused)),
+			       unsigned *dims __attribute__ ((unused)),
 			       void *targ_mem_desc __attribute__ ((unused)))
 {
 #ifdef HOST_NONSHM_PLUGIN
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 226312)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -282,12 +282,20 @@ map_push (struct ptx_stream *s, int asyn
   return;
 }
 
+/* Target data function launch information.  */
+
+struct targ_fn_launch
+{
+  const char *fn;
+  unsigned short dim[GOMP_DIM_MAX];
+};
+
 /* Descriptor of a loaded function.  */
 
 struct targ_fn_descriptor
 {
   CUfunction fn;
-  const char *name;
+  const struct targ_fn_launch *launch;
 };
 
 /* A loaded PTX image.  */
@@ -988,9 +996,8 @@ event_add (enum ptx_event_type type, CUe
 
 void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-	  size_t *sizes, unsigned short *kinds, int num_gangs, int num_workers,
-	  int vector_length, int async, size_t shared_size,
-	  void *targ_mem_desc)
+	    size_t shared_size, int async, unsigned dims[GOMP_DIM_MAX],
+	    void *targ_mem_desc)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
@@ -1007,6 +1014,13 @@ nvptx_exec (void (*fn), size_t mapnum, v
   dev_str = select_stream_for_async (async, pthread_self (), false, NULL);
   assert (dev_str == nvthd->current_stream);
 
+  /* Initialize the launch dimensions.  Typically this is constant,
+     provided by the device compiler, but we must permit runtime
+     values.  */
+  for (i = 0; i != GOMP_DIM_MAX; i++)
+    if (targ_fn->launch->dim[i])
+      dims[i] = targ_fn->launch->dim[i];
+
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
      the corresponding device pointer.  */
@@ -1026,8 +1040,8 @@ nvptx_exec (void (*fn), size_t mapnum, v
 
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " gangs=%u, workers=%u, vectors=%u, shared=%u\n",
-		     __FUNCTION__, targ_fn->name, num_gangs, num_workers,
-		     vector_length, (unsigned)shared_size);
+		     __FUNCTION__, targ_fn->launch->fn,
+		     dims[0], dims[1], dims[2], (unsigned)shared_size);
 
   // OpenACC		CUDA
   //
@@ -1037,8 +1051,8 @@ nvptx_exec (void (*fn), size_t mapnum, v
 
   kargs[0] = &dp;
   r = cuLaunchKernel (function,
-		      num_gangs, 1, 1,
-		      vector_length, num_workers, 1,
+		      dims[GOMP_DIM_GANG], 1, 1,
+		      dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
 		      shared_size, dev_str->stream, kargs, 0);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
@@ -1084,7 +1098,7 @@ nvptx_exec (void (*fn), size_t mapnum, v
 #endif
 
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: finished\n", __FUNCTION__,
-		     targ_fn->name);
+		     targ_fn->launch->fn);
 
 #ifndef DISABLE_ASYNC
   if (async < acc_async_noval)
@@ -1611,7 +1625,7 @@ typedef struct nvptx_tdata
   const char *const *var_names;
   size_t var_num;
 
-  const char *const *fn_names;
+  const struct targ_fn_launch *fn_descs;
   size_t fn_num;
 } nvptx_tdata_t;
 
@@ -1633,7 +1647,8 @@ GOMP_OFFLOAD_load_image_ver (unsigned ve
 			     struct addr_pair **target_table)
 {
   CUmodule module;
-  const char *const *fn_names, *const *var_names;
+  const char *const *var_names;
+  const struct targ_fn_launch *fn_descs;
   unsigned int fn_entries, var_entries, i, j;
   CUresult r;
   struct targ_fn_descriptor *targ_fns;
@@ -1662,7 +1677,7 @@ GOMP_OFFLOAD_load_image_ver (unsigned ve
   var_entries = img_header->var_num;
   var_names = img_header->var_names;
   fn_entries = img_header->fn_num;
-  fn_names = img_header->fn_names;
+  fn_descs = img_header->fn_descs;
 
   targ_tbl = GOMP_PLUGIN_malloc (sizeof (struct addr_pair)
 				 * (fn_entries + var_entries));
@@ -1685,12 +1700,12 @@ GOMP_OFFLOAD_load_image_ver (unsigned ve
     {
       CUfunction function;
 
-      r = cuModuleGetFunction (&function, module, fn_names[i]);
+      r = cuModuleGetFunction (&function, module, fn_descs[i].fn);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r));
 
       targ_fns->fn = function;
-      targ_fns->name = (const char *) fn_names[i];
+      targ_fns->launch = &fn_descs[i];
 
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
@@ -1770,13 +1785,12 @@ void (*device_run) (int n, void *fn_ptr,
 
 void
 GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), size_t mapnum,
-			       void **hostaddrs, void **devaddrs, size_t *sizes,
-			       unsigned short *kinds, int num_gangs,
-			       int num_workers, int vector_length, int async,
-			       size_t shared_size, void *targ_mem_desc)
+			       void **hostaddrs, void **devaddrs,
+			       size_t shared_size,
+			       int async, unsigned *dims, void *targ_mem_desc)
 {
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, sizes, kinds, num_gangs,
-	      num_workers, vector_length, async, shared_size, targ_mem_desc);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, shared_size,
+	      async, dims, targ_mem_desc);
 }
 
 void
Index: include/gomp-constants.h
===================================================================
--- include/gomp-constants.h	(revision 226312)
+++ include/gomp-constants.h	(working copy)
@@ -122,9 +122,30 @@ enum gomp_map_kind
 #define GOMP_DEVICE_ICV			-1
 #define GOMP_DEVICE_HOST_FALLBACK	-2
 
+#define GOMP_DIM_GANG	0
+#define GOMP_DIM_WORKER	1
+#define GOMP_DIM_VECTOR	2
+#define GOMP_DIM_MAX	3
+#define GOMP_DIM_MASK(X) (1 << (X))
+
+/* Varadic launch arguments.  */
+#define GOMP_LAUNCH_END 	0  /* End of args, no dev or op */
+#define GOMP_LAUNCH_DIM		1  /* Launch dimensions, op = mask */
+#define GOMP_LAUNCH_ASYNC_WAIT	2  /* Async & Waits, op = num waits.  */
+#define GOMP_LAUNCH_CODE_SHIFT	28
+#define GOMP_LAUNCH_DEVICE_SHIFT 16
+#define GOMP_LAUNCH_OP_SHIFT 0
+#define GOMP_LAUNCH_PACK(CODE,DEVICE,OP)  \
+  (((CODE) << GOMP_LAUNCH_CODE_SHIFT)		\
+   | ((DEVICE) << GOMP_LAUNCH_DEVICE_SHIFT)	\
+   | ((OP) << GOMP_LAUNCH_OP_SHIFT))
+#define GOMP_LAUNCH_CODE(X) (((X) >> GOMP_LAUNCH_CODE_SHIFT) & 0xf)
+#define GOMP_LAUNCH_DEVICE(X) (((X) >> GOMP_LAUNCH_DEVICE_SHIFT) & 0xfff)
+#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
+
 /* Versions of libgomp and device-specific plugins.  */
 #define GOMP_VERSION	0
-#define GOMP_VERSION_NVIDIA_PTX 0
+#define GOMP_VERSION_NVIDIA_PTX 1
 
 #define GOMP_VERSION_PACK(LIB, DEV) (((LIB) << 16) | (DEV))
 #define GOMP_VERSION_LIB(PACK) (((PACK) >> 16) & 0xffff)
Index: gcc/tree-parloops.c
===================================================================
--- gcc/tree-parloops.c	(revision 226312)
+++ gcc/tree-parloops.c	(working copy)
@@ -2084,7 +2084,7 @@ create_parallel_loop (struct loop *loop,
       tree data_arg = gimple_omp_target_data_arg (kernels);
       gimple_omp_target_set_data_arg (stmt, data_arg);
       tree ganglocal_size
-	= gimple_call_arg (goacc_kernels_internal, /* TODO */ 9);
+	= gimple_call_arg (goacc_kernels_internal, /* TODO */ 6);
       gimple_omp_target_set_ganglocal_size (stmt, ganglocal_size);
 
       gimple_set_location (stmt, loc);
Index: gcc/tree.h
===================================================================
--- gcc/tree.h	(revision 226312)
+++ gcc/tree.h	(working copy)
@@ -1360,6 +1360,8 @@ extern void protected_set_expr_location
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_SCHEDULE), 0)
 
 /* OpenACC clause expressions  */
+#define OMP_CLAUSE_EXPR(NODE, CLAUSE) \
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, CLAUSE), 0)
 #define OMP_CLAUSE_GANG_EXPR(NODE) \
   OMP_CLAUSE_OPERAND ( \
     OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_GANG), 0)
Index: gcc/except.c
===================================================================
--- gcc/except.c	(revision 226312)
+++ gcc/except.c	(working copy)
@@ -151,6 +151,7 @@ along with GCC; see the file COPYING3.
 #include "cfgloop.h"
 #include "builtins.h"
 #include "tree-hash-traits.h"
+#include "omp-low.h"
 
 static GTY(()) int call_site_base;
 
@@ -1491,8 +1492,7 @@ finish_eh_generation (void)
 {
   basic_block bb;
 
-  if (lookup_attribute ("oacc function",
-			DECL_ATTRIBUTES (current_function_decl)))
+  if (get_oacc_fn_attrib (current_function_decl))
     return;
 
   /* Construct the landing pads.  */
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 226312)
+++ gcc/omp-low.c	(working copy)
@@ -2302,12 +2302,6 @@ create_omp_child_function (omp_context *
       = tree_cons (get_identifier ("omp target entrypoint"),
                    NULL_TREE, DECL_ATTRIBUTES (decl));
 
-  if (is_gimple_omp_oacc (ctx->stmt)
-      && !lookup_attribute ("omp function", DECL_ATTRIBUTES (decl)))
-    DECL_ATTRIBUTES (decl)
-      = tree_cons (get_identifier ("oacc function"), NULL_TREE,
-		   DECL_ATTRIBUTES (decl));
-
   t = build_decl (DECL_SOURCE_LOCATION (decl),
 		  RESULT_DECL, NULL_TREE, void_type_node);
   DECL_ARTIFICIAL (t) = 1;
@@ -9277,6 +9271,92 @@ loop_get_oacc_kernels_region_entry (stru
     }
 }
 
+/* Encode an oacc launch argument.  This matches the GOMP_LAUNCH_PACK
+   macro on gomp-constants.h.  We do not check for overflow.  */
+
+static tree
+oacc_launch_pack (unsigned code, tree device, unsigned op)
+{
+  tree res;
+  
+  res = build_int_cst (unsigned_type_node, GOMP_LAUNCH_PACK (code, 0, op));
+  if (device)
+    {
+      device = fold_build2 (LSHIFT_EXPR, unsigned_type_node,
+			    device, build_int_cst (unsigned_type_node,
+						   GOMP_LAUNCH_DEVICE_SHIFT));
+      res = fold_build2 (BIT_IOR_EXPR, unsigned_type_node, res, device);
+    }
+  return res;
+}
+
+/* Look for compute grid dimension clauses and convert to an attribute
+   attached to FN.  This permits the target-side code to (a) massage
+   the dimensions, (b) emit that data and (c) optimize.  Non-constant
+   dimensions are pushed onto ARGS.
+
+   The attribute value is a TREE_LIST.  A set of dimensions is
+   represented as a list of INTEGER_CST.  Those that are runtime
+   expres are represented as an INTEGER_CST of zero.
+
+   TOOO. Normally the attribute will just contain a single such list.  If
+   however it contains a list of lists, this will represent the use of
+   device_type.  Each member of the outer list is an assoc list of
+   dimensions, keyed by the device type.  The first entry will be the
+   default.  Well, that's the plan.  */
+
+#define OACC_FN_ATTRIB "oacc function"
+
+static void
+set_oacc_fn_attrib (tree clauses, tree fn, vec<tree> *args)
+{
+  /* Must match GOMP_DIM ordering.  */
+  static const omp_clause_code ids[] = 
+    {OMP_CLAUSE_NUM_GANGS, OMP_CLAUSE_NUM_WORKERS, OMP_CLAUSE_VECTOR_LENGTH};
+  unsigned ix;
+  tree dims[GOMP_DIM_MAX];
+  tree attr = NULL_TREE;
+  unsigned non_const = 0;
+
+  for (ix = GOMP_DIM_MAX; ix--;)
+    {
+      tree clause = find_omp_clause (clauses, ids[ix]);
+      tree dim;
+
+      if (!clause)
+	dim = integer_one_node;
+      else
+	dim = OMP_CLAUSE_EXPR (clause, ids[ix]);
+      dims[ix] = dim;
+      if (TREE_CODE (dim) != INTEGER_CST)
+	{
+	  dim = integer_zero_node;
+	  non_const |= GOMP_DIM_MASK (ix);
+	}
+      attr = tree_cons (NULL_TREE, dim, attr);
+    }
+
+  /* Add the attributes.  */
+  DECL_ATTRIBUTES (fn) =
+    tree_cons (get_identifier (OACC_FN_ATTRIB), attr, DECL_ATTRIBUTES (fn));
+
+  if (non_const)
+    {
+      /* Push a dynamic argument set.  */
+      args->safe_push (oacc_launch_pack (GOMP_LAUNCH_DIM,
+					 NULL_TREE, non_const));
+      for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
+	if (non_const & GOMP_DIM_MASK (ix))
+	  args->safe_push (dims[ix]);
+    }
+}
+
+tree
+get_oacc_fn_attrib (tree fn)
+{
+  return lookup_attribute (OACC_FN_ATTRIB, DECL_ATTRIBUTES (fn));
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -9728,6 +9808,7 @@ expand_omp_target (struct omp_region *re
     }
 
   gimple g;
+  bool tagging = false;
   /* The maximum number used by any start_ix, without varargs.  */
   auto_vec<tree, 12> args;
   args.quick_push (device);
@@ -9767,45 +9848,20 @@ expand_omp_target (struct omp_region *re
     case BUILT_IN_GOACC_KERNELS_INTERNAL:
     case BUILT_IN_GOACC_PARALLEL:
       {
-	tree t_num_gangs, t_num_workers, t_vector_length;
-
-	/* Default values for num_gangs, num_workers, and vector_length.  */
-	t_num_gangs = t_num_workers = t_vector_length
-	  = fold_convert_loc (gimple_location (entry_stmt),
-			      integer_type_node, integer_one_node);
-	/* ..., but if present, use the value specified by the respective
-	   clause.  */
-	c = find_omp_clause (clauses, OMP_CLAUSE_NUM_GANGS);
-	if (c)
-	  t_num_gangs = OMP_CLAUSE_NUM_GANGS_EXPR (c);
-	c = find_omp_clause (clauses, OMP_CLAUSE_NUM_WORKERS);
-	if (c)
-	  t_num_workers = OMP_CLAUSE_NUM_WORKERS_EXPR (c);
-	c = find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH);
-	if (c)
-	  t_vector_length = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	 
-	args.quick_push (t_num_gangs);
-	args.quick_push (t_num_workers);
-	args.quick_push (t_vector_length);
 	args.quick_push (gimple_omp_target_ganglocal_size (entry_stmt));
+	set_oacc_fn_attrib (clauses, child_fn, &args);
+	tagging = true;
       }
       /* FALLTHRU */
     case BUILT_IN_GOACC_ENTER_EXIT_DATA:
     case BUILT_IN_GOACC_UPDATE:
       {
-	tree t_async;
-	int t_wait_idx;
+	tree t_async = NULL_TREE;
 
 	c = find_omp_clause (clauses, OMP_CLAUSE_DEVICE_TYPE);
 	if (c)
 	  sorry ("device_type clause is not supported yet");
 
-	/* Default values for t_async.  */
-	t_async = fold_convert_loc (gimple_location (entry_stmt),
-				    integer_type_node,
-				    build_int_cst (integer_type_node,
-						   GOMP_ASYNC_SYNC));
 	/* ..., but if present, use the value specified by the respective
 	   clause, making sure that is of the correct type.  */
 	c = find_omp_clause (clauses, OMP_CLAUSE_ASYNC);
@@ -9813,42 +9869,59 @@ expand_omp_target (struct omp_region *re
 	  t_async = fold_convert_loc (OMP_CLAUSE_LOCATION (c),
 				      integer_type_node,
 				      OMP_CLAUSE_ASYNC_EXPR (c));
+	else if (!tagging)
+	  /* Default values for t_async.  */
+	  t_async = fold_convert_loc (gimple_location (entry_stmt),
+				      integer_type_node,
+				      build_int_cst (integer_type_node,
+						     GOMP_ASYNC_SYNC));
+	if (t_async && !tagging)
+	  {
+	    args.safe_push (t_async);
+	    t_async = NULL_TREE;
+	  }
 
-	args.quick_push (t_async);
-	/* Save the index, and... */
-	t_wait_idx = args.length ();
-	/* ... push a default value.  */
-	args.quick_push (fold_convert_loc (gimple_location (entry_stmt),
-					   integer_type_node,
-					   integer_zero_node));
+	/* Save the argument index, and... */
+	unsigned t_wait_idx = args.length ();
+	unsigned num_waits = 0;
 	c = find_omp_clause (clauses, OMP_CLAUSE_WAIT);
-	if (c)
-	  {
-	    int n = 0;
+	if (!tagging || c || t_async)
+	  /* ... push a placeholder.  */
+	  args.safe_push (integer_zero_node);
 
-	    for (; c; c = OMP_CLAUSE_CHAIN (c))
-	      {
-		if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
-		  {
-		    args.safe_push (fold_convert_loc (OMP_CLAUSE_LOCATION (c),
-						      integer_type_node,
-						      OMP_CLAUSE_WAIT_EXPR (c)));
-		    n++;
-		  }
-	      }
+	if (tagging && t_async)
+	  args.safe_push (t_async);
+	
+	for (; c; c = OMP_CLAUSE_CHAIN (c))
+	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
+	    {
+	      args.safe_push (fold_convert_loc (OMP_CLAUSE_LOCATION (c),
+						integer_type_node,
+						OMP_CLAUSE_WAIT_EXPR (c)));
+	      num_waits++;
+	    }
 
-	    /* Now that we know the number, replace the default value.  */
-	    args.ordered_remove (t_wait_idx);
-	    args.quick_insert (t_wait_idx,
-			       fold_convert_loc (gimple_location (entry_stmt),
-						 integer_type_node,
-						 build_int_cst (integer_type_node, n)));
+	if (!tagging || num_waits || t_async)
+	  {
+	    tree len;
+
+	    /* Now that we know the number, update the placeholder.  */
+	    if (tagging)
+	      len = oacc_launch_pack (GOMP_LAUNCH_ASYNC_WAIT,
+				      NULL_TREE, num_waits);
+	    else
+	      len = build_int_cst (integer_type_node, num_waits);
+	    len = fold_convert_loc (gimple_location (entry_stmt),
+				    unsigned_type_node, len);
+	    args[t_wait_idx] = len;
 	  }
       }
       break;
     default:
       gcc_unreachable ();
     }
+  if (tagging)
+    args.safe_push (oacc_launch_pack (GOMP_LAUNCH_END, NULL_TREE, 0));
 
   g = gimple_build_call_vec (builtin_decl_explicit (start_ix), args);
   gimple_set_location (g, gimple_location (entry_stmt));
Index: gcc/omp-low.h
===================================================================
--- gcc/omp-low.h	(revision 226312)
+++ gcc/omp-low.h	(working copy)
@@ -41,6 +41,7 @@ extern bool make_gimple_omp_edges (basic
 extern void omp_finish_file (void);
 extern bool gimple_stmt_omp_data_i_init_p (gimple);
 extern basic_block loop_get_oacc_kernels_region_entry (struct loop *);
+extern tree get_oacc_fn_attrib (tree);
 
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
Index: gcc/lto/lto-lang.c
===================================================================
--- gcc/lto/lto-lang.c	(revision 226312)
+++ gcc/lto/lto-lang.c	(working copy)
@@ -164,9 +164,6 @@ enum lto_builtin_type
 				NAME,
 #define DEF_FUNCTION_TYPE_VAR_7(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7) NAME,
-#define DEF_FUNCTION_TYPE_VAR_12(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-				 NAME,
 #define DEF_POINTER_TYPE(NAME, TYPE) NAME,
 #include "builtin-types.def"
 #undef DEF_PRIMITIVE_TYPE
@@ -186,7 +183,6 @@ enum lto_builtin_type
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   BT_LAST
 };
@@ -674,10 +670,6 @@ lto_define_builtins (tree va_list_ref_ty
 #define DEF_FUNCTION_TYPE_VAR_7(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7)				\
   def_fn_type (ENUM, RETURN, 1, 7, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6, ARG7);
-#define DEF_FUNCTION_TYPE_VAR_12(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-  def_fn_type (ENUM, RETURN, 1, 12, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6,      \
-	       ARG7, ARG8, ARG9, ARG10, ARG11, ARG12);
 #define DEF_POINTER_TYPE(ENUM, TYPE) \
   builtin_types[(int) ENUM] = build_pointer_type (builtin_types[(int) TYPE]);
 
@@ -700,7 +692,6 @@ lto_define_builtins (tree va_list_ref_ty
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   builtin_types[(int) BT_LAST] = NULL_TREE;
 
Index: gcc/builtin-types.def
===================================================================
--- gcc/builtin-types.def	(revision 226312)
+++ gcc/builtin-types.def	(working copy)
@@ -596,10 +596,9 @@ DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_
 			 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR,
 			 BT_PTR, BT_INT, BT_INT)
 
-DEF_FUNCTION_TYPE_VAR_12 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
-			  BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
-			  BT_PTR, BT_PTR, BT_PTR, BT_INT, BT_INT, BT_INT,
-			  BT_SIZE, BT_INT, BT_INT)
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
+			 BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			 BT_PTR, BT_PTR, BT_PTR, BT_SIZE)
 
 DEF_POINTER_TYPE (BT_PTR_FN_VOID_VAR, BT_FN_VOID_VAR)
 DEF_FUNCTION_TYPE_3 (BT_FN_PTR_PTR_FN_VOID_VAR_PTR_SIZE,
Index: gcc/omp-builtins.def
===================================================================
--- gcc/omp-builtins.def	(revision 226312)
+++ gcc/omp-builtins.def	(working copy)
@@ -45,11 +45,11 @@ DEF_GOACC_BUILTIN_FNSPEC (BUILT_IN_GOACC
 			  ATTR_NOTHROW_LIST, "...rrr")
 DEF_GOACC_BUILTIN_FNSPEC (BUILT_IN_GOACC_KERNELS_INTERNAL,
 			  "GOACC_kernels_internal",
-			  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
+			  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
 			  ATTR_FNSPEC_DOT_DOT_DOT_DOT_r_r_r_NOTHROW_LIST,
 			  ATTR_NOTHROW_LIST, "....rrr")
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel",
-		   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
+		   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
 		   ATTR_NOTHROW_LIST)
 DEF_GOACC_BUILTIN_FNSPEC (BUILT_IN_GOACC_UPDATE, "GOACC_update",
 			  BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
Index: gcc/c-family/c-common.c
===================================================================
--- gcc/c-family/c-common.c	(revision 226312)
+++ gcc/c-family/c-common.c	(working copy)
@@ -5455,9 +5455,6 @@ enum c_builtin_type
 				NAME,
 #define DEF_FUNCTION_TYPE_VAR_7(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7) NAME,
-#define DEF_FUNCTION_TYPE_VAR_12(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-				 NAME,
 #define DEF_POINTER_TYPE(NAME, TYPE) NAME,
 #include "builtin-types.def"
 #undef DEF_PRIMITIVE_TYPE
@@ -5477,7 +5474,6 @@ enum c_builtin_type
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   BT_LAST
 };
@@ -5573,10 +5569,6 @@ c_define_builtins (tree va_list_ref_type
 #define DEF_FUNCTION_TYPE_VAR_7(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7)				\
   def_fn_type (ENUM, RETURN, 1, 7, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6, ARG7);
-#define DEF_FUNCTION_TYPE_VAR_12(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-  def_fn_type (ENUM, RETURN, 1, 12, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6,      \
-	       ARG7, ARG8, ARG9, ARG10, ARG11, ARG12);
 #define DEF_POINTER_TYPE(ENUM, TYPE) \
   builtin_types[(int) ENUM] = build_pointer_type (builtin_types[(int) TYPE]);
 
@@ -5599,7 +5591,6 @@ c_define_builtins (tree va_list_ref_type
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   builtin_types[(int) BT_LAST] = NULL_TREE;
 
Index: gcc/fortran/types.def
===================================================================
--- gcc/fortran/types.def	(revision 226312)
+++ gcc/fortran/types.def	(working copy)
@@ -222,7 +222,6 @@ DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_
 			 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR,
 			 BT_PTR, BT_INT, BT_INT)
 
-DEF_FUNCTION_TYPE_VAR_12 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
-			  BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
-			  BT_PTR, BT_PTR, BT_PTR, BT_INT, BT_INT, BT_INT,
-			  BT_SIZE, BT_INT, BT_INT)
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
+			 BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			 BT_PTR, BT_PTR, BT_PTR, BT_SIZE)
Index: gcc/fortran/f95-lang.c
===================================================================
--- gcc/fortran/f95-lang.c	(revision 226312)
+++ gcc/fortran/f95-lang.c	(working copy)
@@ -662,9 +662,6 @@ gfc_init_builtin_functions (void)
 #define DEF_FUNCTION_TYPE_VAR_2(NAME, RETURN, ARG1, ARG2) NAME,
 #define DEF_FUNCTION_TYPE_VAR_7(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7) NAME,
-#define DEF_FUNCTION_TYPE_VAR_12(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-				 NAME,
 #define DEF_POINTER_TYPE(NAME, TYPE) NAME,
 #include "types.def"
 #undef DEF_PRIMITIVE_TYPE
@@ -680,7 +677,6 @@ gfc_init_builtin_functions (void)
 #undef DEF_FUNCTION_TYPE_VAR_0
 #undef DEF_FUNCTION_TYPE_VAR_2
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
     BT_LAST
   };
@@ -1133,23 +1129,6 @@ gfc_init_builtin_functions (void)
 					builtin_types[(int) ARG6],	\
 					builtin_types[(int) ARG7],	\
 					NULL_TREE);
-#define DEF_FUNCTION_TYPE_VAR_12(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-  builtin_types[(int) ENUM]						\
-    = build_varargs_function_type_list (builtin_types[(int) RETURN],   	\
-					builtin_types[(int) ARG1],     	\
-					builtin_types[(int) ARG2],     	\
-					builtin_types[(int) ARG3],	\
-					builtin_types[(int) ARG4],	\
-					builtin_types[(int) ARG5],	\
-					builtin_types[(int) ARG6],	\
-					builtin_types[(int) ARG7],	\
-					builtin_types[(int) ARG8],	\
-					builtin_types[(int) ARG9],	\
-					builtin_types[(int) ARG10],	\
-					builtin_types[(int) ARG11],	\
-					builtin_types[(int) ARG12],	\
-					NULL_TREE);
 #define DEF_POINTER_TYPE(ENUM, TYPE)			\
   builtin_types[(int) ENUM]				\
     = build_pointer_type (builtin_types[(int) TYPE]);
@@ -1167,7 +1146,6 @@ gfc_init_builtin_functions (void)
 #undef DEF_FUNCTION_TYPE_VAR_0
 #undef DEF_FUNCTION_TYPE_VAR_2
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   builtin_types[(int) BT_LAST] = NULL_TREE;
 
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 226312)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -59,6 +59,7 @@
 #include "dominance.h"
 #include "cfg.h"
 #include "omp-low.h"
+#include "gomp-constants.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -3015,9 +3016,49 @@ nvptx_cannot_copy_insn_p (rtx_insn *insn
 static void
 nvptx_record_offload_symbol (tree decl)
 {
-  fprintf (asm_out_file, "//:%s_MAP %s\n",
-	   TREE_CODE (decl) == VAR_DECL ? "VAR" : "FUNC",
-	   IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
+  switch (TREE_CODE (decl))
+    {
+    case VAR_DECL:
+      fprintf (asm_out_file, "//:VAR_MAP \"%s\"\n",
+	       IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
+      break;
+
+    case FUNCTION_DECL:
+      {
+	tree attr = get_oacc_fn_attrib (decl);
+	tree dims = NULL_TREE;
+	unsigned ix;
+	
+	if (attr)
+	  dims = TREE_VALUE (attr);
+	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
+		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
+
+	for (ix = 0; ix != GOMP_DIM_MAX; ix++)
+	  {
+	    unsigned HOST_WIDE_INT dim = 0;
+	    if (dims)
+	      {
+		tree cst = TREE_VALUE (dims);
+
+		/* When device_type support is added an ealier pass
+		   should have massaged the attribute to be
+		   ptx-specific.  */
+		gcc_assert (TREE_CODE (cst) == INTEGER_CST);
+
+		dim = TREE_INT_CST_LOW (cst);
+		dims = TREE_CHAIN (dims);
+	      }
+	    fprintf (asm_out_file, ", " HOST_WIDE_INT_PRINT_HEX, dim);
+	  }
+	
+	fprintf (asm_out_file, "\n");
+      }
+      break;
+  
+    default:
+      gcc_unreachable ();
+    }
 }
 
 /* Implement TARGET_ASM_FILE_START.  Write the kinds of things ptxas expects
Index: gcc/config/nvptx/mkoffload.c
===================================================================
--- gcc/config/nvptx/mkoffload.c	(revision 226312)
+++ gcc/config/nvptx/mkoffload.c	(working copy)
@@ -227,6 +227,8 @@ process (FILE *in, FILE *out)
 {
   size_t len;
   const char *input = read_file (in, &len);
+  const char *comma;
+  id_map const *id;
 
   fprintf (out, "static const char ptx_code[] = \n \"");
   for (size_t i = 0; i < len; i++)
@@ -267,14 +269,18 @@ process (FILE *in, FILE *out)
     }
   fprintf (out, "\";\n\n");
 
-  fprintf (out, "static const char *const var_mappings[] = {\n");
-  for (id_map *id = var_ids; id; id = id->next)
-    fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
-  fprintf (out, "};\n\n");
-  fprintf (out, "static const char *const func_mappings[] = {\n");
-  for (id_map *id = func_ids; id; id = id->next)
-    fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
-  fprintf (out, "};\n\n");
+  fprintf (out, "static const char *const var_mappings[] = {");
+  for (comma = "", id = var_ids; id; comma = ",", id = id->next)
+    fprintf (out, "%s\n\t%s", comma, id->ptx_name);
+  fprintf (out, "\n};\n\n");
+
+  fprintf (out, "static const struct nvptx_fn {\n"
+	   "  const char *name;\n"
+	   "  unsigned short dim[%d];\n"
+	   "} func_mappings[] = {\n", GOMP_DIM_MAX);
+  for (comma = "", id = func_ids; id; comma = ",", id = id->next)
+    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+  fprintf (out, "\n};\n\n");
 
   fprintf (out,
 	   "static const struct nvptx_tdata {\n"
@@ -282,7 +288,7 @@ process (FILE *in, FILE *out)
 	   "  __SIZE_TYPE__ ptx_len;\n"
 	   "  const char *const *var_names;\n"
 	   "  __SIZE_TYPE__ var_num;\n"
-	   "  const char *const *fn_names;\n"
+	   "  const struct nvptx_fn *fn_names;\n"
 	   "  __SIZE_TYPE__ fn_num;\n"
 	   "} target_data = {\n"
 	   "  ptx_code, sizeof (ptx_code),\n"

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-07-28 17:11 [gomp4] Redesign oacc_parallel launch API Nathan Sidwell
@ 2015-07-28 17:46 ` Nathan Sidwell
  2015-07-29  9:57 ` Thomas Schwinge
  2015-08-06 16:33 ` Cesar Philippidis
  2 siblings, 0 replies; 9+ messages in thread
From: Nathan Sidwell @ 2015-07-28 17:46 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Thomas Schwinge

Oh, one more thing.  I placed constants for the 3 launch dimensions into 
gomp-constants.h, as they are needed by both library and compiler.  Working on a 
patch to remove the current set of constants from omp-low.h

nathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-07-28 17:11 [gomp4] Redesign oacc_parallel launch API Nathan Sidwell
  2015-07-28 17:46 ` Nathan Sidwell
@ 2015-07-29  9:57 ` Thomas Schwinge
  2015-07-29 12:43   ` Nathan Sidwell
  2015-08-06 16:33 ` Cesar Philippidis
  2 siblings, 1 reply; 9+ messages in thread
From: Thomas Schwinge @ 2015-07-29  9:57 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: Jakub Jelinek, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

Hi Nathan!

On Tue, 28 Jul 2015 12:52:02 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> I've committed this patch to the gomp4 branch to redo the launch API.  I'll post 
> a version for trunk once the versioning patch gets approved & committed.

Thanks!


(I have not yet looked at the patch in detail.)  There is one regression:

    PASS: libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  execution test

    libgomp: Trying to map into device [0x10f7930..0x10f7a30) object when [0x10f7930..0x10f7a30) is already mapped

Likewise for the other torture testing flags.


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-07-29  9:57 ` Thomas Schwinge
@ 2015-07-29 12:43   ` Nathan Sidwell
  2015-07-29 22:14     ` Nathan Sidwell
  0 siblings, 1 reply; 9+ messages in thread
From: Nathan Sidwell @ 2015-07-29 12:43 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Jakub Jelinek, GCC Patches

On 07/29/15 05:22, Thomas Schwinge wrote:
> Hi Nathan!
>
> On Tue, 28 Jul 2015 12:52:02 -0400, Nathan Sidwell <nathan@acm.org> wrote:
>> I've committed this patch to the gomp4 branch to redo the launch API.  I'll post
>> a version for trunk once the versioning patch gets approved & committed.
>
> Thanks!
>
>
> (I have not yet looked at the patch in detail.)  There is one regression:
>
>      PASS: libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  (test for excess errors)
>      [-PASS:-]{+FAIL:+} libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  execution test
>
>      libgomp: Trying to map into device [0x10f7930..0x10f7a30) object when [0x10f7930..0x10f7a30) is already mapped
>
> Likewise for the other torture testing flags.


Investigating ...  (I've seen those failures be intermittent)

nathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-07-29 12:43   ` Nathan Sidwell
@ 2015-07-29 22:14     ` Nathan Sidwell
  2015-07-30 14:46       ` Nathan Sidwell
  2015-07-30 15:05       ` Thomas Schwinge
  0 siblings, 2 replies; 9+ messages in thread
From: Nathan Sidwell @ 2015-07-29 22:14 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Jakub Jelinek, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 639 bytes --]

On 07/29/15 08:24, Nathan Sidwell wrote:
> On 07/29/15 05:22, Thomas Schwinge wrote:

>> Likewise for the other torture testing flags.
>
>
> Investigating ...  (I've seen those failures be intermittent)

Interestingly the fails go away with an unoptimized libgomp.  I've observed 
something vaguely like that before.  The observed failure mode was getting stuck 
inside the driver library opening the device.  Which is very strange.


Anyway, I've committed the attached to gomp4 branch, which separates the ASYNC 
and WAIT tags, for a slightly better interface.  It doesn't fixup the failure 
thought.  Still thinking about that.

nathan

[-- Attachment #2: gomp4-launch-2.patch --]
[-- Type: text/x-patch, Size: 5157 bytes --]

2015-07-29  Nathan Sidwell  <nathan@codesourcery.com>

	include/
	* gomp-constants.h (GOMP_LAUNCH_ASYNC_WAIT): Replace with ...
	(GOMP_LAUNCH_ASYNC, GOMP_LAUNCH_WAIT): ... these.
	(GOMP_LAUNCH_OP_MAX): New.

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_wait): Add debug print.
	* oacc-parallel.c (GOACC_parallel_keyed): Process separate ASYNC
	and WAIT tags.
	(GOACC_parallel): Adjust forwarding.

	gcc/
	* omp-low.c (expand_omp_target): Emit separate ASYNC and WAIT tags.

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 226346)
+++ gcc/omp-low.c	(working copy)
@@ -9875,23 +9875,35 @@ expand_omp_target (struct omp_region *re
 				      integer_type_node,
 				      build_int_cst (integer_type_node,
 						     GOMP_ASYNC_SYNC));
-	if (t_async && !tagging)
+	if (tagging && t_async)
 	  {
-	    args.safe_push (t_async);
-	    t_async = NULL_TREE;
+	    unsigned HOST_WIDE_INT i_async;
+
+	    if (TREE_CODE (t_async) == INTEGER_CST)
+	      {
+		/* See if we can pack the async arg in to the tag's
+		   operand.  */
+		i_async = TREE_INT_CST_LOW (t_async);
+
+		if (i_async < GOMP_LAUNCH_OP_MAX)
+		  t_async = NULL_TREE;
+	      }
+	    if (t_async)
+	      i_async = GOMP_LAUNCH_OP_MAX;
+	    args.safe_push (oacc_launch_pack
+			    (GOMP_LAUNCH_ASYNC, NULL_TREE, i_async));
 	  }
+	if (t_async)
+	  args.safe_push (t_async);
 
 	/* Save the argument index, and... */
 	unsigned t_wait_idx = args.length ();
 	unsigned num_waits = 0;
 	c = find_omp_clause (clauses, OMP_CLAUSE_WAIT);
-	if (!tagging || c || t_async)
+	if (!tagging || c)
 	  /* ... push a placeholder.  */
 	  args.safe_push (integer_zero_node);
 
-	if (tagging && t_async)
-	  args.safe_push (t_async);
-	
 	for (; c; c = OMP_CLAUSE_CHAIN (c))
 	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
 	    {
@@ -9901,13 +9913,13 @@ expand_omp_target (struct omp_region *re
 	      num_waits++;
 	    }
 
-	if (!tagging || num_waits || t_async)
+	if (!tagging || num_waits)
 	  {
 	    tree len;
 
 	    /* Now that we know the number, update the placeholder.  */
 	    if (tagging)
-	      len = oacc_launch_pack (GOMP_LAUNCH_ASYNC_WAIT,
+	      len = oacc_launch_pack (GOMP_LAUNCH_WAIT,
 				      NULL_TREE, num_waits);
 	    else
 	      len = build_int_cst (integer_type_node, num_waits);
Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c	(revision 226346)
+++ libgomp/oacc-parallel.c	(working copy)
@@ -268,11 +268,20 @@ GOACC_parallel_keyed (int device, void (
 	  }
 	  break;
 
-	case GOMP_LAUNCH_ASYNC_WAIT:
+	case GOMP_LAUNCH_ASYNC:
+	  {
+	    /* Small constant values are encoded in the operand.  */
+	    async = GOMP_LAUNCH_OP (tag);
+
+	    if (async == GOMP_LAUNCH_OP_MAX)
+	      async = va_arg (ap, unsigned);
+	    break;
+	  }
+
+	case GOMP_LAUNCH_WAIT:
 	  {
 	    unsigned num_waits = GOMP_LAUNCH_OP (tag);
 
-	    async = va_arg (ap, unsigned);
 	    if (num_waits)
 	      goacc_wait (async, num_waits, &ap);
 	    break;
@@ -357,8 +366,9 @@ GOACC_parallel (int device, void (*fn) (
 			GOMP_LAUNCH_PACK (GOMP_LAUNCH_DIM, 0,
 					  GOMP_DIM_MASK (GOMP_DIM_MAX) - 1),
 			num_gangs, num_workers, vector_length,
-			GOMP_LAUNCH_PACK (GOMP_LAUNCH_ASYNC_WAIT,
-					  0, num_waits),
+			GOMP_LAUNCH_PACK (GOMP_LAUNCH_ASYNC, 0,
+					  GOMP_LAUNCH_OP_MAX), async,
+			GOMP_LAUNCH_PACK (GOMP_LAUNCH_WAIT, 0, num_waits),
 			async, waits[0], waits[1], waits[2], waits[3],
 			waits[4], waits[5], waits[6], waits[7], waits[8]);
 }
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 226346)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -1346,6 +1346,8 @@ nvptx_wait (int async)
   if (!s)
     GOMP_PLUGIN_fatal ("unknown async %d", async);
 
+  GOMP_PLUGIN_debug (0, "  %s: waiting on async=%d\n", __FUNCTION__, async);
+
   r = cuStreamSynchronize (s->stream);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
Index: include/gomp-constants.h
===================================================================
--- include/gomp-constants.h	(revision 226346)
+++ include/gomp-constants.h	(working copy)
@@ -131,7 +131,8 @@ enum gomp_map_kind
 /* Varadic launch arguments.  */
 #define GOMP_LAUNCH_END 	0  /* End of args, no dev or op */
 #define GOMP_LAUNCH_DIM		1  /* Launch dimensions, op = mask */
-#define GOMP_LAUNCH_ASYNC_WAIT	2  /* Async & Waits, op = num waits.  */
+#define GOMP_LAUNCH_ASYNC	2  /* Async, op = cst val if not MAX  */
+#define GOMP_LAUNCH_WAIT	3  /* Waits, op = num waits.  */
 #define GOMP_LAUNCH_CODE_SHIFT	28
 #define GOMP_LAUNCH_DEVICE_SHIFT 16
 #define GOMP_LAUNCH_OP_SHIFT 0
@@ -142,6 +143,7 @@ enum gomp_map_kind
 #define GOMP_LAUNCH_CODE(X) (((X) >> GOMP_LAUNCH_CODE_SHIFT) & 0xf)
 #define GOMP_LAUNCH_DEVICE(X) (((X) >> GOMP_LAUNCH_DEVICE_SHIFT) & 0xfff)
 #define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
+#define GOMP_LAUNCH_OP_MAX 0xffff
 
 /* Versions of libgomp and device-specific plugins.  */
 #define GOMP_VERSION	0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-07-29 22:14     ` Nathan Sidwell
@ 2015-07-30 14:46       ` Nathan Sidwell
  2015-07-30 15:05       ` Thomas Schwinge
  1 sibling, 0 replies; 9+ messages in thread
From: Nathan Sidwell @ 2015-07-30 14:46 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Jakub Jelinek, GCC Patches

On 07/29/15 17:45, Nathan Sidwell wrote:
> On 07/29/15 08:24, Nathan Sidwell wrote:
>> On 07/29/15 05:22, Thomas Schwinge wrote:
>
>>> Likewise for the other torture testing flags.
>>
>>
>> Investigating ...  (I've seen those failures be intermittent)
>
> Interestingly the fails go away with an unoptimized libgomp.  I've observed
> something vaguely like that before.  The observed failure mode was getting stuck
> inside the driver library opening the device.  Which is very strange.
>

I am no longer observing the failures.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-07-29 22:14     ` Nathan Sidwell
  2015-07-30 14:46       ` Nathan Sidwell
@ 2015-07-30 15:05       ` Thomas Schwinge
  1 sibling, 0 replies; 9+ messages in thread
From: Thomas Schwinge @ 2015-07-30 15:05 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: Jakub Jelinek, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1234 bytes --]

Hi Nathan!

On Wed, 29 Jul 2015 17:45:32 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> On 07/29/15 08:24, Nathan Sidwell wrote:
> > On 07/29/15 05:22, Thomas Schwinge wrote:
> 
> >> Likewise for the other torture testing flags.
> >
> >
> > Investigating ...  (I've seen those failures be intermittent)

(For me, they were persistent, within the handful of testsuite runs I'd
done.)

> Interestingly the fails go away with an unoptimized libgomp.  I've observed 
> something vaguely like that before.  The observed failure mode was getting stuck 
> inside the driver library opening the device.  Which is very strange.

Uh...  :-/

> Anyway, I've committed the attached to gomp4 branch, which separates the ASYNC 
> and WAIT tags, for a slightly better interface.

Thanks!

At first, I saw a number of spurious FAILs and timeouts with different
test cases each time I ran the testsuite, which disappeared once I
manually triggered a full rebuild -- is there some build dependency
missing somewhere...

> It doesn't fixup the failure 
> thought.  Still thinking about that.

Hmm, for me the failures now seem to be done; again, within the handful
of testsuite runs I've done.


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-07-28 17:11 [gomp4] Redesign oacc_parallel launch API Nathan Sidwell
  2015-07-28 17:46 ` Nathan Sidwell
  2015-07-29  9:57 ` Thomas Schwinge
@ 2015-08-06 16:33 ` Cesar Philippidis
  2015-08-06 19:51   ` Nathan Sidwell
  2 siblings, 1 reply; 9+ messages in thread
From: Cesar Philippidis @ 2015-08-06 16:33 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]

On 07/28/2015 09:52 AM, Nathan Sidwell wrote:
> I've committed this patch to the gomp4 branch to redo the launch API. 
> I'll post a version for trunk once the versioning patch gets approved &
> committed.
> 
> This changes the API in a number of ways, allowing device-specific
> knowledge to be moved into the device compiler and out of the host
> compiler.
> 
> Firstly, we attach a tuple of launch dimensions as an attribute to the
> offloaded function's 'oacc function' attribute.  These are the constant
> launch dimensions.  Dynamic dimensions get a zero for their slot in this
> list.  Further this list can be extended in the future to an alist keyed
> by device_type.
> 
> Dynamic dimensions are computed on the host.  however they are passed
> via varadic args to the GOACC_parallel function (which is renamed).  The
> varadic args are passed using key/value representation, and 3 keys are
> currently defined:
> END -- end of the varadic list
> DIM - set of runtime-computed dimensions.  Only the dynamic ones are
> passed.
> ASYNC_WAIT - an async and a set of waits (possibly zero).
> 
> I have arranged for the key to have a slot that can later be filled by
> device_type, and hence support multiple device types.
> 
> The constant dimensions can be used in expansion of the GOACC_nid
> function in the device compiler.  The device compiler could also process
> that list to select the device_type slot that is appropriate.
> 
> For PTX the backend is augmented to emit the launch dimensions into the
> target data, from whence the ptx plugin can pick them up and overwrite
> with any dynamic ones passed in from the launch function.

Looking at set_oacc_fn_attrib, it appears that const values are also
considered dynamic. See the attached test case more more info. Is that
the expected behavior? If not, I could take a look at this after I
finished my reduction patch.

Cesar

[-- Attachment #2: vlength.c --]
[-- Type: text/x-csrc, Size: 250 bytes --]

#include <stdio.h>

const int vl = 32;

int
main ()
{
  unsigned int red = 0;

#pragma acc parallel loop vector_length (vl) vector reduction (+:red) copy (red)
  for (int i = 0; i < 100; i++)
    red ++;

  printf ("red = %d\n", red);

  return 0;
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gomp4] Redesign oacc_parallel launch API
  2015-08-06 16:33 ` Cesar Philippidis
@ 2015-08-06 19:51   ` Nathan Sidwell
  0 siblings, 0 replies; 9+ messages in thread
From: Nathan Sidwell @ 2015-08-06 19:51 UTC (permalink / raw)
  To: Cesar Philippidis, GCC Patches; +Cc: Jakub Jelinek, Thomas Schwinge

On 08/06/15 18:33, Cesar Philippidis wrote:

> Looking at set_oacc_fn_attrib, it appears that const values are also
> considered dynamic. See the attached test case more more info. Is that
> the expected behavior? If not, I could take a look at this after I
> finished my reduction patch.

It's annoying that the offload  call is happening too early for that kind of 
constant propagation.  But I guess it might have been propagated by the time we 
get to oacc_xform.  And hence that could optimize there.

Anyway, a thing to notice but not get distracted by.

nathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-08-06 19:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-28 17:11 [gomp4] Redesign oacc_parallel launch API Nathan Sidwell
2015-07-28 17:46 ` Nathan Sidwell
2015-07-29  9:57 ` Thomas Schwinge
2015-07-29 12:43   ` Nathan Sidwell
2015-07-29 22:14     ` Nathan Sidwell
2015-07-30 14:46       ` Nathan Sidwell
2015-07-30 15:05       ` Thomas Schwinge
2015-08-06 16:33 ` Cesar Philippidis
2015-08-06 19:51   ` Nathan Sidwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).