[patch] adjust default nvptx launch geometry for OpenACC offloaded regions

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
@ 2018-06-20 21:59 Cesar Philippidis
  2018-06-20 22:16 ` Tom de Vries
  2018-06-29 17:16 ` Cesar Philippidis
  0 siblings, 2 replies; 14+ messages in thread
From: Cesar Philippidis @ 2018-06-20 21:59 UTC (permalink / raw)
  To: gcc-patches, Jakub Jelinek; +Cc: tdevries

[-- Attachment #1: Type: text/plain, Size: 1221 bytes --]

At present, the nvptx libgomp plugin does not take into account the
amount of shared resources on GPUs (mostly shared-memory are register
usage) when selecting the default num_gangs and num_workers. In certain
situations, an OpenACC offloaded function can fail to launch if the GPU
does not have sufficient shared resources to accommodate all of the
threads in a CUDA block. This typically manifests when a PTX function
uses a lot of registers and num_workers is set too large, although it
can also happen if the shared-memory has been exhausted by the threads
in a vector.

This patch resolves that issue by adjusting num_workers based the amount
of shared resources used by each threads. If worker parallelism has been
requested, libgomp will spawn as many workers as possible up to 32.
Without this patch, libgomp would always default to launching 32 workers
when worker parallelism is used.

Besides for the worker parallelism, this patch also includes some
heuristics on selecting num_gangs. Before, the plugin would launch two
gangs per GPU multiprocessor. Now it follows the formula contained in
the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

Is this patch OK for trunk?

Thanks,
Cesar

[-- Attachment #2: trunk-default-par.diff --]
[-- Type: text/x-patch, Size: 15455 bytes --]

2018-06-20  Cesar Philippidis  <cesar@codesourcery.com>

        gcc/
        * config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Delete define.
        (PTX_DEFAULT_RUNTIME_DIM): New define.
        (nvptx_goacc_validate_dims): Use it to allow the runtime to
        dynamically allocate num_workers and num_gangs.
        (nvptx_dim_limit): Don't impose an arbritary num_workers.

        libgomp/
        * plugin/plugin-nvptx.c (struct ptx_device): Add
        max_threads_per_block, warp_size, max_threads_per_multiprocessor,
        max_shared_memory_per_multiprocessor, binary_version,
        register_allocation_unit_size, register_allocation_granularity,
        compute_capability_major, compute_capability_minor members.
        (nvptx_open_device): Probe driver for those values.  Adjust
        regs_per_sm and max_shared_memory_per_multiprocessor for K80
        hardware. Dynamically allocate default num_workers.
        (nvptx_exec): Don't probe the CUDA runtime for the hardware
        info.  Use the new variables inside targ_fn_descriptor and
        ptx_device instead.  (GOMP_OFFLOAD_load_image): Set num_gangs,
        register_allocation_{unit_size,granularity}.  Adjust the
        default num_gangs.  Add diagnostic when the hardware cannot
        support the requested num_workers.
        * plugin/cuda/cuda.h (CUdevice_attribute): Add
        CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR,
        CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5608bee..c1946e7 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5165,7 +5165,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
 #define PTX_WORKER_LENGTH 32
-#define PTX_GANG_DEFAULT  0 /* Defer to runtime.  */
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
@@ -5214,9 +5214,9 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
     {
       dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
+	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
-	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
       changed = true;
     }
 
@@ -5230,9 +5230,6 @@ nvptx_dim_limit (int axis)
 {
   switch (axis)
     {
-    case GOMP_DIM_WORKER:
-      return PTX_WORKER_LENGTH;
-
     case GOMP_DIM_VECTOR:
       return PTX_VECTOR_LENGTH;
 
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825..c7d50db 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -69,6 +69,8 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,
+  CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
 } CUdevice_attribute;
 
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 89326e5..ada1df2 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -409,11 +409,25 @@ struct ptx_device
   bool map;
   bool concur;
   bool mkern;
-  int  mode;
+  int mode;
+  int compute_capability_major;
+  int compute_capability_minor;
   int clock_khz;
   int num_sms;
   int regs_per_block;
   int regs_per_sm;
+  int max_threads_per_block;
+  int warp_size;
+  int max_threads_per_multiprocessor;
+  int max_shared_memory_per_multiprocessor;
+
+  int binary_version;
+
+  /* register_allocation_unit_size and register_allocation_granularity
+     were extracted from the "Register Allocation Granularity" in
+     Nvidia's CUDA Occupancy Calculator spreadsheet.  */
+  int register_allocation_unit_size;
+  int register_allocation_granularity;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -725,6 +739,9 @@ nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->binary_version = 0;
+  ptx_dev->register_allocation_unit_size = 0;
+  ptx_dev->register_allocation_granularity = 0;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -765,6 +782,14 @@ nvptx_open_device (int n)
   ptx_dev->mode = pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev);
+  ptx_dev->compute_capability_major = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev);
+  ptx_dev->compute_capability_minor = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
 		  &pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev);
   ptx_dev->mkern = pi;
 
@@ -794,13 +819,28 @@ nvptx_open_device (int n)
   ptx_dev->regs_per_sm = pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev);
+  ptx_dev->max_threads_per_block = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
 		  &pi, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev);
+  ptx_dev->warp_size = pi;
   if (pi != 32)
     {
       GOMP_PLUGIN_error ("Only warp size 32 is supported");
       return NULL;
     }
 
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
+  ptx_dev->max_threads_per_multiprocessor = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi,
+		  CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR,
+		  dev);
+  ptx_dev->max_shared_memory_per_multiprocessor = pi;
+
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
@@ -809,6 +849,39 @@ nvptx_open_device (int n)
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
+  GOMP_PLUGIN_debug (0, "Nvidia device %d:\n\tGPU_OVERLAP = %d\n"
+		     "\tCAN_MAP_HOST_MEMORY = %d\n\tCONCURRENT_KERNELS = %d\n"
+		     "\tCOMPUTE_MODE = %d\n\tINTEGRATED = %d\n"
+		     "\tCU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = %d\n"
+		     "\tCU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = %d\n"
+		     "\tINTEGRATED = %d\n"
+		     "\tMAX_THREADS_PER_BLOCK = %d\n\tWARP_SIZE = %d\n"
+		     "\tMULTIPROCESSOR_COUNT = %d\n"
+		     "\tMAX_THREADS_PER_MULTIPROCESSOR = %d\n"
+		     "\tMAX_REGISTERS_PER_MULTIPROCESSOR = %d\n"
+		     "\tMAX_SHARED_MEMORY_PER_MULTIPROCESSOR = %d\n",
+		     ptx_dev->ord, ptx_dev->overlap, ptx_dev->map,
+		     ptx_dev->concur, ptx_dev->mode, ptx_dev->mkern,
+		     ptx_dev->compute_capability_major,
+		     ptx_dev->compute_capability_minor,
+		     ptx_dev->mkern, ptx_dev->max_threads_per_block,
+		     ptx_dev->warp_size, ptx_dev->num_sms,
+		     ptx_dev->max_threads_per_multiprocessor,
+		     ptx_dev->regs_per_sm,
+		     ptx_dev->max_shared_memory_per_multiprocessor);
+
+  /* K80 (SM_37) boards contain two physical GPUs.  Consequntly they
+     report 2x larger values for MAX_REGISTERS_PER_MULTIPROCESSOR and
+     MAX_SHARED_MEMORY_PER_MULTIPROCESSOR.  Those values need to be
+     adjusted on order to allow the nvptx_exec to select an
+     appropriate num_workers.  */
+  if (ptx_dev->compute_capability_major == 3
+      && ptx_dev->compute_capability_minor == 7)
+    {
+      ptx_dev->regs_per_sm /= 2;
+      ptx_dev->max_shared_memory_per_multiprocessor /= 2;
+    }
+
   if (!init_streams_for_device (ptx_dev, async_engines))
     return NULL;
 
@@ -1120,6 +1193,14 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   void *hp, *dp;
   struct nvptx_thread *nvthd = nvptx_thread ();
   const char *maybe_abort_msg = "(perhaps abort was called)";
+  int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
+  int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
+  int dev_size = nvptx_thread ()->ptx_dev->num_sms;
+  int warp_size = nvptx_thread ()->ptx_dev->warp_size;
+  int rf_size = nvptx_thread ()->ptx_dev->regs_per_sm;
+  int reg_unit_size = nvptx_thread ()->ptx_dev->register_allocation_unit_size;
+  int reg_granularity
+    = nvptx_thread ()->ptx_dev->register_allocation_granularity;
 
   function = targ_fn->fn;
 
@@ -1138,71 +1219,92 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
        seen_zero = 1;
     }
 
-  if (seen_zero)
-    {
-      /* See if the user provided GOMP_OPENACC_DIM environment
-	 variable to specify runtime defaults. */
-      static int default_dims[GOMP_DIM_MAX];
+  /* Calculate the optimal number of gangs for the current device.  */
+  int reg_used = targ_fn->regs_per_thread;
+  int reg_per_warp = ((reg_used * warp_size + reg_unit_size - 1)
+		      / reg_unit_size) * reg_unit_size;
+  int threads_per_sm = (rf_size / reg_per_warp / reg_granularity)
+    * reg_granularity * warp_size;
+  int threads_per_block = threads_per_sm > block_size
+    ? block_size : threads_per_sm;
 
-      pthread_mutex_lock (&ptx_dev_lock);
-      if (!default_dims[0])
-	{
-	  for (int i = 0; i < GOMP_DIM_MAX; ++i)
-	    default_dims[i] = GOMP_PLUGIN_acc_default_dim (i);
-
-	  int warp_size, block_size, dev_size, cpu_size;
-	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
-	  /* 32 is the default for known hardware.  */
-	  int gang = 0, worker = 32, vector = 32;
-	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
-
-	  cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
-	  cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
-	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
-	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
-
-	  if (CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &block_size, cu_tpb,
-				 dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &warp_size, cu_ws,
-				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &dev_size, cu_mpc,
-				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &cpu_size, cu_tpm,
-				    dev) == CUDA_SUCCESS)
-	    {
-	      GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
-				 " dev_size=%d, cpu_size=%d\n",
-				 warp_size, block_size, dev_size, cpu_size);
-	      gang = (cpu_size / block_size) * dev_size;
-	      worker = block_size / warp_size;
-	      vector = warp_size;
-	    }
+  threads_per_block /= warp_size;
 
-	  /* There is no upper bound on the gang size.  The best size
-	     matches the hardware configuration.  Logical gangs are
-	     scheduled onto physical hardware.  To maximize usage, we
-	     should guess a large number.  */
-	  if (default_dims[GOMP_DIM_GANG] < 1)
-	    default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;
-	  /* The worker size must not exceed the hardware.  */
-	  if (default_dims[GOMP_DIM_WORKER] < 1
-	      || (default_dims[GOMP_DIM_WORKER] > worker && gang))
-	    default_dims[GOMP_DIM_WORKER] = worker;
-	  /* The vector size must exactly match the hardware.  */
-	  if (default_dims[GOMP_DIM_VECTOR] < 1
-	      || (default_dims[GOMP_DIM_VECTOR] != vector && gang))
-	    default_dims[GOMP_DIM_VECTOR] = vector;
-
-	  GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n",
-			     default_dims[GOMP_DIM_GANG],
-			     default_dims[GOMP_DIM_WORKER],
-			     default_dims[GOMP_DIM_VECTOR]);
-	}
-      pthread_mutex_unlock (&ptx_dev_lock);
+  if (threads_per_sm > cpu_size)
+    threads_per_sm = cpu_size;
 
+  /* Set default launch geometry.  */
+  static int default_dims[GOMP_DIM_MAX];
+  pthread_mutex_lock (&ptx_dev_lock);
+  if (!default_dims[0])
+    {
+      /* 32 is the default for known hardware.  */
+      int gang = 0, worker = 32, vector = 32;
+
+      gang = (cpu_size / block_size) * dev_size;
+      vector = warp_size;
+
+      /* If the user hasn't specified the number of gangs, determine
+	 it dynamically based on the hardware configuration.  */
+      if (default_dims[GOMP_DIM_GANG] == 0)
+	default_dims[GOMP_DIM_GANG] = -1;
+      /* The worker size must not exceed the hardware.  */
+      if (default_dims[GOMP_DIM_WORKER] < 1
+	  || (default_dims[GOMP_DIM_WORKER] > worker && gang))
+	default_dims[GOMP_DIM_WORKER] = -1;
+      /* The vector size must exactly match the hardware.  */
+      if (default_dims[GOMP_DIM_VECTOR] < 1
+	  || (default_dims[GOMP_DIM_VECTOR] != vector && gang))
+	default_dims[GOMP_DIM_VECTOR] = vector;
+
+      GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n",
+			 default_dims[GOMP_DIM_GANG],
+			 default_dims[GOMP_DIM_WORKER],
+			 default_dims[GOMP_DIM_VECTOR]);
+    }
+  pthread_mutex_unlock (&ptx_dev_lock);
+
+  if (seen_zero)
+    {
       for (i = 0; i != GOMP_DIM_MAX; i++)
-	if (!dims[i])
-	  dims[i] = default_dims[i];
+ 	if (!dims[i])
+	  {
+	    if (default_dims[i] > 0)
+	      dims[i] = default_dims[i];
+	    else
+	      switch (i) {
+	      case GOMP_DIM_GANG:
+		/* The constant 2 was emperically.  The justification
+		   behind it is to prevent the hardware from idling by
+		   throwing twice the amount of work that it can
+		   physically handle.  */
+		dims[i] = (reg_granularity > 0)
+		  ? 2 * threads_per_sm / warp_size * dev_size
+		  : 2 * dev_size;
+		break;
+	      case GOMP_DIM_WORKER:
+		dims[i] = threads_per_block;
+		break;
+	      case GOMP_DIM_VECTOR:
+		dims[i] = warp_size;
+		break;
+	      default:
+		abort ();
+	      }
+	  }
+    }
+
+  /* Check if the accelerator has sufficient hardware resources to
+     launch the offloaded kernel.  */
+  if (dims[GOMP_DIM_WORKER] > 1)
+    {
+      if (reg_granularity > 0 && dims[GOMP_DIM_WORKER] > threads_per_block)
+	GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources "
+			   "to launch '%s'; recompile the program with "
+			   "'num_workers = %d' on that offloaded region or "
+			   "'-fopenacc-dim=-:%d'.\n",
+			   targ_fn->launch->fn, threads_per_block,
+			   threads_per_block);
     }
 
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
@@ -1870,6 +1972,39 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       targ_fns->regs_per_thread = nregs;
       targ_fns->max_threads_per_block = mthrs;
 
+      if (!dev->binary_version)
+	{
+	  int val;
+	  CUDA_CALL_ERET (-1, cuFuncGetAttribute, &val,
+			  CU_FUNC_ATTRIBUTE_BINARY_VERSION, function);
+	  dev->binary_version = val;
+
+	  /* These values were obtained from the CUDA Occupancy Calculator
+	     spreadsheet.  */
+	  if (dev->binary_version == 20
+	      || dev->binary_version == 21)
+	    {
+	    dev->register_allocation_unit_size = 128;
+	    dev->register_allocation_granularity = 2;
+	    }
+	  else if (dev->binary_version == 60)
+	    {
+	      dev->register_allocation_unit_size = 256;
+	      dev->register_allocation_granularity = 2;
+	    }
+	  else if (dev->binary_version <= 70)
+	    {
+	      dev->register_allocation_unit_size = 256;
+	      dev->register_allocation_granularity = 4;
+	    }
+	  else
+	    {
+	      /* Fallback to -1 to for unknown targets.  */
+	      dev->register_allocation_unit_size = -1;
+	      dev->register_allocation_granularity = -1;
+	    }
+	}
+
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
     }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-06-20 21:59 [patch] adjust default nvptx launch geometry for OpenACC offloaded regions Cesar Philippidis
@ 2018-06-20 22:16 ` Tom de Vries
  2018-06-21 13:58   ` Cesar Philippidis
  2018-06-29 17:16 ` Cesar Philippidis
  1 sibling, 1 reply; 14+ messages in thread
From: Tom de Vries @ 2018-06-20 22:16 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Jakub Jelinek

On 06/20/2018 11:59 PM, Cesar Philippidis wrote:
> Now it follows the formula contained in
> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.

Any reason we're not using the cuda runtime functions to get the
occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns
to determine launch configuration in nvptx ) ?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-06-20 22:16 ` Tom de Vries
@ 2018-06-21 13:58   ` Cesar Philippidis
  2018-07-02 14:14     ` Tom de Vries
  0 siblings, 1 reply; 14+ messages in thread
From: Cesar Philippidis @ 2018-06-21 13:58 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek

On 06/20/2018 03:15 PM, Tom de Vries wrote:
> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:
>> Now it follows the formula contained in
>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
> 
> Any reason we're not using the cuda runtime functions to get the
> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns
> to determine launch configuration in nvptx ) ?

There are two reasons:

  1) cuda_occupancy.h depends on the CUDA runtime to extract the device
     properties instead of the CUDA driver API. However, we can always
     teach libgomp how to populate the cudaDeviceProp struct using the
     driver API.

  2) CUDA is not always present on the build host, and that's why
     libgomp maintains its own cuda.h. So at the very least, this
     functionality would be good to have in libgomp as a fallback
     implementation; its not good to have program fail due to
     insufficient hardware resources errors when it is avoidable.

Cesar

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-06-21 13:58   ` Cesar Philippidis
@ 2018-07-02 14:14     ` Tom de Vries
  2018-07-02 14:39       ` Cesar Philippidis
  2018-07-11 19:13       ` Cesar Philippidis
  0 siblings, 2 replies; 14+ messages in thread
From: Tom de Vries @ 2018-07-02 14:14 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

On 06/21/2018 03:58 PM, Cesar Philippidis wrote:
> On 06/20/2018 03:15 PM, Tom de Vries wrote:
>> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:
>>> Now it follows the formula contained in
>>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
>>
>> Any reason we're not using the cuda runtime functions to get the
>> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns
>> to determine launch configuration in nvptx ) ?
> 
> There are two reasons:
> 
>   1) cuda_occupancy.h depends on the CUDA runtime to extract the device
>      properties instead of the CUDA driver API. However, we can always
>      teach libgomp how to populate the cudaDeviceProp struct using the
>      driver API.
> 
>   2) CUDA is not always present on the build host, and that's why
>      libgomp maintains its own cuda.h. So at the very least, this
>      functionality would be good to have in libgomp as a fallback
>      implementation;

Libgomp maintains its own cuda.h to "allow building GCC with PTX
offloading even without CUDA being installed" (
https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00980.html ).

The libgomp nvptx plugin however uses the cuda driver API to launch
kernels etc, so we can assume that's always available at launch time.
And according to the "CUDA Pro Tip: Occupancy API Simplifies Launch
Configuration", the occupancy API is also available in the driver API.

What we cannot assume to be available is the occupancy API pre cuda-6.5.
So it's fine to have a fallback for that (properly isolated in utility
functions), but for cuda 6.5 and up we want to use the occupancy API.

>      its not good to have program fail due to
>      insufficient hardware resources errors when it is avoidable.
>

Right, in fact there are two separate things you're trying to address
here: launch failure and occupancy heuristic, so split the patch.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-07-02 14:14     ` Tom de Vries
@ 2018-07-02 14:39       ` Cesar Philippidis
  2018-07-11 19:13       ` Cesar Philippidis
  1 sibling, 0 replies; 14+ messages in thread
From: Cesar Philippidis @ 2018-07-02 14:39 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

On 07/02/2018 07:14 AM, Tom de Vries wrote:
> On 06/21/2018 03:58 PM, Cesar Philippidis wrote:
>> On 06/20/2018 03:15 PM, Tom de Vries wrote:
>>> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:
>>>> Now it follows the formula contained in
>>>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
>>>
>>> Any reason we're not using the cuda runtime functions to get the
>>> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns
>>> to determine launch configuration in nvptx ) ?
>>
>> There are two reasons:
>>
>>   1) cuda_occupancy.h depends on the CUDA runtime to extract the device
>>      properties instead of the CUDA driver API. However, we can always
>>      teach libgomp how to populate the cudaDeviceProp struct using the
>>      driver API.
>>
>>   2) CUDA is not always present on the build host, and that's why
>>      libgomp maintains its own cuda.h. So at the very least, this
>>      functionality would be good to have in libgomp as a fallback
>>      implementation;
> 
> Libgomp maintains its own cuda.h to "allow building GCC with PTX
> offloading even without CUDA being installed" (
> https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00980.html ).
> 
> The libgomp nvptx plugin however uses the cuda driver API to launch
> kernels etc, so we can assume that's always available at launch time.
> And according to the "CUDA Pro Tip: Occupancy API Simplifies Launch
> Configuration", the occupancy API is also available in the driver API.

Thanks for the info. I was not aware that the CUDA driver API had a
thread occupancy calculator (it' described in section 4.18).

> What we cannot assume to be available is the occupancy API pre cuda-6.5.
> So it's fine to have a fallback for that (properly isolated in utility
> functions), but for cuda 6.5 and up we want to use the occupancy API.

That seems reasonable. I'll run some experiments with that. In the
meantime, would it be OK to make this fallback the default, then add
support for the driver occupancy calculator as a follow up?

>>      its not good to have program fail due to
>>      insufficient hardware resources errors when it is avoidable.
>>
> 
> Right, in fact there are two separate things you're trying to address
> here: launch failure and occupancy heuristic, so split the patch.

ACK. I'll split those changes into separate patches.

By the way, do you have any preferences on how to break up the nvptx
vector length changes for trunk submission? I was planning on breaking
it down into four components - generic ME changes, tests, nvptx
reductions and the rest. Those two nvptx compoinents are large, so I'll
probably break them down to smaller patches, but I'm not sure if it's
worthwhile to make them independent from one another with the use of a
lot of stub functions.

Cesar

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-07-02 14:14     ` Tom de Vries
  2018-07-02 14:39       ` Cesar Philippidis
@ 2018-07-11 19:13       ` Cesar Philippidis
  2018-07-26 11:58         ` Tom de Vries
                           ` (4 more replies)
  1 sibling, 5 replies; 14+ messages in thread
From: Cesar Philippidis @ 2018-07-11 19:13 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 3116 bytes --]

On 07/02/2018 07:14 AM, Tom de Vries wrote:
> On 06/21/2018 03:58 PM, Cesar Philippidis wrote:
>> On 06/20/2018 03:15 PM, Tom de Vries wrote:
>>> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:
>>>> Now it follows the formula contained in
>>>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
>>>
>>> Any reason we're not using the cuda runtime functions to get the
>>> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns
>>> to determine launch configuration in nvptx ) ?
>>
>> There are two reasons:
>>
>>   1) cuda_occupancy.h depends on the CUDA runtime to extract the device
>>      properties instead of the CUDA driver API. However, we can always
>>      teach libgomp how to populate the cudaDeviceProp struct using the
>>      driver API.
>>
>>   2) CUDA is not always present on the build host, and that's why
>>      libgomp maintains its own cuda.h. So at the very least, this
>>      functionality would be good to have in libgomp as a fallback
>>      implementation;
> 
> Libgomp maintains its own cuda.h to "allow building GCC with PTX
> offloading even without CUDA being installed" (
> https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00980.html ).
> 
> The libgomp nvptx plugin however uses the cuda driver API to launch
> kernels etc, so we can assume that's always available at launch time.
> And according to the "CUDA Pro Tip: Occupancy API Simplifies Launch
> Configuration", the occupancy API is also available in the driver API.
> 
> What we cannot assume to be available is the occupancy API pre cuda-6.5.
> So it's fine to have a fallback for that (properly isolated in utility
> functions), but for cuda 6.5 and up we want to use the occupancy API.

Here's revision 2 to the patch. I replaced all of my thread occupancy
heuristics with calls to the CUDA driver as you suggested. The
performance is worse than my heuristics, but that's to be expected
because the CUDA driver only guarantees the minimal launch geometry to
to fully utilize the hardware, and not the optimal value. I'll
reintroduce my heuristics later as a follow up patch. The major
advantage of the CUDA thread occupancy calculator is that it allows the
runtime to select sensible default num_workers to avoid those annoying
runtime failures due to insufficient GPU hardware resources.

One thing that may stick out in this patch is how it probes for the
driver version instead of the API version. It turns out that the API
version corresponds to the SM version declared in the PTX sources,
whereas the driver version corresponds to the latest version of CUDA
supported by the driver. At least that's the case with driver version
396.24.

>>      its not good to have program fail due to
>>      insufficient hardware resources errors when it is avoidable.
>>
> 
> Right, in fact there are two separate things you're trying to address
> here: launch failure and occupancy heuristic, so split the patch.

That hunk was small, so I included it with this patch. Although if you
insist, I can remove it.

Is this patch OK for trunk? I tested it x86_64 with nvptx offloading.

Cesar

[-- Attachment #2: trunk-libgomp-default-par.diff --]
[-- Type: text/x-patch, Size: 8118 bytes --]

2018-07-XX  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	gcc/
	* config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Rename to ...
	(PTX_DEFAULT_RUNTIME_DIM): ... this.
	(nvptx_goacc_validate_dims): Set default worker and gang dims to
	PTX_DEFAULT_RUNTIME_DIM.
	(nvptx_dim_limit): Ignore GOMP_DIM_WORKER;

	libgomp/
	* plugin/cuda/cuda.h (CUoccupancyB2DSize): Declare.
	(cuOccupancyMaxPotentialBlockSizeWithFlags): Likewise.
	* plugin/plugin-nvptx.c (struct ptx_device): Add driver_version member.
	(nvptx_open_device): Set it.
	(nvptx_exec): Use the CUDA driver to both determine default num_gangs
	and num_workers, and error if the hardware doesn't have sufficient
	resources to launch a kernel.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5608bee8a8d..c1946e75f42 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5165,7 +5165,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
 #define PTX_WORKER_LENGTH 32
-#define PTX_GANG_DEFAULT  0 /* Defer to runtime.  */
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
@@ -5214,9 +5214,9 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
     {
       dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
+	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
-	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
       changed = true;
     }
 
@@ -5230,9 +5230,6 @@ nvptx_dim_limit (int axis)
 {
   switch (axis)
     {
-    case GOMP_DIM_WORKER:
-      return PTX_WORKER_LENGTH;
-
     case GOMP_DIM_VECTOR:
       return PTX_VECTOR_LENGTH;
 
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index 4799825bda2..1ee59db172c 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -44,6 +44,7 @@ typedef void *CUevent;
 typedef void *CUfunction;
 typedef void *CUlinkState;
 typedef void *CUmodule;
+typedef size_t (*CUoccupancyB2DSize)(int);
 typedef void *CUstream;
 
 typedef enum {
@@ -170,6 +171,9 @@ CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
 CUresult cuModuleLoad (CUmodule *, const char *);
 CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
+CUresult cuOccupancyMaxPotentialBlockSizeWithFlags(int *, int *, CUfunction,
+						   CUoccupancyB2DSize, size_t,
+						   int, unsigned int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 89326e57741..5022e462a3d 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -94,6 +94,7 @@ CUDA_ONE_CALL (cuModuleGetGlobal)	\
 CUDA_ONE_CALL (cuModuleLoad)		\
 CUDA_ONE_CALL (cuModuleLoadData)	\
 CUDA_ONE_CALL (cuModuleUnload)		\
+CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \
 CUDA_ONE_CALL (cuStreamCreate)		\
 CUDA_ONE_CALL (cuStreamDestroy)		\
 CUDA_ONE_CALL (cuStreamQuery)		\
@@ -414,6 +415,7 @@ struct ptx_device
   int num_sms;
   int regs_per_block;
   int regs_per_sm;
+  int driver_version;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -725,6 +727,7 @@ nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->driver_version = 0;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -806,6 +809,9 @@ nvptx_open_device (int n)
   if (r != CUDA_SUCCESS)
     async_engines = 1;
 
+  CUDA_CALL_ERET (NULL, cuDriverGetVersion, &pi);
+  ptx_dev->driver_version = pi;
+
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
@@ -1120,6 +1126,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   void *hp, *dp;
   struct nvptx_thread *nvthd = nvptx_thread ();
   const char *maybe_abort_msg = "(perhaps abort was called)";
+  int dev_size = nvthd->ptx_dev->num_sms;
 
   function = targ_fn->fn;
 
@@ -1140,8 +1147,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
   if (seen_zero)
     {
-      /* See if the user provided GOMP_OPENACC_DIM environment
-	 variable to specify runtime defaults. */
+      /* Specify runtime defaults. */
       static int default_dims[GOMP_DIM_MAX];
 
       pthread_mutex_lock (&ptx_dev_lock);
@@ -1150,23 +1156,20 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	  for (int i = 0; i < GOMP_DIM_MAX; ++i)
 	    default_dims[i] = GOMP_PLUGIN_acc_default_dim (i);
 
-	  int warp_size, block_size, dev_size, cpu_size;
+	  int warp_size, block_size, cpu_size;
 	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
 	  /* 32 is the default for known hardware.  */
 	  int gang = 0, worker = 32, vector = 32;
-	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
+	  CUdevice_attribute cu_tpb, cu_ws, cu_tpm;
 
 	  cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
 	  cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
-	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
 	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
 
 	  if (CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &block_size, cu_tpb,
 				 dev) == CUDA_SUCCESS
 	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &warp_size, cu_ws,
 				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &dev_size, cu_mpc,
-				    dev) == CUDA_SUCCESS
 	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &cpu_size, cu_tpm,
 				    dev) == CUDA_SUCCESS)
 	    {
@@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 			     default_dims[GOMP_DIM_VECTOR]);
 	}
       pthread_mutex_unlock (&ptx_dev_lock);
+      int vectors = default_dims[GOMP_DIM_VECTOR];
+      int workers = default_dims[GOMP_DIM_WORKER];
+      int gangs = default_dims[GOMP_DIM_GANG];
+
+      if (nvptx_thread()->ptx_dev->driver_version > 6050)
+	{
+	  int grids, blocks;
+	  CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
+			    &blocks, function, NULL, 0,
+			    dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
+	  GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
+			     "grid = %d, block = %d\n", grids, blocks);
+
+	  gangs = grids * dev_size;
+	  workers = blocks / vectors;
+	}
 
       for (i = 0; i != GOMP_DIM_MAX; i++)
 	if (!dims[i])
-	  dims[i] = default_dims[i];
+	  {
+	    switch (i)
+	      {
+	      case GOMP_DIM_GANG:
+		/* The constant 2 was emperically.  The justification
+		   behind it is to prevent the hardware from idling by
+		   throwing twice the amount of work that it can
+		   physically handle.  */
+		dims[i] = gangs;
+		break;
+	      case GOMP_DIM_WORKER:
+		dims[i] = workers;
+		break;
+	      case GOMP_DIM_VECTOR:
+		dims[i] = vectors;
+		break;
+	      default:
+		abort ();
+	      }
+	  }
     }
 
+  /* Check if the accelerator has sufficient hardware resources to
+     launch the offloaded kernel.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]
+      > targ_fn->max_threads_per_block)
+    GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources to"
+		       " launch '%s' with num_workers = %d and vector_length ="
+		       " %d; recompile the program with 'num_workers = x and"
+		       " vector_length = y' on that offloaded region or "
+		       "'-fopenacc-dim=-:x:y' where x * y <= %d.\n",
+		       targ_fn->launch->fn, dims[GOMP_DIM_WORKER],
+		       dims[GOMP_DIM_VECTOR], targ_fn->max_threads_per_block);
+
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
      the corresponding device pointer.  */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-07-11 19:13       ` Cesar Philippidis
@ 2018-07-26 11:58         ` Tom de Vries
  2018-07-26 12:13         ` [libgomp, nvptx] Move device property sampling from nvptx_exec to nvptx_open Tom de Vries
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Tom de Vries @ 2018-07-26 11:58 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

> Content-Type: text/x-patch; name="trunk-libgomp-default-par.diff"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: attachment; filename="trunk-libgomp-default-par.diff"

From https://gcc.gnu.org/contribute.html#patches :
...
We prefer patches posted as plain text or as MIME parts of type
text/x-patch or text/plain, disposition inline, encoded as 7bit or 8bit.
It is strongly discouraged to post patches as MIME parts of type
application/whatever, disposition attachment or encoded as base64 or
quoted-printable.
...

Please post with content-disposition inline instead of attachment (or,
as plain text).

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [libgomp, nvptx] Move device property sampling from nvptx_exec to nvptx_open
  2018-07-11 19:13       ` Cesar Philippidis
  2018-07-26 11:58         ` Tom de Vries
@ 2018-07-26 12:13         ` Tom de Vries
  2018-07-26 12:45         ` [patch] adjust default nvptx launch geometry for OpenACC offloaded regions Tom de Vries
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Tom de Vries @ 2018-07-26 12:13 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 2024 bytes --]

> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
> index 89326e57741..5022e462a3d 100644
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -1120,6 +1126,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>    void *hp, *dp;
>    struct nvptx_thread *nvthd = nvptx_thread ();
>    const char *maybe_abort_msg = "(perhaps abort was called)";
> +  int dev_size = nvthd->ptx_dev->num_sms;
>  
>    function = targ_fn->fn;
>  
> @@ -1150,23 +1156,20 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>  	  for (int i = 0; i < GOMP_DIM_MAX; ++i)
>  	    default_dims[i] = GOMP_PLUGIN_acc_default_dim (i);
>  
> -	  int warp_size, block_size, dev_size, cpu_size;
> +	  int warp_size, block_size, cpu_size;
>  	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
>  	  /* 32 is the default for known hardware.  */
>  	  int gang = 0, worker = 32, vector = 32;
> -	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
> +	  CUdevice_attribute cu_tpb, cu_ws, cu_tpm;
>  
>  	  cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
>  	  cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
> -	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
>  	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
>  
>  	  if (CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &block_size, cu_tpb,
>  				 dev) == CUDA_SUCCESS
>  	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &warp_size, cu_ws,
>  				    dev) == CUDA_SUCCESS
> -	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &dev_size, cu_mpc,
> -				    dev) == CUDA_SUCCESS
>  	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &cpu_size, cu_tpm,
>  				    dev) == CUDA_SUCCESS)
>  	    {

This is a good idea (and should have been an independent patch of course).

Furthermore, it's better to move the remaining cuDeviceGetAttribute
calls to nvptx_open, as was already suggested by Thomas here (
https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01020.html ).

Committed to trunk.

- Tom

[-- Attachment #2: 0001-libgomp-nvptx-Move-device-property-sampling-from-nvptx_exec-to-nvptx_open.patch --]
[-- Type: text/x-patch, Size: 3829 bytes --]

[libgomp, nvptx] Move device property sampling from nvptx_exec to nvptx_open

Move sampling of device properties from nvptx_exec to nvptx_open, and assume
the sampling always succeeds.  This simplifies the default dimension
initialization code in nvptx_open.

2018-07-26  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	* plugin/plugin-nvptx.c (struct ptx_device): Add warp_size,
	max_threads_per_block and max_threads_per_multiprocessor fields.
	(nvptx_open_device): Initialize new fields.
	(nvptx_exec): Use num_sms, and new fields.

---
 libgomp/plugin/plugin-nvptx.c | 53 +++++++++++++++++++++----------------------
 1 file changed, 26 insertions(+), 27 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 89326e57741..5d9b5151e95 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -414,6 +414,9 @@ struct ptx_device
   int num_sms;
   int regs_per_block;
   int regs_per_sm;
+  int warp_size;
+  int max_threads_per_block;
+  int max_threads_per_multiprocessor;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -800,6 +803,15 @@ nvptx_open_device (int n)
       GOMP_PLUGIN_error ("Only warp size 32 is supported");
       return NULL;
     }
+  ptx_dev->warp_size = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute, &pi,
+		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev);
+  ptx_dev->max_threads_per_block = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute, &pi,
+		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
+  ptx_dev->max_threads_per_multiprocessor = pi;
 
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
@@ -1150,33 +1162,20 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	  for (int i = 0; i < GOMP_DIM_MAX; ++i)
 	    default_dims[i] = GOMP_PLUGIN_acc_default_dim (i);
 
-	  int warp_size, block_size, dev_size, cpu_size;
-	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
-	  /* 32 is the default for known hardware.  */
-	  int gang = 0, worker = 32, vector = 32;
-	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
-
-	  cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
-	  cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
-	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
-	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
-
-	  if (CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &block_size, cu_tpb,
-				 dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &warp_size, cu_ws,
-				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &dev_size, cu_mpc,
-				    dev) == CUDA_SUCCESS
-	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &cpu_size, cu_tpm,
-				    dev) == CUDA_SUCCESS)
-	    {
-	      GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
-				 " dev_size=%d, cpu_size=%d\n",
-				 warp_size, block_size, dev_size, cpu_size);
-	      gang = (cpu_size / block_size) * dev_size;
-	      worker = block_size / warp_size;
-	      vector = warp_size;
-	    }
+	  int gang, worker, vector;
+	  {
+	    int warp_size = nvthd->ptx_dev->warp_size;
+	    int block_size = nvthd->ptx_dev->max_threads_per_block;
+	    int cpu_size = nvthd->ptx_dev->max_threads_per_multiprocessor;
+	    int dev_size = nvthd->ptx_dev->num_sms;
+	    GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
+			       " dev_size=%d, cpu_size=%d\n",
+			       warp_size, block_size, dev_size, cpu_size);
+
+	    gang = (cpu_size / block_size) * dev_size;
+	    worker = block_size / warp_size;
+	    vector = warp_size;
+	  }
 
 	  /* There is no upper bound on the gang size.  The best size
 	     matches the hardware configuration.  Logical gangs are

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-07-11 19:13       ` Cesar Philippidis
  2018-07-26 11:58         ` Tom de Vries
  2018-07-26 12:13         ` [libgomp, nvptx] Move device property sampling from nvptx_exec to nvptx_open Tom de Vries
@ 2018-07-26 12:45         ` Tom de Vries
  2018-07-26 14:27         ` Cesar Philippidis
  2018-07-30 10:16         ` Tom de Vries
  4 siblings, 0 replies; 14+ messages in thread
From: Tom de Vries @ 2018-07-26 12:45 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1564 bytes --]

>> Right, in fact there are two separate things you're trying to address
>> here: launch failure and occupancy heuristic, so split the patch.

> That hunk was small, so I included it with this patch. Although if you
> insist, I can remove it.

Please, for future reference, always assume that I insist instead of
asking me, unless you have an argument to present why that is not a good
idea. And just to be clear here: "small" is not such an argument.

Please keep in mind ( https://gcc.gnu.org/contribute.html#patches ):
...
Don't mix together changes made for different reasons. Send them
individually.
...

> +  /* Check if the accelerator has sufficient hardware resources to
> +     launch the offloaded kernel.  */
> +  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]
> +      > targ_fn->max_threads_per_block)
> +    GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources to"
> +		       " launch '%s' with num_workers = %d and vector_length ="
> +		       " %d; recompile the program with 'num_workers = x and"
> +		       " vector_length = y' on that offloaded region or "
> +		       "'-fopenacc-dim=-:x:y' where x * y <= %d.\n",
> +		       targ_fn->launch->fn, dims[GOMP_DIM_WORKER],
> +		       dims[GOMP_DIM_VECTOR], targ_fn->max_threads_per_block);
> +

This is copied from the state on an openacc branch where vector-length
is variable, and the error message text doesn't make sense on current
trunk for that reason. Also, it suggests a syntax for fopenacc-dim
that's not supported on trunk.

Committed as attached.

Thanks,
- Tom

[-- Attachment #2: 0002-libgomp-nvptx-Add-error-with-recompilation-hint-for-launch-failure.patch --]
[-- Type: text/x-patch, Size: 1866 bytes --]

[libgomp, nvptx] Add error with recompilation hint for launch failure

Currently, when a kernel is lauched with too many workers, it results in a cuda
launch failure.  This is triggered f.i. for parallel-loop-1.c at -O0 on a Quadro
M1200.

This patch detects this situation, and errors out with a hint on how to fix it.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-07-26  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	* plugin/plugin-nvptx.c (nvptx_exec): Error if the hardware doesn't have
	sufficient resources to launch a kernel, and give a hint on how to fix
	it.

---
 libgomp/plugin/plugin-nvptx.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 5d9b5151e95..3a4077a1315 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1204,6 +1204,21 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	  dims[i] = default_dims[i];
     }
 
+  /* Check if the accelerator has sufficient hardware resources to
+     launch the offloaded kernel.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]
+      > targ_fn->max_threads_per_block)
+    {
+      int suggest_workers
+	= targ_fn->max_threads_per_block / dims[GOMP_DIM_VECTOR];
+      GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources to"
+			 " launch '%s' with num_workers = %d; recompile the"
+			 " program with 'num_workers = %d' on that offloaded"
+			 " region or '-fopenacc-dim=:%d'",
+			 targ_fn->launch->fn, dims[GOMP_DIM_WORKER],
+			 suggest_workers, suggest_workers);
+    }
+
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
      the corresponding device pointer.  */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-07-11 19:13       ` Cesar Philippidis
                           ` (2 preceding siblings ...)
  2018-07-26 12:45         ` [patch] adjust default nvptx launch geometry for OpenACC offloaded regions Tom de Vries
@ 2018-07-26 14:27         ` Cesar Philippidis
  2018-07-26 15:18           ` Tom de Vries
  2018-07-30 10:16         ` Tom de Vries
  4 siblings, 1 reply; 14+ messages in thread
From: Cesar Philippidis @ 2018-07-26 14:27 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

Hi Tom,

I see that you're reviewing the libgomp changes. Please disregard the
following hunk:

On 07/11/2018 12:13 PM, Cesar Philippidis wrote:
> @@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>  			     default_dims[GOMP_DIM_VECTOR]);
>  	}
>        pthread_mutex_unlock (&ptx_dev_lock);
> +      int vectors = default_dims[GOMP_DIM_VECTOR];
> +      int workers = default_dims[GOMP_DIM_WORKER];
> +      int gangs = default_dims[GOMP_DIM_GANG];
> +
> +      if (nvptx_thread()->ptx_dev->driver_version > 6050)
> +	{
> +	  int grids, blocks;
> +	  CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
> +			    &blocks, function, NULL, 0,
> +			    dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
> +	  GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
> +			     "grid = %d, block = %d\n", grids, blocks);
> +
> +	  gangs = grids * dev_size;
> +	  workers = blocks / vectors;
> +	}

I revisited this change yesterday and I noticed it was setting gangs
incorrectly. Basically, gangs should be set as follows

  gangs = grids * (blocks / warp_size);

or to be more closer to og8 as

  gangs = 2 * grids * (blocks / warp_size);

The use of that magic constant 2 is to prevent thread starvation. That's
a similar concept behind make -j<2*#threads>.

Anyway, I'm still experimenting with that change. There are still some
discrepancies between the way that I select num_workers and how the
driver does. The driver appears to be a little bit more conservative,
but according to the thread occupancy calculator, that should yield
greater performance on GPUs.

I just wanted to give you a heads up because you seem to be working on this.

Thanks for all of your reviews!

By the way, are you now maintainer of the libgomp nvptx plugin?

Cesar

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-07-26 14:27         ` Cesar Philippidis
@ 2018-07-26 15:18           ` Tom de Vries
  0 siblings, 0 replies; 14+ messages in thread
From: Tom de Vries @ 2018-07-26 15:18 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

On 07/26/2018 04:27 PM, Cesar Philippidis wrote:
> Hi Tom,
> 
> I see that you're reviewing the libgomp changes. Please disregard the
> following hunk:
> 
> On 07/11/2018 12:13 PM, Cesar Philippidis wrote:
>> @@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>>  			     default_dims[GOMP_DIM_VECTOR]);
>>  	}
>>        pthread_mutex_unlock (&ptx_dev_lock);
>> +      int vectors = default_dims[GOMP_DIM_VECTOR];
>> +      int workers = default_dims[GOMP_DIM_WORKER];
>> +      int gangs = default_dims[GOMP_DIM_GANG];
>> +
>> +      if (nvptx_thread()->ptx_dev->driver_version > 6050)
>> +	{
>> +	  int grids, blocks;
>> +	  CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
>> +			    &blocks, function, NULL, 0,
>> +			    dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
>> +	  GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
>> +			     "grid = %d, block = %d\n", grids, blocks);
>> +
>> +	  gangs = grids * dev_size;
>> +	  workers = blocks / vectors;
>> +	}
> 
> I revisited this change yesterday and I noticed it was setting gangs
> incorrectly. Basically, gangs should be set as follows
> 
>   gangs = grids * (blocks / warp_size);
> 
> or to be more closer to og8 as
> 
>   gangs = 2 * grids * (blocks / warp_size);
> 
> The use of that magic constant 2 is to prevent thread starvation. That's
> a similar concept behind make -j<2*#threads>.
> 
> Anyway, I'm still experimenting with that change. There are still some
> discrepancies between the way that I select num_workers and how the
> driver does. The driver appears to be a little bit more conservative,
> but according to the thread occupancy calculator, that should yield
> greater performance on GPUs.
> 
> I just wanted to give you a heads up because you seem to be working on this.
> 

Ack, thanks for letting me know.

> Thanks for all of your reviews!
> 
> By the way, are you now maintainer of the libgomp nvptx plugin?

I'm not sure if that's a separate thing.

AFAIU the responsibilities of the nvptx maintainer are:
- the nvptx backend (under supervision of the global maintainers)
- and anything nvptx-y in all other components (under supervision of the
  component and global maintainers)

So, I'd say I'm on the hook to review patches for the nvptx plugin in
libgomp.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-07-11 19:13       ` Cesar Philippidis
                           ` (3 preceding siblings ...)
  2018-07-26 14:27         ` Cesar Philippidis
@ 2018-07-30 10:16         ` Tom de Vries
  4 siblings, 0 replies; 14+ messages in thread
From: Tom de Vries @ 2018-07-30 10:16 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 466 bytes --]

On 07/11/2018 09:13 PM, Cesar Philippidis wrote:
> 2018-07-XX  Cesar Philippidis  <cesar@codesourcery.com>
> 	    Tom de Vries  <tdevries@suse.de>
> 
> 	gcc/
> 	* config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Rename to ...
> 	(PTX_DEFAULT_RUNTIME_DIM): ... this.
> 	(nvptx_goacc_validate_dims): Set default worker and gang dims to
> 	PTX_DEFAULT_RUNTIME_DIM.
> 	(nvptx_dim_limit): Ignore GOMP_DIM_WORKER;

That's an independent patch.

Committed at below.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-offloading-Determine-default-workers-at-runtime.patch --]
[-- Type: text/x-patch, Size: 1911 bytes --]

[nvptx, offloading] Determine default workers at runtime

Currently, if the user doesn't specify the number of workers for an openacc
region, the compiler hardcodes it to a default value.

This patch removes this functionality, such that the libgomp runtime can decide
on a default value.

2018-07-27  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Rename to ...
	(PTX_DEFAULT_RUNTIME_DIM): ... this.
	(nvptx_goacc_validate_dims): Set default worker and gang dims to
	PTX_DEFAULT_RUNTIME_DIM.
	(nvptx_dim_limit): Ignore GOMP_DIM_WORKER.

---
 gcc/config/nvptx/nvptx.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5608bee8a8d..c1946e75f42 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5165,7 +5165,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
 #define PTX_WORKER_LENGTH 32
-#define PTX_GANG_DEFAULT  0 /* Defer to runtime.  */
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
@@ -5214,9 +5214,9 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
     {
       dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
+	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
-	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
       changed = true;
     }
 
@@ -5230,9 +5230,6 @@ nvptx_dim_limit (int axis)
 {
   switch (axis)
     {
-    case GOMP_DIM_WORKER:
-      return PTX_WORKER_LENGTH;
-
     case GOMP_DIM_VECTOR:
       return PTX_VECTOR_LENGTH;
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-06-20 21:59 [patch] adjust default nvptx launch geometry for OpenACC offloaded regions Cesar Philippidis
  2018-06-20 22:16 ` Tom de Vries
@ 2018-06-29 17:16 ` Cesar Philippidis
  2018-06-30 11:36   ` Cesar Philippidis
  1 sibling, 1 reply; 14+ messages in thread
From: Cesar Philippidis @ 2018-06-29 17:16 UTC (permalink / raw)
  To: gcc-patches, Jakub Jelinek; +Cc: tdevries

Ping.

Ceasr

On 06/20/2018 02:59 PM, Cesar Philippidis wrote:
> At present, the nvptx libgomp plugin does not take into account the
> amount of shared resources on GPUs (mostly shared-memory are register
> usage) when selecting the default num_gangs and num_workers. In certain
> situations, an OpenACC offloaded function can fail to launch if the GPU
> does not have sufficient shared resources to accommodate all of the
> threads in a CUDA block. This typically manifests when a PTX function
> uses a lot of registers and num_workers is set too large, although it
> can also happen if the shared-memory has been exhausted by the threads
> in a vector.
> 
> This patch resolves that issue by adjusting num_workers based the amount
> of shared resources used by each threads. If worker parallelism has been
> requested, libgomp will spawn as many workers as possible up to 32.
> Without this patch, libgomp would always default to launching 32 workers
> when worker parallelism is used.
> 
> Besides for the worker parallelism, this patch also includes some
> heuristics on selecting num_gangs. Before, the plugin would launch two
> gangs per GPU multiprocessor. Now it follows the formula contained in
> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
> 
> Is this patch OK for trunk?
> 
> Thanks,
> Cesar
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions
  2018-06-29 17:16 ` Cesar Philippidis
@ 2018-06-30 11:36   ` Cesar Philippidis
  0 siblings, 0 replies; 14+ messages in thread
From: Cesar Philippidis @ 2018-06-30 11:36 UTC (permalink / raw)
  To: gcc-patches, Jakub Jelinek; +Cc: tdevries

On 06/29/2018 10:12 AM, Cesar Philippidis wrote:
> Ping.

While porting the vector length patches to trunk, I realized that I
mistakenly removed support for the environment variable GOMP_OPENACC_DIM
in this patch (thanks for adding those test case Tom!). I'll post an
updated version of this patch once I got the vector length patches
working with it.

Cesar

> On 06/20/2018 02:59 PM, Cesar Philippidis wrote:
>> At present, the nvptx libgomp plugin does not take into account the
>> amount of shared resources on GPUs (mostly shared-memory are register
>> usage) when selecting the default num_gangs and num_workers. In certain
>> situations, an OpenACC offloaded function can fail to launch if the GPU
>> does not have sufficient shared resources to accommodate all of the
>> threads in a CUDA block. This typically manifests when a PTX function
>> uses a lot of registers and num_workers is set too large, although it
>> can also happen if the shared-memory has been exhausted by the threads
>> in a vector.
>>
>> This patch resolves that issue by adjusting num_workers based the amount
>> of shared resources used by each threads. If worker parallelism has been
>> requested, libgomp will spawn as many workers as possible up to 32.
>> Without this patch, libgomp would always default to launching 32 workers
>> when worker parallelism is used.
>>
>> Besides for the worker parallelism, this patch also includes some
>> heuristics on selecting num_gangs. Before, the plugin would launch two
>> gangs per GPU multiprocessor. Now it follows the formula contained in
>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
>>
>> Is this patch OK for trunk?
>>
>> Thanks,
>> Cesar
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-07-30 10:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-20 21:59 [patch] adjust default nvptx launch geometry for OpenACC offloaded regions Cesar Philippidis
2018-06-20 22:16 ` Tom de Vries
2018-06-21 13:58   ` Cesar Philippidis
2018-07-02 14:14     ` Tom de Vries
2018-07-02 14:39       ` Cesar Philippidis
2018-07-11 19:13       ` Cesar Philippidis
2018-07-26 11:58         ` Tom de Vries
2018-07-26 12:13         ` [libgomp, nvptx] Move device property sampling from nvptx_exec to nvptx_open Tom de Vries
2018-07-26 12:45         ` [patch] adjust default nvptx launch geometry for OpenACC offloaded regions Tom de Vries
2018-07-26 14:27         ` Cesar Philippidis
2018-07-26 15:18           ` Tom de Vries
2018-07-30 10:16         ` Tom de Vries
2018-06-29 17:16 ` Cesar Philippidis
2018-06-30 11:36   ` Cesar Philippidis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).