public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [og7] vector_length extension part 1: generalize function and variable names
@ 2018-03-01 21:17 Cesar Philippidis
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                   ` (4 more replies)
  0 siblings, 5 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-01 21:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 1053 bytes --]

Right now, I'm in the process of adding support for larger
vector_lengths in the nvptx BE. To reduce the size of the final patch,
I've separated all of the misc. function and variable renaming into this
patch. Once the nvptx BE is extended to support multiple CUDA
warps-sized vector lengths, in certain respect vectors will act like
workers with regards to state propagation and reductions (i.e., large
vectors will use shared-memory to propagate state between vector-single
to vector-partitioned modes and also for reductions). To that end, this
patch renames worker functions and variables from worker-something to
shared-something. Likewise, vector specific functions have been renamed
as warp-something.

This patch also introduces a new populate_offload_attrs function. At
present, that function is only used in nvptx_reorg, and it is actually
overkill for that. However, later one it will be used in other places,
including nvptx_validate_dims and the nvptx reduction handling code.

This patch has been committed to openacc-gcc-7-branch.

Cesar

[-- Attachment #2: og7-vl-part1.diff --]
[-- Type: text/x-patch, Size: 22274 bytes --]

2018-03-01  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (PTX_VECTOR_LENGTH, PTX_WORKER_LENGTH,
	PTX_DEFAULT_RUNTIME_DIM): Move to the top of the file.
	(PTX_WARP_SIZE): Define.
	(PTX_CTA_SIZE): Define.
	(worker_bcast_size): Rename to oacc_bcast_size.
	(worker_bcast_align): Rename to oacc_bcast_align.
	(worker_bcast_sym): Rename to oacc_bcast_sym.
	(nvptx_option_override): Update usage of oacc_bcast_*.
	(nvptx_gen_wcast): Rename to nvptx_gen_warp_bcast.
	(struct wcast_data_t): Rename to broadcast_data_t.
	(nvptx_gen_wcast): Rename to nvptx_gen_shared_bcast.  Update to use
	oacc_bcast_* variables.
	(struct offload_attrs): New.
	(propagator_fn): Add bool argument.
	(nvptx_propagate): New bool argument.  Pass bool argument to fn.
	(vprop_gen): Rename to warp_prop_gen.  Update call to
	nvptx_gen_warp_bcast.
	(nvptx_vpropagate): Rename to nvptx_warp_propagate. Update call to
	nvptx_propagate.
	(wprop_gen): Rename to shared_prop_gen.  Update usage of oacc_bcast_*
	variables and call to nvptx_gen_shared_bcast.
	(nvptx_wpropagate): Rename to nvptx_shared_propagate.  Update usage
	of oacc_bcast_* variables and call to nvptx_propagate.
	(nvptx_wsync): Rename to nvptx_cta_sync.
	(nvptx_single): Update usage of oacc_bcast_* vars and calls to
	nvptx_gen_warp_bcast, nvptx_gen_shared_bcast and nvptx_cta_sync.
	(nvptx_process_pars): Likewise.
	(nvptx_neuter_pars): Whitespace.
	(populate_offload_attrs): New function.
	(nvptx_reorg): Use it to extract partitioning mask.
	(write_worker_buffer): Rename to write_shared_buffer.
	(nvptx_file_end): Update calls to write_shared_buffer.
	(nvptx_expand_worker_addr): Rename to nvptx_expand_shared_addr.
	(nvptx_expand_builtin): Update call to nvptx_expand_shared_addr.
	(nvptx_simt_vf): Return PTX_WARP_SIZE instead of PTX_VECTOR_LENGTH.
	(nvptx_get_worker_red_addr): Rename to nvptx_get_shared_red_addr.
	(nvptx_goacc_reduction_setup): Update call to
	nvptx_get_shared_red_addr.
	(nvptx_goacc_reduction_fini): Likewise.
	(nvptx_goacc_reduction_teardown): Likewise.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 1d510a7bb7d..b16cf59575c 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -77,6 +77,14 @@
 
 #define WORKAROUND_PTXJIT_BUG 1
 
+/* Define dimension sizes for known hardware.  */
+#define PTX_VECTOR_LENGTH 32
+//#define PTX_VECTOR_LENGTH 128
+#define PTX_WORKER_LENGTH 32
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
+#define PTX_WARP_SIZE 32
+#define PTX_CTA_SIZE 1024
+
 /* The various PTX memory areas an object might reside in.  */
 enum nvptx_data_area
 {
@@ -118,14 +126,15 @@ struct tree_hasher : ggc_cache_ptr_hash<tree_node>
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
-/* Buffer needed to broadcast across workers.  This is used for both
-   worker-neutering and worker broadcasting.  It is shared by all
-   functions emitted.  The buffer is placed in shared memory.  It'd be
-   nice if PTX supported common blocks, because then this could be
-   shared across TUs (taking the largest size).  */
-static unsigned worker_bcast_size;
-static unsigned worker_bcast_align;
-static GTY(()) rtx worker_bcast_sym;
+/* Buffer needed to broadcast across workers and vectors.  This is
+   used for both worker-neutering and worker broadcasting, and
+   vector-neutering and boardcasting when vector_length > 32.  It is
+   shared by all functions emitted.  The buffer is placed in shared
+   memory.  It'd be nice if PTX supported common blocks, because then
+   this could be shared across TUs (taking the largest size).  */
+static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_align;
+static GTY(()) rtx oacc_bcast_sym;
 
 /* Buffer needed for worker reductions.  This has to be distinct from
    the worker broadcast array, as both may be live concurrently.  */
@@ -198,9 +207,9 @@ nvptx_option_override (void)
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
 
-  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_bcast");
-  SET_SYMBOL_DATA_AREA (worker_bcast_sym, DATA_AREA_SHARED);
-  worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast");
+  SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED);
+  oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
@@ -1737,14 +1746,14 @@ nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, nvptx_shuffle_kind kind)
    across the vectors of a single warp.  */
 
 static rtx
-nvptx_gen_vcast (rtx reg)
+nvptx_gen_warp_bcast (rtx reg)
 {
   return nvptx_gen_shuffle (reg, reg, const0_rtx, SHUFFLE_IDX);
 }
 
 /* Structure used when generating a worker-level spill or fill.  */
 
-struct wcast_data_t
+struct broadcast_data_t
 {
   rtx base;  /* Register holding base addr of buffer.  */
   rtx ptr;  /* Iteration var,  if needed.  */
@@ -1768,7 +1777,8 @@ enum propagate_mask
    how many loop iterations will be executed (0 for not a loop).  */
    
 static rtx
-nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+nvptx_gen_shared_bcast (rtx reg, propagate_mask pm, unsigned rep,
+			broadcast_data_t *data, bool vector)
 {
   rtx  res;
   machine_mode mode = GET_MODE (reg);
@@ -1782,7 +1792,7 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
 	start_sequence ();
 	if (pm & PM_read)
 	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
-	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	emit_insn (nvptx_gen_shared_bcast (tmp, pm, rep, data, vector));
 	if (pm & PM_write)
 	  emit_insn (gen_rtx_SET (reg, gen_rtx_NE (BImode, tmp, const0_rtx)));
 	res = get_insns ();
@@ -1798,10 +1808,11 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
 	  {
 	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
 
-	    if (align > worker_bcast_align)
-	      worker_bcast_align = align;
+	    if (align > oacc_bcast_align)
+	      oacc_bcast_align = align;
 	    data->offset = (data->offset + align - 1) & ~(align - 1);
 	    addr = data->base;
+	    gcc_assert (data->base != NULL);
 	    if (data->offset)
 	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
 	  }
@@ -2809,8 +2820,16 @@ nvptx_propagate_unified (rtx_insn *unified)
   gcc_assert (ok);
 }
 
-/* Loop structure of the function.  The entire function is described as
-   a NULL loop.  */
+/* Offloading function attributes.  */
+
+struct offload_attrs
+{
+  unsigned mask;
+  int num_gangs;
+  int num_workers;
+  int vector_length;
+  int max_workers;
+};
 
 struct parallel
 {
@@ -3746,11 +3765,11 @@ nvptx_find_sese (auto_vec<basic_block> &blocks, bb_pair_vec_t &regions)
    regions and (b) only propagating stack entries that are used.  The
    latter might be quite hard to determine.  */
 
-typedef rtx (*propagator_fn) (rtx, propagate_mask, unsigned, void *);
+typedef rtx (*propagator_fn) (rtx, propagate_mask, unsigned, void *, bool);
 
 static bool
 nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
-		 propagate_mask rw, propagator_fn fn, void *data)
+		 propagate_mask rw, propagator_fn fn, void *data, bool vector)
 {
   bitmap live = DF_LIVE_IN (block);
   bitmap_iterator iterator;
@@ -3785,7 +3804,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	  
 	  emit_insn (gen_rtx_SET (idx, GEN_INT (fs)));
 	  /* Allow worker function to initialize anything needed.  */
-	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  rtx init = fn (tmp, PM_loop_begin, fs, data, vector);
 	  if (init)
 	    emit_insn (init);
 	  emit_label (label);
@@ -3794,7 +3813,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	}
       if (rw & PM_read)
 	emit_insn (gen_rtx_SET (tmp, gen_rtx_MEM (DImode, ptr)));
-      emit_insn (fn (tmp, rw, fs, data));
+      emit_insn (fn (tmp, rw, fs, data, vector));
       if (rw & PM_write)
 	emit_insn (gen_rtx_SET (gen_rtx_MEM (DImode, ptr), tmp));
       if (fs)
@@ -3802,7 +3821,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	  emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, idx, const0_rtx)));
 	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
 	  emit_insn (gen_br_true_uni (pred, label));
-	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  rtx fini = fn (tmp, PM_loop_end, fs, data, vector);
 	  if (fini)
 	    emit_insn (fini);
 	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
@@ -3822,7 +3841,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 
 	if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
 	  {
-	    rtx bcast = fn (reg, rw, 0, data);
+	    rtx bcast = fn (reg, rw, 0, data, vector);
 
 	    insn = emit_insn_after (bcast, insn);
 	    empty = false;
@@ -3831,16 +3850,17 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
   return empty;
 }
 
-/* Worker for nvptx_vpropagate.  */
+/* Worker for nvptx_warp_propagate.  */
 
 static rtx
-vprop_gen (rtx reg, propagate_mask pm,
-	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+warp_prop_gen (rtx reg, propagate_mask pm,
+	       unsigned ARG_UNUSED (count), void *ARG_UNUSED (data),
+	       bool ARG_UNUSED (vector))
 {
   if (!(pm & PM_read_write))
     return 0;
   
-  return nvptx_gen_vcast (reg);
+  return nvptx_gen_warp_bcast (reg);
 }
 
 /* Propagate state that is live at start of BLOCK across the vectors
@@ -3848,25 +3868,27 @@ vprop_gen (rtx reg, propagate_mask pm,
    IS_CALL and return as for nvptx_propagate.  */
 
 static bool
-nvptx_vpropagate (bool is_call, basic_block block, rtx_insn *insn)
+nvptx_warp_propagate (bool is_call, basic_block block, rtx_insn *insn)
 {
-  return nvptx_propagate (is_call, block, insn, PM_read_write, vprop_gen, 0);
+  return nvptx_propagate (is_call, block, insn, PM_read_write,
+			  warp_prop_gen, 0, false);
 }
 
-/* Worker for nvptx_wpropagate.  */
+/* Worker for nvptx_shared_propagate.  */
 
 static rtx
-wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+shared_prop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_,
+		 bool vector)
 {
-  wcast_data_t *data = (wcast_data_t *)data_;
+  broadcast_data_t *data = (broadcast_data_t *)data_;
 
   if (pm & PM_loop_begin)
     {
       /* Starting a loop, initialize pointer.    */
       unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
 
-      if (align > worker_bcast_align)
-	worker_bcast_align = align;
+      if (align > oacc_bcast_align)
+	oacc_bcast_align = align;
       data->offset = (data->offset + align - 1) & ~(align - 1);
 
       data->ptr = gen_reg_rtx (Pmode);
@@ -3880,7 +3902,7 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
       return clobber;
     }
   else
-    return nvptx_gen_wcast (reg, pm, rep, data);
+    return nvptx_gen_shared_bcast (reg, pm, rep, data, vector);
 }
 
 /* Spill or fill live state that is live at start of BLOCK.  PRE_P
@@ -3889,34 +3911,36 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
    INSN.  IS_CALL and return as for nvptx_propagate.  */
 
 static bool
-nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
+nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
+			rtx_insn *insn, bool vector)
 {
-  wcast_data_t data;
+  broadcast_data_t data;
 
   data.base = gen_reg_rtx (Pmode);
   data.offset = 0;
   data.ptr = NULL_RTX;
 
   bool empty = nvptx_propagate (is_call, block, insn,
-				pre_p ? PM_read : PM_write, wprop_gen, &data);
+				pre_p ? PM_read : PM_write, shared_prop_gen,
+				&data, vector);
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, worker_bcast_sym);
+      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
       emit_insn_after (init, insn);
 
-      if (worker_bcast_size < data.offset)
-	worker_bcast_size = data.offset;
+      if (oacc_bcast_size < data.offset)
+	oacc_bcast_size = data.offset;
     }
   return empty;
 }
 
-/* Emit a worker-level synchronization barrier.  We use different
+/* Emit a CTA-level synchronization barrier.  We use different
    markers for before and after synchronizations.  */
 
 static rtx
-nvptx_wsync (bool after)
+nvptx_cta_sync (bool after)
 {
   return gen_nvptx_barsync (GEN_INT (after));
 }
@@ -4153,31 +4177,33 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  emit_insn_before (gen_rtx_SET (tmp, pvar), label);
 	  emit_insn_before (gen_rtx_SET (pvar, tmp), tail);
 #endif
-	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	  emit_insn_before (nvptx_gen_warp_bcast (pvar), tail);
 	}
       else
 	{
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
-	  wcast_data_t data;
+	  broadcast_data_t data;
 
-	  data.base = worker_bcast_sym;
+	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (worker_bcast_size < GET_MODE_SIZE (SImode))
-	    worker_bcast_size = GET_MODE_SIZE (SImode);
+	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
+	    oacc_bcast_size = GET_MODE_SIZE (SImode);
 
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
+						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_wsync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (false), tail);
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
+						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_wsync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (true), tail);
 	}
 
       extract_insn (tail);
@@ -4289,19 +4315,21 @@ nvptx_process_pars (parallel *par)
   
   if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
     {
-      nvptx_wpropagate (false, is_call, par->forked_block, par->forked_insn);
-      bool empty = nvptx_wpropagate (true, is_call,
-				     par->forked_block, par->fork_insn);
+      nvptx_shared_propagate (false, is_call, par->forked_block,
+			      par->forked_insn, false);
+      bool empty = nvptx_shared_propagate (true, is_call,
+					   par->forked_block, par->fork_insn,
+					   false);
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_after (nvptx_wsync (false), par->forked_insn);
-	  emit_insn_before (nvptx_wsync (true), par->joining_insn);
+	  emit_insn_after (nvptx_cta_sync (false), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (true), par->joining_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
-    nvptx_vpropagate (is_call, par->forked_block, par->forked_insn);
+    nvptx_warp_propagate (is_call, par->forked_block, par->forked_insn);
 
   /* Now do siblings.  */
   if (par->next)
@@ -4380,12 +4408,62 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
     }
 
   if (skip_mask)
-      nvptx_skip_par (skip_mask, par);
+    nvptx_skip_par (skip_mask, par);
   
   if (par->next)
     nvptx_neuter_pars (par->next, modes, outer);
 }
 
+static void
+populate_offload_attrs (offload_attrs *oa)
+{
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+  tree dims = TREE_VALUE (attr);
+  unsigned ix;
+
+  oa->mask = 0;
+
+  for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+    {
+      tree t = TREE_VALUE (dims);
+      int size = (t == NULL_TREE) ? 0 : TREE_INT_CST_LOW (t);
+      tree allowed = TREE_PURPOSE (dims);
+
+      if (size != 1 && !(allowed && integer_zerop (allowed)))
+	oa->mask |= GOMP_DIM_MASK (ix);
+
+      switch (ix)
+	{
+	case GOMP_DIM_GANG:
+	  oa->num_gangs = size;
+	  break;
+
+	case GOMP_DIM_WORKER:
+	  oa->num_workers = size;
+	  break;
+
+	case GOMP_DIM_VECTOR:
+	  oa->vector_length = size;
+	  break;
+	}
+    }
+
+  if (oa->vector_length == 0)
+    {
+      /* FIXME: Need a more graceful way to handle large vector
+	 lengths in OpenACC routines.  */
+      if (!lookup_attribute ("omp target entrypoint",
+			     DECL_ATTRIBUTES (current_function_decl)))
+	oa->vector_length = PTX_WARP_SIZE;
+      else
+	oa->vector_length = PTX_VECTOR_LENGTH;
+    }
+  if (oa->num_workers == 0)
+    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+  else
+    oa->max_workers = oa->num_workers;
+}
+
 /* PTX-specific reorganization
    - Split blocks at fork and join instructions
    - Compute live registers
@@ -4435,27 +4513,19 @@ nvptx_reorg (void)
     {
       /* If we determined this mask before RTL expansion, we could
 	 elide emission of some levels of forks and joins.  */
-      unsigned mask = 0;
-      tree dims = TREE_VALUE (attr);
-      unsigned ix;
+      offload_attrs oa;
 
-      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
-	{
-	  int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
-	  tree allowed = TREE_PURPOSE (dims);
+      populate_offload_attrs (&oa);
 
-	  if (size != 1 && !(allowed && integer_zerop (allowed)))
-	    mask |= GOMP_DIM_MASK (ix);
-	}
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
-      gcc_assert (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
-		  || (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
+      gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+		  || (oa.mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
 
       /* Discover & process partitioned regions.  */
       parallel *pars = nvptx_discover_pars (&bb_insn_map);
       nvptx_process_pars (pars);
-      nvptx_neuter_pars (pars, mask, 0);
+      nvptx_neuter_pars (pars, oa.mask, 0);
       delete pars;
     }
 
@@ -4629,10 +4699,11 @@ nvptx_file_start (void)
   fputs ("// END PREAMBLE\n", asm_out_file);
 }
 
-/* Emit a declaration for a worker-level buffer in .shared memory.  */
+/* Emit a declaration for a worker and vector-level buffer in .shared
+   memory.  */
 
 static void
-write_worker_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
+write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
 {
   const char *name = XSTR (sym, 0);
 
@@ -4653,16 +4724,16 @@ nvptx_file_end (void)
     nvptx_record_fndecl (decl);
   fputs (func_decls.str().c_str(), asm_out_file);
 
-  if (worker_bcast_size)
-    write_worker_buffer (asm_out_file, worker_bcast_sym,
-			 worker_bcast_align, worker_bcast_size);
+  if (oacc_bcast_size)
+    write_shared_buffer (asm_out_file, oacc_bcast_sym,
+			 oacc_bcast_align, oacc_bcast_size);
 
   if (worker_red_size)
-    write_worker_buffer (asm_out_file, worker_red_sym,
+    write_shared_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
   if (gangprivate_shared_size)
-    write_worker_buffer (asm_out_file, gangprivate_shared_sym,
+    write_shared_buffer (asm_out_file, gangprivate_shared_sym,
 			 gangprivate_shared_align, gangprivate_shared_size);
 
   if (need_softstack_decl)
@@ -4713,7 +4784,7 @@ nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
 /* Worker reduction address expander.  */
 
 static rtx
-nvptx_expand_worker_addr (tree exp, rtx target,
+nvptx_expand_shared_addr (tree exp, rtx target,
 			  machine_mode ARG_UNUSED (mode), int ignore)
 {
   if (ignore)
@@ -4870,7 +4941,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
       return nvptx_expand_shuffle (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_WORKER_ADDR:
-      return nvptx_expand_worker_addr (exp, target, mode, ignore);
+      return nvptx_expand_shared_addr (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_CMP_SWAP:
     case NVPTX_BUILTIN_CMP_SWAPLL:
@@ -4882,18 +4953,13 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
     default: gcc_unreachable ();
     }
 }
-\f
-/* Define dimension sizes for known hardware.  */
-#define PTX_VECTOR_LENGTH 32
-#define PTX_WORKER_LENGTH 32
-#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
 static int
 nvptx_simt_vf ()
 {
-  return PTX_VECTOR_LENGTH;
+  return PTX_WARP_SIZE;
 }
 
 /* Validate compute dimensions of an OpenACC offload or routine, fill
@@ -5007,7 +5073,7 @@ nvptx_goacc_fork_join (gcall *call, const int dims[],
    data at that location.  */
 
 static tree
-nvptx_get_worker_red_addr (tree type, tree offset)
+nvptx_get_shared_red_addr (tree type, tree offset)
 {
   machine_mode mode = TYPE_MODE (type);
   tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
@@ -5468,7 +5534,7 @@ nvptx_goacc_reduction_setup (gcall *call)
     {
       /* Store incoming value to worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5597,7 +5663,7 @@ nvptx_goacc_reduction_fini (gcall *call)
 	{
 	  /* Get reduction buffer address.  */
 	  tree offset = gimple_call_arg (call, 5);
-	  tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
 	  tree ptr = make_ssa_name (TREE_TYPE (call));
 
 	  gimplify_assign (ptr, call, &seq);
@@ -5645,7 +5711,7 @@ nvptx_goacc_reduction_teardown (gcall *call)
     {
       /* Read the worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_worker_red_addr(TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-01 21:17 [og7] vector_length extension part 1: generalize function and variable names Cesar Philippidis
@ 2018-03-02 16:55 ` Cesar Philippidis
  2018-03-21 17:16   ` Tom de Vries
                     ` (10 more replies)
  2018-03-02 17:51 ` [og7] vector_length extension part 3: reductions Cesar Philippidis
                   ` (3 subsequent siblings)
  4 siblings, 11 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-02 16:55 UTC (permalink / raw)
  To: gcc-patches; +Cc: Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]

The attached patch generalizes the worker state propagation and
synchronization code to handle large vectors. When the vector_length is
larger than a CUDA warp, the nvptx BE will now use shared-memory to
spill-and-fill vector state when transitioning from vector-single mode
to vector partitioned.

In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
have been extended to take a barrier ID and a thread count. The idea
here is to assign one barrier for each logical vector. Worker-single
synchronization is controlled by barrier 0. Therefore, the vector
barrier ID is set to tid.y+1 (because there's one vector unit per
worker) in nvptx_init_oacc_workers and placed into a register stored in
cfun->machine->sync_bar. If no workers are present, then the barrier ID
falls back to 0.

As a follow up patch will show, the nvptx BE falls back to using
vector_length = 32 when a vector loop is nested inside a worker loop.
This is because I observed that the PTX JIT does not reliable generate
SASS code to keep warps convergent in large vectors. While it works for
99% of the libgomp test cases, the ones that fail usually deadlock
because the PTX JIT generates BRA instructions for the vector code
instead of SSY/SYNC. At this point, I'm not sure if the nvptx is
generating back code, or if there is a bug in the PTX JIT. Hopefully,
Volta's warp sync functionality will resolve this problem regardless.

These changes are relatively straightforward and noncontroversial. I'll
commit this patch to openacc-gcc-7-branch once the other patches are
ready. There will be three more patches in this series.

Cesar

[-- Attachment #2: og7-vl-part2-sns.diff --]
[-- Type: text/x-patch, Size: 16344 bytes --]

2018-03-02  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
	(nvptx_init_axis_predicate): Initialize vector_red_partition.
	(nvptx_init_oacc_workers): New function.
	(nvptx_declare_function_name): Emit a .maxntid directive hint and
	call nvptx_init_oacc_workers.
	(MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define.
	(nvptx_mach_max_workers): New function.
	(nvptx_mach_vector_length): New function.
	(nvptx_needs_shared_bcast): New function.
	(nvptx_find_par): Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_shared_propagate): Iniitalize vector bcast partition and
	synchronization state.
	(nvptx_cta_sync): Change arguments to take in a lock and thread count.
	Update call to gen_nvptx_barsync.
	(nvptx_single):  Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_process_pars): Likewise.
	(populate_offload_attrs): Handle the situation where the default
	runtime geometry has not been initialized yet for reductions.
	(nvptx_reorg): Set function-specific axis_dim's.
	* config/nvptx/nvptx.h (struct machine_function): Add axis_dims,
	bcast_partition, red_partition and sync_bar members.
	* config/nvptx/nvptx.md (nvptx_barsync): Adjust operands.

From 0a1dd1d85e47feeaa6f7a2e070baba69dadea444 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Fri, 2 Mar 2018 07:39:25 -0800
Subject: [PATCH] bar and sync

---
 gcc/config/nvptx/nvptx.c  | 226 ++++++++++++++++++++++++++++++++++++++++------
 gcc/config/nvptx/nvptx.h  |   8 ++
 gcc/config/nvptx/nvptx.md |  10 +-
 3 files changed, 214 insertions(+), 30 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9d77176c638..507c8671704 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -133,6 +133,7 @@ static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
    memory.  It'd be nice if PTX supported common blocks, because then
    this could be shared across TUs (taking the largest size).  */
 static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_partition;
 static unsigned oacc_bcast_align;
 static GTY(()) rtx oacc_bcast_sym;
 
@@ -1104,8 +1105,53 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
 {
   fprintf (file, "\t{\n");
   fprintf (file, "\t\t.reg.u32\t%%%s;\n", name);
-  fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_red;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%%s, %%tid.%s;\n", name, name);
   fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tid.y;\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_red, __vector_red;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_red; "
+	       "// vector reduction buffer\n",
+	       REGNO (cfun->machine->red_partition),
+	       vector_red_partition);
+    }
+  fprintf (file, "\t}\n");
+}
+
+/* Emit code to initialize OpenACC worker broadcast and synchronization
+   registers.  */
+
+static void
+nvptx_init_oacc_workers (FILE *file)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%tidy;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n");
+      fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; "
+	       "// vector broadcast offset\n",
+	       REGNO (cfun->machine->bcast_partition),
+	       oacc_bcast_partition);
+    }
+  if (cfun->machine->sync_bar)
+    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
+	     "// vector synchronization barrier\n",
+	     REGNO (cfun->machine->sync_bar));
   fprintf (file, "\t}\n");
 }
 
@@ -1231,6 +1277,13 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
      stream, in order to share the prototype writing code.  */
   std::stringstream s;
   write_fn_proto (s, true, name, decl);
+
+  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */
+  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
+      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
+      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
+	<< cfun->machine->axis_dim[1] << ", 1\n";
+
   s << "{\n";
 
   bool return_in_mem = write_return_type (s, false, result_type);
@@ -1341,6 +1394,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->unisimt_predicate
       || (cfun->machine->has_simtreg && !crtl->is_leaf))
     nvptx_init_unisimt_predicate (file);
+  if (cfun->machine->bcast_partition || cfun->machine->sync_bar)
+    nvptx_init_oacc_workers (file);
 }
 
 /* Output code for switching uniform-simt state.  ENTERING indicates whether
@@ -2849,6 +2904,26 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
+static int
+nvptx_mach_max_workers ()
+{
+  return cfun->machine->axis_dim[MACH_MAX_WORKERS];
+}
+
+static int
+nvptx_mach_vector_length ()
+{
+  return cfun->machine->axis_dim[MACH_VECTOR_LENGTH];
+}
+
+/* Loop structure of the function.  The entire function is described as
+   a NULL loop.  */
+
 struct parallel
 {
   /* Parent parallel.  */
@@ -2996,6 +3071,19 @@ nvptx_split_blocks (bb_insn_map_t *map)
     }
 }
 
+/* Return true if MASK contains parallelism that requires shared
+   memory to broadcast.  */
+
+static bool
+nvptx_needs_shared_bcast (unsigned mask)
+{
+  bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER);
+  bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    && nvptx_mach_vector_length () != PTX_WARP_SIZE;
+
+  return worker || large_vector;
+}
+
 /* BLOCK is a basic block containing a head or tail instruction.
    Locate the associated prehead or pretail instruction, which must be
    in the single predecessor block.  */
@@ -3071,7 +3159,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    par = new parallel (par, mask);
 	    par->forked_block = block;
 	    par->forked_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->fork_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
 	  }
@@ -3086,7 +3174,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    gcc_assert (par->mask == mask);
 	    par->join_block = block;
 	    par->join_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->joining_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
 	    par = par->parent;
@@ -3944,23 +4032,45 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
+      rtx bcast_sym = oacc_bcast_sym;
+
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
+      if (vector && nvptx_mach_max_workers () > 1)
+	{
+	  if (!cfun->machine->bcast_partition)
+	    {
+	      /* It would be nice to place this register in
+		 DATA_AREA_SHARED.  */
+	      cfun->machine->bcast_partition = gen_reg_rtx (DImode);
+	    }
+	  if (!cfun->machine->sync_bar)
+	    cfun->machine->sync_bar = gen_reg_rtx (SImode);
+
+	  bcast_sym = cfun->machine->bcast_partition;
+	}
+
+      rtx init = gen_rtx_SET (data.base, bcast_sym);
       emit_insn_after (init, insn);
 
-      if (oacc_bcast_size < data.offset)
-	oacc_bcast_size = data.offset;
+      if (oacc_bcast_partition < data.offset)
+	{
+	  int psize = data.offset;
+	  psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
+	  oacc_bcast_partition = psize;
+	  oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	}
     }
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  THREADS is the
+   number of threads controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4115,13 +4225,23 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    pred = gen_reg_rtx (BImode);
 	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
 	  }
-	
+
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
 	else
 	  br = gen_br_true_uni (pred, label);
-	emit_insn_before (br, head);
+
+	if (recog_memoized (head) == CODE_FOR_nvptx_forked
+	    && recog_memoized (NEXT_INSN (head)) == CODE_FOR_nvptx_barsync)
+	  {
+	    head = NEXT_INSN (head);
+	    emit_insn_after (br, head);
+	  }
+	else if (recog_memoized (head) == CODE_FOR_nvptx_barsync)
+	  emit_insn_after (br, head);
+	else
+	  emit_insn_before (br, head);
 
 	LABEL_NUSES (label)++;
 	if (tail_branch)
@@ -4135,7 +4255,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
     {
       rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
 
-      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask
+	  && nvptx_mach_vector_length () == PTX_WARP_SIZE)
 	{
 	  /* Vector mode only, do a shuffle.  */
 #if WORKAROUND_PTXJIT_BUG
@@ -4202,26 +4323,55 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  unsigned size = GET_MODE_SIZE (SImode);
+	  bool vector = true;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
+
+	  if (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask)
+	    vector = false;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
-	    oacc_bcast_size = GET_MODE_SIZE (SImode);
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->bcast_partition)
+	    data.base = cfun->machine->bcast_partition;
+
+	  gcc_assert (data.base != NULL);
+
+	  if (oacc_bcast_partition < size)
+	    {
+	      int psize = size;
+	      psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
+	      oacc_bcast_partition = psize;
+	      oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	    }
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
-						    false),
+						    vector),
 			    before);
+
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->sync_bar)
+	    {
+	      barrier = cfun->machine->sync_bar;
+	      threads = nvptx_mach_vector_length ();
+	    }
+
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
-						    false), tail);
+						    vector),
+			    tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4330,20 +4480,32 @@ nvptx_process_pars (parallel *par)
     }
 
   bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0;
-  
-  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+  bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER));
+  bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+		      && nvptx_mach_vector_length () > PTX_WARP_SIZE);
+
+  if (worker || large_vector)
     {
       nvptx_shared_propagate (false, is_call, par->forked_block,
-			      par->forked_insn, false);
+			      par->forked_insn, !worker);
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
-					   false);
+					   !worker);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
+
+      if (!worker && cfun->machine->sync_bar)
+	{
+	  barrier = cfun->machine->sync_bar;
+	  threads = nvptx_mach_vector_length ();
+	}
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_after (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->joining_insn);
+	  emit_insn_after (nvptx_cta_sync (barrier, threads), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads),
+			    par->joining_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
@@ -4469,15 +4631,20 @@ populate_offload_attrs (offload_attrs *oa)
   if (oa->vector_length == 0)
     {
       /* FIXME: Need a more graceful way to handle large vector
-	 lengths in OpenACC routines.  */
+	 lengths in OpenACC routines and also -fopenacc-dims.  */
       if (!lookup_attribute ("omp target entrypoint",
 			     DECL_ATTRIBUTES (current_function_decl)))
 	oa->vector_length = PTX_WARP_SIZE;
-      else
+      else if (PTX_VECTOR_LENGTH != PTX_WARP_SIZE)
 	oa->vector_length = PTX_VECTOR_LENGTH;
     }
   if (oa->num_workers == 0)
-    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    {
+      if (oa->vector_length == 0)
+	oa->max_workers = PTX_WORKER_LENGTH;
+      else
+	oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    }
   else
     oa->max_workers = oa->num_workers;
 }
@@ -4535,6 +4702,9 @@ nvptx_reorg (void)
 
       populate_offload_attrs (&oa);
 
+      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
+
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
       gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 8a14507c88a..99943025a50 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -226,6 +226,14 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.   */
+  rtx bcast_partition; /* Register containing the size of each
+			  vector's partition of share-memory used to
+			  broadcast state.  */
+  rtx red_partition; /* Similar to bcast_partition, except for vector
+			reductions.  */
+  rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 28ae263c867..ac2731233dd 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1418,10 +1418,16 @@
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (!REG_P (operands[0]))
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_nounroll"
-- 
2.14.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [og7] vector_length extension part 3: reductions
  2018-03-01 21:17 [og7] vector_length extension part 1: generalize function and variable names Cesar Philippidis
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
@ 2018-03-02 17:51 ` Cesar Philippidis
  2018-04-05 14:07   ` Tom de Vries
  2018-04-05 16:26   ` Tom de Vries
  2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-02 17:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

This patch teaches the nvptx BE how to process vector reductions with
large vector lengths. The original vector reduction finalizer won't work
because it uses a warp shuffle operations. Now that vectors may contain
multiple warps, they need to store the partial reductions into
shared-memory like workers. Once the reduction variable is placed in
shared-memory, it will use the same atomic finalizer to update it as the
workers.

Much like the shared-memory spill-and-fill vector state propagation
extension, the nvptx BE needs to reserve enough shared-memory for each
worker that may encounter a vector reduction. That's why the reduction
functions have been augmented with an offload_attrs arguments. The
offload_attrs contains a max_workers field. Unlike vector_length, which
is fixed as a compile-time constant, num_workers can be altered
dynamically at runtime. Given that the size of a CUDA block is fixed,
max_workers is set to max_block_size / vector_length. This will be
discussed further in the next patch.

Effectively, the nvptx BE will now maintain a shared-memory reduction
buffer, named vector_red_sym, that contains max_workers logical
reduction partitions, where each partition contains enough shared-memory
for all of the reductions used by a single vector. By design, OpenACC
reductions are expanded relatively early during oaccdevlow. Because
accessing the reduction partition is a common operation, the partition
offset is placed in a register stored in cfun->machine_red_partition and
initialized in nvptx_init_axis_predicate. Due to how late that register
becomes available, nvptx_expand_shared_addr emits a
gen_nvptx_red_partition instruction to acquire share-memory address for
the reduction variable indirectly.

You may notice a hack in nvptx_declare_function_name. I observed that
sometimes GCC will mark red_partition as dead and not emit PTX code to
declare it. That's why nvptx_declare_function_name manually inserts it
into regno_reg_rtx prior to declaring all of the PTX registers. I think
there might be something wrong with nvptx_red_partition instruction.
Tom, can you take a look at it?

Ultimately, I suspect that large workers would greatly benefit by using
a new parallel tree reduction finalizer. Whereas the atomic finalizer
may have been suitable for a maximum of 32 workers, vector_length can be
up to 1024 threads, and a sequential finalizer will be slow. However,
that's a project for another day.

I'll commit this patch to openacc-gcc-7-branch after Tom reviews the new
nvptx_red_partition insn.

Cesar

[-- Attachment #2: og7-vl-part3-reductions.diff --]
[-- Type: text/x-patch, Size: 14258 bytes --]

2018-03-02  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx-protos.h (nvptx_output_red_partition): Declare.
	* config/nvptx/nvptx.c (vector_red_size, vector_red_align,
	vector_red_partition, vector_red_sym): New global variables.
	(nvptx_option_override): Initialize vector_red_sym.
	(nvptx_declare_function_name): Restore red_partition register.
	(nvptx_file_end): Emit code to declare the vector reduction variables.
	(nvptx_output_red_partition): New function.
	(nvptx_expand_shared_addr): Add vector argument. Use it to handle
	large vector reductions.
	(enum nvptx_builtins): Add NVPTX_BUILTIN_VECTOR_ADDR.
	(nvptx_init_builtins): Add VECTOR_ADDR.
	(nvptx_expand_builtin): Update call to nvptx_expand_shared_addr.
	Handle nvptx_expand_shared_addr.
	(nvptx_get_shared_red_addr): Add vector argument and handle large
	vectors.
	(nvptx_goacc_reduction_setup): Add offload_attrs argument and handle
	large vectors.
	(nvptx_goacc_reduction_init): Likewise.
	(nvptx_goacc_reduction_fini): Likewise.
	(nvptx_goacc_reduction_teardown): Likewise.
	(nvptx_goacc_reduction): Update calls to nvptx_goacc_reduction_{setup,
	init,fini,teardown}.
	* config/nvptx/nvptx.md (UNSPECV_RED_PART): New unspecv.
	(nvptx_red_partition): New insn.

From 3834101d5144666f30d8798e983e276bd2c66636 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Fri, 2 Mar 2018 07:36:11 -0800
Subject: [PATCH] reductions

---
 gcc/config/nvptx/nvptx-protos.h |   1 +
 gcc/config/nvptx/nvptx.c        | 146 +++++++++++++++++++++++++++++++---------
 gcc/config/nvptx/nvptx.md       |  12 ++++
 3 files changed, 128 insertions(+), 31 deletions(-)

diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h
index 16b316f12b8..326c38c5dc7 100644
--- a/gcc/config/nvptx/nvptx-protos.h
+++ b/gcc/config/nvptx/nvptx-protos.h
@@ -55,5 +55,6 @@ extern const char *nvptx_output_return (void);
 extern const char *nvptx_output_set_softstack (unsigned);
 extern const char *nvptx_output_simt_enter (rtx, rtx, rtx);
 extern const char *nvptx_output_simt_exit (rtx);
+extern const char *nvptx_output_red_partition (rtx, rtx);
 #endif
 #endif
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 4a48d44f44c..9d77176c638 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -142,6 +142,14 @@ static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Buffer needed for vector reductions, when vector_length >
+   PTX_WARP_SIZE.  This has to be distinct from the worker broadcast
+   array, as both may be live concurrently.  */
+static unsigned vector_red_size;
+static unsigned vector_red_align;
+static unsigned vector_red_partition;
+static GTY(()) rtx vector_red_sym;
+
 /* Shared memory block for gang-private variables.  */
 static unsigned gangprivate_shared_size;
 static unsigned gangprivate_shared_align;
@@ -215,6 +223,10 @@ nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  vector_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__vector_red");
+  SET_SYMBOL_DATA_AREA (vector_red_sym, DATA_AREA_SHARED);
+  vector_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
   SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
   gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
@@ -1296,6 +1308,12 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 	fprintf (file, "\t.local.align 8 .b8 %%simtstack_ar["
 		HOST_WIDE_INT_PRINT_DEC "];\n", simtsz);
     }
+
+  /* Restore the vector reduction partition register, if necessary.  */
+  if (cfun->machine->red_partition)
+    regno_reg_rtx[REGNO (cfun->machine->red_partition)]
+      = cfun->machine->red_partition;
+
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < maxregs; i++)
@@ -4732,6 +4750,10 @@ nvptx_file_end (void)
     write_shared_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (vector_red_size)
+    write_shared_buffer (asm_out_file, vector_red_sym,
+			 vector_red_align, vector_red_size);
+
   if (gangprivate_shared_size)
     write_shared_buffer (asm_out_file, gangprivate_shared_sym,
 			 gangprivate_shared_align, gangprivate_shared_size);
@@ -4781,33 +4803,78 @@ nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
   return target;
 }
 
-/* Worker reduction address expander.  */
+const char *
+nvptx_output_red_partition (rtx dst, rtx offset)
+{
+  const char *zero_offset = "\t\tmov.u64\t%%r%d, %%r%d; // vred buffer\n";
+  const char *with_offset = "\t\tadd.u64\t%%r%d, %%r%d, %d; // vred buffer\n";
+
+  if (offset == const0_rtx)
+    fprintf (asm_out_file, zero_offset, REGNO (dst),
+	     REGNO (cfun->machine->red_partition));
+  else
+    fprintf (asm_out_file, with_offset, REGNO (dst),
+	     REGNO (cfun->machine->red_partition), UINTVAL (offset));
+
+  return "";
+}
+
+/* Shared-memory reduction address expander.  */
 
 static rtx
 nvptx_expand_shared_addr (tree exp, rtx target,
-			  machine_mode ARG_UNUSED (mode), int ignore)
+			  machine_mode ARG_UNUSED (mode), int ignore,
+			  int vector)
 {
   if (ignore)
     return target;
 
   unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
-  if (align > worker_red_align)
-    worker_red_align = align;
-
   unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
   unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
-  if (size + offset > worker_red_size)
-    worker_red_size = size + offset;
-
   rtx addr = worker_red_sym;
-  if (offset)
+
+  if (vector)
     {
-      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset));
-      addr = gen_rtx_CONST (Pmode, addr);
+      offload_attrs oa;
+      unsigned new_size = size + offset;
+
+      populate_offload_attrs (&oa);
+
+      new_size = (new_size * oa.max_workers + align - 1) & ~(align - 1);
+
+      if (align > vector_red_align)
+	vector_red_align = align;
+
+      if (cfun->machine->red_partition == NULL)
+	cfun->machine->red_partition = gen_reg_rtx (Pmode);
+
+      if (new_size > vector_red_size)
+	{
+	  int partition_size = (size + offset + align - 1) & ~(align -1);
+	  vector_red_size = new_size;
+	  vector_red_partition = partition_size;
+	}
+
+      addr = gen_reg_rtx (Pmode);
+      emit_insn (gen_nvptx_red_partition (addr, GEN_INT (offset)));
     }
+  else
+    {
+      if (align > worker_red_align)
+	worker_red_align = align;
 
-  emit_move_insn (target, addr);
+      if (size + offset > worker_red_size)
+	worker_red_size = size + offset;
 
+      if (offset)
+	{
+	  addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset));
+	  addr = gen_rtx_CONST (Pmode, addr);
+	}
+   }
+
+  emit_move_insn (target, addr);
   return target;
 }
 
@@ -4874,6 +4941,7 @@ enum nvptx_builtins
   NVPTX_BUILTIN_SHUFFLE,
   NVPTX_BUILTIN_SHUFFLELL,
   NVPTX_BUILTIN_WORKER_ADDR,
+  NVPTX_BUILTIN_VECTOR_ADDR,
   NVPTX_BUILTIN_CMP_SWAP,
   NVPTX_BUILTIN_CMP_SWAPLL,
   NVPTX_BUILTIN_COND_UNI,
@@ -4912,6 +4980,8 @@ nvptx_init_builtins (void)
   DEF (SHUFFLELL, "shufflell", (LLUINT, LLUINT, UINT, UINT, NULL_TREE));
   DEF (WORKER_ADDR, "worker_addr",
        (PTRVOID, ST, UINT, UINT, NULL_TREE));
+  DEF (VECTOR_ADDR, "vector_addr",
+       (PTRVOID, ST, UINT, UINT, NULL_TREE));
   DEF (CMP_SWAP, "cmp_swap", (UINT, PTRVOID, UINT, UINT, NULL_TREE));
   DEF (CMP_SWAPLL, "cmp_swapll", (LLUINT, PTRVOID, LLUINT, LLUINT, NULL_TREE));
   DEF (COND_UNI, "cond_uni", (integer_type_node, integer_type_node, NULL_TREE));
@@ -4941,7 +5011,10 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
       return nvptx_expand_shuffle (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_WORKER_ADDR:
-      return nvptx_expand_shared_addr (exp, target, mode, ignore);
+      return nvptx_expand_shared_addr (exp, target, mode, ignore, false);
+
+    case NVPTX_BUILTIN_VECTOR_ADDR:
+      return nvptx_expand_shared_addr (exp, target, mode, ignore, true);
 
     case NVPTX_BUILTIN_CMP_SWAP:
     case NVPTX_BUILTIN_CMP_SWAPLL:
@@ -5197,10 +5270,13 @@ nvptx_goacc_fork_join (gcall *call, const int dims[],
    data at that location.  */
 
 static tree
-nvptx_get_shared_red_addr (tree type, tree offset)
+nvptx_get_shared_red_addr (tree type, tree offset, bool vector)
 {
+  enum nvptx_builtins addr_dim = NVPTX_BUILTIN_WORKER_ADDR;
+  if (vector)
+    addr_dim = NVPTX_BUILTIN_VECTOR_ADDR;
   machine_mode mode = TYPE_MODE (type);
-  tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
+  tree fndecl = nvptx_builtin_decl (addr_dim, true);
   tree size = build_int_cst (unsigned_type_node, GET_MODE_SIZE (mode));
   tree align = build_int_cst (unsigned_type_node,
 			      GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT);
@@ -5631,7 +5707,7 @@ nvptx_adjust_reduction_type (tree var, tree type, gimple_seq *seq)
 /* NVPTX implementation of GOACC_REDUCTION_SETUP.  */
 
 static void
-nvptx_goacc_reduction_setup (gcall *call)
+nvptx_goacc_reduction_setup (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5654,11 +5730,13 @@ nvptx_goacc_reduction_setup (gcall *call)
 	}
     }
   
-  if (level == GOMP_DIM_WORKER)
+  if (level == GOMP_DIM_WORKER
+      || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE))
     {
       /* Store incoming value to worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+					     level == GOMP_DIM_VECTOR);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5677,7 +5755,7 @@ nvptx_goacc_reduction_setup (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_INIT. */
 
 static void
-nvptx_goacc_reduction_init (gcall *call)
+nvptx_goacc_reduction_init (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5691,7 +5769,7 @@ nvptx_goacc_reduction_init (gcall *call)
   
   push_gimplify_context (true);
 
-  if (level == GOMP_DIM_VECTOR)
+  if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE)
     {
       /* Initialize vector-non-zeroes to INIT_VAL (OP).  */
       tree tid = make_ssa_name (integer_type_node);
@@ -5763,7 +5841,7 @@ nvptx_goacc_reduction_init (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_FINI.  */
 
 static void
-nvptx_goacc_reduction_fini (gcall *call)
+nvptx_goacc_reduction_fini (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5777,17 +5855,18 @@ nvptx_goacc_reduction_fini (gcall *call)
 
   push_gimplify_context (true);
 
-  if (level == GOMP_DIM_VECTOR)
+  if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE)
     r = nvptx_vector_reduction (gimple_location (call), &gsi, var, op);
   else
     {
       tree accum = NULL_TREE;
 
-      if (level == GOMP_DIM_WORKER)
+      if (level == GOMP_DIM_WORKER || level == GOMP_DIM_VECTOR)
 	{
 	  /* Get reduction buffer address.  */
 	  tree offset = gimple_call_arg (call, 5);
-	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
+	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+						 level == GOMP_DIM_VECTOR);
 	  tree ptr = make_ssa_name (TREE_TYPE (call));
 
 	  gimplify_assign (ptr, call, &seq);
@@ -5822,7 +5901,7 @@ nvptx_goacc_reduction_fini (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_TEARDOWN.  */
 
 static void
-nvptx_goacc_reduction_teardown (gcall *call)
+nvptx_goacc_reduction_teardown (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5831,11 +5910,13 @@ nvptx_goacc_reduction_teardown (gcall *call)
   gimple_seq seq = NULL;
   
   push_gimplify_context (true);
-  if (level == GOMP_DIM_WORKER)
+  if (level == GOMP_DIM_WORKER
+      || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE))
     {
       /* Read the worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset,
+					    level == GOMP_DIM_VECTOR);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5870,23 +5951,26 @@ static void
 nvptx_goacc_reduction (gcall *call)
 {
   unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
 
   switch (code)
     {
     case IFN_GOACC_REDUCTION_SETUP:
-      nvptx_goacc_reduction_setup (call);
+      nvptx_goacc_reduction_setup (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_INIT:
-      nvptx_goacc_reduction_init (call);
+      nvptx_goacc_reduction_init (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_FINI:
-      nvptx_goacc_reduction_fini (call);
+      nvptx_goacc_reduction_fini (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_TEARDOWN:
-      nvptx_goacc_reduction_teardown (call);
+      nvptx_goacc_reduction_teardown (call, &oa);
       break;
 
     default:
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index ac7b7cc8440..28ae263c867 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -66,6 +66,8 @@
 
    UNSPECV_SIMT_ENTER
    UNSPECV_SIMT_EXIT
+
+   UNSPECV_RED_PART
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -1427,3 +1429,13 @@
   ""
   "\\t.pragma \\\"nounroll\\\";"
   [(set_attr "predicable" "false")])
+
+(define_insn "nvptx_red_partition"
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=R")
+	(unspec_volatile [(match_operand:DI 1 "const_int_operand")]
+	 UNSPECV_RED_PART))]
+  ""
+  {
+    return nvptx_output_red_partition (operands[0], operands[1]);
+  }
+  [(set_attr "predicable" "false")])
-- 
2.14.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-01 21:17 [og7] vector_length extension part 1: generalize function and variable names Cesar Philippidis
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
  2018-03-02 17:51 ` [og7] vector_length extension part 3: reductions Cesar Philippidis
@ 2018-03-02 19:18 ` Cesar Philippidis
  2018-03-21 15:55   ` Tom de Vries
                     ` (4 more replies)
  2018-03-02 20:47 ` [og7] vector_length extension part 5: libgomp and tests Cesar Philippidis
  2018-03-09 15:29 ` [og7] vector_length extension part 1: generalize function and variable names Thomas Schwinge
  4 siblings, 5 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-02 19:18 UTC (permalink / raw)
  To: gcc-patches; +Cc: Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 3114 bytes --]

The attached patch adjusts the existing goacc validate_dims target hook
and introduces a new goacc adjust_parallelism target hook. Now that
vector length is no longer hard-coded to 32, there are four different
ways to set it:

  1) compiler default
  2) explicitly via the vector_length clause
  3) compile time using -fopenacc-dim or the GOMP_OPENACC_DIM
     environment variable
  4) fallback to vector_length = 32 due to insufficient parallelism

The compiler default is activated in the absence of 2) and 3). It is
controlled by the macro PTX_VECTOR_LENGTH in nvptx.c. While working on
this patch set, I had it set to 128 to get more test coverage. But in
order to maintain backwards compatibility with acc routines (which is
still a work in progress), I've kept the default vector length to 32.
Besides, large vector reductions are expected to run slower until the
parallel reduction finalizer is ready.

The new default_dims arguments to validate_dims represents is necessary
to accommodate option 3) from above. validate_dims is called after
oaccdevlow has assigned parallelism to each acc loop.

Prior to this patch, oaccdevlow automatically assigned parallelism to
acc loops using oacc_loop_fixed_partitions and
oacc_loop_auto_partitions. Both of those functions were
processor-agnostic. In the case of nvptx, due to the current limitations
in this patch set, the nvptx BE needs to fallback to using a
vector_length of 32 whenever a vector loop is nested inside a worker
loop. By supplying the parallelism mask for both the current loop and
the outer loops, the goacc adjust_parallelism hook allows the back ends
to fine tune any parallelism as necessary.

Inside the nvptx BE, nvptx_goacc_adjust_parallelism uses a new "nvptx vl
warp" function attribute to denote that the offloaded function must
fallback to using a vector length of 32. Later,
nvptx_goacc_validate_dims uses the attribute to adjust vector_length
accordingly.

Going forward, in addition to adding a new parallel reduction finalizer,
the nvptx BE would benefit from merging synchronization and reduction
code for combined worker-reduction loops, e.g.

  #pragma acc loop worker vector

At present, GCC partitions acc loops with internal function markers for
each level of parallelism associated with the loop. If a loop has both
worker and vector level parallelism, it will have a dummy outer worker
loop, and dummy inner vector loop. On CUDA hardware, there's no strong
difference between workers and vectors as CUDA blocks are a loose
collection of warps. Therefore, it would make more sense to merge the
two loops together into a special WV loop. That would at least require
some changes in the BE in addition to oacc_loop_{auto,fixed}_partitions.
There were some problems in the past where CUDA hardware would lock up
because the synchronization requirements for those two levels of
parallelism. Merging them ought to simplify the synchronization code and
enable the PTX JIT to generate better code.

Overall, the changes in this patch are mild. I'll apply it to
openacc-gcc-7-branch after Tom approves the reduction patch.

Cesar


[-- Attachment #2: og7-vl-part4-hooks.diff --]
[-- Type: text/x-patch, Size: 16663 bytes --]

2018-03-02  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (NVPTX_GOACC_VL_WARP): Define.
	(nvptx_goacc_needs_vl_warp): New function.
	(nvptx_goacc_validate_dims): Add new default_dims argument and take
	larger vector lengths into account.
	(nvptx_adjust_parallelism): New function.
	(TARGET_GOACC_ADJUST_PARALLELISM): Define.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Add placeholder for TARGET_GOACC_ADJUST_PARALLELISM.
	* omp-offload.c (oacc_parse_default_dims): Update usage of the
	targetm.goacc_valdate_dims hook.
	(oacc_validate_dims): Add default_dims argument.
	(oacc_loop_fixed_partitions): Use the adjust_parallelism hook to
	modify this_mask.
	(oacc_loop_auto_partitions): Use the adjust_parallelism hook to
	modify this_mask and loop->mask.
	(execute_oacc_device_lower): Update call to oacc_validate_dims.
	(default_goacc_adjust_parallelism): New function.
	* target.def (validate_dims): Add new default_dims argument.
	(adjust_parallelism): New hook.
	* targhooks.h (default_goacc_validate_dims): Add new argument.
	(default_goacc_adjust_parallelism): Declare.

From 1ee16b267dfbb0a148e8ec3b83ca463c21cbac1d Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Fri, 2 Mar 2018 10:08:23 -0800
Subject: [PATCH] New target hooks

---
 gcc/config/nvptx/nvptx.c | 139 +++++++++++++++++++++++++++++++++++++++++++++--
 gcc/doc/tm.texi          |  15 +++--
 gcc/doc/tm.texi.in       |   2 +
 gcc/omp-offload.c        |  35 ++++++++++--
 gcc/target.def           |  17 ++++--
 gcc/targhooks.h          |   3 +-
 6 files changed, 190 insertions(+), 21 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5642941c6a3..507c8671704 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5205,14 +5205,36 @@ nvptx_simt_vf ()
   return PTX_WARP_SIZE;
 }
 
+#define NVPTX_GOACC_VL_WARP "nvptx vl warp"
+
+/* Return true of the offloaded function needs a vector_length of
+   PTX_WARP_SIZE.  */
+
+static bool
+nvptx_goacc_needs_vl_warp ()
+{
+  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,
+				DECL_ATTRIBUTES (current_function_decl));
+  return attr == NULL_TREE;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  If
    DECL is null, we are validating the default dimensions.  */
 
 static bool
-nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
+nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level,
+			   int default_dims[])
 {
+  int default_vector_length = PTX_VECTOR_LENGTH;
+
+  /* For capability reasons, fallback to vl = 32 for runtime values.  */
+  if (dims[GOMP_DIM_VECTOR] == 0)
+    default_vector_length = PTX_WARP_SIZE;
+  else if (default_dims)
+      default_vector_length = default_dims[GOMP_DIM_VECTOR];
+
   /* Detect if a function is unsuitable for offloading.  */
   if (!flag_offload_force && decl)
     {
@@ -5237,18 +5259,20 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 
   bool changed = false;
 
-  /* The vector size must be 32, unless this is a SEQ routine.  */
+  /* The vector size must be a positive multiple of the warp size,
+     unless this is a SEQ routine.  */
   if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
       && dims[GOMP_DIM_VECTOR] >= 0
-      && dims[GOMP_DIM_VECTOR] != PTX_VECTOR_LENGTH)
+      && (dims[GOMP_DIM_VECTOR] % 32 != 0
+	  || dims[GOMP_DIM_VECTOR] == 0))
     {
       if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
 		    dims[GOMP_DIM_VECTOR]
 		    ? G_("using vector_length (%d), ignoring %d")
 		    : G_("using vector_length (%d), ignoring runtime setting"),
-		    PTX_VECTOR_LENGTH, dims[GOMP_DIM_VECTOR]);
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+		    default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
       changed = true;
     }
 
@@ -5262,16 +5286,77 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
       changed = true;
     }
 
+  /* Ensure that num_worker * vector_length < cta size.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+    {
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
+  /* vector_length must not exceed PTX_CTA_SIZE.  */
+  if (dims[GOMP_DIM_VECTOR] >= PTX_CTA_SIZE)
+    {
+      int new_vector = PTX_CTA_SIZE;
+      if (default_dims)
+	new_vector = default_vector_length;
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  new_vector, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = new_vector;
+      changed = true;
+    }
+
+  /* Set vector_length to default_vector_length if there are a sufficient
+     number of free threads in the CTA.  */
+  if (dims[GOMP_DIM_WORKER] > 0 && dims[GOMP_DIM_VECTOR] <= 0)
+    {
+      if (dims[GOMP_DIM_WORKER] * default_vector_length <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = default_vector_length;
+      else if (dims[GOMP_DIM_WORKER] * PTX_WARP_SIZE <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      else
+	error_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
+		  "vector_length must be at least 32");
+      changed = true;
+    }
+
+  /* Specify a default vector_length.  */
+  if (dims[GOMP_DIM_VECTOR] < 0)
+    {
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
+      changed = true;
+    }
+
+  if (nvptx_goacc_needs_vl_warp () && dims[GOMP_DIM_VECTOR] != PTX_WARP_SIZE)
+    {
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
   if (!decl)
     {
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+      bool new_vector = false;
+      if (dims[GOMP_DIM_VECTOR] <= 1)
+	{
+	  dims[GOMP_DIM_VECTOR] = default_vector_length;
+	  new_vector = true;
+	}
       if (dims[GOMP_DIM_WORKER] < 0)
 	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
+      if (new_vector
+	  && dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
       changed = true;
     }
 
+  gcc_assert (dims[GOMP_DIM_VECTOR] != 0);
+  gcc_assert (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] <= PTX_CTA_SIZE);
+
   return changed;
 }
 
@@ -5291,6 +5376,45 @@ nvptx_dim_limit (int axis)
   return 0;
 }
 
+/* Adjust the parallelism available to a loop given vector_length
+   associated with the offloaded function.  */
+
+static unsigned
+nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
+{
+  if (nvptx_goacc_needs_vl_warp ())
+    return inner_mask;
+
+  bool wv = (inner_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
+
+  if (oa.vector_length == PTX_WARP_SIZE)
+    return inner_mask;
+
+  /* FIXME: This is overly conservative; worker and vector loop will
+     eventually be combined.  */
+  if (wv)
+    return inner_mask & ~GOMP_DIM_MASK (GOMP_DIM_WORKER);
+
+  /* It's difficult to guarantee that warps in large vector_lengths
+     will remain convergent when a vector loop is nested inside a
+     worker loop.  Therefore, fallback to setting vector_length to
+     PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
+     sm_70+ targets.  */
+  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+      && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
+			      DECL_ATTRIBUTES (current_function_decl));
+      DECL_ATTRIBUTES (current_function_decl) = attr;
+    }
+
+  return inner_mask;
+}
+
 /* Determine whether fork & joins are needed.  */
 
 static bool
@@ -6180,6 +6304,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_GOACC_DIM_LIMIT
 #define TARGET_GOACC_DIM_LIMIT nvptx_dim_limit
 
+#undef TARGET_GOACC_ADJUST_PARALLELISM
+#define TARGET_GOACC_ADJUST_PARALLELISM nvptx_adjust_parallelism
+
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 0fcb9c64bf4..3028e438ddd 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5865,7 +5865,7 @@ to use it.
 Return number of threads in SIMT thread group on the target.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
+@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level}, int *@var{default_dims})
 This hook should check the launch dimensions provided for an OpenACC
 compute region, or routine.  Defaulted values are represented as -1
 and non-constant values as 0.  The @var{fn_level} is negative for the
@@ -5873,9 +5873,10 @@ function corresponding to the compute region.  For a routine is is the
 outermost level at which partitioned execution may be spawned.  The hook
 should verify non-default values.  If DECL is NULL, global defaults
 are being validated and unspecified defaults should be filled in.
-Diagnostics should be issued as appropriate.  Return
-true, if changes have been made.  You must override this hook to
-provide dimensions larger than 1.
+Diagnostics should be issued as appropriate.  The @var{default_dims}
+contain the user-specified default dims.  Return true, if changes have
+been made.  You must override this hook to provide dimensions larger
+than 1.
 @end deftypefn
 
 @deftypefn {Target Hook} int TARGET_GOACC_DIM_LIMIT (int @var{axis})
@@ -5883,6 +5884,12 @@ This hook should return the maximum size of a particular dimension,
 or zero if unbounded.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_GOACC_ADJUST_PARALLELISM (unsigned @var{this_mask}, unsigned @var{outer_mask})
+This hook allows the accelerator compiler to remove any unused
+parallelism exposed in the current loop @var{THIS_MASK}, and the
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
 This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN
 function calls to target-specific gimple, or indicate whether they
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4187da139a9..fc73ad13e0a 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4298,6 +4298,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_DIM_LIMIT
 
+@hook TARGET_GOACC_ADJUST_PARALLELISM
+
 @hook TARGET_GOACC_FORK_JOIN
 
 @hook TARGET_GOACC_REDUCTION
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index ba3f4317f4e..f15ce6b8f8d 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -615,8 +615,8 @@ oacc_parse_default_dims (const char *dims)
     }
 
   /* Allow the backend to validate the dimensions.  */
-  targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1);
-  targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2);
+  targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1, NULL);
+  targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2, NULL);
 }
 
 /* Validate and update the dimensions for offloaded FN.  ATTRS is the
@@ -626,7 +626,8 @@ oacc_parse_default_dims (const char *dims)
    function.  */
 
 static void
-oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
+oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used,
+		    int * ARG_UNUSED (default_dims))
 {
   tree purpose[GOMP_DIM_MAX];
   unsigned ix;
@@ -675,7 +676,8 @@ oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
 		      axes[ix], axes[ix]);
     }
 
-  bool changed = targetm.goacc.validate_dims (fn, dims, level);
+  bool changed = targetm.goacc.validate_dims (fn, dims, level,
+					      oacc_default_dims);
 
   /* Default anything left to 1 or a partitioned default.  */
   for (ix = 0; ix != GOMP_DIM_MAX; ix++)
@@ -1258,6 +1260,13 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
 	}
     }
 
+  /* FIXME: Ideally, we should be coalescing parallelism here if the
+     hardware supports it.  E.g. Instead of partitioning a loop
+     across worker and vector axes, sometimes the hardware can
+     execute those loops together without resorting to placing
+     extra thread barriers.  */
+  this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
+
   mask_all |= this_mask;
 
   if (loop->flags & OLF_TILE)
@@ -1349,6 +1358,7 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	  this_mask ^= loop->e_mask;
 	}
 
+      this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
       loop->mask |= this_mask;
     }
 
@@ -1397,6 +1407,8 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	}
 
       loop->mask |= this_mask;
+      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
+
       if (!loop->mask && noisy)
 	warning_at (loop->loc, 0,
 		    tiling
@@ -1604,7 +1616,8 @@ execute_oacc_device_lower ()
     }
 
   int dims[GOMP_DIM_MAX];
-  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask);
+  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask,
+		      NULL);
 
   if (dump_file)
     {
@@ -1746,7 +1759,8 @@ execute_oacc_device_lower ()
 
 bool
 default_goacc_validate_dims (tree ARG_UNUSED (decl), int *dims,
-			     int ARG_UNUSED (fn_level))
+			     int ARG_UNUSED (fn_level),
+			     int * ARG_UNUSED (default_dims))
 {
   bool changed = false;
 
@@ -1774,6 +1788,15 @@ default_goacc_dim_limit (int ARG_UNUSED (axis))
 #endif
 }
 
+/* Default adjustment of loop parallelism is not required.  */
+
+unsigned
+default_goacc_adjust_parallelism (unsigned this_mask,
+				  unsigned ARG_UNUSED (outer_mask))
+{
+  return this_mask;
+}
+
 namespace {
 
 const pass_data pass_data_oacc_device_lower =
diff --git a/gcc/target.def b/gcc/target.def
index b302d3639da..aa7da2c1b2c 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1683,10 +1683,11 @@ function corresponding to the compute region.  For a routine is is the\n\
 outermost level at which partitioned execution may be spawned.  The hook\n\
 should verify non-default values.  If DECL is NULL, global defaults\n\
 are being validated and unspecified defaults should be filled in.\n\
-Diagnostics should be issued as appropriate.  Return\n\
-true, if changes have been made.  You must override this hook to\n\
-provide dimensions larger than 1.",
-bool, (tree decl, int *dims, int fn_level),
+Diagnostics should be issued as appropriate.  The @var{default_dims}\n\
+contain the user-specified default dims.  Return true, if changes have\n\
+been made.  You must override this hook to provide dimensions larger\n\
+than 1.",
+bool, (tree decl, int *dims, int fn_level, int *default_dims),
 default_goacc_validate_dims)
 
 DEFHOOK
@@ -1696,6 +1697,14 @@ or zero if unbounded.",
 int, (int axis),
 default_goacc_dim_limit)
 
+DEFHOOK
+(adjust_parallelism,
+"This hook allows the accelerator compiler to remove any unused\n\
+parallelism exposed in the current loop @var{THIS_MASK}, and the\n\
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.",
+unsigned, (unsigned this_mask, unsigned outer_mask),
+default_goacc_adjust_parallelism)
+
 DEFHOOK
 (fork_join,
 "This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN\n\
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 18070df7839..b60c72a38f1 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -111,10 +111,11 @@ extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
-extern bool default_goacc_validate_dims (tree, int [], int);
+extern bool default_goacc_validate_dims (tree, int [], int, int []);
 extern int default_goacc_dim_limit (int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
 extern void default_goacc_reduction (gcall *);
+extern unsigned default_goacc_adjust_parallelism (unsigned, unsigned);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
-- 
2.14.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [og7] vector_length extension part 5: libgomp and tests
  2018-03-01 21:17 [og7] vector_length extension part 1: generalize function and variable names Cesar Philippidis
                   ` (2 preceding siblings ...)
  2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
@ 2018-03-02 20:47 ` Cesar Philippidis
  2018-03-16 13:50   ` Thomas Schwinge
                     ` (2 more replies)
  2018-03-09 15:29 ` [og7] vector_length extension part 1: generalize function and variable names Thomas Schwinge
  4 siblings, 3 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-02 20:47 UTC (permalink / raw)
  To: gcc-patches, Schwinge, Thomas; +Cc: Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 947 bytes --]

The attached patch is the last one in the vector length extension
series. It consists of some tweaks to the libgomp nvptx plugin to
accommodate larger vectors along with two test cases.

I only added two test cases because there's really not much interesting
going on with longer vector lengths. We should eventually add more tests
cases to handle situations where the nvptx BE falls back to using a
shorter vector length. But right now GCC just makes those changes
silently. There is precedent for the nvptx BE to emit a warning when
vector length != 32, but that might be too verbose. On one hand, it
could be argued that the compiler should error if it cannot satisfy the
user's request. On the other hand, falling back to a smaller vector
length ensures correctness.

Thomas, do you have any thoughts on the warnings/errors or there lack of?

I'll apply this patch to openacc-gcc-7-branch once the reduction changes
have been approved.

Cesar

[-- Attachment #2: og7-vl-part5-runtime.diff --]
[-- Type: text/x-patch, Size: 6368 bytes --]

2018-03-02  Cesar Philippidis  <cesar@codesourcery.com>

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_exec): Adjust calculations of
	workers and vectors.
	* testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: New test.
	* testsuite/libgomp.oacc-fortran/gemm.f90: New test.


From 37e83baa6ae7e867e6203d7554e8381660488421 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Fri, 2 Mar 2018 06:54:25 -0800
Subject: [PATCH] runtime changes and tests

---
 libgomp/plugin/plugin-nvptx.c                      |  10 +-
 .../libgomp.oacc-c-c++-common/vred2d-128.c         |  55 +++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    | 108 +++++++++++++++++++++
 3 files changed, 169 insertions(+), 4 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bdc0c30e1f5..9b4768f0e59 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -734,8 +734,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   int threads_per_block = threads_per_sm > block_size
     ? block_size : threads_per_sm;
 
-  threads_per_block /= warp_size;
-
   if (threads_per_sm > cpu_size)
     threads_per_sm = cpu_size;
 
@@ -802,6 +800,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
   if (seen_zero)
     {
+      int vectors = dims[GOMP_DIM_VECTOR] > 0
+	? dims[GOMP_DIM_VECTOR] : warp_size;
+      int workers = threads_per_block / vectors;
+
       for (i = 0; i != GOMP_DIM_MAX; i++)
 	if (!dims[i])
 	  {
@@ -819,10 +821,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 		  : 2 * dev_size;
 		break;
 	      case GOMP_DIM_WORKER:
-		dims[i] = threads_per_block;
+		dims[i] = workers;
 		break;
 	      case GOMP_DIM_VECTOR:
-		dims[i] = warp_size;
+		dims[i] = vectors;
 		break;
 	      default:
 		abort ();
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
new file mode 100644
index 00000000000..0fbd30e77b9
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
@@ -0,0 +1,55 @@
+/* Test large vector lengths.  */
+
+#include <assert.h>
+
+#define n 10000
+int a1[n], a2[n];
+
+#define gentest(name, outer, inner)		\
+  void name ()					\
+  {						\
+  long i, j, t1, t2, t3;			\
+  _Pragma(outer)				\
+  for (i = 0; i < n; i++)			\
+    {						\
+      t1 = 0;					\
+      t2 = 0;					\
+      _Pragma(inner)				\
+      for (j = i; j < n; j++)			\
+	{					\
+	  t1++;					\
+	  t2--;					\
+	}					\
+      a1[i] = t1;				\
+      a2[i] = t2;				\
+    }						\
+  for (i = 0; i < n; i++)			\
+    {						\
+      assert (a1[i] == n-i);			\
+      assert (a2[i] == -(n-i));			\
+    }						\
+  }						\
+  
+gentest (test1, "acc parallel loop gang vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test2, "acc parallel loop gang vector_length (128)",
+	 "acc loop worker vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test3, "acc parallel loop gang worker vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test4, "acc parallel loop",
+	 "acc loop reduction(+:t1) reduction(-:t2)")
+
+
+int
+main ()
+{
+  test1 ();
+  test2 ();
+  test3 ();
+  test4 ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
new file mode 100644
index 00000000000..ad67dce5cad
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
@@ -0,0 +1,108 @@
+! Exercise three levels of parallelism using SGEMM from BLAS.
+
+! { dg-additional-options "-fopenacc-dim=-:-:128" }
+
+! Implicitly set vector_length to 128 using -fopenacc-dim.
+subroutine openacc_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n))
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm
+
+! Explicitly set vector_length to 128 using a vector_length clause.
+subroutine openacc_sgemm_128 (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n)) vector_length (128)
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm_128
+
+subroutine host_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  do j = 1, n
+     do i = 1, m
+        temp = 0.0
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine host_sgemm
+
+program main
+  integer, parameter :: M = 100, N = 50, K = 2000
+  real :: a(K, M), b(K, N), c(M, N), d (M, N), e (M, N)
+  real alpha, beta
+  integer i, j
+
+  a(:,:) = 1.0
+  b(:,:) = 0.25
+
+  c(:,:) = 0.0
+  d(:,:) = 0.0
+  e(:,:) = 0.0
+
+  alpha = 1.05
+  beta = 1.25
+
+  call openacc_sgemm (M, N, K, alpha, a, b, beta, c)
+  call openacc_sgemm_128 (M, N, K, alpha, a, b, beta, d)
+  call host_sgemm (M, N, K, alpha, a, b, beta, e)
+
+  do i = 1, m
+     do j = 1, n
+        if (c(i,j) /= e(i,j)) call abort
+        if (d(i,j) /= e(i,j)) call abort
+     end do
+  end do
+end program main
-- 
2.14.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 1: generalize function and variable names
  2018-03-01 21:17 [og7] vector_length extension part 1: generalize function and variable names Cesar Philippidis
                   ` (3 preceding siblings ...)
  2018-03-02 20:47 ` [og7] vector_length extension part 5: libgomp and tests Cesar Philippidis
@ 2018-03-09 15:29 ` Thomas Schwinge
  2018-03-09 15:31   ` Cesar Philippidis
  4 siblings, 1 reply; 50+ messages in thread
From: Thomas Schwinge @ 2018-03-09 15:29 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: Tom de Vries, gcc-patches

Hi!

On Thu, 1 Mar 2018 13:17:01 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:
> To reduce the size of the final patch,
> I've separated all of the misc. function and variable renaming into this
> patch.

Yes, please always do such refactoring changes independently of other
functionality changes.


> This patch also introduces a new populate_offload_attrs function.

I'm seeing:

    [...]/gcc/config/nvptx/nvptx.c: In function 'void nvptx_reorg()':
    [...]/gcc/config/nvptx/nvptx.c:4451:3: warning: 'oa.offload_attrs::vector_length' may be used uninitialized in this function [-Wmaybe-uninitialized]
       if (oa->vector_length == 0)
       ^
    [...]/gcc/config/nvptx/nvptx.c:4516:21: note: 'oa.offload_attrs::vector_length' was declared here
           offload_attrs oa;
                         ^

That must be "populate_offload_attrs" inlined into "nvptx_reorg".  I
can't tell yet why it complains about "vector_length" only but not about
the others.

For reference:

> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c

> +/* Offloading function attributes.  */
> +
> +struct offload_attrs
> +{
> +  unsigned mask;
> +  int num_gangs;
> +  int num_workers;
> +  int vector_length;
> +  int max_workers;
> +};

> +static void
> +populate_offload_attrs (offload_attrs *oa)
> +{
> +  tree attr = oacc_get_fn_attrib (current_function_decl);
> +  tree dims = TREE_VALUE (attr);
> +  unsigned ix;
> +
> +  oa->mask = 0;
> +
> +  for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
> +    {
> +      tree t = TREE_VALUE (dims);
> +      int size = (t == NULL_TREE) ? 0 : TREE_INT_CST_LOW (t);
> +      tree allowed = TREE_PURPOSE (dims);
> +
> +      if (size != 1 && !(allowed && integer_zerop (allowed)))
> +	oa->mask |= GOMP_DIM_MASK (ix);
> +
> +      switch (ix)
> +	{
> +	case GOMP_DIM_GANG:
> +	  oa->num_gangs = size;
> +	  break;
> +
> +	case GOMP_DIM_WORKER:
> +	  oa->num_workers = size;
> +	  break;
> +
> +	case GOMP_DIM_VECTOR:
> +	  oa->vector_length = size;
> +	  break;
> +	}
> +    }
> +
> +  if (oa->vector_length == 0)
> +    {
> +      /* FIXME: Need a more graceful way to handle large vector
> +	 lengths in OpenACC routines.  */
> +      if (!lookup_attribute ("omp target entrypoint",
> +			     DECL_ATTRIBUTES (current_function_decl)))
> +	oa->vector_length = PTX_WARP_SIZE;
> +      else
> +	oa->vector_length = PTX_VECTOR_LENGTH;
> +    }
> +  if (oa->num_workers == 0)
> +    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
> +  else
> +    oa->max_workers = oa->num_workers;
> +}

> @@ -4435,27 +4513,19 @@ nvptx_reorg (void)
>      {
>        /* If we determined this mask before RTL expansion, we could
>  	 elide emission of some levels of forks and joins.  */
> -      unsigned mask = 0;
> -      tree dims = TREE_VALUE (attr);
> -      unsigned ix;
> +      offload_attrs oa;
>  
> -      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
> -	{
> -	  int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
> -	  tree allowed = TREE_PURPOSE (dims);
> +      populate_offload_attrs (&oa);
>  
> -	  if (size != 1 && !(allowed && integer_zerop (allowed)))
> -	    mask |= GOMP_DIM_MASK (ix);
> -	}
>        /* If there is worker neutering, there must be vector
>  	 neutering.  Otherwise the hardware will fail.  */
> -      gcc_assert (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
> -		  || (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
> +      gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
> +		  || (oa.mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
>  
>        /* Discover & process partitioned regions.  */
>        parallel *pars = nvptx_discover_pars (&bb_insn_map);
>        nvptx_process_pars (pars);
> -      nvptx_neuter_pars (pars, mask, 0);
> +      nvptx_neuter_pars (pars, oa.mask, 0);
>        delete pars;
>      }


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 1: generalize function and variable names
  2018-03-09 15:29 ` [og7] vector_length extension part 1: generalize function and variable names Thomas Schwinge
@ 2018-03-09 15:31   ` Cesar Philippidis
  0 siblings, 0 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-09 15:31 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Tom de Vries, gcc-patches

On 03/09/2018 07:29 AM, Thomas Schwinge wrote:

> On Thu, 1 Mar 2018 13:17:01 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:
>> To reduce the size of the final patch,
>> I've separated all of the misc. function and variable renaming into this
>> patch.
> 
> Yes, please always do such refactoring changes independently of other
> functionality changes.
> 
> 
>> This patch also introduces a new populate_offload_attrs function.
> 
> I'm seeing:
> 
>     [...]/gcc/config/nvptx/nvptx.c: In function 'void nvptx_reorg()':
>     [...]/gcc/config/nvptx/nvptx.c:4451:3: warning: 'oa.offload_attrs::vector_length' may be used uninitialized in this function [-Wmaybe-uninitialized]
>        if (oa->vector_length == 0)
>        ^
>     [...]/gcc/config/nvptx/nvptx.c:4516:21: note: 'oa.offload_attrs::vector_length' was declared here
>            offload_attrs oa;
>                          ^
> 
> That must be "populate_offload_attrs" inlined into "nvptx_reorg".  I
> can't tell yet why it complains about "vector_length" only but not about
> the others.

I got lazy and just merged that function as-is. That warning will go
away once the reset of the vector length changes are in.

Cesar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 5: libgomp and tests
  2018-03-02 20:47 ` [og7] vector_length extension part 5: libgomp and tests Cesar Philippidis
@ 2018-03-16 13:50   ` Thomas Schwinge
  2018-03-27 13:00   ` Tom de Vries
  2018-04-05 16:36   ` Tom de Vries
  2 siblings, 0 replies; 50+ messages in thread
From: Thomas Schwinge @ 2018-03-16 13:50 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: Tom de Vries, gcc-patches

Hi!

On Fri, 2 Mar 2018 12:47:23 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:
> The attached patch is the last one in the vector length extension
> series. It consists of some tweaks to the libgomp nvptx plugin to
> accommodate larger vectors along with two test cases.
> 
> I only added two test cases because there's really not much interesting
> going on with longer vector lengths. We should eventually add more tests
> cases to handle situations where the nvptx BE falls back to using a
> shorter vector length. But right now GCC just makes those changes
> silently. There is precedent for the nvptx BE to emit a warning when
> vector length != 32, but that might be too verbose. On one hand, it
> could be argued that the compiler should error if it cannot satisfy the
> user's request. On the other hand, falling back to a smaller vector
> length ensures correctness.
> 
> Thomas, do you have any thoughts on the warnings/errors or there lack of?

Yeah, warning when "vector_length([bigger than 32])" is reduced to
"vector_length(32)" is probably too verbose/not useful.  But, it'd be
useful to have in "-fopt-info-omp"?


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
@ 2018-03-21 15:55   ` Tom de Vries
  2018-03-21 20:28     ` Cesar Philippidis
  2018-03-26 14:25   ` Tom de Vries
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-21 15:55 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

On 03/02/2018 08:18 PM, Cesar Philippidis wrote:

> og7-vl-part4-hooks.diff

> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
> index 5642941c6a3..507c8671704 100644
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c
> @@ -5205,14 +5205,36 @@ nvptx_simt_vf ()
>     return PTX_WARP_SIZE;
>   }
>   
> +#define NVPTX_GOACC_VL_WARP "nvptx vl warp"
> +
> +/* Return true of the offloaded function needs a vector_length of
> +   PTX_WARP_SIZE.  */
> +
> +static bool
> +nvptx_goacc_needs_vl_warp ()
> +{
> +  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,
> +				DECL_ATTRIBUTES (current_function_decl));
> +  return attr == NULL_TREE;
> +}
> +

I just wrote an example using "#pragma acc parallel vector_length (128)" 
and looked at the generated code. I found that the actual vector_length 
was still 32. I tracked this back to this function returning true.

I think we need "return attr != NULL_TREE".

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
@ 2018-03-21 17:16   ` Tom de Vries
  2018-03-22  8:05     ` Cesar Philippidis
  2018-03-22 14:24   ` Tom de Vries
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-21 17:16 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
> have been extended to take a barrier ID and a thread count. The idea
> here is to assign one barrier for each logical vector. Worker-single
> synchronization is controlled by barrier 0. Therefore, the vector
> barrier ID is set to tid.y+1 (because there's one vector unit per
> worker) in nvptx_init_oacc_workers and placed into a register stored in
> cfun->machine->sync_bar. If no workers are present, then the barrier ID
> falls back to 0.

I compiled a worker loop before and after the patch series, and observed 
this change:
...
@@ -70,7 +71,7 @@
   $L2:
    // joining 2;
   $L5:
-  bar.sync 1;
+  bar.sync 0;
    // join 2;
    ret;
  }
...

AFAICT from your explanation above, that change is intentional.

Changing the code generation scheme for workers is fine, but obviously 
that should be a minimal, separate patch that we can bisect back to.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-21 15:55   ` Tom de Vries
@ 2018-03-21 20:28     ` Cesar Philippidis
  0 siblings, 0 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-21 20:28 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches

On 03/21/2018 08:49 AM, Tom de Vries wrote:
> On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> 
>> og7-vl-part4-hooks.diff
> 
>> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
>> index 5642941c6a3..507c8671704 100644
>> --- a/gcc/config/nvptx/nvptx.c
>> +++ b/gcc/config/nvptx/nvptx.c
>> @@ -5205,14 +5205,36 @@ nvptx_simt_vf ()
>>     return PTX_WARP_SIZE;
>>   }
>>   +#define NVPTX_GOACC_VL_WARP "nvptx vl warp"
>> +
>> +/* Return true of the offloaded function needs a vector_length of
>> +   PTX_WARP_SIZE.  */
>> +
>> +static bool
>> +nvptx_goacc_needs_vl_warp ()
>> +{
>> +  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,
>> +                DECL_ATTRIBUTES (current_function_decl));
>> +  return attr == NULL_TREE;
>> +}
>> +
> 
> I just wrote an example using "#pragma acc parallel vector_length (128)"
> and looked at the generated code. I found that the actual vector_length
> was still 32. I tracked this back to this function returning true.
> 
> I think we need "return attr != NULL_TREE".

Yes. Good catch. I've added another test case for this.

Thanks,
Cesar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-21 17:16   ` Tom de Vries
@ 2018-03-22  8:05     ` Cesar Philippidis
  2018-03-22 14:16       ` Tom de Vries
  0 siblings, 1 reply; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-22  8:05 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1322 bytes --]

On 03/21/2018 10:10 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
>> have been extended to take a barrier ID and a thread count. The idea
>> here is to assign one barrier for each logical vector. Worker-single
>> synchronization is controlled by barrier 0. Therefore, the vector
>> barrier ID is set to tid.y+1 (because there's one vector unit per
>> worker) in nvptx_init_oacc_workers and placed into a register stored in
>> cfun->machine->sync_bar. If no workers are present, then the barrier ID
>> falls back to 0.
> 
> I compiled a worker loop before and after the patch series, and observed
> this change:
> ...
> @@ -70,7 +71,7 @@
>   $L2:
>    // joining 2;
>   $L5:
> -  bar.sync 1;
> +  bar.sync 0;
>    // join 2;
>    ret;
>  }
> ...
> 
> AFAICT from your explanation above, that change is intentional.
> 
> Changing the code generation scheme for workers is fine, but obviously
> that should be a minimal, separate patch that we can bisect back to.

That sounds reasonable. I'll apply this patch to og7 once testing has
completed. While all of the functionality it introduces is unnecessary
without the vector length changes, at least it can be applied independently.

Cesar

[-- Attachment #2: og7-new-barsync.diff --]
[-- Type: text/x-patch, Size: 3608 bytes --]

Update bar.sync usage

2018-03-21  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_cta_sync): Change arguments to take
	in a lock and thread count.  Update call to gen_nvptx_barsync.
	(nvptx_single): Update call to nvptx_cta_sync.
	(nvptx_process_pars): Likewise.
	* config/nvptx/nvptx.md (nvptx_barsync): Adjust operands.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59fed7..029628f8a0e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3936,13 +3936,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  THREADS is the
+   number of threads controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4192,6 +4193,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
@@ -4204,14 +4207,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
 						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4328,12 +4331,14 @@ nvptx_process_pars (parallel *par)
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
 					   false);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 2b4bcb3a45b..e638a13c366 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1421,10 +1421,16 @@
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (!REG_P (operands[0]))
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_nounroll"

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22  8:05     ` Cesar Philippidis
@ 2018-03-22 14:16       ` Tom de Vries
  2018-03-22 14:35         ` Cesar Philippidis
  0 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-22 14:16 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1467 bytes --]

On 03/22/2018 04:59 AM, Cesar Philippidis wrote:
> On 03/21/2018 10:10 AM, Tom de Vries wrote:
>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>>> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
>>> have been extended to take a barrier ID and a thread count. The idea
>>> here is to assign one barrier for each logical vector. Worker-single
>>> synchronization is controlled by barrier 0. Therefore, the vector
>>> barrier ID is set to tid.y+1 (because there's one vector unit per
>>> worker) in nvptx_init_oacc_workers and placed into a register stored in
>>> cfun->machine->sync_bar. If no workers are present, then the barrier ID
>>> falls back to 0.
>>
>> I compiled a worker loop before and after the patch series, and observed
>> this change:
>> ...
>> @@ -70,7 +71,7 @@
>>    $L2:
>>     // joining 2;
>>    $L5:
>> -  bar.sync 1;
>> +  bar.sync 0;
>>     // join 2;
>>     ret;
>>   }
>> ...
>>
>> AFAICT from your explanation above, that change is intentional.
>>
>> Changing the code generation scheme for workers is fine, but obviously
>> that should be a minimal, separate patch that we can bisect back to.
> 
> That sounds reasonable. I'll apply this patch to og7 once testing has
> completed. While all of the functionality it introduces is unnecessary

In other words, the patch is not minimal.

Thanks,
- Tom

> without the vector length changes, at least it can be applied independently.
> 

[-- Attachment #2: tmp.patch --]
[-- Type: text/x-patch, Size: 2365 bytes --]

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59..16d846e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3936,13 +3936,13 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock);
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4192,6 +4192,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  rtx barrier = GEN_INT (0);
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
@@ -4204,14 +4205,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
 						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier), tail);
 	}
 
       extract_insn (tail);
@@ -4328,12 +4329,13 @@ nvptx_process_pars (parallel *par)
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
 					   false);
+      rtx barrier = GEN_INT (0);
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
  2018-03-21 17:16   ` Tom de Vries
@ 2018-03-22 14:24   ` Tom de Vries
  2018-03-22 15:18     ` Cesar Philippidis
  2018-03-22 15:04   ` Tom de Vries
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-22 14:24 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Thomas Schwinge

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:

> 	(nvptx_declare_function_name): Emit a .maxntid directive hint and
> 	call nvptx_init_oacc_workers.

> +
> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */
> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
> +	<< cfun->machine->axis_dim[1] << ", 1\n";
> +

This change:
...
  // BEGIN FUNCTION DEF: main$_omp_fn$0
  .entry main$_omp_fn$0 (.param .u64 %in_ar0)
+  .maxntid 32, 32, 1
...
needs to be an individual patch.


 > +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */

'Help' is too strongly formulated, given that there's no clear link 
between the semantics of the directive, and the observed effect.

Use "seems to have the effect" or some such formulation.

Also, list in the comment a JIT driver version, and sm_ version and a 
testcase for which this is required.

Also, guard it with WORKAROUND_PTXJIT_BUG_3 (_2 is already taken in trunk.)

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 14:16       ` Tom de Vries
@ 2018-03-22 14:35         ` Cesar Philippidis
  0 siblings, 0 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-22 14:35 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 642 bytes --]

On 03/22/2018 06:43 AM, Tom de Vries wrote:
> On 03/22/2018 04:59 AM, Cesar Philippidis wrote:
>> On 03/21/2018 10:10 AM, Tom de Vries wrote:

>>> Changing the code generation scheme for workers is fine, but obviously
>>> that should be a minimal, separate patch that we can bisect back to.
>>
>> That sounds reasonable. I'll apply this patch to og7 once testing has
>> completed. While all of the functionality it introduces is unnecessary
> 
> In other words, the patch is not minimal.

My intention was to reduce the size of the final vector length patch.
But I can commit this patch after testing as it's equivalent at this point.

Cesar

[-- Attachment #2: og7-minimal-barriers.diff --]
[-- Type: text/x-patch, Size: 935 bytes --]

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59fed7..eff87732c4b 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4211,7 +4211,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (false), tail);
 	}
 
       extract_insn (tail);
@@ -4333,7 +4333,7 @@ nvptx_process_pars (parallel *par)
 	{
 	  /* Insert begin and end synchronizations.  */
 	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
  2018-03-21 17:16   ` Tom de Vries
  2018-03-22 14:24   ` Tom de Vries
@ 2018-03-22 15:04   ` Tom de Vries
  2018-03-22 17:14     ` Cesar Philippidis
  2018-03-22 17:47   ` Tom de Vries
                     ` (7 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-22 15:04 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Thomas Schwinge

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> The attached patch generalizes the worker state propagation and
> synchronization code to handle large vectors. When the vector_length is
> larger than a CUDA warp, the nvptx BE will now use shared-memory to
> spill-and-fill vector state when transitioning from vector-single mode
> to vector partitioned.

I've compiled this test-case:
...
int
main (void)
{
   int a[10];
#pragma acc parallel loop worker
   for (int i = 0; i < 10; i++)
     a[i] = i;

   return 0;
}
...

without and with the patch series, and observed the following difference 
in generated ptx:
...
-.shared .align 8 .u8 __oacc_bcast[8];
+.shared .align 8 .u8 __oacc_bcast[264];
...

Why is the example using 33 times more shared memory space with the 
patch series applied?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 14:24   ` Tom de Vries
@ 2018-03-22 15:18     ` Cesar Philippidis
  2018-03-22 16:20       ` Tom de Vries
  0 siblings, 1 reply; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-22 15:18 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

On 03/22/2018 07:23 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> 
>>     (nvptx_declare_function_name): Emit a .maxntid directive hint and
>>     call nvptx_init_oacc_workers.
> 
>> +
>> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */
>> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
>> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
>> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
>> +    << cfun->machine->axis_dim[1] << ", 1\n";
>> +
> 
> This change:
> ...
>  // BEGIN FUNCTION DEF: main$_omp_fn$0
>  .entry main$_omp_fn$0 (.param .u64 %in_ar0)
> +  .maxntid 32, 32, 1
> ...
> needs to be an individual patch.

cfun->machine->axis_dims is something new to the vector length changes,
so I hard-coded .maxntid to size '32, 32, 1' for og7 as an interim solution.

>> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */
> 
> 'Help' is too strongly formulated, given that there's no clear link
> between the semantics of the directive, and the observed effect.
> 
> Use "seems to have the effect" or some such formulation.
> 
> Also, list in the comment a JIT driver version, and sm_ version and a
> testcase for which this is required.
> 
> Also, guard it with WORKAROUND_PTXJIT_BUG_3 (_2 is already taken in trunk.)

Sounds reasonable. I'll commit the patch to og7 once the regression
testing has completed.

Thanks,
Cesar

[-- Attachment #2: 0001-add-.maxntid-hint.patch --]
[-- Type: text/x-patch, Size: 1289 bytes --]

From b89ec8060de3affb94b580be3260381028d4c183 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Thu, 22 Mar 2018 08:05:53 -0700
Subject: [PATCH] add .maxntid hint

---
 gcc/config/nvptx/nvptx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index eff87732c4b..9fb2bcd6852 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -76,6 +76,7 @@
 #include "target-def.h"
 
 #define WORKAROUND_PTXJIT_BUG 1
+#define WORKAROUND_PTXJIT_BUG_3 1
 
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
@@ -1219,6 +1220,15 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
      stream, in order to share the prototype writing code.  */
   std::stringstream s;
   write_fn_proto (s, true, name, decl);
+
+#if WORKAROUND_PTXJIT_BUG_3
+  /* Emitting a .maxntid seems to have the effect of encouraging the
+     PTX JIT emit SYNC branches.  */
+  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
+      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
+      s << ".maxntid 32, 32, 1\n";
+#endif
+
   s << "{\n";
 
   bool return_in_mem = write_return_type (s, false, result_type);
-- 
2.14.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 15:18     ` Cesar Philippidis
@ 2018-03-22 16:20       ` Tom de Vries
  2018-03-22 17:26         ` Cesar Philippidis
  0 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-22 16:20 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Thomas Schwinge

On 03/22/2018 04:11 PM, Cesar Philippidis wrote:
> On 03/22/2018 07:23 AM, Tom de Vries wrote:
>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>>
>>>      (nvptx_declare_function_name): Emit a .maxntid directive hint and
>>>      call nvptx_init_oacc_workers.
>>> +
>>> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */
>>> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
>>> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
>>> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
>>> +    << cfun->machine->axis_dim[1] << ", 1\n";
>>> +
>> This change:
>> ...
>>   // BEGIN FUNCTION DEF: main$_omp_fn$0
>>   .entry main$_omp_fn$0 (.param .u64 %in_ar0)
>> +  .maxntid 32, 32, 1
>> ...
>> needs to be an individual patch.
> cfun->machine->axis_dims is something new to the vector length changes,
> so I hard-coded .maxntid to size '32, 32, 1' for og7 as an interim solution.
> 

That's obviously not good enough.

When I compile this test-case:
...
int
main (void)
{
   int a[10];
#pragma acc parallel num_workers (16)
#pragma acc loop worker
   for (int i = 0; i < 10; i++)
     a[i] = i;

   return 0;
}
...

I get:
...
  .maxntid 32, 16, 1
...

That's the change you need to isolate.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 15:04   ` Tom de Vries
@ 2018-03-22 17:14     ` Cesar Philippidis
  0 siblings, 0 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-22 17:14 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1503 bytes --]

On 03/22/2018 07:44 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> The attached patch generalizes the worker state propagation and
>> synchronization code to handle large vectors. When the vector_length is
>> larger than a CUDA warp, the nvptx BE will now use shared-memory to
>> spill-and-fill vector state when transitioning from vector-single mode
>> to vector partitioned.
> 
> I've compiled this test-case:
> ...
> int
> main (void)
> {
>   int a[10];
> #pragma acc parallel loop worker
>   for (int i = 0; i < 10; i++)
>     a[i] = i;
> 
>   return 0;
> }
> ...
> 
> without and with the patch series, and observed the following difference
> in generated ptx:
> ...
> -.shared .align 8 .u8 __oacc_bcast[8];
> +.shared .align 8 .u8 __oacc_bcast[264];
> ...
> 
> Why is the example using 33 times more shared memory space with the
> patch series applied?

Because the nvptx BE wasn't taking into account that vector_length = 32
doesn't need to use shared-memory to broadcast variables.

That magic value of 33 was derived from nvptx_mach_max_workers () + 1.
When vector_length > 32, there needs to be nvptx_mach_max_workers ()
partitions for vector state propagation. There also needs to be a
shared-memory buffer for worker-state propagation, because I found
situations where some threads where still spilling and filling workers
before vector 0 transitioned vector-partitioned mode.

The attached, untested, patch should resolve that issue.

Cesar

[-- Attachment #2: og7-vvl-bcast-mem.diff --]
[-- Type: text/x-patch, Size: 1218 bytes --]

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 3102c79bf96..f81fb0113d5 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4061,9 +4061,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
       if (oacc_bcast_partition < data.offset)
 	{
 	  int psize = data.offset;
+	  int pnum = 1;
+
+	  if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+	    pnum = nvptx_mach_max_workers () + 1;
+
 	  psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
 	  oacc_bcast_partition = psize;
-	  oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	  oacc_bcast_size = psize * pnum;
 	}
     }
   return empty;
@@ -4348,9 +4353,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  if (oacc_bcast_partition < size)
 	    {
 	      int psize = size;
+	      int pnum = 1;
+
+	      if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+		pnum = nvptx_mach_max_workers () + 1;
+
 	      psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
 	      oacc_bcast_partition = psize;
-	      oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	      oacc_bcast_size = psize * pnum;
 	    }
 
 	  data.offset = 0;

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 16:20       ` Tom de Vries
@ 2018-03-22 17:26         ` Cesar Philippidis
  2018-03-22 17:58           ` Tom de Vries
  2018-03-23 14:35           ` Tom de Vries
  0 siblings, 2 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-22 17:26 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 560 bytes --]

On 03/22/2018 09:18 AM, Tom de Vries wrote:

> That's obviously not good enough.
> 
> When I compile this test-case:
> ...
> int
> main (void)
> {
>   int a[10];
> #pragma acc parallel num_workers (16)
> #pragma acc loop worker
>   for (int i = 0; i < 10; i++)
>     a[i] = i;
> 
>   return 0;
> }
> ...
> 
> I get:
> ...
>  .maxntid 32, 16, 1
> ...
> 
> That's the change you need to isolate.

I attached an updated patch which incorporates the
cfun->machine->axis_dim changes. It now generates more precise arguments
for maxntid.

Cesar

[-- Attachment #2: 0001-emit-.maxntid-hint.patch --]
[-- Type: text/x-patch, Size: 2725 bytes --]

From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <cesar@codesourcery.com>
Date: Thu, 22 Mar 2018 08:05:53 -0700
Subject: [PATCH] emit .maxntid hint

---
 gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++
 gcc/config/nvptx/nvptx.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index eff87732c4b..3958f71e995 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -76,6 +76,7 @@
 #include "target-def.h"
 
 #define WORKAROUND_PTXJIT_BUG 1
+#define WORKAROUND_PTXJIT_BUG_3 1
 
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
@@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
      stream, in order to share the prototype writing code.  */
   std::stringstream s;
   write_fn_proto (s, true, name, decl);
+
+#if WORKAROUND_PTXJIT_BUG_3
+  /* Emitting a .maxntid seems to have the effect of encouraging the
+     PTX JIT emit SYNC branches.  */
+  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
+      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
+      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
+	<< cfun->machine->axis_dim[1] << ", 1\n";
+#endif
+
   s << "{\n";
 
   bool return_in_mem = write_return_type (s, false, result_type);
@@ -2831,6 +2842,11 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
 struct parallel
 {
   /* Parent parallel.  */
@@ -4525,6 +4541,9 @@ nvptx_reorg (void)
 
       populate_offload_attrs (&oa);
 
+      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
+
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
       gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 8a14507c88a..958516da604 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -226,6 +226,8 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.   */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
-- 
2.14.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (2 preceding siblings ...)
  2018-03-22 15:04   ` Tom de Vries
@ 2018-03-22 17:47   ` Tom de Vries
  2018-03-22 17:48     ` Cesar Philippidis
  2018-03-23 13:14   ` Tom de Vries
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-22 17:47 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> +  rtx red_partition; /* Similar to bcast_partition, except for vector
> +			reductions.  */

Shouldn't this be in "[og7] vector_length extension part 3: reductions"?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 17:47   ` Tom de Vries
@ 2018-03-22 17:48     ` Cesar Philippidis
  2018-03-22 18:00       ` Tom de Vries
  0 siblings, 1 reply; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-22 17:48 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches

On 03/22/2018 10:39 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> +  rtx red_partition; /* Similar to bcast_partition, except for vector
>> +            reductions.  */
> 
> Shouldn't this be in "[og7] vector_length extension part 3: reductions"?

Maybe. But keep in mind, with the exception of the bar.sync and maxntid
changes you requested, I don't think the vector length patch makes sense
to go in as individual hunks. Maybe I could split out the new
TARGET_GOACC_ADJUST_PARALLELISM hook in part 4 into a separate patch.
But, at the same time, if something isn't being used, what's the point
of going through that extra work?

Cesar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 17:26         ` Cesar Philippidis
@ 2018-03-22 17:58           ` Tom de Vries
  2018-03-22 19:32             ` Cesar Philippidis
  2018-03-23 14:35           ` Tom de Vries
  1 sibling, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-22 17:58 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Thomas Schwinge

On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
> On 03/22/2018 09:18 AM, Tom de Vries wrote:
> 
>> That's obviously not good enough.
>>
>> When I compile this test-case:
>> ...
>> int
>> main (void)
>> {
>>    int a[10];
>> #pragma acc parallel num_workers (16)
>> #pragma acc loop worker
>>    for (int i = 0; i < 10; i++)
>>      a[i] = i;
>>
>>    return 0;
>> }
>> ...
>>
>> I get:
>> ...
>>   .maxntid 32, 16, 1
>> ...
>>
>> That's the change you need to isolate.
> 
> I attached an updated patch which incorporates the
> cfun->machine->axis_dim changes. It now generates more precise arguments
> for maxntid.

I'll try this out.

Still, this doesn't address my request: "Also, list in the comment a JIT 
driver version, and sm_ version and a testcase for which this is required"

Thanks,
- Tom

> 
> Cesar
> 
> 
> 0001-emit-.maxntid-hint.patch
> 
> 
>  From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001
> From: Cesar Philippidis <cesar@codesourcery.com>
> Date: Thu, 22 Mar 2018 08:05:53 -0700
> Subject: [PATCH] emit .maxntid hint
> 
> ---
>   gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++
>   gcc/config/nvptx/nvptx.h |  2 ++
>   2 files changed, 21 insertions(+)
> 
> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
> index eff87732c4b..3958f71e995 100644
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c
> @@ -76,6 +76,7 @@
>   #include "target-def.h"
>   
>   #define WORKAROUND_PTXJIT_BUG 1
> +#define WORKAROUND_PTXJIT_BUG_3 1
>   
>   /* Define dimension sizes for known hardware.  */
>   #define PTX_VECTOR_LENGTH 32
> @@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
>        stream, in order to share the prototype writing code.  */
>     std::stringstream s;
>     write_fn_proto (s, true, name, decl);
> +
> +#if WORKAROUND_PTXJIT_BUG_3
> +  /* Emitting a .maxntid seems to have the effect of encouraging the
> +     PTX JIT emit SYNC branches.  */
> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
> +	<< cfun->machine->axis_dim[1] << ", 1\n";
> +#endif
> +
>     s << "{\n";
>   
>     bool return_in_mem = write_return_type (s, false, result_type);
> @@ -2831,6 +2842,11 @@ struct offload_attrs
>     int max_workers;
>   };
>   
> +/* Define entries for cfun->machine->axis_dim.  */
> +
> +#define MACH_VECTOR_LENGTH 0
> +#define MACH_MAX_WORKERS 1
> +
>   struct parallel
>   {
>     /* Parent parallel.  */
> @@ -4525,6 +4541,9 @@ nvptx_reorg (void)
>   
>         populate_offload_attrs (&oa);
>   
> +      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
> +      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
> +
>         /* If there is worker neutering, there must be vector
>   	 neutering.  Otherwise the hardware will fail.  */
>         gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
> diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
> index 8a14507c88a..958516da604 100644
> --- a/gcc/config/nvptx/nvptx.h
> +++ b/gcc/config/nvptx/nvptx.h
> @@ -226,6 +226,8 @@ struct GTY(()) machine_function
>     int return_mode; /* Return mode of current fn.
>   		      (machine_mode not defined yet.) */
>     rtx axis_predicate[2]; /* Neutering predicates.  */
> +  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
> +		      vector_length, dim[1] is num_workers.   */
>     rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
>     rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
>     rtx unisimt_location; /* Mask location for -muniform-simt.  */
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 17:48     ` Cesar Philippidis
@ 2018-03-22 18:00       ` Tom de Vries
  0 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-22 18:00 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

On 03/22/2018 06:47 PM, Cesar Philippidis wrote:
> On 03/22/2018 10:39 AM, Tom de Vries wrote:
>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>>> +  rtx red_partition; /* Similar to bcast_partition, except for vector
>>> +            reductions.  */
>>
>> Shouldn't this be in "[og7] vector_length extension part 3: reductions"?
> 
> Maybe. But keep in mind, with the exception of the bar.sync and maxntid
> changes you requested, I don't think the vector length patch makes sense
> to go in as individual hunks. Maybe I could split out the new
> TARGET_GOACC_ADJUST_PARALLELISM hook in part 4 into a separate patch.
> But, at the same time, if something isn't being used, what's the point
> of going through that extra work?

Because patches that are split into logically consistent parts are easy 
to review, and easy to analyze and fix or undo when bisected back to. 
And yes, that's extra work.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 17:58           ` Tom de Vries
@ 2018-03-22 19:32             ` Cesar Philippidis
  2018-03-23  8:56               ` Tom de Vries
  0 siblings, 1 reply; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-22 19:32 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]

On 03/22/2018 10:51 AM, Tom de Vries wrote:
> On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
>> On 03/22/2018 09:18 AM, Tom de Vries wrote:
>>
>>> That's obviously not good enough.
>>>
>>> When I compile this test-case:
>>> ...
>>> int
>>> main (void)
>>> {
>>>    int a[10];
>>> #pragma acc parallel num_workers (16)
>>> #pragma acc loop worker
>>>    for (int i = 0; i < 10; i++)
>>>      a[i] = i;
>>>
>>>    return 0;
>>> }
>>> ...
>>>
>>> I get:
>>> ...
>>>   .maxntid 32, 16, 1
>>> ...
>>>
>>> That's the change you need to isolate.
>>
>> I attached an updated patch which incorporates the
>> cfun->machine->axis_dim changes. It now generates more precise arguments
>> for maxntid.
> 
> I'll try this out.
> 
> Still, this doesn't address my request: "Also, list in the comment a JIT
> driver version, and sm_ version and a testcase for which this is required"

I attached the test case where it used to fail without maxntid. But
after looking at again, the maxntid directive was probably masking that
other PTX JIT bug involving abort and exiting threads that you fixed.
And in fact, the test case works without the maxntid patch on my sm_60 GPU.

I'm going to retest the variable vector length changes without it and
see if it's still necessary. On one hand, maxntid should be fairly
innocuous, but I don't like how it can mask other PTX JIT bugs. At this
point, I'm leaning towards dropping it if does not impact the libgomp
regression test suite anymore. What do you want to do?

Cesar

[-- Attachment #2: dcp-3a.c --]
[-- Type: text/x-csrc, Size: 729 bytes --]

/* This test was failing with nvptx offloading without the .maxntid
   PTX directive.  */

int i;
int main(void)
{
  int j, v;
  i = -1;
  j = -2;
  v = 0;

  j = -2;
  v = 0;
#pragma acc parallel present_or_copyout (v) copyout (i, j) vector_length(128)
  {
    i = 2;
    j = 1;
    if (i != 2 || j != 1)
      __builtin_abort ();
    v = 1;
  }
  if (v != 1 || i != 2 || j != 1)
    __builtin_abort ();
  i = -1;
  j = -2;
  v = 0;
#pragma acc parallel present_or_copyout (v) copy (i, j) vector_length(128)
  {
    if (i != -1 || j != -2)
      __builtin_abort ();
    i = 2;
    j = 1;
    if (i != 2 || j != 1)
      __builtin_abort ();
    v = 1;
  }
  if (v != 1 || i != 2 || j != 1)
    __builtin_abort ();

  return 0;
}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 19:32             ` Cesar Philippidis
@ 2018-03-23  8:56               ` Tom de Vries
  0 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-23  8:56 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Thomas Schwinge

On 03/22/2018 08:04 PM, Cesar Philippidis wrote:
> I'm going to retest the variable vector length changes without it and
> see if it's still necessary. On one hand, maxntid should be fairly
> innocuous, but I don't like how it can mask other PTX JIT bugs. At this
> point, I'm leaning towards dropping it if does not impact the libgomp
> regression test suite anymore. What do you want to do?

If there is no observable difference in tests passing/failing, then we 
should drop it.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (3 preceding siblings ...)
  2018-03-22 17:47   ` Tom de Vries
@ 2018-03-23 13:14   ` Tom de Vries
  2018-03-23 13:16   ` Tom de Vries
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-23 13:14 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 367 bytes --]

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> @@ -4115,13 +4225,23 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
>   	    pred = gen_reg_rtx (BImode);
>   	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
>   	  }
> -	
> +

It's fine to clean up whitespace, but please do that in separate patches.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Fix-whitespace-in-nvptx_single.patch --]
[-- Type: text/x-patch, Size: 665 bytes --]

[nvptx] Fix whitespace in nvptx_single

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_single): Fix whitespace.

---
 gcc/config/nvptx/nvptx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59..50d7319 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4100,7 +4100,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    pred = gen_reg_rtx (BImode);
 	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
 	  }
-	
+
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (4 preceding siblings ...)
  2018-03-23 13:14   ` Tom de Vries
@ 2018-03-23 13:16   ` Tom de Vries
  2018-03-23 14:18   ` Tom de Vries
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-23 13:16 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 404 bytes --]

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> +/* Loop structure of the function.  The entire function is described as
> +   a NULL loop.  */
> +
>   struct parallel
>   {
>     /* Parent parallel.  */

You dropped this comment in "vector_length extension part 1: generalize 
function and variable names".

It's good to add it back, but that needs to be a separate patch.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0002-nvptx-Re-add-removed-struct-parallel-comment.patch --]
[-- Type: text/x-patch, Size: 599 bytes --]

[nvptx] Re-add removed struct parallel comment

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (struct parallel): Re-add comment.

---
 gcc/config/nvptx/nvptx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 50d7319..9873449 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2831,6 +2831,9 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Loop structure of the function.  The entire function is described as
+   a NULL loop.  */
+
 struct parallel
 {
   /* Parent parallel.  */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (5 preceding siblings ...)
  2018-03-23 13:16   ` Tom de Vries
@ 2018-03-23 14:18   ` Tom de Vries
  2018-03-23 16:30   ` Tom de Vries
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-23 14:18 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1082 bytes --]

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
> index 28ae263c867..ac2731233dd 100644
> --- a/gcc/config/nvptx/nvptx.md
> +++ b/gcc/config/nvptx/nvptx.md
> @@ -1418,10 +1418,16 @@
>     [(set_attr "atomic" "true")])
>   
>   (define_insn "nvptx_barsync"
> -  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
> +  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
> +		     (match_operand:SI 1 "const_int_operand")]
>   		    UNSPECV_BARSYNC)]
>     ""
> -  "\\tbar.sync\\t%0;"
> +  {
> +    if (!REG_P (operands[0]))
> +      return "\\tbar.sync\\t%0;";
> +    else
> +      return "\\tbar.sync\\t%0, %1;";
> +  }
>     [(set_attr "predicable" "false")])

This is wrong. The first operand can be a register or a constant, and 
the second operand is independent. Whether or not we print the second 
operand is independent of whether the first is a register.

In this patch I've reserved INTVAL (operands[1]) == 0 for the "no second 
operand" case.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Add-thread-count-parm-to-bar.sync.patch --]
[-- Type: text/x-patch, Size: 3765 bytes --]

[nvptx] Add thread count parm to bar.sync

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.md (nvptx_barsync): Add and handle operand.
	* config/nvptx/nvptx.c (nvptx_cta_sync): Change arguments to take in a
	lock and thread count.  Update call to gen_nvptx_barsync.
	(nvptx_single, nvptx_process_pars): Update calls to nvptx_cta_sync.

---
 gcc/config/nvptx/nvptx.c  | 22 ++++++++++++++--------
 gcc/config/nvptx/nvptx.md | 10 ++++++++--
 3 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 12441cb..32f2efb 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3939,13 +3939,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  THREADS is the
+   number of threads controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4195,6 +4196,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
@@ -4207,14 +4210,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
 						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4331,12 +4334,15 @@ nvptx_process_pars (parallel *par)
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
 					   false);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads),
+			    par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 2b4bcb3a..2609222 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1421,10 +1421,16 @@
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (INTVAL (operands[1]) == 0)
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_nounroll"

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-22 17:26         ` Cesar Philippidis
  2018-03-22 17:58           ` Tom de Vries
@ 2018-03-23 14:35           ` Tom de Vries
  1 sibling, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-23 14:35 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 3795 bytes --]

On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
> On 03/22/2018 09:18 AM, Tom de Vries wrote:
> 
>> That's obviously not good enough.
>>
>> When I compile this test-case:
>> ...
>> int
>> main (void)
>> {
>>    int a[10];
>> #pragma acc parallel num_workers (16)
>> #pragma acc loop worker
>>    for (int i = 0; i < 10; i++)
>>      a[i] = i;
>>
>>    return 0;
>> }
>> ...
>>
>> I get:
>> ...
>>   .maxntid 32, 16, 1
>> ...
>>
>> That's the change you need to isolate.
> 
> I attached an updated patch which incorporates the
> cfun->machine->axis_dim changes. It now generates more precise arguments
> for maxntid.
> 

Even with maxntid dropped, axis_dim is still used elsewhere in the patch 
series, so we can split off the introduction of axis_dim and helper 
functions in a separate patch.

Committed.

Thanks,
- Tom

> Cesar
> 
> 
> 0001-emit-.maxntid-hint.patch
> 
> 
>  From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001
> From: Cesar Philippidis <cesar@codesourcery.com>
> Date: Thu, 22 Mar 2018 08:05:53 -0700
> Subject: [PATCH] emit .maxntid hint
> 
> ---
>   gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++
>   gcc/config/nvptx/nvptx.h |  2 ++
>   2 files changed, 21 insertions(+)
> 
> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
> index eff87732c4b..3958f71e995 100644
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c
> @@ -76,6 +76,7 @@
>   #include "target-def.h"
>   
>   #define WORKAROUND_PTXJIT_BUG 1
> +#define WORKAROUND_PTXJIT_BUG_3 1
>   
>   /* Define dimension sizes for known hardware.  */
>   #define PTX_VECTOR_LENGTH 32
> @@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
>        stream, in order to share the prototype writing code.  */
>     std::stringstream s;
>     write_fn_proto (s, true, name, decl);
> +
> +#if WORKAROUND_PTXJIT_BUG_3
> +  /* Emitting a .maxntid seems to have the effect of encouraging the
> +     PTX JIT emit SYNC branches.  */
> +  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
> +      && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
> +      s << ".maxntid " << cfun->machine->axis_dim[0] << ", "
> +	<< cfun->machine->axis_dim[1] << ", 1\n";
> +#endif
> +
>     s << "{\n";
>   
>     bool return_in_mem = write_return_type (s, false, result_type);
> @@ -2831,6 +2842,11 @@ struct offload_attrs
>     int max_workers;
>   };
>   
> +/* Define entries for cfun->machine->axis_dim.  */
> +
> +#define MACH_VECTOR_LENGTH 0
> +#define MACH_MAX_WORKERS 1
> +
>   struct parallel
>   {
>     /* Parent parallel.  */
> @@ -4525,6 +4541,9 @@ nvptx_reorg (void)
>   
>         populate_offload_attrs (&oa);
>   
> +      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
> +      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
> +
>         /* If there is worker neutering, there must be vector
>   	 neutering.  Otherwise the hardware will fail.  */
>         gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
> diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
> index 8a14507c88a..958516da604 100644
> --- a/gcc/config/nvptx/nvptx.h
> +++ b/gcc/config/nvptx/nvptx.h
> @@ -226,6 +226,8 @@ struct GTY(()) machine_function
>     int return_mode; /* Return mode of current fn.
>   		      (machine_mode not defined yet.) */
>     rtx axis_predicate[2]; /* Neutering predicates.  */
> +  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
> +		      vector_length, dim[1] is num_workers.   */
>     rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
>     rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
>     rtx unisimt_location; /* Mask location for -muniform-simt.  */
> 


[-- Attachment #2: 0002-nvptx-Add-axis_dim.patch --]
[-- Type: text/x-patch, Size: 2210 bytes --]

[nvptx] Add axis_dim

2018-03-23  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define.
	(nvptx_mach_max_workers, nvptx_mach_vector_length): New function.
	(nvptx_reorg): Set function-specific axis_dim's.
	* config/nvptx/nvptx.h (struct machine_function): Add axis_dims.

---
 gcc/config/nvptx/nvptx.c | 20 ++++++++++++++++++++
 gcc/config/nvptx/nvptx.h |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 32f2efb..3cb33ae 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2831,6 +2831,23 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_max_workers ()
+{
+  return cfun->machine->axis_dim[MACH_MAX_WORKERS];
+}
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_vector_length ()
+{
+  return cfun->machine->axis_dim[MACH_VECTOR_LENGTH];
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -4534,6 +4551,9 @@ nvptx_reorg (void)
 
       populate_offload_attrs (&oa);
 
+      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
+
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
       gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 8a14507..784628e 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -226,6 +226,8 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (6 preceding siblings ...)
  2018-03-23 14:18   ` Tom de Vries
@ 2018-03-23 16:30   ` Tom de Vries
  2018-03-30  1:50   ` Tom de Vries
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-23 16:30 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> +  if (cfun->machine->sync_bar)
> +    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
> +	     "// vector synchronization barrier\n",
> +	     REGNO (cfun->machine->sync_bar));

I realize that atm we don't support large vector length when nesting a 
vector loop inside a worker loop, but ... if we did support that, and 
used a vector_length of 64, then with the "Maximum number of threads per 
block" of 1024 we have a possible 16 workers. And when using the maximum 
number of workers, we'll end up using logical barrier 16 (while we only 
have 0..15).

It would be good to have at least an assert detecting this situation.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
  2018-03-21 15:55   ` Tom de Vries
@ 2018-03-26 14:25   ` Tom de Vries
  2018-03-26 14:37     ` Cesar Philippidis
  2018-03-26 16:52   ` Tom de Vries
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-26 14:25 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
> index ba3f4317f4e..f15ce6b8f8d 100644
> --- a/gcc/omp-offload.c
> +++ b/gcc/omp-offload.c
> @@ -626,7 +626,8 @@ oacc_parse_default_dims (const char *dims)
>      function.  */
>   
>   static void
> -oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
> +oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used,
> +		    int * ARG_UNUSED (default_dims))
>   {
>     tree purpose[GOMP_DIM_MAX];
>     unsigned ix;

> @@ -1604,7 +1616,8 @@ execute_oacc_device_lower ()
>       }
>   
>     int dims[GOMP_DIM_MAX];
> -  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask);
> +  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask,
> +		      NULL);
>   
>     if (dump_file)
>       {

What's the purpose of this unused parameter default_dims, that only ever 
gets to be NULL?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-26 14:25   ` Tom de Vries
@ 2018-03-26 14:37     ` Cesar Philippidis
  0 siblings, 0 replies; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-26 14:37 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches

On 03/26/2018 07:14 AM, Tom de Vries wrote:
> On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
>> diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
>> index ba3f4317f4e..f15ce6b8f8d 100644
>> --- a/gcc/omp-offload.c
>> +++ b/gcc/omp-offload.c
>> @@ -626,7 +626,8 @@ oacc_parse_default_dims (const char *dims)
>>      function.  */
>>     static void
>> -oacc_validate_dims (tree fn, tree attrs, int *dims, int level,
>> unsigned used)
>> +oacc_validate_dims (tree fn, tree attrs, int *dims, int level,
>> unsigned used,
>> +            int * ARG_UNUSED (default_dims))
>>   {
>>     tree purpose[GOMP_DIM_MAX];
>>     unsigned ix;
> 
>> @@ -1604,7 +1616,8 @@ execute_oacc_device_lower ()
>>       }
>>       int dims[GOMP_DIM_MAX];
>> -  oacc_validate_dims (current_function_decl, attrs, dims, fn_level,
>> used_mask);
>> +  oacc_validate_dims (current_function_decl, attrs, dims, fn_level,
>> used_mask,
>> +              NULL);
>>       if (dump_file)
>>       {
> 
> What's the purpose of this unused parameter default_dims, that only ever
> gets to be NULL?

That's stale and can be removed. In an earlier, and more complicated,
version of the patch I was still trying to get large vector lengths to
work with multiple workers.

I'll remove it from my patch.

Thanks,
Cesar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
  2018-03-21 15:55   ` Tom de Vries
  2018-03-26 14:25   ` Tom de Vries
@ 2018-03-26 16:52   ` Tom de Vries
  2018-03-27 12:16     ` Tom de Vries
  2018-03-26 17:13   ` Tom de Vries
  2018-04-05 16:32   ` Tom de Vries
  4 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-26 16:52 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 165 bytes --]

On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> introduces a new goacc adjust_parallelism target hook.

That's another separate patch.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-openacc-Add-target-hook-TARGET_GOACC_ADJUST_PARALLELISM.patch --]
[-- Type: text/x-patch, Size: 4859 bytes --]

[openacc] Add target hook TARGET_GOACC_ADJUST_PARALLELISM

2018-03-26  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* doc/tm.texi.in: Add placeholder for TARGET_GOACC_ADJUST_PARALLELISM.
	* doc/tm.texi: Regenerate.
	* omp-offload.c (oacc_loop_fixed_partitions): Use the adjust_parallelism
	hook to modify this_mask.
	(oacc_loop_auto_partitions): Use the adjust_parallelism hook to modify
	this_mask and loop->mask.
	(default_goacc_adjust_parallelism): New function.
	* target.def (adjust_parallelism): New hook.
	* targhooks.h (default_goacc_adjust_parallelism): Declare.

---
 gcc/doc/tm.texi       |  6 ++++++
 gcc/doc/tm.texi.in    |  2 ++
 gcc/omp-offload.c     | 19 +++++++++++++++++++
 gcc/target.def        |  8 ++++++++
 gcc/targhooks.h       |  1 +
 6 files changed, 49 insertions(+)

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 0fcb9c6..271eb4d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5883,6 +5883,12 @@ This hook should return the maximum size of a particular dimension,
 or zero if unbounded.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_GOACC_ADJUST_PARALLELISM (unsigned @var{this_mask}, unsigned @var{outer_mask})
+This hook allows the accelerator compiler to remove any unused
+parallelism exposed in the current loop @var{THIS_MASK}, and the
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
 This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN
 function calls to target-specific gimple, or indicate whether they
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4187da1..fc73ad1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4298,6 +4298,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_DIM_LIMIT
 
+@hook TARGET_GOACC_ADJUST_PARALLELISM
+
 @hook TARGET_GOACC_FORK_JOIN
 
 @hook TARGET_GOACC_REDUCTION
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index ba3f431..aa4de24 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -1258,6 +1258,13 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
 	}
     }
 
+  /* FIXME: Ideally, we should be coalescing parallelism here if the
+     hardware supports it.  E.g. Instead of partitioning a loop
+     across worker and vector axes, sometimes the hardware can
+     execute those loops together without resorting to placing
+     extra thread barriers.  */
+  this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
+
   mask_all |= this_mask;
 
   if (loop->flags & OLF_TILE)
@@ -1349,6 +1356,7 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	  this_mask ^= loop->e_mask;
 	}
 
+      this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
       loop->mask |= this_mask;
     }
 
@@ -1396,7 +1404,9 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 			" to parallelize element loop");
 	}
 
+      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
       loop->mask |= this_mask;
+
       if (!loop->mask && noisy)
 	warning_at (loop->loc, 0,
 		    tiling
@@ -1774,6 +1784,15 @@ default_goacc_dim_limit (int ARG_UNUSED (axis))
 #endif
 }
 
+/* Default adjustment of loop parallelism is not required.  */
+
+unsigned
+default_goacc_adjust_parallelism (unsigned this_mask,
+				  unsigned ARG_UNUSED (outer_mask))
+{
+  return this_mask;
+}
+
 namespace {
 
 const pass_data pass_data_oacc_device_lower =
diff --git a/gcc/target.def b/gcc/target.def
index b302d36..c878fee 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1697,6 +1697,14 @@ int, (int axis),
 default_goacc_dim_limit)
 
 DEFHOOK
+(adjust_parallelism,
+"This hook allows the accelerator compiler to remove any unused\n\
+parallelism exposed in the current loop @var{THIS_MASK}, and the\n\
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.",
+unsigned, (unsigned this_mask, unsigned outer_mask),
+default_goacc_adjust_parallelism)
+
+DEFHOOK
 (fork_join,
 "This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN\n\
 function calls to target-specific gimple, or indicate whether they\n\
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 18070df..f4f6864 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -115,6 +115,7 @@ extern bool default_goacc_validate_dims (tree, int [], int);
 extern int default_goacc_dim_limit (int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
 extern void default_goacc_reduction (gcall *);
+extern unsigned default_goacc_adjust_parallelism (unsigned, unsigned);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
                     ` (2 preceding siblings ...)
  2018-03-26 16:52   ` Tom de Vries
@ 2018-03-26 17:13   ` Tom de Vries
  2018-04-05 16:32   ` Tom de Vries
  4 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-26 17:13 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> The attached patch adjusts the existing goacc validate_dims target hook

This is overkill. All we need is a function
"int oacc_get_default_dim (int dim)".

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-26 16:52   ` Tom de Vries
@ 2018-03-27 12:16     ` Tom de Vries
  0 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-27 12:16 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 613 bytes --]

On 03/26/2018 06:33 PM, Tom de Vries wrote:
> +      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
>         loop->mask |= this_mask;

I committed the above, but the original:
...
> @@ -1397,6 +1407,8 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
>  	}
>  
>        loop->mask |= this_mask;
> +      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
> +
>        if (!loop->mask && noisy)
>  	warning_at (loop->loc, 0,
>  		    tiling
...
has the two loop->mask lines in the reverse order.

Fixed in attached patch.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-openacc-Fix-adjust_parallism-usage-in-oacc_loop_auto_partitions.patch --]
[-- Type: text/x-patch, Size: 788 bytes --]

[openacc] Fix adjust_parallism usage in oacc_loop_auto_partitions

2018-03-27  Tom de Vries  <tom@codesourcery.com>

	* omp-offload.c (oacc_loop_auto_partitions): Fix adjust_parallism usage.

---
 gcc/omp-offload.c     | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index aa4de24..ed17160 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -1404,8 +1404,8 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 			" to parallelize element loop");
 	}
 
-      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
       loop->mask |= this_mask;
+      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
 
       if (!loop->mask && noisy)
 	warning_at (loop->loc, 0,

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 5: libgomp and tests
  2018-03-02 20:47 ` [og7] vector_length extension part 5: libgomp and tests Cesar Philippidis
  2018-03-16 13:50   ` Thomas Schwinge
@ 2018-03-27 13:00   ` Tom de Vries
  2018-04-05 16:36   ` Tom de Vries
  2 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-03-27 13:00 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Schwinge, Thomas

[-- Attachment #1: Type: text/plain, Size: 187 bytes --]

On 03/02/2018 09:47 PM, Cesar Philippidis wrote:
> two test cases.

Committed as separate patch, while ignoring the warnings "using 
vector_length \\(32\\), ignoring 128".

Thanks,
- Tom

[-- Attachment #2: 0002-openacc-Add-vector_length-128-testcases.patch --]
[-- Type: text/x-patch, Size: 4910 bytes --]

[openacc] Add vector_length 128 testcases

2018-03-27  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: New test.
	* testsuite/libgomp.oacc-fortran/gemm.f90: New test.

---
 .../libgomp.oacc-c-c++-common/vred2d-128.c         |  57 +++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    | 109 +++++++++++++++++++++
 3 files changed, 172 insertions(+)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
new file mode 100644
index 0000000..1dc5fe0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
@@ -0,0 +1,57 @@
+/* Test large vector lengths.  */
+
+#include <assert.h>
+
+#define n 10000
+int a1[n], a2[n];
+
+#define gentest(name, outer, inner)		\
+  void name ()					\
+  {						\
+  long i, j, t1, t2, t3;			\
+  _Pragma(outer)				\
+  for (i = 0; i < n; i++)			\
+    {						\
+      t1 = 0;					\
+      t2 = 0;					\
+      _Pragma(inner)				\
+      for (j = i; j < n; j++)			\
+	{					\
+	  t1++;					\
+	  t2--;					\
+	}					\
+      a1[i] = t1;				\
+      a2[i] = t2;				\
+    }						\
+  for (i = 0; i < n; i++)			\
+    {						\
+      assert (a1[i] == n-i);			\
+      assert (a2[i] == -(n-i));			\
+    }						\
+  }						\
+
+gentest (test1, "acc parallel loop gang vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test2, "acc parallel loop gang vector_length (128)",
+	 "acc loop worker vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test3, "acc parallel loop gang worker vector_length (128)",
+	 "acc loop vector reduction(+:t1) reduction(-:t2)")
+
+gentest (test4, "acc parallel loop",
+	 "acc loop reduction(+:t1) reduction(-:t2)")
+
+/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
+
+
+int
+main ()
+{
+  test1 ();
+  test2 ();
+  test3 ();
+  test4 ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
new file mode 100644
index 0000000..62b8a45
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
@@ -0,0 +1,109 @@
+! Exercise three levels of parallelism using SGEMM from BLAS.
+
+! { dg-additional-options "-fopenacc-dim=-:-:128" }
+
+! Implicitly set vector_length to 128 using -fopenacc-dim.
+subroutine openacc_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n))
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm
+
+! Explicitly set vector_length to 128 using a vector_length clause.
+subroutine openacc_sgemm_128 (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n)) vector_length (128)
+  ! { dg-prune-output "using vector_length \\(32\\), ignoring 128" }
+  do j = 1, n
+     !$acc loop
+     do i = 1, m
+        temp = 0.0
+        !$acc loop reduction(+:temp)
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine openacc_sgemm_128
+
+subroutine host_sgemm (m, n, k, alpha, a, b, beta, c)
+  integer :: m, n, k
+  real :: alpha, beta
+  real :: a(k,*), b(k,*), c(m,*)
+
+  integer :: i, j, l
+  real :: temp
+
+  do j = 1, n
+     do i = 1, m
+        temp = 0.0
+        do l = 1, k
+           temp = temp + a(l,i)*b(l,j)
+        end do
+        if(beta == 0.0) then
+           c(i,j) = alpha*temp
+        else
+           c(i,j) = alpha*temp + beta*c(i,j)
+        end if
+     end do
+  end do
+end subroutine host_sgemm
+
+program main
+  integer, parameter :: M = 100, N = 50, K = 2000
+  real :: a(K, M), b(K, N), c(M, N), d (M, N), e (M, N)
+  real alpha, beta
+  integer i, j
+
+  a(:,:) = 1.0
+  b(:,:) = 0.25
+
+  c(:,:) = 0.0
+  d(:,:) = 0.0
+  e(:,:) = 0.0
+
+  alpha = 1.05
+  beta = 1.25
+
+  call openacc_sgemm (M, N, K, alpha, a, b, beta, c)
+  call openacc_sgemm_128 (M, N, K, alpha, a, b, beta, d)
+  call host_sgemm (M, N, K, alpha, a, b, beta, e)
+
+  do i = 1, m
+     do j = 1, n
+        if (c(i,j) /= e(i,j)) call abort
+        if (d(i,j) /= e(i,j)) call abort
+     end do
+  end do
+end program main

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (7 preceding siblings ...)
  2018-03-23 16:30   ` Tom de Vries
@ 2018-03-30  1:50   ` Tom de Vries
  2018-03-30 14:48     ` Tom de Vries
  2018-04-03 14:52   ` [nvptx] Use MAX, MIN, ROUND_UP macros Tom de Vries
  2018-04-03 15:00   ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Tom de Vries
  10 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-30  1:50 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> As a follow up patch will show, the nvptx BE falls back to using
> vector_length = 32 when a vector loop is nested inside a worker loop.

I disabled the fallback, and analyzed the vred2d-128.c illegal memory 
access execution failure.

I minimized that down to this ptx:
...
.shared .align 8 .u8 __oacc_bcast[176];

{
   {
     .reg .u32 %x;
     mov.u32 %x,%tid.x;
     setp.ne.u32 %r86,%x,0;
   }

   {
     .reg .u32 %tidy;
     .reg .u64 %t_bcast;
     .reg .u64 %y64;
     mov.u32 %tidy,%tid.y;
     cvt.u64.u32 %y64,%tidy;
     add.u64 %y64,%y64,1;
     cvta.shared.u64 %t_bcast,__oacc_bcast;
     mad.lo.u64 %r66,%y64,88,%t_bcast;
   }

   @ %r86 bra $L28;
   st.u32 [%r66+80],0;
  $L28:
   ret;
}
...

The ptx is called with 2 workers and 128 vector_length.

So, 2 workers mean %tid.y has values 0 and 1.
Then %y64 has values 1 and 2.
Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + 
(2 * 88) + 80.

So we're accessing memory at location 256, while the __oacc_bcast is 
only 176 bytes big.

I formulated this assert that AFAIU detects this situation in the compiler:
...
@@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, 
const char *name)
    fprintf (file, "\t}\n");
  }

+static int nvptx_mach_max_workers ();
+
  /* Emit code to initialize OpenACC worker broadcast and synchronization
     registers.  */

@@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
                "// vector broadcast offset\n",
                REGNO (cfun->machine->bcast_partition),
                oacc_bcast_partition);
+      gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 
1) <= oacc_bcast_size);
      }
    if (cfun->machine->sync_bar)
      fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
...

The assert is not triggered when the fallback is used.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-30  1:50   ` Tom de Vries
@ 2018-03-30 14:48     ` Tom de Vries
  2018-03-30 15:06       ` Cesar Philippidis
  0 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-30 14:48 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

On 03/30/2018 03:07 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> As a follow up patch will show, the nvptx BE falls back to using
>> vector_length = 32 when a vector loop is nested inside a worker loop.
> 
> I disabled the fallback, and analyzed the vred2d-128.c illegal memory 
> access execution failure.
> 
> I minimized that down to this ptx:
> ...
> .shared .align 8 .u8 __oacc_bcast[176];
> 
> {
>    {
>      .reg .u32 %x;
>      mov.u32 %x,%tid.x;
>      setp.ne.u32 %r86,%x,0;
>    }
> 
>    {
>      .reg .u32 %tidy;
>      .reg .u64 %t_bcast;
>      .reg .u64 %y64;
>      mov.u32 %tidy,%tid.y;
>      cvt.u64.u32 %y64,%tidy;
>      add.u64 %y64,%y64,1;
>      cvta.shared.u64 %t_bcast,__oacc_bcast;
>      mad.lo.u64 %r66,%y64,88,%t_bcast;
>    }
> 
>    @ %r86 bra $L28;
>    st.u32 [%r66+80],0;
>   $L28:
>    ret;
> }
> ...
> 
> The ptx is called with 2 workers and 128 vector_length.
> 
> So, 2 workers mean %tid.y has values 0 and 1.
> Then %y64 has values 1 and 2.
> Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
> Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + 
> (2 * 88) + 80.
> 
> So we're accessing memory at location 256, while the __oacc_bcast is 
> only 176 bytes big.
> 
> I formulated this assert that AFAIU detects this situation in the compiler:
> ...
> @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, 
> const char *name)
>     fprintf (file, "\t}\n");
>   }
> 
> +static int nvptx_mach_max_workers ();
> +
>   /* Emit code to initialize OpenACC worker broadcast and synchronization
>      registers.  */
> 
> @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
>                 "// vector broadcast offset\n",
>                 REGNO (cfun->machine->bcast_partition),
>                 oacc_bcast_partition);
> +      gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 
> 1) <= oacc_bcast_size);
>       }
>     if (cfun->machine->sync_bar)
>       fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
> ...
> 
> The assert is not triggered when the fallback is used.

I've tracked the problem down to:
...
> -      if (oacc_bcast_size < data.offset)                                                                          
> -       oacc_bcast_size = data.offset;                                                                             
> +      if (oacc_bcast_partition < data.offset)                                                                     
> +       {                                                                                                          
> +         int psize = data.offset;                                                                                 
> +         psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);                                        
> +         oacc_bcast_partition = psize;                                                                            
> +         oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);                                               
> +       }                                                                                                          
...

We hit this if clause for a first compiled function, with num_workers(1).

This sets oacc_bcast_partition and oacc_bcast_size as required for that 
functions.

Then we hit this if clause for a second compiled function, with 
num_workers (2).

We need oacc_bcast_size updated, but the 'oacc_bcast_partition < 
data.offset' is false, so the update doesn't happen.

I managed to fix this by making the code unconditional, and using MAX to 
update oacc_bcast_partition and oacc_bcast_size.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-30 14:48     ` Tom de Vries
@ 2018-03-30 15:06       ` Cesar Philippidis
  2018-03-30 15:35         ` Tom de Vries
  0 siblings, 1 reply; 50+ messages in thread
From: Cesar Philippidis @ 2018-03-30 15:06 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches

On 03/30/2018 07:45 AM, Tom de Vries wrote:
> On 03/30/2018 03:07 AM, Tom de Vries wrote:
>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>>> As a follow up patch will show, the nvptx BE falls back to using
>>> vector_length = 32 when a vector loop is nested inside a worker loop.
>>
>> I disabled the fallback, and analyzed the vred2d-128.c illegal memory
>> access execution failure.
>>
>> I minimized that down to this ptx:
>> ...
>> .shared .align 8 .u8 __oacc_bcast[176];
>>
>> {
>>    {
>>      .reg .u32 %x;
>>      mov.u32 %x,%tid.x;
>>      setp.ne.u32 %r86,%x,0;
>>    }
>>
>>    {
>>      .reg .u32 %tidy;
>>      .reg .u64 %t_bcast;
>>      .reg .u64 %y64;
>>      mov.u32 %tidy,%tid.y;
>>      cvt.u64.u32 %y64,%tidy;
>>      add.u64 %y64,%y64,1;
>>      cvta.shared.u64 %t_bcast,__oacc_bcast;
>>      mad.lo.u64 %r66,%y64,88,%t_bcast;
>>    }
>>
>>    @ %r86 bra $L28;
>>    st.u32 [%r66+80],0;
>>   $L28:
>>    ret;
>> }
>> ...
>>
>> The ptx is called with 2 workers and 128 vector_length.
>>
>> So, 2 workers mean %tid.y has values 0 and 1.
>> Then %y64 has values 1 and 2.
>> Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
>> Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast
>> + (2 * 88) + 80.
>>
>> So we're accessing memory at location 256, while the __oacc_bcast is
>> only 176 bytes big.
>>
>> I formulated this assert that AFAIU detects this situation in the
>> compiler:
>> ...
>> @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int
>> regno, const char *name)
>>     fprintf (file, "\t}\n");
>>   }
>>
>> +static int nvptx_mach_max_workers ();
>> +
>>   /* Emit code to initialize OpenACC worker broadcast and synchronization
>>      registers.  */
>>
>> @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
>>                 "// vector broadcast offset\n",
>>                 REGNO (cfun->machine->bcast_partition),
>>                 oacc_bcast_partition);
>> +      gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () +
>> 1) <= oacc_bcast_size);
>>       }
>>     if (cfun->machine->sync_bar)
>>       fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
>> ...
>>
>> The assert is not triggered when the fallback is used.
> 
> I've tracked the problem down to:
> ...
>> -      if (oacc_bcast_size <
>> data.offset)                                                                         
>> -       oacc_bcast_size =
>> data.offset;                                                                            
>> +      if (oacc_bcast_partition <
>> data.offset)                                                                    
>> +      
>> {                                                                                                         
>> +         int psize =
>> data.offset;                                                                                
>> +         psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align
>> - 1);                                        +        
>> oacc_bcast_partition =
>> psize;                                                                           
>> +         oacc_bcast_size = psize * (nvptx_mach_max_workers () +
>> 1);                                               +      
>> }                                                                                                         
> 
> ...
> 
> We hit this if clause for a first compiled function, with num_workers(1).
> 
> This sets oacc_bcast_partition and oacc_bcast_size as required for that
> functions.
> 
> Then we hit this if clause for a second compiled function, with
> num_workers (2).
> 
> We need oacc_bcast_size updated, but the 'oacc_bcast_partition <
> data.offset' is false, so the update doesn't happen.
> 
> I managed to fix this by making the code unconditional, and using MAX to
> update oacc_bcast_partition and oacc_bcast_size.

It looks like that's fallout from this patch
<https://gcc.gnu.org/ml/gcc-patches/2018-03/msg01212.html>. I should
have checked that patch with the vector length fallback disabled.

Cesar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-30 15:06       ` Cesar Philippidis
@ 2018-03-30 15:35         ` Tom de Vries
  2018-04-05 16:33           ` Tom de Vries
  0 siblings, 1 reply; 50+ messages in thread
From: Tom de Vries @ 2018-03-30 15:35 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

On 03/30/2018 05:00 PM, Cesar Philippidis wrote:
> I should
> have checked that patch with the vector length fallback disabled.

Right. The patch series introduces a lot of code that is not exercised.

I've added an -mlong-vector-in-workers option in my local branch and 
added 3 test-cases to exercise the code with fallback disabled everytime 
I run the libgomp tests.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [nvptx] Use MAX, MIN, ROUND_UP macros
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (8 preceding siblings ...)
  2018-03-30  1:50   ` Tom de Vries
@ 2018-04-03 14:52   ` Tom de Vries
  2018-04-03 15:00   ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Tom de Vries
  10 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-03 14:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: Cesar Philippidis

[-- Attachment #1: Type: text/plain, Size: 783 bytes --]

[ was: [og7] vector_length extension part 2: Generalize state 
propagation and synchronization ]

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> -      if (oacc_bcast_size < data.offset)
> -	oacc_bcast_size = data.offset;

The current state of nvptx.c uses this construct a lot, which is harder 
to read than:
...
   oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
...

This patch replaces all such occurrences with MIN or MAX.


> +	  psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);

Likewise, this pattern occurs a lot, which is equivalent to:
...
	  psize = ROUND_UP (psize, oacc_bcast_align);
...

This patch also replaces all such occurrences with ROUND_UP.


Build on x86_64 with nvptx accelerator and reg-tested libgomp.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Use-MAX-MIN-ROUND_UP-macros.patch --]
[-- Type: text/x-patch, Size: 4181 bytes --]

[nvptx] Use MAX, MIN, ROUND_UP macros

2018-04-03  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_gen_shared_bcast, shared_prop_gen)
	(nvptx_goacc_expand_accel_var): Use MAX and ROUND_UP.
	(nvptx_assemble_value, nvptx_output_skip): Use MIN.
	(nvptx_shared_propagate, nvptx_single, nvptx_expand_shared_addr): Use
	MAX.

---
 gcc/config/nvptx/nvptx.c | 35 +++++++++++++----------------------
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 38f25ad..d4ff730 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1808,9 +1808,8 @@ nvptx_gen_shared_bcast (rtx reg, propagate_mask pm, unsigned rep,
 	  {
 	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
 
-	    if (align > oacc_bcast_align)
-	      oacc_bcast_align = align;
-	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    oacc_bcast_align = MAX (oacc_bcast_align, align);
+	    data->offset = ROUND_UP (data->offset, align);
 	    addr = data->base;
 	    gcc_assert (data->base != NULL);
 	    if (data->offset)
@@ -1932,8 +1931,7 @@ nvptx_assemble_value (unsigned HOST_WIDE_INT val, unsigned size)
     {
       val >>= part * BITS_PER_UNIT;
       part = init_frag.size - init_frag.offset;
-      if (part > size)
-	part = size;
+      part = MIN (part, size);
 
       unsigned HOST_WIDE_INT partial
 	= val << (init_frag.offset * BITS_PER_UNIT);
@@ -1996,8 +1994,7 @@ nvptx_output_skip (FILE *, unsigned HOST_WIDE_INT size)
   if (init_frag.offset)
     {
       unsigned part = init_frag.size - init_frag.offset;
-      if (part > size)
-	part = (unsigned) size;
+      part = MIN (part, (unsigned)size);
       size -= part;
       nvptx_assemble_value (0, part);
     }
@@ -3912,9 +3909,8 @@ shared_prop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_,
       /* Starting a loop, initialize pointer.    */
       unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
 
-      if (align > oacc_bcast_align)
-	oacc_bcast_align = align;
-      data->offset = (data->offset + align - 1) & ~(align - 1);
+      oacc_bcast_align = MAX (oacc_bcast_align, align);
+      data->offset = ROUND_UP (data->offset, align);
 
       data->ptr = gen_reg_rtx (Pmode);
 
@@ -3955,8 +3951,7 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
       rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
       emit_insn_after (init, insn);
 
-      if (oacc_bcast_size < data.offset)
-	oacc_bcast_size = data.offset;
+      oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
     }
   return empty;
 }
@@ -4224,8 +4219,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
-	    oacc_bcast_size = GET_MODE_SIZE (SImode);
+	  oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode));
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
@@ -4833,13 +4827,11 @@ nvptx_expand_shared_addr (tree exp, rtx target,
     return target;
 
   unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
-  if (align > worker_red_align)
-    worker_red_align = align;
+  worker_red_align = MAX (worker_red_align, align);
 
   unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
   unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
-  if (size + offset > worker_red_size)
-    worker_red_size = size + offset;
+  worker_red_size = MAX (worker_red_size, size + offset);
 
   rtx addr = worker_red_sym;
   if (offset)
@@ -5832,10 +5824,9 @@ nvptx_goacc_expand_accel_var (tree var)
       else
 	{
 	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
-	  gangprivate_shared_size =
-	    (gangprivate_shared_size + align - 1) & ~(align - 1);
-	  if (gangprivate_shared_align < align)
-	    gangprivate_shared_align = align;
+	  gangprivate_shared_size
+	    = ROUND_UP (gangprivate_shared_size, align);
+	  gangprivate_shared_align = MAX (gangprivate_shared_align, align);
 
 	  offset = gangprivate_shared_size;
 	  bool existed = gangprivate_shared_hmap.put (var, offset);

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
                     ` (9 preceding siblings ...)
  2018-04-03 14:52   ` [nvptx] Use MAX, MIN, ROUND_UP macros Tom de Vries
@ 2018-04-03 15:00   ` Tom de Vries
  2018-04-05 14:06     ` Tom de Vries
  2018-04-05 14:14     ` Tom de Vries
  10 siblings, 2 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-03 15:00 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 639 bytes --]

On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
> 	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.

One last thing: this variable needs to be reset to zero for every function.

Without this reset, we can generated different code for a function 
depending on whether there's another function in front or not.


> 	(populate_offload_attrs): Handle the situation where the default
> 	runtime geometry has not been initialized yet for reductions.

I've moved this bit to "vector_length extension part 4: target hooks and 
automatic parallelism".


Build on x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0002-nvptx-Generalize-state-propagation-and-synchronization.patch --]
[-- Type: text/x-patch, Size: 10691 bytes --]

[nvptx] Generalize state propagation and synchronization

2018-04-03  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
	(nvptx_option_override): Init oacc_bcast_partition.
	(nvptx_init_oacc_workers): New function.
	(nvptx_declare_function_name): Call nvptx_init_oacc_workers.
	(nvptx_needs_shared_bcast): New function.
	(nvptx_find_par): Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_shared_propagate): Initialize vector bcast partition and
	synchronization state.
	(nvptx_single):  Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_process_pars): Likewise.
	* config/nvptx/nvptx.h (struct machine_function): Add
	bcast_partition and sync_bar members.

---
 gcc/config/nvptx/nvptx.c | 137 ++++++++++++++++++++++++++++++++++++++++++-----
 gcc/config/nvptx/nvptx.h |   4 ++
 2 files changed, 129 insertions(+), 12 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index d4ff730..0b46e13 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -133,6 +133,7 @@ static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
    memory.  It'd be nice if PTX supported common blocks, because then
    this could be shared across TUs (taking the largest size).  */
 static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_partition;
 static unsigned oacc_bcast_align;
 static GTY(()) rtx oacc_bcast_sym;
 
@@ -157,6 +158,8 @@ static bool need_softstack_decl;
 /* True if any function references __nvptx_uni.  */
 static bool need_unisimt_decl;
 
+static int nvptx_mach_max_workers ();
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -210,6 +213,7 @@ nvptx_option_override (void)
   oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast");
   SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED);
   oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  oacc_bcast_partition = 0;
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
@@ -1097,6 +1101,40 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize OpenACC worker broadcast and synchronization
+   registers.  */
+
+static void
+nvptx_init_oacc_workers (FILE *file)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%tidy;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n");
+      fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; "
+	       "// vector broadcast offset\n",
+	       REGNO (cfun->machine->bcast_partition),
+	       oacc_bcast_partition);
+    }
+  /* Verify oacc_bcast_size.  */
+  gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1)
+	      <= oacc_bcast_size);
+  if (cfun->machine->sync_bar)
+    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
+	     "// vector synchronization barrier\n",
+	     REGNO (cfun->machine->sync_bar));
+  fprintf (file, "\t}\n");
+}
+
 /* Emit code to initialize predicate and master lane index registers for
    -muniform-simt code generation variant.  */
 
@@ -1323,6 +1361,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->unisimt_predicate
       || (cfun->machine->has_simtreg && !crtl->is_leaf))
     nvptx_init_unisimt_predicate (file);
+  if (cfun->machine->bcast_partition || cfun->machine->sync_bar)
+    nvptx_init_oacc_workers (file);
 }
 
 /* Output code for switching uniform-simt state.  ENTERING indicates whether
@@ -3000,6 +3040,19 @@ nvptx_split_blocks (bb_insn_map_t *map)
     }
 }
 
+/* Return true if MASK contains parallelism that requires shared
+   memory to broadcast.  */
+
+static bool
+nvptx_needs_shared_bcast (unsigned mask)
+{
+  bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER);
+  bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    && nvptx_mach_vector_length () != PTX_WARP_SIZE;
+
+  return worker || large_vector;
+}
+
 /* BLOCK is a basic block containing a head or tail instruction.
    Locate the associated prehead or pretail instruction, which must be
    in the single predecessor block.  */
@@ -3075,7 +3128,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    par = new parallel (par, mask);
 	    par->forked_block = block;
 	    par->forked_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->fork_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
 	  }
@@ -3090,7 +3143,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    gcc_assert (par->mask == mask);
 	    par->join_block = block;
 	    par->join_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->joining_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
 	    par = par->parent;
@@ -3947,11 +4000,33 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
+      rtx bcast_sym = oacc_bcast_sym;
+
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
+      if (vector && nvptx_mach_max_workers () > 1)
+	{
+	  if (!cfun->machine->bcast_partition)
+	    {
+	      /* It would be nice to place this register in
+		 DATA_AREA_SHARED.  */
+	      cfun->machine->bcast_partition = gen_reg_rtx (DImode);
+	    }
+	  if (!cfun->machine->sync_bar)
+	    cfun->machine->sync_bar = gen_reg_rtx (SImode);
+
+	  bcast_sym = cfun->machine->bcast_partition;
+	}
+
+      rtx init = gen_rtx_SET (data.base, bcast_sym);
       emit_insn_after (init, insn);
 
-      oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
+      unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align);
+      unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			   ? nvptx_mach_max_workers () + 1
+			   : 1);
+
+      oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+      oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
     }
   return empty;
 }
@@ -4146,7 +4221,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
     {
       rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
 
-      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask
+	  && nvptx_mach_vector_length () == PTX_WARP_SIZE)
 	{
 	  /* Vector mode only, do a shuffle.  */
 #if WORKAROUND_PTXJIT_BUG
@@ -4213,23 +4289,51 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  unsigned size = GET_MODE_SIZE (SImode);
+	  bool vector = true;
 	  rtx barrier = GEN_INT (0);
 	  int threads = 0;
 
+	  if (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask)
+	    vector = false;
+
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode));
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->bcast_partition)
+	    data.base = cfun->machine->bcast_partition;
+
+	  gcc_assert (data.base != NULL);
+
+	  unsigned int psize = ROUND_UP (size, oacc_bcast_align);
+	  unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			       ? nvptx_mach_max_workers () + 1
+			       : 1);
+
+	  oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+	  oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
-						    false),
+						    vector),
 			    before);
+
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->sync_bar)
+	    {
+	      barrier = cfun->machine->sync_bar;
+	      threads = nvptx_mach_vector_length ();
+	    }
+
 	  /* Barrier so other workers can see the write.  */
 	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
-						    false), tail);
+						    vector),
+			    tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
@@ -4342,17 +4446,26 @@ nvptx_process_pars (parallel *par)
     }
 
   bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0;
-  
-  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+  bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER));
+  bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+		      && nvptx_mach_vector_length () > PTX_WARP_SIZE);
+
+  if (worker || large_vector)
     {
       nvptx_shared_propagate (false, is_call, par->forked_block,
-			      par->forked_insn, false);
+			      par->forked_insn, !worker);
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
-					   false);
+					   !worker);
       rtx barrier = GEN_INT (0);
       int threads = 0;
 
+      if (!worker && cfun->machine->sync_bar)
+	{
+	  barrier = cfun->machine->sync_bar;
+	  threads = nvptx_mach_vector_length ();
+	}
+
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 784628e..fb9f04b 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -228,6 +228,10 @@ struct GTY(()) machine_function
   rtx axis_predicate[2]; /* Neutering predicates.  */
   int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
 		      vector_length, dim[1] is num_workers.  */
+  rtx bcast_partition; /* Register containing the size of each
+			  vector's partition of share-memory used to
+			  broadcast state.  */
+  rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-04-03 15:00   ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Tom de Vries
@ 2018-04-05 14:06     ` Tom de Vries
  2018-04-05 14:14     ` Tom de Vries
  1 sibling, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-05 14:06 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 647 bytes --]

On 04/03/2018 05:00 PM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>>     * config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
> 
> One last thing: this variable needs to be reset to zero for every function.
> 
> Without this reset, we can generated different code for a function 
> depending on whether there's another function in front or not.

In the previous commit, I set that variable in nvptx_option_override, 
but as I've found out that's not enough.

This patch does the init in nvptx_set_current_function.

Build x86_64 with nvptx accelerator and reg-tested libgomp.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Add-per-function-initialization-of-oacc_broadcast_partition.patch --]
[-- Type: text/x-patch, Size: 640 bytes --]

[nvptx] Add per-function initialization of oacc_broadcast_partition

2018-04-05  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_set_current_function): Initialize
	oacc_broadcast_partition.

---
 gcc/config/nvptx/nvptx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 0b46e13..009ca59 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5962,6 +5962,7 @@ nvptx_set_current_function (tree fndecl)
 
   gangprivate_shared_hmap.empty ();
   nvptx_previous_fndecl = fndecl;
+  oacc_bcast_partition = 0;
 }
 
 #undef TARGET_OPTION_OVERRIDE

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 3: reductions
  2018-03-02 17:51 ` [og7] vector_length extension part 3: reductions Cesar Philippidis
@ 2018-04-05 14:07   ` Tom de Vries
  2018-04-05 16:26   ` Tom de Vries
  1 sibling, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-05 14:07 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 229 bytes --]

On 03/02/2018 06:51 PM, Cesar Philippidis wrote:
> This patch teaches the nvptx BE how to process vector reductions with
> large vector lengths.

Committed test-case exercising large vector length with reductions.

Thanks,
- Tom

[-- Attachment #2: 0002-openacc-Add-vector-length-128-10.c.patch --]
[-- Type: text/x-patch, Size: 1338 bytes --]

[openacc] Add vector-length-128-10.c

2018-04-05  Tom de Vries  <tom@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c: New test.

---
 .../vector-length-128-10.c                         | 40 ++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
new file mode 100644
index 0000000..e46b5cf
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+  unsigned int res = 1;
+  unsigned long long res2 = 1;
+#pragma acc parallel vector_length (128) copyin (a,b) reduction (+:res, res2) copy (res, res2)
+  {
+#pragma acc loop vector reduction (+:res, res2)
+    for (unsigned int i = 0; i < n; i++)
+      {
+	res += ((a[i] + b[i]) % 2);
+	res2 += ((a[i] + b[i]) % 2);
+      }
+  }
+
+  if (res != 478)
+    abort ();
+  if (res2 != 478)
+    abort ();
+
+  return 0;
+}
+/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-04-03 15:00   ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Tom de Vries
  2018-04-05 14:06     ` Tom de Vries
@ 2018-04-05 14:14     ` Tom de Vries
  1 sibling, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-05 14:14 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

On 04/03/2018 05:00 PM, Tom de Vries wrote:
> +      unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align);
> +      unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
> +			   ? nvptx_mach_max_workers () + 1
> +			   : 1);

This claims too much space for a simple long vector loop. Filed as 
PR85231 - "[og7, openacc, nvptx] Too much shared memory claimed for long 
vector length".

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 3: reductions
  2018-03-02 17:51 ` [og7] vector_length extension part 3: reductions Cesar Philippidis
  2018-04-05 14:07   ` Tom de Vries
@ 2018-04-05 16:26   ` Tom de Vries
  1 sibling, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-05 16:26 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 524 bytes --]

On 03/02/2018 06:51 PM, Cesar Philippidis wrote:
> This patch teaches the nvptx BE how to process vector reductions with
> large vector lengths.

As with the "[nvptx] Generalize state propagation and synchronization" 
patch":
- added use of MAX and ROUND_UP
- added missing initialization of vector_red_partition
- added assert checking vector_red_partition and vector_red_size

Also:
- added FIXME for hack in nvptx_declare_function_name


Build x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Handle-large-vector-reductions.patch --]
[-- Type: text/x-patch, Size: 16179 bytes --]

[nvptx] Handle large vector reductions

2018-04-05  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx-protos.h (nvptx_output_red_partition): Declare.
	* config/nvptx/nvptx.c (vector_red_size, vector_red_align,
	vector_red_partition, vector_red_sym): New global variables.
	(nvptx_option_override): Initialize vector_red_sym.
	(nvptx_declare_function_name): Restore red_partition register.
	(nvptx_file_end): Emit code to declare the vector reduction variables.
	(nvptx_output_red_partition): New function.
	(nvptx_expand_shared_addr): Add vector argument. Use it to handle
	large vector reductions.
	(enum nvptx_builtins): Add NVPTX_BUILTIN_VECTOR_ADDR.
	(nvptx_init_builtins): Add VECTOR_ADDR.
	(nvptx_expand_builtin): Update call to nvptx_expand_shared_addr.
	Handle nvptx_expand_shared_addr.
	(nvptx_get_shared_red_addr): Add vector argument and handle large
	vectors.
	(nvptx_goacc_reduction_setup): Add offload_attrs argument and handle
	large vectors.
	(nvptx_goacc_reduction_init): Likewise.
	(nvptx_goacc_reduction_fini): Likewise.
	(nvptx_goacc_reduction_teardown): Likewise.
	(nvptx_goacc_reduction): Update calls to nvptx_goacc_reduction_{setup,
	init,fini,teardown}.
	(nvptx_init_axis_predicate): Initialize vector_red_partition.
	(nvptx_set_current_function): Init vector_red_partition.
	* config/nvptx/nvptx.md (UNSPECV_RED_PART): New unspecv.
	(nvptx_red_partition): New insn.
	* config/nvptx/nvptx.h (struct machine_function): Add red_partition.

---
 gcc/config/nvptx/nvptx-protos.h |   1 +
 gcc/config/nvptx/nvptx.c        | 154 ++++++++++++++++++++++++++++++++--------
 gcc/config/nvptx/nvptx.h        |   2 +
 gcc/config/nvptx/nvptx.md       |  12 ++++
 4 files changed, 140 insertions(+), 29 deletions(-)

diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h
index 16b316f..326c38c 100644
--- a/gcc/config/nvptx/nvptx-protos.h
+++ b/gcc/config/nvptx/nvptx-protos.h
@@ -55,5 +55,6 @@ extern const char *nvptx_output_return (void);
 extern const char *nvptx_output_set_softstack (unsigned);
 extern const char *nvptx_output_simt_enter (rtx, rtx, rtx);
 extern const char *nvptx_output_simt_exit (rtx);
+extern const char *nvptx_output_red_partition (rtx, rtx);
 #endif
 #endif
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 009ca59..51bd69d 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -143,6 +143,14 @@ static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Buffer needed for vector reductions, when vector_length >
+   PTX_WARP_SIZE.  This has to be distinct from the worker broadcast
+   array, as both may be live concurrently.  */
+static unsigned vector_red_size;
+static unsigned vector_red_align;
+static unsigned vector_red_partition;
+static GTY(()) rtx vector_red_sym;
+
 /* Shared memory block for gang-private variables.  */
 static unsigned gangprivate_shared_size;
 static unsigned gangprivate_shared_align;
@@ -219,6 +227,11 @@ nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  vector_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__vector_red");
+  SET_SYMBOL_DATA_AREA (vector_red_sym, DATA_AREA_SHARED);
+  vector_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  vector_red_partition = 0;
+
   gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
   SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
   gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
@@ -1096,8 +1109,25 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
 {
   fprintf (file, "\t{\n");
   fprintf (file, "\t\t.reg.u32\t%%%s;\n", name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_red;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
   fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name);
   fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tid.y;\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_red, __vector_red;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_red; "
+	       "// vector reduction buffer\n",
+	       REGNO (cfun->machine->red_partition),
+	       vector_red_partition);
+    }
+  /* Verify vector_red_size.  */
+  gcc_assert (vector_red_partition * nvptx_mach_max_workers ()
+	      <= vector_red_size);
   fprintf (file, "\t}\n");
 }
 
@@ -1334,6 +1364,13 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 	fprintf (file, "\t.local.align 8 .b8 %%simtstack_ar["
 		HOST_WIDE_INT_PRINT_DEC "];\n", simtsz);
     }
+
+  /* Restore the vector reduction partition register, if necessary.
+     FIXME: Find out when and why this is necessary, and fix it.  */
+  if (cfun->machine->red_partition)
+    regno_reg_rtx[REGNO (cfun->machine->red_partition)]
+      = cfun->machine->red_partition;
+
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < maxregs; i++)
@@ -4881,6 +4918,10 @@ nvptx_file_end (void)
     write_shared_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (vector_red_size)
+    write_shared_buffer (asm_out_file, vector_red_sym,
+			 vector_red_align, vector_red_size);
+
   if (gangprivate_shared_size)
     write_shared_buffer (asm_out_file, gangprivate_shared_sym,
 			 gangprivate_shared_align, gangprivate_shared_size);
@@ -4930,31 +4971,68 @@ nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
   return target;
 }
 
-/* Worker reduction address expander.  */
+const char *
+nvptx_output_red_partition (rtx dst, rtx offset)
+{
+  const char *zero_offset = "\t\tmov.u64\t%%r%d, %%r%d; // vred buffer\n";
+  const char *with_offset = "\t\tadd.u64\t%%r%d, %%r%d, %d; // vred buffer\n";
+
+  if (offset == const0_rtx)
+    fprintf (asm_out_file, zero_offset, REGNO (dst),
+	     REGNO (cfun->machine->red_partition));
+  else
+    fprintf (asm_out_file, with_offset, REGNO (dst),
+	     REGNO (cfun->machine->red_partition), UINTVAL (offset));
+
+  return "";
+}
+
+/* Shared-memory reduction address expander.  */
 
 static rtx
 nvptx_expand_shared_addr (tree exp, rtx target,
-			  machine_mode ARG_UNUSED (mode), int ignore)
+			  machine_mode ARG_UNUSED (mode), int ignore,
+			  int vector)
 {
   if (ignore)
     return target;
 
   unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
-  worker_red_align = MAX (worker_red_align, align);
-
   unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
   unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
-  worker_red_size = MAX (worker_red_size, size + offset);
-
   rtx addr = worker_red_sym;
-  if (offset)
+
+  if (vector)
     {
-      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset));
-      addr = gen_rtx_CONST (Pmode, addr);
+      offload_attrs oa;
+
+      populate_offload_attrs (&oa);
+
+      unsigned int psize = ROUND_UP (size + offset, align);
+      unsigned int pnum = oa.max_workers;
+      vector_red_partition = MAX (vector_red_partition, psize);
+      vector_red_size = MAX (vector_red_size, psize * pnum);
+      vector_red_align = MAX (vector_red_align, align);
+
+      if (cfun->machine->red_partition == NULL)
+	cfun->machine->red_partition = gen_reg_rtx (Pmode);
+
+      addr = gen_reg_rtx (Pmode);
+      emit_insn (gen_nvptx_red_partition (addr, GEN_INT (offset)));
     }
+  else
+    {
+      worker_red_align = MAX (worker_red_align, align);
+      worker_red_size = MAX (worker_red_size, size + offset);
 
-  emit_move_insn (target, addr);
+      if (offset)
+	{
+	  addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset));
+	  addr = gen_rtx_CONST (Pmode, addr);
+	}
+   }
 
+  emit_move_insn (target, addr);
   return target;
 }
 
@@ -5021,6 +5099,7 @@ enum nvptx_builtins
   NVPTX_BUILTIN_SHUFFLE,
   NVPTX_BUILTIN_SHUFFLELL,
   NVPTX_BUILTIN_WORKER_ADDR,
+  NVPTX_BUILTIN_VECTOR_ADDR,
   NVPTX_BUILTIN_CMP_SWAP,
   NVPTX_BUILTIN_CMP_SWAPLL,
   NVPTX_BUILTIN_COND_UNI,
@@ -5059,6 +5138,8 @@ nvptx_init_builtins (void)
   DEF (SHUFFLELL, "shufflell", (LLUINT, LLUINT, UINT, UINT, NULL_TREE));
   DEF (WORKER_ADDR, "worker_addr",
        (PTRVOID, ST, UINT, UINT, NULL_TREE));
+  DEF (VECTOR_ADDR, "vector_addr",
+       (PTRVOID, ST, UINT, UINT, NULL_TREE));
   DEF (CMP_SWAP, "cmp_swap", (UINT, PTRVOID, UINT, UINT, NULL_TREE));
   DEF (CMP_SWAPLL, "cmp_swapll", (LLUINT, PTRVOID, LLUINT, LLUINT, NULL_TREE));
   DEF (COND_UNI, "cond_uni", (integer_type_node, integer_type_node, NULL_TREE));
@@ -5088,7 +5169,10 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
       return nvptx_expand_shuffle (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_WORKER_ADDR:
-      return nvptx_expand_shared_addr (exp, target, mode, ignore);
+      return nvptx_expand_shared_addr (exp, target, mode, ignore, false);
+
+    case NVPTX_BUILTIN_VECTOR_ADDR:
+      return nvptx_expand_shared_addr (exp, target, mode, ignore, true);
 
     case NVPTX_BUILTIN_CMP_SWAP:
     case NVPTX_BUILTIN_CMP_SWAPLL:
@@ -5220,10 +5304,13 @@ nvptx_goacc_fork_join (gcall *call, const int dims[],
    data at that location.  */
 
 static tree
-nvptx_get_shared_red_addr (tree type, tree offset)
+nvptx_get_shared_red_addr (tree type, tree offset, bool vector)
 {
+  enum nvptx_builtins addr_dim = NVPTX_BUILTIN_WORKER_ADDR;
+  if (vector)
+    addr_dim = NVPTX_BUILTIN_VECTOR_ADDR;
   machine_mode mode = TYPE_MODE (type);
-  tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
+  tree fndecl = nvptx_builtin_decl (addr_dim, true);
   tree size = build_int_cst (unsigned_type_node, GET_MODE_SIZE (mode));
   tree align = build_int_cst (unsigned_type_node,
 			      GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT);
@@ -5654,7 +5741,7 @@ nvptx_adjust_reduction_type (tree var, tree type, gimple_seq *seq)
 /* NVPTX implementation of GOACC_REDUCTION_SETUP.  */
 
 static void
-nvptx_goacc_reduction_setup (gcall *call)
+nvptx_goacc_reduction_setup (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5677,11 +5764,13 @@ nvptx_goacc_reduction_setup (gcall *call)
 	}
     }
   
-  if (level == GOMP_DIM_WORKER)
+  if (level == GOMP_DIM_WORKER
+      || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE))
     {
       /* Store incoming value to worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+					     level == GOMP_DIM_VECTOR);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5700,7 +5789,7 @@ nvptx_goacc_reduction_setup (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_INIT. */
 
 static void
-nvptx_goacc_reduction_init (gcall *call)
+nvptx_goacc_reduction_init (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5714,7 +5803,7 @@ nvptx_goacc_reduction_init (gcall *call)
   
   push_gimplify_context (true);
 
-  if (level == GOMP_DIM_VECTOR)
+  if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE)
     {
       /* Initialize vector-non-zeroes to INIT_VAL (OP).  */
       tree tid = make_ssa_name (integer_type_node);
@@ -5786,7 +5875,7 @@ nvptx_goacc_reduction_init (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_FINI.  */
 
 static void
-nvptx_goacc_reduction_fini (gcall *call)
+nvptx_goacc_reduction_fini (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5800,17 +5889,18 @@ nvptx_goacc_reduction_fini (gcall *call)
 
   push_gimplify_context (true);
 
-  if (level == GOMP_DIM_VECTOR)
+  if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE)
     r = nvptx_vector_reduction (gimple_location (call), &gsi, var, op);
   else
     {
       tree accum = NULL_TREE;
 
-      if (level == GOMP_DIM_WORKER)
+      if (level == GOMP_DIM_WORKER || level == GOMP_DIM_VECTOR)
 	{
 	  /* Get reduction buffer address.  */
 	  tree offset = gimple_call_arg (call, 5);
-	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
+	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+						 level == GOMP_DIM_VECTOR);
 	  tree ptr = make_ssa_name (TREE_TYPE (call));
 
 	  gimplify_assign (ptr, call, &seq);
@@ -5845,7 +5935,7 @@ nvptx_goacc_reduction_fini (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_TEARDOWN.  */
 
 static void
-nvptx_goacc_reduction_teardown (gcall *call)
+nvptx_goacc_reduction_teardown (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5854,11 +5944,13 @@ nvptx_goacc_reduction_teardown (gcall *call)
   gimple_seq seq = NULL;
   
   push_gimplify_context (true);
-  if (level == GOMP_DIM_WORKER)
+  if (level == GOMP_DIM_WORKER
+      || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE))
     {
       /* Read the worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+					     level == GOMP_DIM_VECTOR);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5893,23 +5985,26 @@ static void
 nvptx_goacc_reduction (gcall *call)
 {
   unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
 
   switch (code)
     {
     case IFN_GOACC_REDUCTION_SETUP:
-      nvptx_goacc_reduction_setup (call);
+      nvptx_goacc_reduction_setup (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_INIT:
-      nvptx_goacc_reduction_init (call);
+      nvptx_goacc_reduction_init (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_FINI:
-      nvptx_goacc_reduction_fini (call);
+      nvptx_goacc_reduction_fini (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_TEARDOWN:
-      nvptx_goacc_reduction_teardown (call);
+      nvptx_goacc_reduction_teardown (call, &oa);
       break;
 
     default:
@@ -5962,6 +6057,7 @@ nvptx_set_current_function (tree fndecl)
 
   gangprivate_shared_hmap.empty ();
   nvptx_previous_fndecl = fndecl;
+  vector_red_partition = 0;
   oacc_bcast_partition = 0;
 }
 
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index fb9f04b..6994f18 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -231,6 +231,8 @@ struct GTY(()) machine_function
   rtx bcast_partition; /* Register containing the size of each
 			  vector's partition of share-memory used to
 			  broadcast state.  */
+  rtx red_partition; /* Similar to bcast_partition, except for vector
+			reductions.  */
   rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 2609222..b3604c0 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -66,6 +66,8 @@
 
    UNSPECV_SIMT_ENTER
    UNSPECV_SIMT_EXIT
+
+   UNSPECV_RED_PART
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -1438,3 +1440,13 @@
   ""
   "\\t.pragma \\\"nounroll\\\";"
   [(set_attr "predicable" "false")])
+
+(define_insn "nvptx_red_partition"
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=R")
+	(unspec_volatile [(match_operand:DI 1 "const_int_operand")]
+	 UNSPECV_RED_PART))]
+  ""
+  {
+    return nvptx_output_red_partition (operands[0], operands[1]);
+  }
+  [(set_attr "predicable" "false")])

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 4: target hooks and automatic parallelism
  2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
                     ` (3 preceding siblings ...)
  2018-03-26 17:13   ` Tom de Vries
@ 2018-04-05 16:32   ` Tom de Vries
  4 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-05 16:32 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 420 bytes --]

On 03/02/2018 08:18 PM, Cesar Philippidis wrote:
> The attached patch adjusts the existing goacc validate_dims target hook
> and introduces a new goacc adjust_parallelism target hook.

The attached patch now just introduces the nvptx_adjust_parallelism 
target hook implementation, which enables test-cases to start using the 
feature.

Build x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0002-nvptx-Enable-large-vectors.patch --]
[-- Type: text/x-patch, Size: 13493 bytes --]

[nvptx] Enable large vectors

2018-04-05  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* omp-offload.c (oacc_get_default_dim): New function.
	* omp-offload.h (oacc_get_default_dim): Declare.
	* config/nvptx/nvptx.c (NVPTX_GOACC_VL_WARP): Define.
	(nvptx_goacc_needs_vl_warp): New function.
	(nvptx_goacc_validate_dims): Take larger vector lengths into
	account.
	(nvptx_adjust_parallelism): New function.
	(TARGET_GOACC_ADJUST_PARALLELISM): Define.
	(populate_offload_attrs): Handle the situation where the default
	runtime geometry has not been initialized yet for reductions.

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c: Expect
	vector length to be 128.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c: Same.
	* testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: Same.
	* testsuite/libgomp.oacc-fortran/gemm.f90: Same.

---
 gcc/config/nvptx/nvptx.c                           | 148 +++++++++++++++++++--
 gcc/omp-offload.c                                  |   7 +
 gcc/omp-offload.h                                  |   2 +
 .../vector-length-128-1.c                          |   5 +-
 .../vector-length-128-10.c                         |   1 -
 .../vector-length-128-2.c                          |   5 +-
 .../libgomp.oacc-c-c++-common/vred2d-128.c         |   2 -
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    |   1 -
 8 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 51bd69d..595413a 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -71,6 +71,7 @@
 #include "fold-const.h"
 #include "intl.h"
 #include "tree-hash-traits.h"
+#include "omp-offload.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -4634,15 +4635,20 @@ populate_offload_attrs (offload_attrs *oa)
   if (oa->vector_length == 0)
     {
       /* FIXME: Need a more graceful way to handle large vector
-	 lengths in OpenACC routines.  */
+	 lengths in OpenACC routines and also -fopenacc-dims.  */
       if (!lookup_attribute ("omp target entrypoint",
 			     DECL_ATTRIBUTES (current_function_decl)))
 	oa->vector_length = PTX_WARP_SIZE;
-      else
+      else if (PTX_VECTOR_LENGTH != PTX_WARP_SIZE)
 	oa->vector_length = PTX_VECTOR_LENGTH;
     }
   if (oa->num_workers == 0)
-    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    {
+      if (oa->vector_length == 0)
+	oa->max_workers = PTX_WORKER_LENGTH;
+      else
+	oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+    }
   else
     oa->max_workers = oa->num_workers;
 }
@@ -5193,6 +5199,19 @@ nvptx_simt_vf ()
   return PTX_WARP_SIZE;
 }
 
+#define NVPTX_GOACC_VL_WARP "nvptx vl warp"
+
+/* Return true of the offloaded function needs a vector_length of
+   PTX_WARP_SIZE.  */
+
+static bool
+nvptx_goacc_needs_vl_warp ()
+{
+  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,
+				DECL_ATTRIBUTES (current_function_decl));
+  return attr != NULL_TREE;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  If
@@ -5201,6 +5220,14 @@ nvptx_simt_vf ()
 static bool
 nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 {
+  int default_vector_length = PTX_VECTOR_LENGTH;
+
+  /* For capability reasons, fallback to vl = 32 for runtime values.  */
+  if (dims[GOMP_DIM_VECTOR] == 0)
+    default_vector_length = PTX_WARP_SIZE;
+  else if (decl)
+    default_vector_length = oacc_get_default_dim (GOMP_DIM_VECTOR);
+
   /* Detect if a function is unsuitable for offloading.  */
   if (!flag_offload_force && decl)
     {
@@ -5225,18 +5252,20 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 
   bool changed = false;
 
-  /* The vector size must be 32, unless this is a SEQ routine.  */
+  /* The vector size must be a positive multiple of the warp size,
+     unless this is a SEQ routine.  */
   if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
       && dims[GOMP_DIM_VECTOR] >= 0
-      && dims[GOMP_DIM_VECTOR] != PTX_VECTOR_LENGTH)
+      && (dims[GOMP_DIM_VECTOR] % 32 != 0
+	  || dims[GOMP_DIM_VECTOR] == 0))
     {
       if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
 		    dims[GOMP_DIM_VECTOR]
 		    ? G_("using vector_length (%d), ignoring %d")
 		    : G_("using vector_length (%d), ignoring runtime setting"),
-		    PTX_VECTOR_LENGTH, dims[GOMP_DIM_VECTOR]);
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+		    default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
       changed = true;
     }
 
@@ -5250,16 +5279,77 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
       changed = true;
     }
 
+  /* Ensure that num_worker * vector_length < cta size.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+    {
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
+  /* vector_length must not exceed PTX_CTA_SIZE.  */
+  if (dims[GOMP_DIM_VECTOR] >= PTX_CTA_SIZE)
+    {
+      int new_vector = PTX_CTA_SIZE;
+      if (decl)
+	new_vector = default_vector_length;
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  new_vector, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = new_vector;
+      changed = true;
+    }
+
+  /* Set vector_length to default_vector_length if there are a sufficient
+     number of free threads in the CTA.  */
+  if (dims[GOMP_DIM_WORKER] > 0 && dims[GOMP_DIM_VECTOR] <= 0)
+    {
+      if (dims[GOMP_DIM_WORKER] * default_vector_length <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = default_vector_length;
+      else if (dims[GOMP_DIM_WORKER] * PTX_WARP_SIZE <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      else
+	error_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
+		  "vector_length must be at least 32");
+      changed = true;
+    }
+
+  /* Specify a default vector_length.  */
+  if (dims[GOMP_DIM_VECTOR] < 0)
+    {
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
+      changed = true;
+    }
+
+  if (nvptx_goacc_needs_vl_warp () && dims[GOMP_DIM_VECTOR] != PTX_WARP_SIZE)
+    {
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
   if (!decl)
     {
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+      bool new_vector = false;
+      if (dims[GOMP_DIM_VECTOR] <= 1)
+	{
+	  dims[GOMP_DIM_VECTOR] = default_vector_length;
+	  new_vector = true;
+	}
       if (dims[GOMP_DIM_WORKER] < 0)
 	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
+      if (new_vector
+	  && dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
       changed = true;
     }
 
+  gcc_assert (dims[GOMP_DIM_VECTOR] != 0);
+  gcc_assert (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] <= PTX_CTA_SIZE);
+
   return changed;
 }
 
@@ -5279,6 +5369,45 @@ nvptx_dim_limit (int axis)
   return 0;
 }
 
+/* Adjust the parallelism available to a loop given vector_length
+   associated with the offloaded function.  */
+
+static unsigned
+nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
+{
+  if (nvptx_goacc_needs_vl_warp ())
+    return inner_mask;
+
+  bool wv = (inner_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
+
+  if (oa.vector_length == PTX_WARP_SIZE)
+    return inner_mask;
+
+  /* FIXME: This is overly conservative; worker and vector loop will
+     eventually be combined.  */
+  if (wv)
+    return inner_mask & ~GOMP_DIM_MASK (GOMP_DIM_WORKER);
+
+  /* It's difficult to guarantee that warps in large vector_lengths
+     will remain convergent when a vector loop is nested inside a
+     worker loop.  Therefore, fallback to setting vector_length to
+     PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
+     sm_70+ targets.  */
+  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+      && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
+			      DECL_ATTRIBUTES (current_function_decl));
+      DECL_ATTRIBUTES (current_function_decl) = attr;
+    }
+
+  return inner_mask;
+}
+
 /* Determine whether fork & joins are needed.  */
 
 static bool
@@ -6169,6 +6298,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_GOACC_DIM_LIMIT
 #define TARGET_GOACC_DIM_LIMIT nvptx_dim_limit
 
+#undef TARGET_GOACC_ADJUST_PARALLELISM
+#define TARGET_GOACC_ADJUST_PARALLELISM nvptx_adjust_parallelism
+
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index ed17160..66c6212 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -551,6 +551,13 @@ oacc_xform_tile (gcall *call)
 static int oacc_default_dims[GOMP_DIM_MAX];
 static int oacc_min_dims[GOMP_DIM_MAX];
 
+int
+oacc_get_default_dim (int dim)
+{
+  gcc_assert (0 <= dim && dim < GOMP_DIM_MAX);
+  return oacc_default_dims[dim];
+}
+
 /* Parse the default dimension parameter.  This is a set of
    :-separated optional compute dimensions.  Each dimension is either
    a positive integer, or '-' for a dynamic value computed at
diff --git a/gcc/omp-offload.h b/gcc/omp-offload.h
index 528448b..014ee52 100644
--- a/gcc/omp-offload.h
+++ b/gcc/omp-offload.h
@@ -22,6 +22,8 @@ along with GCC; see the file COPYING3.  If not see
 #ifndef GCC_OMP_DEVICE_H
 #define GCC_OMP_DEVICE_H
 
+extern int oacc_get_default_dim (int dim);
+
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
index fab5b0d..18d77cc 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
@@ -33,7 +33,6 @@ main (void)
 
   return 0;
 }
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
-/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
index e46b5cf..0658cfd 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
@@ -37,4 +37,3 @@ main (void)
 
   return 0;
 }
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
index cc6fd55..2ab6499 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
@@ -34,7 +34,6 @@ main (void)
 
   return 0;
 }
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
-/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
index 1dc5fe0..318c0e6 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
@@ -42,8 +42,6 @@ gentest (test3, "acc parallel loop gang worker vector_length (128)",
 gentest (test4, "acc parallel loop",
 	 "acc loop reduction(+:t1) reduction(-:t2)")
 
-/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
-
 
 int
 main ()
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
index 62b8a45..ad67dce 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gemm.f90
@@ -39,7 +39,6 @@ subroutine openacc_sgemm_128 (m, n, k, alpha, a, b, beta, c)
   real :: temp
 
   !$acc parallel loop copy(c(1:m,1:n)) copyin(a(1:k,1:m),b(1:k,1:n)) vector_length (128)
-  ! { dg-prune-output "using vector_length \\(32\\), ignoring 128" }
   do j = 1, n
      !$acc loop
      do i = 1, m

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  2018-03-30 15:35         ` Tom de Vries
@ 2018-04-05 16:33           ` Tom de Vries
  0 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-05 16:33 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 548 bytes --]

On 03/30/2018 05:14 PM, Tom de Vries wrote:
> On 03/30/2018 05:00 PM, Cesar Philippidis wrote:
>> I should
>> have checked that patch with the vector length fallback disabled.
> 
> Right. The patch series introduces a lot of code that is not exercised.
> 
> I've added an -mlong-vector-in-workers option in my local branch and 
> added 3 test-cases to exercise the code with fallback disabled everytime 
> I run the libgomp tests.
> 

This patch adds that option.

Build x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0003-nvptx-Add-mlong-vector-in-workers.patch --]
[-- Type: text/x-patch, Size: 9695 bytes --]

[nvptx] Add -mlong-vector-in-workers

2018-04-05  Tom de Vries  <tom@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_adjust_parallelism): Handle
	nvptx_long_vectors_in_workers.
	* config/nvptx/nvptx.opt (mlong-vector-in-workers): Add option.

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c: New test.

---
 gcc/config/nvptx/nvptx.c                           |  3 +-
 gcc/config/nvptx/nvptx.opt                         |  3 ++
 .../vector-length-128-4.c                          | 41 ++++++++++++++++++++
 .../vector-length-128-5.c                          | 42 +++++++++++++++++++++
 .../vector-length-128-6.c                          | 42 +++++++++++++++++++++
 .../vector-length-128-8.c                          | 44 ++++++++++++++++++++++
 .../vector-length-128-9.c                          | 44 ++++++++++++++++++++++
 7 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 595413a..b5e6dce 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5397,7 +5397,8 @@ nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
      worker loop.  Therefore, fallback to setting vector_length to
      PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
      sm_70+ targets.  */
-  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+  if (nvptx_long_vectors_in_workers == 0
+      && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
       && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
     {
       tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index e2d64bd..f7f37ec 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -62,3 +62,6 @@ Enum(ptx_isa) String(sm_35) Value(PTX_ISA_SM35)
 misa=
 Target RejectNegative ToLower Joined Enum(ptx_isa) Var(ptx_isa_option) Init(PTX_ISA_SM30)
 Specify the version of the ptx ISA to use.
+
+mlong-vector-in-workers
+Target Var(nvptx_long_vectors_in_workers) Undocumented Init(0)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
new file mode 100644
index 0000000..6d43f82
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
@@ -0,0 +1,41 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel num_workers (2) vector_length (128) copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
new file mode 100644
index 0000000..661fdc7
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
@@ -0,0 +1,42 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-fopenacc-dim=-:2:128" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
new file mode 100644
index 0000000..91f611e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
@@ -0,0 +1,42 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-set-target-env-var "GOMP_OPENACC_DIM" ":2:" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel vector_length (128) copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c
new file mode 100644
index 0000000..6246067
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c
@@ -0,0 +1,44 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-additional-options "-fopenacc-dim=-:-:-" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */
+  
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 32\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=32, vectors=32" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c
new file mode 100644
index 0000000..2f8b4b7
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c
@@ -0,0 +1,44 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-additional-options "-fopenacc-dim=-:8:-" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */
+  
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 8, 32\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=32" } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [og7] vector_length extension part 5: libgomp and tests
  2018-03-02 20:47 ` [og7] vector_length extension part 5: libgomp and tests Cesar Philippidis
  2018-03-16 13:50   ` Thomas Schwinge
  2018-03-27 13:00   ` Tom de Vries
@ 2018-04-05 16:36   ` Tom de Vries
  2 siblings, 0 replies; 50+ messages in thread
From: Tom de Vries @ 2018-04-05 16:36 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches, Schwinge, Thomas

[-- Attachment #1: Type: text/plain, Size: 307 bytes --]

On 03/02/2018 09:47 PM, Cesar Philippidis wrote:
> 	libgomp/
> 	* plugin/plugin-nvptx.c (nvptx_exec): Adjust calculations of
> 	workers and vectors.

I wrote a test case that triggers this code, and added it to this code.

Build x86_64 with nvptx accelerator and tested libgomp.

Committed.

Thanks,
- Tom


[-- Attachment #2: 0004-nvptx-Handle-large-vectors-in-libgomp.patch --]
[-- Type: text/x-patch, Size: 3027 bytes --]

[nvptx] Handle large vectors in libgomp

2018-04-05  Cesar Philippidis  <cesar@codesourcery.com>
	    Tom de Vries  <tom@codesourcery.com>

	* plugin/plugin-nvptx.c (nvptx_exec): Adjust calculations of
	workers and vectors.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: New test.

---
 libgomp/plugin/plugin-nvptx.c                      | 10 +++---
 .../vector-length-128-7.c                          | 41 ++++++++++++++++++++++
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bdc0c30..9b4768f 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -734,8 +734,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   int threads_per_block = threads_per_sm > block_size
     ? block_size : threads_per_sm;
 
-  threads_per_block /= warp_size;
-
   if (threads_per_sm > cpu_size)
     threads_per_sm = cpu_size;
 
@@ -802,6 +800,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
   if (seen_zero)
     {
+      int vectors = dims[GOMP_DIM_VECTOR] > 0
+	? dims[GOMP_DIM_VECTOR] : warp_size;
+      int workers = threads_per_block / vectors;
+
       for (i = 0; i != GOMP_DIM_MAX; i++)
 	if (!dims[i])
 	  {
@@ -819,10 +821,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 		  : 2 * dev_size;
 		break;
 	      case GOMP_DIM_WORKER:
-		dims[i] = threads_per_block;
+		dims[i] = workers;
 		break;
 	      case GOMP_DIM_VECTOR:
-		dims[i] = warp_size;
+		dims[i] = vectors;
 		break;
 	      default:
 		abort ();
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
new file mode 100644
index 0000000..60c264c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
@@ -0,0 +1,41 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel vector_length (128) copyin (a,b) copyout (c)
+  {
+#pragma acc loop worker
+    for (unsigned int i = 0; i < 4; i++)
+#pragma acc loop vector
+      for (unsigned int j = 0; j < n / 4; j++)
+	c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2018-04-05 16:36 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-01 21:17 [og7] vector_length extension part 1: generalize function and variable names Cesar Philippidis
2018-03-02 16:55 ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Cesar Philippidis
2018-03-21 17:16   ` Tom de Vries
2018-03-22  8:05     ` Cesar Philippidis
2018-03-22 14:16       ` Tom de Vries
2018-03-22 14:35         ` Cesar Philippidis
2018-03-22 14:24   ` Tom de Vries
2018-03-22 15:18     ` Cesar Philippidis
2018-03-22 16:20       ` Tom de Vries
2018-03-22 17:26         ` Cesar Philippidis
2018-03-22 17:58           ` Tom de Vries
2018-03-22 19:32             ` Cesar Philippidis
2018-03-23  8:56               ` Tom de Vries
2018-03-23 14:35           ` Tom de Vries
2018-03-22 15:04   ` Tom de Vries
2018-03-22 17:14     ` Cesar Philippidis
2018-03-22 17:47   ` Tom de Vries
2018-03-22 17:48     ` Cesar Philippidis
2018-03-22 18:00       ` Tom de Vries
2018-03-23 13:14   ` Tom de Vries
2018-03-23 13:16   ` Tom de Vries
2018-03-23 14:18   ` Tom de Vries
2018-03-23 16:30   ` Tom de Vries
2018-03-30  1:50   ` Tom de Vries
2018-03-30 14:48     ` Tom de Vries
2018-03-30 15:06       ` Cesar Philippidis
2018-03-30 15:35         ` Tom de Vries
2018-04-05 16:33           ` Tom de Vries
2018-04-03 14:52   ` [nvptx] Use MAX, MIN, ROUND_UP macros Tom de Vries
2018-04-03 15:00   ` [og7] vector_length extension part 2: Generalize state propagation and synchronization Tom de Vries
2018-04-05 14:06     ` Tom de Vries
2018-04-05 14:14     ` Tom de Vries
2018-03-02 17:51 ` [og7] vector_length extension part 3: reductions Cesar Philippidis
2018-04-05 14:07   ` Tom de Vries
2018-04-05 16:26   ` Tom de Vries
2018-03-02 19:18 ` [og7] vector_length extension part 4: target hooks and automatic parallelism Cesar Philippidis
2018-03-21 15:55   ` Tom de Vries
2018-03-21 20:28     ` Cesar Philippidis
2018-03-26 14:25   ` Tom de Vries
2018-03-26 14:37     ` Cesar Philippidis
2018-03-26 16:52   ` Tom de Vries
2018-03-27 12:16     ` Tom de Vries
2018-03-26 17:13   ` Tom de Vries
2018-04-05 16:32   ` Tom de Vries
2018-03-02 20:47 ` [og7] vector_length extension part 5: libgomp and tests Cesar Philippidis
2018-03-16 13:50   ` Thomas Schwinge
2018-03-27 13:00   ` Tom de Vries
2018-04-05 16:36   ` Tom de Vries
2018-03-09 15:29 ` [og7] vector_length extension part 1: generalize function and variable names Thomas Schwinge
2018-03-09 15:31   ` Cesar Philippidis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).