[PATCH 01/11] [nvptx] Update openacc dim macros

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 01/11] [nvptx] Update openacc dim macros
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
  2018-07-24 20:47 ` [PATCH 04/11] [nvptx] Make nvptx state propagation function names more generic cesar
  2018-07-24 20:47 ` [PATCH 05/11] [nvptx] Fix whitespace in nvptx_single and nvptx_neuter_pars cesar
@ 2018-07-24 20:47 ` cesar
  2018-07-24 20:48 ` [PATCH 07/11] [nvptx] Add thread count parm to bar.sync cesar
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:47 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Cesar Philippidis <cesar@codesourcery.com>

Besides for updating the macros for the NVPTX OpenACC dims, this patch
also renames PTX_GANG_DEFAULT to PTX_DEFAULT_RUNTIME_DIM. I had
originally included the PTX_GANG_DEFAULT hunk in an earlier libgomp
patch, but going forward it makes sense to isolate the nvptx and
libgomp changes when possible, which this patch series does.

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Rename to
	PTX_DEFAULT_RUNTIME_DIM.
	* (PTX_VECTOR_LENGTH, PTX_WORKER_LENGTH,
	PTX_DEFAULT_RUNTIME_DIM): Move to the top of the file.
	(PTX_WARP_SIZE): Define.
	(PTX_CTA_SIZE): Define.
	(nvptx_simt_vf): Return PTX_WARP_SIZE instead of PTX_VECTOR_LENGTH.
	(nvptx_goacc_validate_dims): Use PTX_DEFAULT_RUNTIME_DIM.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5608bee..521f83e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -81,6 +81,13 @@
 #define WORKAROUND_PTXJIT_BUG_2 1
 #define WORKAROUND_PTXJIT_BUG_3 1
 
+/* Define dimension sizes for known hardware.  */
+#define PTX_VECTOR_LENGTH 32
+#define PTX_WORKER_LENGTH 32
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
+#define PTX_WARP_SIZE 32
+#define PTX_CTA_SIZE 1024
+
 /* The various PTX memory areas an object might reside in.  */
 enum nvptx_data_area
 {
@@ -5161,18 +5168,13 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
     default: gcc_unreachable ();
     }
 }
-\f
-/* Define dimension sizes for known hardware.  */
-#define PTX_VECTOR_LENGTH 32
-#define PTX_WORKER_LENGTH 32
-#define PTX_GANG_DEFAULT  0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 
 static int
 nvptx_simt_vf ()
 {
-  return PTX_VECTOR_LENGTH;
+  return PTX_WARP_SIZE;
 }
 
 /* Validate compute dimensions of an OpenACC offload or routine, fill
@@ -5216,7 +5218,7 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
       if (dims[GOMP_DIM_WORKER] < 0)
 	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
       if (dims[GOMP_DIM_GANG] < 0)
-	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
       changed = true;
     }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 04/11] [nvptx] Make nvptx state propagation function names more generic
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
@ 2018-07-24 20:47 ` cesar
  2018-07-24 20:47 ` [PATCH 05/11] [nvptx] Fix whitespace in nvptx_single and nvptx_neuter_pars cesar
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:47 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Cesar Philippidis <cesar@codesourcery.com>

This patch renames various state propagation functions into somewhat
that reflects their usage in generic worker and vector contexts. E.g.,
whereas before nvptx_wpropagate used to be used exclusively for worker
state propagation, it will eventually be used for any state
propagation using multiple warps. Because variable length vectors will
be able use both shared memory and warp shuffles for broadcasting, the
old vector-specific functions now contain 'warp' in their name.

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_gen_wcast): Rename as
	nvptx_gen_warp_bcast.
	(nvptx_gen_wcast): Rename to nvptx_gen_shared_bcast, add bool
	vector argument, and update call to nvptx_gen_shared_bcast.
	(propagator_fn): Add bool argument.
	(nvptx_propagate): New bool argument, pass bool argument to fn.
	(vprop_gen): Rename to warp_prop_gen, update call to
	nvptx_gen_warp_bcast.
	(nvptx_vpropagate): Rename to nvptx_warp_propagate, update call to
	nvptx_propagate.
	(wprop_gen): Rename to shared_prop_gen, update call to
	nvptx_gen_shared_bcast.
	(nvptx_wpropagate): Rename to nvptx_shared_propagate, update call
	to nvptx_propagate.
	(nvptx_wsync): Rename to nvptx_cta_sync.
	(nvptx_single): Update calls to nvptx_gen_warp_bcast,
	nvptx_gen_shared_bcast and nvptx_cta_sync.
	(nvptx_process_pars): Likewise.
	(write_worker_buffer): Rename as write_shared_buffer.
	(nvptx_file_end): Update calls to write_shared_buffer.
	(nvptx_expand_worker_addr): Rename as nvptx_expand_shared_addr.
	(nvptx_expand_builtin): Update call to nvptx_expand_shared_addr.
	(nvptx_get_worker_red_addr): Rename as nvptx_get_shared_red_addr.
	(nvptx_goacc_reduction_setup): Update call to
	nvptx_get_shared_red_addr.
	(nvptx_goacc_reduction_fini): Likewise.
	(nvptx_goacc_reduction_teardown): Likewise.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 1b83b3c..447425f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1750,7 +1750,7 @@ nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, nvptx_shuffle_kind kind)
    across the vectors of a single warp.  */
 
 static rtx
-nvptx_gen_vcast (rtx reg)
+nvptx_gen_warp_bcast (rtx reg)
 {
   return nvptx_gen_shuffle (reg, reg, const0_rtx, SHUFFLE_IDX);
 }
@@ -1781,7 +1781,8 @@ enum propagate_mask
    how many loop iterations will be executed (0 for not a loop).  */
    
 static rtx
-nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *data)
+nvptx_gen_shared_bcast (rtx reg, propagate_mask pm, unsigned rep,
+			broadcast_data_t *data, bool vector)
 {
   rtx  res;
   machine_mode mode = GET_MODE (reg);
@@ -1795,7 +1796,7 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *dat
 	start_sequence ();
 	if (pm & PM_read)
 	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
-	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	emit_insn (nvptx_gen_shared_bcast (tmp, pm, rep, data, vector));
 	if (pm & PM_write)
 	  emit_insn (gen_rtx_SET (reg, gen_rtx_NE (BImode, tmp, const0_rtx)));
 	res = get_insns ();
@@ -1815,6 +1816,7 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *dat
 	      oacc_bcast_align = align;
 	    data->offset = (data->offset + align - 1) & ~(align - 1);
 	    addr = data->base;
+	    gcc_assert (data->base != NULL);
 	    if (data->offset)
 	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
 	  }
@@ -3816,11 +3818,11 @@ nvptx_find_sese (auto_vec<basic_block> &blocks, bb_pair_vec_t &regions)
    regions and (b) only propagating stack entries that are used.  The
    latter might be quite hard to determine.  */
 
-typedef rtx (*propagator_fn) (rtx, propagate_mask, unsigned, void *);
+typedef rtx (*propagator_fn) (rtx, propagate_mask, unsigned, void *, bool);
 
 static bool
 nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
-		 propagate_mask rw, propagator_fn fn, void *data)
+		 propagate_mask rw, propagator_fn fn, void *data, bool vector)
 {
   bitmap live = DF_LIVE_IN (block);
   bitmap_iterator iterator;
@@ -3855,7 +3857,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	  
 	  emit_insn (gen_rtx_SET (idx, GEN_INT (fs)));
 	  /* Allow worker function to initialize anything needed.  */
-	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  rtx init = fn (tmp, PM_loop_begin, fs, data, vector);
 	  if (init)
 	    emit_insn (init);
 	  emit_label (label);
@@ -3864,7 +3866,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	}
       if (rw & PM_read)
 	emit_insn (gen_rtx_SET (tmp, gen_rtx_MEM (DImode, ptr)));
-      emit_insn (fn (tmp, rw, fs, data));
+      emit_insn (fn (tmp, rw, fs, data, vector));
       if (rw & PM_write)
 	emit_insn (gen_rtx_SET (gen_rtx_MEM (DImode, ptr), tmp));
       if (fs)
@@ -3872,7 +3874,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	  emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, idx, const0_rtx)));
 	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
 	  emit_insn (gen_br_true_uni (pred, label));
-	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  rtx fini = fn (tmp, PM_loop_end, fs, data, vector);
 	  if (fini)
 	    emit_insn (fini);
 	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
@@ -3892,7 +3894,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 
 	if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
 	  {
-	    rtx bcast = fn (reg, rw, 0, data);
+	    rtx bcast = fn (reg, rw, 0, data, vector);
 
 	    insn = emit_insn_after (bcast, insn);
 	    empty = false;
@@ -3901,16 +3903,17 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
   return empty;
 }
 
-/* Worker for nvptx_vpropagate.  */
+/* Worker for nvptx_warp_propagate.  */
 
 static rtx
-vprop_gen (rtx reg, propagate_mask pm,
-	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+warp_prop_gen (rtx reg, propagate_mask pm,
+	       unsigned ARG_UNUSED (count), void *ARG_UNUSED (data),
+	       bool ARG_UNUSED (vector))
 {
   if (!(pm & PM_read_write))
     return 0;
   
-  return nvptx_gen_vcast (reg);
+  return nvptx_gen_warp_bcast (reg);
 }
 
 /* Propagate state that is live at start of BLOCK across the vectors
@@ -3918,15 +3921,17 @@ vprop_gen (rtx reg, propagate_mask pm,
    IS_CALL and return as for nvptx_propagate.  */
 
 static bool
-nvptx_vpropagate (bool is_call, basic_block block, rtx_insn *insn)
+nvptx_warp_propagate (bool is_call, basic_block block, rtx_insn *insn)
 {
-  return nvptx_propagate (is_call, block, insn, PM_read_write, vprop_gen, 0);
+  return nvptx_propagate (is_call, block, insn, PM_read_write,
+			  warp_prop_gen, 0, false);
 }
 
-/* Worker for nvptx_wpropagate.  */
+/* Worker for nvptx_shared_propagate.  */
 
 static rtx
-wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+shared_prop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_,
+		 bool vector)
 {
   broadcast_data_t *data = (broadcast_data_t *)data_;
 
@@ -3950,7 +3955,7 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
       return clobber;
     }
   else
-    return nvptx_gen_wcast (reg, pm, rep, data);
+    return nvptx_gen_shared_bcast (reg, pm, rep, data, vector);
 }
 
 /* Spill or fill live state that is live at start of BLOCK.  PRE_P
@@ -3959,7 +3964,8 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
    INSN.  IS_CALL and return as for nvptx_propagate.  */
 
 static bool
-nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
+nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
+			rtx_insn *insn, bool vector)
 {
   broadcast_data_t data;
 
@@ -3968,7 +3974,8 @@ nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
   data.ptr = NULL_RTX;
 
   bool empty = nvptx_propagate (is_call, block, insn,
-				pre_p ? PM_read : PM_write, wprop_gen, &data);
+				pre_p ? PM_read : PM_write, shared_prop_gen,
+				&data, vector);
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
@@ -3982,11 +3989,11 @@ nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
   return empty;
 }
 
-/* Emit a worker-level synchronization barrier.  We use different
+/* Emit a CTA-level synchronization barrier.  We use different
    markers for before and after synchronizations.  */
 
 static rtx
-nvptx_wsync (bool after)
+nvptx_cta_sync (bool after)
 {
   return gen_nvptx_barsync (GEN_INT (after));
 }
@@ -4341,7 +4348,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  emit_insn_before (gen_rtx_SET (tmp, pvar), label);
 	  emit_insn_before (gen_rtx_SET (pvar, tmp), tail);
 #endif
-	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	  emit_insn_before (nvptx_gen_warp_bcast (pvar), tail);
 	}
       else
 	{
@@ -4356,16 +4363,18 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    oacc_bcast_size = GET_MODE_SIZE (SImode);
 
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
+						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_wsync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (false), tail);
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
+						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_wsync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (true), tail);
 	}
 
       extract_insn (tail);
@@ -4482,19 +4491,21 @@ nvptx_process_pars (parallel *par)
 
   if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
     {
-      nvptx_wpropagate (false, is_call, par->forked_block, par->forked_insn);
-      bool empty = nvptx_wpropagate (true, is_call,
-				     par->forked_block, par->fork_insn);
+      nvptx_shared_propagate (false, is_call, par->forked_block,
+			      par->forked_insn, false);
+      bool empty = nvptx_shared_propagate (true, is_call,
+					   par->forked_block, par->fork_insn,
+					   false);
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_wsync (false), par->forked_insn);
-	  emit_insn_before (nvptx_wsync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
-    nvptx_vpropagate (is_call, par->forked_block, par->forked_insn);
+    nvptx_warp_propagate (is_call, par->forked_block, par->forked_insn);
 
   /* Now do siblings.  */
   if (par->next)
@@ -4997,10 +5008,11 @@ nvptx_file_start (void)
   fputs ("// END PREAMBLE\n", asm_out_file);
 }
 
-/* Emit a declaration for a worker-level buffer in .shared memory.  */
+/* Emit a declaration for a worker and vector-level buffer in .shared
+   memory.  */
 
 static void
-write_worker_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
+write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
 {
   const char *name = XSTR (sym, 0);
 
@@ -5022,11 +5034,11 @@ nvptx_file_end (void)
   fputs (func_decls.str().c_str(), asm_out_file);
 
   if (oacc_bcast_size)
-    write_worker_buffer (asm_out_file, oacc_bcast_sym,
+    write_shared_buffer (asm_out_file, oacc_bcast_sym,
 			 oacc_bcast_align, oacc_bcast_size);
 
   if (worker_red_size)
-    write_worker_buffer (asm_out_file, worker_red_sym,
+    write_shared_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
   if (need_softstack_decl)
@@ -5077,7 +5089,7 @@ nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
 /* Worker reduction address expander.  */
 
 static rtx
-nvptx_expand_worker_addr (tree exp, rtx target,
+nvptx_expand_shared_addr (tree exp, rtx target,
 			  machine_mode ARG_UNUSED (mode), int ignore)
 {
   if (ignore)
@@ -5213,7 +5225,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
       return nvptx_expand_shuffle (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_WORKER_ADDR:
-      return nvptx_expand_worker_addr (exp, target, mode, ignore);
+      return nvptx_expand_shared_addr (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_CMP_SWAP:
     case NVPTX_BUILTIN_CMP_SWAPLL:
@@ -5323,7 +5335,7 @@ nvptx_goacc_fork_join (gcall *call, const int dims[],
    data at that location.  */
 
 static tree
-nvptx_get_worker_red_addr (tree type, tree offset)
+nvptx_get_shared_red_addr (tree type, tree offset)
 {
   machine_mode mode = TYPE_MODE (type);
   tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
@@ -5665,7 +5677,7 @@ nvptx_goacc_reduction_setup (gcall *call)
     {
       /* Store incoming value to worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5807,7 +5819,7 @@ nvptx_goacc_reduction_fini (gcall *call)
 	{
 	  /* Get reduction buffer address.  */
 	  tree offset = gimple_call_arg (call, 5);
-	  tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
 	  tree ptr = make_ssa_name (TREE_TYPE (call));
 
 	  gimplify_assign (ptr, call, &seq);
@@ -5851,7 +5863,7 @@ nvptx_goacc_reduction_teardown (gcall *call)
     {
       /* Read the worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_worker_red_addr(TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 05/11] [nvptx] Fix whitespace in nvptx_single and nvptx_neuter_pars
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
  2018-07-24 20:47 ` [PATCH 04/11] [nvptx] Make nvptx state propagation function names more generic cesar
@ 2018-07-24 20:47 ` cesar
  2018-07-24 20:47 ` [PATCH 01/11] [nvptx] Update openacc dim macros cesar
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:47 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Tom de Vries <tom@codesourcery.com>

This patch only adjusts white space.

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_single): Fix whitespace.
	(nvptx_neuter_pars): Likewise.

(cherry picked from openacc-gcc-7-branch commit
10f697dfcdaa77b842de6e9a62c68b33a49d3c16)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 447425f..4d46d89 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4250,7 +4250,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    pred = gen_reg_rtx (BImode);
 	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
 	  }
-	
+
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
@@ -4584,7 +4584,7 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
     }
 
   if (skip_mask)
-      nvptx_skip_par (skip_mask, par);
+    nvptx_skip_par (skip_mask, par);
   
   if (par->next)
     nvptx_neuter_pars (par->next, modes, outer);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 07/11] [nvptx] Add thread count parm to bar.sync
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (2 preceding siblings ...)
  2018-07-24 20:47 ` [PATCH 01/11] [nvptx] Update openacc dim macros cesar
@ 2018-07-24 20:48 ` cesar
  2018-07-24 20:48 ` [PATCH 10/11] [nvptx] Use MAX, MIN, ROUND_UP macros cesar
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:48 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Tom de Vries <tom@codesourcery.com>

This patch extends the nvptx_barsync pattern to accept an integeral
parameter to specify how many threads utilize the requested
barrier. This is necessary variable length vectors, as each large
vector will require own thread barrier.

2018-XX-YY  Tom de Vries  <tdevries@suse.de>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.md (nvptx_barsync): Add and handle operand.
	* config/nvptx/nvptx.c (nvptx_cta_sync): Change arguments to take in a
	lock and thread count.  Update call to gen_nvptx_barsync.
	(nvptx_single, nvptx_process_pars): Update calls to nvptx_cta_sync.

(cherry picked from openacc-gcc-7-branch commit
49b69b4002496c9f5759d933e0a6233660434e70)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 1f954a6..fa27f71 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3989,13 +3989,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   return empty;
 }
 
-/* Emit a CTA-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier (bar.sync).  LOCK is the
+   barrier number, which is an integer or a register.  THREADS is the
+   number of threads controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4355,6 +4356,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
@@ -4367,14 +4370,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
 						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4496,12 +4499,15 @@ nvptx_process_pars (parallel *par)
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
 					   false);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads),
+			    par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 2988f5d..9538333 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1455,10 +1455,16 @@
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (INTVAL (operands[1]) == 0)
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_expand "memory_barrier"
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 10/11] [nvptx] Use MAX, MIN, ROUND_UP macros
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (3 preceding siblings ...)
  2018-07-24 20:48 ` [PATCH 07/11] [nvptx] Add thread count parm to bar.sync cesar
@ 2018-07-24 20:48 ` cesar
  2018-07-24 20:48 ` [PATCH 08/11] [nvptx] Add axis_dim cesar
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:48 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Tom de Vries <tom@codesourcery.com>

This patch replaces the confusing, in-lined min, max and rounding code
sprinkled throughout the nvptx BE with calls to MIN, MAX, and ROUND_UP
macros.

2018-XX-YY  Tom de Vries  <tdevries@suse.de>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_gen_shared_bcast, shared_prop_gen)
	(nvptx_goacc_expand_accel_var): Use MAX and ROUND_UP.
	(nvptx_assemble_value, nvptx_output_skip): Use MIN.
	(nvptx_shared_propagate, nvptx_single, nvptx_expand_shared_addr): Use
	MAX.

(cherry picked from openacc-gcc-7-branch commit
d3d6411c160071f70f995bbcd92f617aec67ba10)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 09daedd..7d49b4f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1812,9 +1812,8 @@ nvptx_gen_shared_bcast (rtx reg, propagate_mask pm, unsigned rep,
 	  {
 	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
 
-	    if (align > oacc_bcast_align)
-	      oacc_bcast_align = align;
-	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    oacc_bcast_align = MAX (oacc_bcast_align, align);
+	    data->offset = ROUND_UP (data->offset, align);
 	    addr = data->base;
 	    gcc_assert (data->base != NULL);
 	    if (data->offset)
@@ -1936,8 +1935,7 @@ nvptx_assemble_value (unsigned HOST_WIDE_INT val, unsigned size)
     {
       val >>= part * BITS_PER_UNIT;
       part = init_frag.size - init_frag.offset;
-      if (part > size)
-	part = size;
+      part = MIN (part, size);
 
       unsigned HOST_WIDE_INT partial
 	= val << (init_frag.offset * BITS_PER_UNIT);
@@ -2000,8 +1998,7 @@ nvptx_output_skip (FILE *, unsigned HOST_WIDE_INT size)
   if (init_frag.offset)
     {
       unsigned part = init_frag.size - init_frag.offset;
-      if (part > size)
-	part = (unsigned) size;
+      part = MIN (part, (unsigned)size);
       size -= part;
       nvptx_assemble_value (0, part);
     }
@@ -3957,9 +3954,8 @@ shared_prop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_,
       /* Starting a loop, initialize pointer.    */
       unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
 
-      if (align > oacc_bcast_align)
-	oacc_bcast_align = align;
-      data->offset = (data->offset + align - 1) & ~(align - 1);
+      oacc_bcast_align = MAX (oacc_bcast_align, align);
+      data->offset = ROUND_UP (data->offset, align);
 
       data->ptr = gen_reg_rtx (Pmode);
 
@@ -4000,8 +3996,7 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
       rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
       emit_insn_after (init, insn);
 
-      if (oacc_bcast_size < data.offset)
-	oacc_bcast_size = data.offset;
+      oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
     }
   return empty;
 }
@@ -4379,8 +4374,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
-	    oacc_bcast_size = GET_MODE_SIZE (SImode);
+	  oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode));
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
@@ -5122,13 +5116,11 @@ nvptx_expand_shared_addr (tree exp, rtx target,
     return target;
 
   unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
-  if (align > worker_red_align)
-    worker_red_align = align;
+  worker_red_align = MAX (worker_red_align, align);
 
   unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
   unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
-  if (size + offset > worker_red_size)
-    worker_red_size = size + offset;
+  worker_red_size = MAX (worker_red_size, size + offset);
 
   rtx addr = worker_red_sym;
   if (offset)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 08/11] [nvptx] Add axis_dim
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (4 preceding siblings ...)
  2018-07-24 20:48 ` [PATCH 10/11] [nvptx] Use MAX, MIN, ROUND_UP macros cesar
@ 2018-07-24 20:48 ` cesar
  2018-07-24 20:49 ` [PATCH 02/11] [nvptx] Rename worker_bcast variables oacc_bcast cesar
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:48 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Tom de Vries <tom@codesourcery.com>

This patch introduces an axis_dim member to the machine_function
struct. The launch geometry will be queried frequently enough so that
its justified to store that information with each cfun.

2018-XX-YY  Tom de Vries  <tdevries@suse.de>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define.
	(nvptx_mach_max_workers, nvptx_mach_vector_length): New function.
	(nvptx_reorg): Set function-specific axis_dim's.
	* config/nvptx/nvptx.h (struct machine_function): Add axis_dims.

(cherry picked from openacc-gcc-7-branch commit
a36cbbe19af6822abd203d167905b8ca61d95992)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index fa27f71..e3a02d2 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2884,6 +2884,23 @@ struct offload_attrs
   int max_workers;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_max_workers ()
+{
+  return cfun->machine->axis_dim[MACH_MAX_WORKERS];
+}
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_vector_length ()
+{
+  return cfun->machine->axis_dim[MACH_VECTOR_LENGTH];
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -4831,6 +4848,9 @@ nvptx_reorg (void)
 
       populate_offload_attrs (&oa);
 
+      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
+
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
       gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index dfa1e9a..90fb2c9 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -210,6 +210,8 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 06/11] [nvptx] only use one bar.sync barriers in OpenACC offloaded code
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (7 preceding siblings ...)
  2018-07-24 20:49 ` [PATCH 11/11] [nvptx] Generalize state propagation and synchronization cesar
@ 2018-07-24 20:49 ` cesar
  2018-07-24 20:49 ` [PATCH 09/11] [nvptx] Use TARGET_SET_CURRENT_FUNCTION cesar
  2018-07-24 20:49 ` [PATCH 03/11] [nvptx] Consolidate offloaded function attributes into struct offload_attrs cesar
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:49 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Cesar Philippidis <cesar@codesourcery.com>

This patch teaches nvptx_single to always use barrier '0' for CTA
synchronization. This started off as a cosmetic change, but later on
each large vector (i.e. one that larger than a PTX warp) will need to
use its own unique thread barrier to avoid thread divergence.
Consequently, this patch begins the process of teaching the nvptx
state propagator how to use a common thread barrier for each
propagation level.

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_single): Always pass false to
	nvptx_cta_sync.
	(nvptx_process_pars): Likewise.

(cherry picked from openacc-gcc-7-branch commit
ac0a55b8e72363a09f7968474744c51c1fa7720a)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 4d46d89..1f954a6 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4374,7 +4374,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (true), tail);
+	  emit_insn_before (nvptx_cta_sync (false), tail);
 	}
 
       extract_insn (tail);
@@ -4501,7 +4501,7 @@ nvptx_process_pars (parallel *par)
 	{
 	  /* Insert begin and end synchronizations.  */
 	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 11/11] [nvptx] Generalize state propagation and synchronization
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (6 preceding siblings ...)
  2018-07-24 20:49 ` [PATCH 02/11] [nvptx] Rename worker_bcast variables oacc_bcast cesar
@ 2018-07-24 20:49 ` cesar
  2018-07-24 20:49 ` [PATCH 06/11] [nvptx] only use one bar.sync barriers in OpenACC offloaded code cesar
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:49 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Tom de Vries <tom@codesourcery.com>

As the title mentions, this patch generalizes the state propagation and
synchronization code. Note that while the patch makes reference to
large_vectors, they are not enabled in nvptx_goacc_validate_dims.
Therefore, only the worker case is exercised in this patch.

2018-XX-YY  Tom de Vries  <tdevries@suse.de>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
	(nvptx_option_override): Init oacc_bcast_partition.
	(nvptx_init_oacc_workers): New function.
	(nvptx_declare_function_name): Call nvptx_init_oacc_workers.
	(nvptx_needs_shared_bcast): New function.
	(nvptx_find_par): Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_shared_propagate): Initialize vector bcast partition and
	synchronization state.
	(nvptx_single):  Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_process_pars): Likewise.
	(nvptx_set_current_function): Initialize oacc_broadcast_partition.
	* config/nvptx/nvptx.h (struct machine_function): Add
	bcast_partition and sync_bar members.

(cherry picked from openacc-gcc-7-branch commit
628f439f33ed6f689656a1ed8ff74db97e7ec3ed, and commit
293e415e04d6b407e59118253e5fdfe539000cfe)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 7d49b4f..abd47ac 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -136,6 +136,7 @@ static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
    memory.  It'd be nice if PTX supported common blocks, because then
    this could be shared across TUs (taking the largest size).  */
 static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_partition;
 static unsigned oacc_bcast_align;
 static GTY(()) rtx oacc_bcast_sym;
 
@@ -154,6 +155,8 @@ static bool need_softstack_decl;
 /* True if any function references __nvptx_uni.  */
 static bool need_unisimt_decl;
 
+static int nvptx_mach_max_workers ();
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -213,6 +216,7 @@ nvptx_option_override (void)
   oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast");
   SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED);
   oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  oacc_bcast_partition = 0;
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
@@ -1101,6 +1105,40 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize OpenACC worker broadcast and synchronization
+   registers.  */
+
+static void
+nvptx_init_oacc_workers (FILE *file)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%tidy;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n");
+      fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; "
+	       "// vector broadcast offset\n",
+	       REGNO (cfun->machine->bcast_partition),
+	       oacc_bcast_partition);
+    }
+  /* Verify oacc_bcast_size.  */
+  gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1)
+	      <= oacc_bcast_size);
+  if (cfun->machine->sync_bar)
+    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
+	     "// vector synchronization barrier\n",
+	     REGNO (cfun->machine->sync_bar));
+  fprintf (file, "\t}\n");
+}
+
 /* Emit code to initialize predicate and master lane index registers for
    -muniform-simt code generation variant.  */
 
@@ -1327,6 +1365,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->unisimt_predicate
       || (cfun->machine->has_simtreg && !crtl->is_leaf))
     nvptx_init_unisimt_predicate (file);
+  if (cfun->machine->bcast_partition || cfun->machine->sync_bar)
+    nvptx_init_oacc_workers (file);
 }
 
 /* Output code for switching uniform-simt state.  ENTERING indicates whether
@@ -3045,6 +3085,19 @@ nvptx_split_blocks (bb_insn_map_t *map)
     }
 }
 
+/* Return true if MASK contains parallelism that requires shared
+   memory to broadcast.  */
+
+static bool
+nvptx_needs_shared_bcast (unsigned mask)
+{
+  bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER);
+  bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    && nvptx_mach_vector_length () != PTX_WARP_SIZE;
+
+  return worker || large_vector;
+}
+
 /* BLOCK is a basic block containing a head or tail instruction.
    Locate the associated prehead or pretail instruction, which must be
    in the single predecessor block.  */
@@ -3120,7 +3173,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    par = new parallel (par, mask);
 	    par->forked_block = block;
 	    par->forked_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->fork_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
 	  }
@@ -3135,7 +3188,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    gcc_assert (par->mask == mask);
 	    par->join_block = block;
 	    par->join_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->joining_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
 	    par = par->parent;
@@ -3992,11 +4045,33 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
+      rtx bcast_sym = oacc_bcast_sym;
+
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
+      if (vector && nvptx_mach_max_workers () > 1)
+	{
+	  if (!cfun->machine->bcast_partition)
+	    {
+	      /* It would be nice to place this register in
+		 DATA_AREA_SHARED.  */
+	      cfun->machine->bcast_partition = gen_reg_rtx (DImode);
+	    }
+	  if (!cfun->machine->sync_bar)
+	    cfun->machine->sync_bar = gen_reg_rtx (SImode);
+
+	  bcast_sym = cfun->machine->bcast_partition;
+	}
+
+      rtx init = gen_rtx_SET (data.base, bcast_sym);
       emit_insn_after (init, insn);
 
-      oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
+      unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align);
+      unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			   ? nvptx_mach_max_workers () + 1
+			   : 1);
+
+      oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+      oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
     }
   return empty;
 }
@@ -4301,7 +4376,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
     {
       rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
 
-      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask
+	  && nvptx_mach_vector_length () == PTX_WARP_SIZE)
 	{
 	  /* Vector mode only, do a shuffle.  */
 #if WORKAROUND_PTXJIT_BUG
@@ -4368,23 +4444,51 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  unsigned size = GET_MODE_SIZE (SImode);
+	  bool vector = true;
 	  rtx barrier = GEN_INT (0);
 	  int threads = 0;
 
+	  if (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask)
+	    vector = false;
+
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode));
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->bcast_partition)
+	    data.base = cfun->machine->bcast_partition;
+
+	  gcc_assert (data.base != NULL);
+
+	  unsigned int psize = ROUND_UP (size, oacc_bcast_align);
+	  unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			       ? nvptx_mach_max_workers () + 1
+			       : 1);
+
+	  oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+	  oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
-						    false),
+						    vector),
 			    before);
+
+	  if (vector
+	      && nvptx_mach_max_workers () > 1
+	      && cfun->machine->sync_bar)
+	    {
+	      barrier = cfun->machine->sync_bar;
+	      threads = nvptx_mach_vector_length ();
+	    }
+
 	  /* Barrier so other workers can see the write.  */
 	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
-						    false), tail);
+						    vector),
+			    tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
@@ -4502,17 +4606,26 @@ nvptx_process_pars (parallel *par)
     }
 
   bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0;
+  bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER));
+  bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+		      && nvptx_mach_vector_length () > PTX_WARP_SIZE);
 
-  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+  if (worker || large_vector)
     {
       nvptx_shared_propagate (false, is_call, par->forked_block,
-			      par->forked_insn, false);
+			      par->forked_insn, !worker);
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
-					   false);
+					   !worker);
       rtx barrier = GEN_INT (0);
       int threads = 0;
 
+      if (!worker && cfun->machine->sync_bar)
+	{
+	  barrier = cfun->machine->sync_bar;
+	  threads = nvptx_mach_vector_length ();
+	}
+
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
@@ -6013,6 +6126,7 @@ nvptx_set_current_function (tree fndecl)
     return;
 
   nvptx_previous_fndecl = fndecl;
+  oacc_bcast_partition = 0;
 }
 
 #undef TARGET_OPTION_OVERRIDE
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 90fb2c9..b923560 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -212,6 +212,10 @@ struct GTY(()) machine_function
   rtx axis_predicate[2]; /* Neutering predicates.  */
   int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
 		      vector_length, dim[1] is num_workers.  */
+  rtx bcast_partition; /* Register containing the size of each
+			  vector's partition of share-memory used to
+			  broadcast state.  */
+  rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 02/11] [nvptx] Rename worker_bcast variables oacc_bcast.
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (5 preceding siblings ...)
  2018-07-24 20:48 ` [PATCH 08/11] [nvptx] Add axis_dim cesar
@ 2018-07-24 20:49 ` cesar
  2018-07-24 20:49 ` [PATCH 11/11] [nvptx] Generalize state propagation and synchronization cesar
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:49 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Cesar Philippidis <cesar@codesourcery.com>

Eventually, we want the nvptx BE to use a common shared memory buffer
for both worker and vector state propagation (albeit using different
partitions of shared memory for each logical thread). This patch
renames the worker_bcast variables into a more generic oacc_bcast.

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (worker_bcast_size): Rename as
	oacc_bcast_size.
	(worker_bcast_align): Rename as oacc_bcast_align.
	(worker_bcast_sym): Rename as oacc_bcast_sym.
	(nvptx_option_override): Update usage of oacc_bcast_*.
	(struct wcast_data_t): Rename as broadcast_data_t.
	(nvptx_gen_wcast): Update type of data argument and usage of
	oacc_bcast_align.
	(wprop_gen): Update type of data_ and usage of oacc_bcast_align.
	(nvptx_wpropagate): Update type of data and usage of
	oacc_bcast_{sym,size}.
	(nvptx_single): Update type of data and usage of oacc_bcast_size.
	(nvptx_file_end): Update usage of oacc_bcast_{sym,align,size}.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 521f83e..fb3e0c7 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -129,14 +129,15 @@ struct tree_hasher : ggc_cache_ptr_hash<tree_node>
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
-/* Buffer needed to broadcast across workers.  This is used for both
-   worker-neutering and worker broadcasting.  It is shared by all
-   functions emitted.  The buffer is placed in shared memory.  It'd be
-   nice if PTX supported common blocks, because then this could be
-   shared across TUs (taking the largest size).  */
-static unsigned worker_bcast_size;
-static unsigned worker_bcast_align;
-static GTY(()) rtx worker_bcast_sym;
+/* Buffer needed to broadcast across workers and vectors.  This is
+   used for both worker-neutering and worker broadcasting, and
+   vector-neutering and boardcasting when vector_length > 32.  It is
+   shared by all functions emitted.  The buffer is placed in shared
+   memory.  It'd be nice if PTX supported common blocks, because then
+   this could be shared across TUs (taking the largest size).  */
+static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_align;
+static GTY(()) rtx oacc_bcast_sym;
 
 /* Buffer needed for worker reductions.  This has to be distinct from
    the worker broadcast array, as both may be live concurrently.  */
@@ -209,9 +210,9 @@ nvptx_option_override (void)
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
 
-  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_bcast");
-  SET_SYMBOL_DATA_AREA (worker_bcast_sym, DATA_AREA_SHARED);
-  worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast");
+  SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED);
+  oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
@@ -1756,7 +1757,7 @@ nvptx_gen_vcast (rtx reg)
 
 /* Structure used when generating a worker-level spill or fill.  */
 
-struct wcast_data_t
+struct broadcast_data_t
 {
   rtx base;  /* Register holding base addr of buffer.  */
   rtx ptr;  /* Iteration var,  if needed.  */
@@ -1780,7 +1781,7 @@ enum propagate_mask
    how many loop iterations will be executed (0 for not a loop).  */
    
 static rtx
-nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *data)
 {
   rtx  res;
   machine_mode mode = GET_MODE (reg);
@@ -1810,8 +1811,8 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
 	  {
 	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
 
-	    if (align > worker_bcast_align)
-	      worker_bcast_align = align;
+	    if (align > oacc_bcast_align)
+	      oacc_bcast_align = align;
 	    data->offset = (data->offset + align - 1) & ~(align - 1);
 	    addr = data->base;
 	    if (data->offset)
@@ -3916,15 +3917,15 @@ nvptx_vpropagate (bool is_call, basic_block block, rtx_insn *insn)
 static rtx
 wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
 {
-  wcast_data_t *data = (wcast_data_t *)data_;
+  broadcast_data_t *data = (broadcast_data_t *)data_;
 
   if (pm & PM_loop_begin)
     {
       /* Starting a loop, initialize pointer.    */
       unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
 
-      if (align > worker_bcast_align)
-	worker_bcast_align = align;
+      if (align > oacc_bcast_align)
+	oacc_bcast_align = align;
       data->offset = (data->offset + align - 1) & ~(align - 1);
 
       data->ptr = gen_reg_rtx (Pmode);
@@ -3949,7 +3950,7 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
 static bool
 nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
 {
-  wcast_data_t data;
+  broadcast_data_t data;
 
   data.base = gen_reg_rtx (Pmode);
   data.offset = 0;
@@ -3961,11 +3962,11 @@ nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
   if (data.offset)
     {
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, worker_bcast_sym);
+      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
       emit_insn_after (init, insn);
 
-      if (worker_bcast_size < data.offset)
-	worker_bcast_size = data.offset;
+      if (oacc_bcast_size < data.offset)
+	oacc_bcast_size = data.offset;
     }
   return empty;
 }
@@ -4335,13 +4336,13 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	{
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
-	  wcast_data_t data;
+	  broadcast_data_t data;
 
-	  data.base = worker_bcast_sym;
+	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (worker_bcast_size < GET_MODE_SIZE (SImode))
-	    worker_bcast_size = GET_MODE_SIZE (SImode);
+	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
+	    oacc_bcast_size = GET_MODE_SIZE (SImode);
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
@@ -4967,9 +4968,9 @@ nvptx_file_end (void)
     nvptx_record_fndecl (decl);
   fputs (func_decls.str().c_str(), asm_out_file);
 
-  if (worker_bcast_size)
-    write_worker_buffer (asm_out_file, worker_bcast_sym,
-			 worker_bcast_align, worker_bcast_size);
+  if (oacc_bcast_size)
+    write_worker_buffer (asm_out_file, oacc_bcast_sym,
+			 oacc_bcast_align, oacc_bcast_size);
 
   if (worker_red_size)
     write_worker_buffer (asm_out_file, worker_red_sym,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 09/11] [nvptx] Use TARGET_SET_CURRENT_FUNCTION
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (8 preceding siblings ...)
  2018-07-24 20:49 ` [PATCH 06/11] [nvptx] only use one bar.sync barriers in OpenACC offloaded code cesar
@ 2018-07-24 20:49 ` cesar
  2018-07-24 20:49 ` [PATCH 03/11] [nvptx] Consolidate offloaded function attributes into struct offload_attrs cesar
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:49 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Cesar Philippidis <cesar@codesourcery.com>

Chung-Lin had originally defined TARGET_SET_CURRENT_FUNCTION as part
of his gang-local variable patch. But I intend to introduce those
changes at a later time. Eventually the state propagation code will
utilize nvptx_set_current_function to reset the reduction buffer
offset. However, for the time being, this patch only introduces
it as a placeholder.

2018-XX-YY  Chung-Lin Tang  <cltang@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com

	gcc/
	config/nvptx/nvptx.c (nvptx_previous_fndecl): Declare.
	(nvptx_set_current_function): New function.
	(TARGET_SET_CURRENT_FUNCTION): Define.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index e3a02d2..09daedd 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -6012,6 +6012,17 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+static GTY(()) tree nvptx_previous_fndecl;
+
+static void
+nvptx_set_current_function (tree fndecl)
+{
+  if (!fndecl || fndecl == nvptx_previous_fndecl)
+    return;
+
+  nvptx_previous_fndecl = fndecl;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6145,6 +6156,9 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
 #undef TARGET_CAN_CHANGE_MODE_CLASS
 #define TARGET_CAN_CHANGE_MODE_CLASS nvptx_can_change_mode_class
 
+#undef TARGET_SET_CURRENT_FUNCTION
+#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 03/11] [nvptx] Consolidate offloaded function attributes into struct offload_attrs
       [not found] <cover.1532464999.git.cesar@codesourcery.com>
                   ` (9 preceding siblings ...)
  2018-07-24 20:49 ` [PATCH 09/11] [nvptx] Use TARGET_SET_CURRENT_FUNCTION cesar
@ 2018-07-24 20:49 ` cesar
  10 siblings, 0 replies; 11+ messages in thread
From: cesar @ 2018-07-24 20:49 UTC (permalink / raw)
  To: tdevries, gcc-patches

From: Cesar Philippidis <cesar@codesourcery.com>

This patch introduces a new struct offload_attrs, which contains the
details regarding the offload function launch geometry. In addition to
its current usage to neuter worker and vector threads, it will
eventually by used to validate the compile-time launch geometry
requested by the user.

2018-XX-YY  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (struct offload_attrs): New.
	(populate_offload_attrs): New function.
	(nvptx_reorg): Use it to extract partitioning mask.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index fb3e0c7..1b83b3c 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2871,6 +2871,17 @@ nvptx_reorg_uniform_simt ()
     }
 }
 
+/* Offloading function attributes.  */
+
+struct offload_attrs
+{
+  unsigned mask;
+  int num_gangs;
+  int num_workers;
+  int vector_length;
+  int max_workers;
+};
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -4568,6 +4579,56 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
     nvptx_neuter_pars (par->next, modes, outer);
 }
 
+static void
+populate_offload_attrs (offload_attrs *oa)
+{
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+  tree dims = TREE_VALUE (attr);
+  unsigned ix;
+
+  oa->mask = 0;
+
+  for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+    {
+      tree t = TREE_VALUE (dims);
+      int size = (t == NULL_TREE) ? 0 : TREE_INT_CST_LOW (t);
+      tree allowed = TREE_PURPOSE (dims);
+
+      if (size != 1 && !(allowed && integer_zerop (allowed)))
+	oa->mask |= GOMP_DIM_MASK (ix);
+
+      switch (ix)
+	{
+	case GOMP_DIM_GANG:
+	  oa->num_gangs = size;
+	  break;
+
+	case GOMP_DIM_WORKER:
+	  oa->num_workers = size;
+	  break;
+
+	case GOMP_DIM_VECTOR:
+	  oa->vector_length = size;
+	  break;
+	}
+    }
+
+  if (oa->vector_length == 0)
+    {
+      /* FIXME: Need a more graceful way to handle large vector
+	 lengths in OpenACC routines.  */
+      if (!lookup_attribute ("omp target entrypoint",
+			     DECL_ATTRIBUTES (current_function_decl)))
+	oa->vector_length = PTX_WARP_SIZE;
+      else
+	oa->vector_length = PTX_VECTOR_LENGTH;
+    }
+  if (oa->num_workers == 0)
+    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
+  else
+    oa->max_workers = oa->num_workers;
+}
+
 #if WORKAROUND_PTXJIT_BUG_2
 /* Variant of pc_set that only requires JUMP_P (INSN) if STRICT.  This variant
    is needed in the nvptx target because the branches generated for
@@ -4749,27 +4810,19 @@ nvptx_reorg (void)
     {
       /* If we determined this mask before RTL expansion, we could
 	 elide emission of some levels of forks and joins.  */
-      unsigned mask = 0;
-      tree dims = TREE_VALUE (attr);
-      unsigned ix;
+      offload_attrs oa;
 
-      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
-	{
-	  int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
-	  tree allowed = TREE_PURPOSE (dims);
+      populate_offload_attrs (&oa);
 
-	  if (size != 1 && !(allowed && integer_zerop (allowed)))
-	    mask |= GOMP_DIM_MASK (ix);
-	}
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
-      gcc_assert (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
-		  || (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
+      gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+		  || (oa.mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
 
       /* Discover & process partitioned regions.  */
       parallel *pars = nvptx_discover_pars (&bb_insn_map);
       nvptx_process_pars (pars);
-      nvptx_neuter_pars (pars, mask, 0);
+      nvptx_neuter_pars (pars, oa.mask, 0);
       delete pars;
     }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-07-24 20:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1532464999.git.cesar@codesourcery.com>
2018-07-24 20:47 ` [PATCH 04/11] [nvptx] Make nvptx state propagation function names more generic cesar
2018-07-24 20:47 ` [PATCH 05/11] [nvptx] Fix whitespace in nvptx_single and nvptx_neuter_pars cesar
2018-07-24 20:47 ` [PATCH 01/11] [nvptx] Update openacc dim macros cesar
2018-07-24 20:48 ` [PATCH 07/11] [nvptx] Add thread count parm to bar.sync cesar
2018-07-24 20:48 ` [PATCH 10/11] [nvptx] Use MAX, MIN, ROUND_UP macros cesar
2018-07-24 20:48 ` [PATCH 08/11] [nvptx] Add axis_dim cesar
2018-07-24 20:49 ` [PATCH 02/11] [nvptx] Rename worker_bcast variables oacc_bcast cesar
2018-07-24 20:49 ` [PATCH 11/11] [nvptx] Generalize state propagation and synchronization cesar
2018-07-24 20:49 ` [PATCH 06/11] [nvptx] only use one bar.sync barriers in OpenACC offloaded code cesar
2018-07-24 20:49 ` [PATCH 09/11] [nvptx] Use TARGET_SET_CURRENT_FUNCTION cesar
2018-07-24 20:49 ` [PATCH 03/11] [nvptx] Consolidate offloaded function attributes into struct offload_attrs cesar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).