[PATCH 5/8] nvptx mkoffload: pass -mgomp for OpenMP offloading

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 5/8] nvptx mkoffload: pass -mgomp for OpenMP offloading
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
                   ` (6 preceding siblings ...)
  2016-10-14 16:40 ` [PATCH 6/8] new target hook: TARGET_SIMT_VF Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-17 15:51 ` [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Bernd Schmidt
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch wires up use of alternative -mgomp multilib for OpenMP offloading
via nvptx mkoffload.  It makes OpenACC and OpenMP incompatible for
simultaneous offloading compilation, so I've added a diagnostic for that.

	* config/nvptx/mkoffload.c (main): Check that either OpenACC or OpenMP
	is selected.  Pass -mgomp to offload compiler in OpenMP case.
---
 gcc/config/nvptx/mkoffload.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c
index c8eed45..d876c7b 100644
--- a/gcc/config/nvptx/mkoffload.c
+++ b/gcc/config/nvptx/mkoffload.c
@@ -460,6 +460,7 @@ main (int argc, char **argv)
 
   /* Scan the argument vector.  */
   bool fopenmp = false;
+  bool fopenacc = false;
   for (int i = 1; i < argc; i++)
     {
 #define STR "-foffload-abi="
@@ -476,11 +477,15 @@ main (int argc, char **argv)
 #undef STR
       else if (strcmp (argv[i], "-fopenmp") == 0)
 	fopenmp = true;
+      else if (strcmp (argv[i], "-fopenacc") == 0)
+	fopenacc = true;
       else if (strcmp (argv[i], "-save-temps") == 0)
 	save_temps = true;
       else if (strcmp (argv[i], "-v") == 0)
 	verbose = true;
     }
+  if (!(fopenacc ^ fopenmp))
+    fatal_error (input_location, "either -fopenacc or -fopenmp must be set");
 
   struct obstack argv_obstack;
   obstack_init (&argv_obstack);
@@ -501,6 +506,8 @@ main (int argc, char **argv)
     default:
       gcc_unreachable ();
     }
+  if (fopenmp)
+    obstack_ptr_grow (&argv_obstack, "-mgomp");
 
   for (int ix = 1; ix != argc; ix++)
     {
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/8] nvptx -muniform-simt
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
                   ` (4 preceding siblings ...)
  2016-10-14 16:40 ` [PATCH 8/8] nvptx: handle OpenMP "omp target entrypoint" Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-14 16:40 ` [PATCH 6/8] new target hook: TARGET_SIMT_VF Alexander Monakov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch implements -muniform-simt code generation option, which is used to
emit code for OpenMP offloading.  The goal is to emit code that can either
execute "normally", or can execute in a way that keeps all lanes in a given
warp active, their local state synchronized, and observable effects from
execution happening as if only one lane was active.  The latter mode is how
OpenMP offloaded code runs outside of SIMD regions.

To achieve that, the compiler instruments atomic instructions and calls to
functions provided by the CUDA runtime (malloc, free, vprintf), i.e. those
that GCC itself doesn't compile.

Instrumentation converts an atomic instruction to a predicated atomic
instruction followed by a warp shuffle.  To illustrate,

   atom.op dest, <args>

becomes

  @PRED	atom.op	dest, <args>
   shfl.idx dest, dest, MASTER

where, outside of SIMD regions:

- PRED is true in lane 0, false in lanes 1-31, so the side effect happens once
- MASTER is 0 in all lanes, so the shuffle synchronizes 'dest' among all lanes

and inside of SIMD regions:

- PRED is true in all lanes, so the atomic is done in all lanes independently
- MASTER equals to current lane number, so the shuffle is a no-op.

To keep track of current state and compute PRED and MASTER, the compiler uses
shared memory array 'unsigned __nvptx_uni[]' with per-warp all-zeros or
all-ones masks.  The mask word is zero outside of SIMD regions, all-ones
inside.  Function prologue uses mask to compute MASTER and PRED via:

    MASTER = LANE_ID & MASK;
    PRED   = LANE_ID == MASTER;

Calls are handled like atomics.

gcc/
	* config/nvptx/nvptx.c (need_unisimt_decl): New variable.  Set it...
	(nvptx_init_unisimt_predicate): ...here (new function) and use it...
	(nvptx_file_end): ...here to emit declaration of __nvptx_uni array.
	(nvptx_declare_function_name): Call nvptx_init_unisimt_predicate.
	(nvptx_get_unisimt_master): New helper function.
	(nvptx_get_unisimt_predicate): Ditto.
	(nvptx_call_insn_is_syscall_p): Ditto.
	(nvptx_unisimt_handle_set): Ditto.
	(nvptx_reorg_uniform_simt): New.  Transform code for -muniform-simt.
	(nvptx_reorg): Call nvptx_reorg_uniform_simt.
	* config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_unisimt__ when -muniform-simt option is active.
	(struct machine_function): Add unisimt_master, unisimt_predicate
	rtx fields.
	* config/nvptx/nvptx.md (atomic): New attribute.
	(atomic_compare_and_swap<mode>_1): Mark with atomic attribute.
	(atomic_exchange<mode>): Ditto.
	(atomic_fetch_add<mode>): Ditto.
	(atomic_fetch_addsf): Ditto.
	(atomic_fetch_<logic><mode>): Ditto.
	* config/nvptx/nvptx.opt (muniform-simt): New option.
	* doc/invoke.texi (-muniform-simt): Document.

gcc/testsuite/
	* gcc.target/nvptx/unisimt.c: New test.

libgcc/
	* config/nvptx/crt0.c (__main): Setup __nvptx_uni.
---
 gcc/config/nvptx/nvptx.c                 | 124 +++++++++++++++++++++++++++++++
 gcc/config/nvptx/nvptx.h                 |   4 +
 gcc/config/nvptx/nvptx.md                |  18 +++--
 gcc/config/nvptx/nvptx.opt               |   4 +
 gcc/doc/invoke.texi                      |  11 +++
 gcc/testsuite/gcc.target/nvptx/unisimt.c |  22 ++++++
 libgcc/config/nvptx/crt0.c               |   4 +
 7 files changed, 182 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/nvptx/unisimt.c

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 4cdaa1e..65217ab 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -144,6 +144,9 @@ static GTY(()) tree global_lock_var;
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 
+/* True if any function references __nvptx_uni.  */
+static bool need_unisimt_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -1058,6 +1061,34 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize predicate and master lane index registers for
+   -muniform-simt code generation variant.  */
+
+static void
+nvptx_init_unisimt_predicate (FILE *file)
+{
+  int bits = POINTER_SIZE;
+  int master = REGNO (cfun->machine->unisimt_master);
+  int pred = REGNO (cfun->machine->unisimt_predicate);
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32 %%ustmp0;\n");
+  fprintf (file, "\t\t.reg.u%d %%ustmp1;\n", bits);
+  fprintf (file, "\t\t.reg.u%d %%ustmp2;\n", bits);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.y;\n");
+  fprintf (file, "\t\tmul%s.u32 %%ustmp1, %%ustmp0, 4;\n",
+	   bits == 64 ? ".wide" : ".lo");
+  fprintf (file, "\t\tmov.u%d %%ustmp2, __nvptx_uni;\n", bits);
+  fprintf (file, "\t\tadd.u%d %%ustmp2, %%ustmp2, %%ustmp1;\n", bits);
+  fprintf (file, "\t\tld.shared.u32 %%r%d, [%%ustmp2];\n", master);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.x;\n");
+  /* Compute 'master lane index' as 'tid.x & __nvptx_uni[tid.y]'.  */
+  fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master);
+  /* Compute predicate as 'tid.x == master'.  */
+  fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, %%ustmp0;\n", pred, master);
+  fprintf (file, "\t}\n");
+  need_unisimt_decl = true;
+}
+
 /* Implement ASM_DECLARE_FUNCTION_NAME.  Writes the start of a ptx
    function, including local var decls and copies from the arguments to
    local regs.  */
@@ -1149,6 +1180,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->axis_predicate[1])
     nvptx_init_axis_predicate (file,
 			       REGNO (cfun->machine->axis_predicate[1]), "x");
+  if (cfun->machine->unisimt_predicate)
+    nvptx_init_unisimt_predicate (file);
 }
 
 /* Output instruction that sets soft stack pointer in shared memory to the
@@ -2409,6 +2442,89 @@ nvptx_reorg_subreg (void)
     }
 }
 
+/* Return a SImode "master lane index" register for uniform-simt, allocating on
+   first use.  */
+
+static rtx
+nvptx_get_unisimt_master ()
+{
+  rtx &master = cfun->machine->unisimt_master;
+  return master ? master : master = gen_reg_rtx (SImode);
+}
+
+/* Return a BImode "predicate" register for uniform-simt, similar to above.  */
+
+static rtx
+nvptx_get_unisimt_predicate ()
+{
+  rtx &pred = cfun->machine->unisimt_predicate;
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
+/* Return true if given call insn references one of the functions provided by
+   the CUDA runtime: malloc, free, vprintf.  */
+
+static bool
+nvptx_call_insn_is_syscall_p (rtx_insn *insn)
+{
+  rtx pat = PATTERN (insn);
+  gcc_checking_assert (GET_CODE (pat) == PARALLEL);
+  pat = XVECEXP (pat, 0, 0);
+  if (GET_CODE (pat) == SET)
+    pat = SET_SRC (pat);
+  gcc_checking_assert (GET_CODE (pat) == CALL
+		       && GET_CODE (XEXP (pat, 0)) == MEM);
+  rtx addr = XEXP (XEXP (pat, 0), 0);
+  if (GET_CODE (addr) != SYMBOL_REF)
+    return false;
+  const char *name = XSTR (addr, 0);
+  /* Ordinary malloc/free are redirected to __nvptx_{malloc,free), so only the
+     references with forced assembler name refer to PTX syscalls.  For vprintf,
+     accept both normal and forced-assembler-name references.  */
+  return (!strcmp (name, "vprintf") || !strcmp (name, "*vprintf")
+	  || !strcmp (name, "*malloc")
+	  || !strcmp (name, "*free"));
+}
+
+/* If SET subexpression of INSN sets a register, emit a shuffle instruction to
+   propagate its value from lane MASTER to current lane.  */
+
+static void
+nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
+{
+  rtx reg;
+  if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
+    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
+}
+
+/* Adjust code for uniform-simt code generation variant by making atomics and
+   "syscalls" conditionally executed, and inserting shuffle-based propagation
+   for registers being set.  */
+
+static void
+nvptx_reorg_uniform_simt ()
+{
+  rtx_insn *insn, *next;
+
+  for (insn = get_insns (); insn; insn = next)
+    {
+      next = NEXT_INSN (insn);
+      if (!(CALL_P (insn) && nvptx_call_insn_is_syscall_p (insn))
+	  && !(NONJUMP_INSN_P (insn)
+	       && GET_CODE (PATTERN (insn)) == PARALLEL
+	       && get_attr_atomic (insn)))
+	continue;
+      rtx pat = PATTERN (insn);
+      rtx master = nvptx_get_unisimt_master ();
+      for (int i = 0; i < XVECLEN (pat, 0); i++)
+	nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+      rtx pred = nvptx_get_unisimt_predicate ();
+      pred = gen_rtx_NE (BImode, pred, const0_rtx);
+      pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
+      validate_change (insn, &PATTERN (insn), pat, false);
+    }
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -3931,6 +4047,9 @@ nvptx_reorg (void)
   /* Replace subregs.  */
   nvptx_reorg_subreg ();
 
+  if (TARGET_UNIFORM_SIMT)
+    nvptx_reorg_uniform_simt ();
+
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -4127,6 +4246,11 @@ nvptx_file_end (void)
       fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",
 	       POINTER_SIZE);
     }
+  if (need_unisimt_decl)
+    {
+      write_var_marker (asm_out_file, false, true, "__nvptx_uni");
+      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n");
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 2c4962a..1702178 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -33,6 +33,8 @@
       builtin_define ("__nvptx__");		\
       if (TARGET_SOFT_STACK)			\
         builtin_define ("__nvptx_softstack__");	\
+      if (TARGET_UNIFORM_SIMT)			\
+        builtin_define ("__nvptx_unisimt__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -215,6 +217,8 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
+  rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
 };
 #endif
 \f
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 5c5c991..35ae71e 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -62,6 +62,9 @@ (define_c_enum "unspecv" [
 (define_attr "subregs_ok" "false,true"
   (const_string "false"))
 
+(define_attr "atomic" "false,true"
+  (const_string "false"))
+
 ;; The nvptx operand predicates, in general, don't permit subregs and
 ;; only literal constants, which differ from the generic ones, which
 ;; permit subregs and symbolc constants (as appropriate)
@@ -1218,7 +1221,8 @@ (define_insn "atomic_compare_and_swap<mode>_1"
    (set (match_dup 1)
 	(unspec_volatile:SDIM [(const_int 0)] UNSPECV_CAS))]
   ""
-  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;")
+  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;"
+  [(set_attr "atomic" "true")])
 
 (define_insn "atomic_exchange<mode>"
   [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")	;; output
@@ -1229,7 +1233,8 @@ (define_insn "atomic_exchange<mode>"
    (set (match_dup 1)
 	(match_operand:SDIM 2 "nvptx_nonmemory_operand" "Ri"))]	;; input
   ""
-  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;")
+  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;"
+  [(set_attr "atomic" "true")])
 
 (define_insn "atomic_fetch_add<mode>"
   [(set (match_operand:SDIM 1 "memory_operand" "+m")
@@ -1241,7 +1246,8 @@ (define_insn "atomic_fetch_add<mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "atomic" "true")])
 
 (define_insn "atomic_fetch_addsf"
   [(set (match_operand:SF 1 "memory_operand" "+m")
@@ -1253,7 +1259,8 @@ (define_insn "atomic_fetch_addsf"
    (set (match_operand:SF 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "atomic" "true")])
 
 (define_code_iterator any_logic [and ior xor])
 (define_code_attr logic [(and "and") (ior "or") (xor "xor")])
@@ -1269,7 +1276,8 @@ (define_insn "atomic_fetch_<logic><mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   "0"
-  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
+  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;"
+  [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
   [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 71ddadb..0d46e1d 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -36,3 +36,7 @@ Optimize partition neutering.
 msoft-stack
 Target Report Mask(SOFT_STACK)
 Use custom stacks instead of local memory for automatic storage.
+
+muniform-simt
+Target Report Mask(UNIFORM_SIMT)
+Generate code that can keep local state uniform across all lanes.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 57dd62f..5150b2f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -20270,6 +20270,17 @@ for OpenMP offloading, but the option is exposed on its own for the purpose
 of testing the compiler; to generate code suitable for linking into programs
 using OpenMP offloading, use option @option{-mgomp}.
 
+@item -muniform-simt
+@opindex muniform-simt
+Generate code that may keep all lanes in each warp active, even when
+observable effects from execution must appear as if only one lane is active.
+This is achieved by instrumenting syscalls and atomic instructions in a
+lightweight way, allowing the compiler to emit code that can switch at run
+time between this and conventional execution modes. This code generation
+variant is used for OpenMP offloading, but the option is exposed on its own
+for the purpose of testing the compiler; to generate code suitable for linking
+into programs using OpenMP offloading, use option @option{-mgomp}.
+
 @end table
 
 @node PDP-11 Options
diff --git a/gcc/testsuite/gcc.target/nvptx/unisimt.c b/gcc/testsuite/gcc.target/nvptx/unisimt.c
new file mode 100644
index 0000000..d268721
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/unisimt.c
@@ -0,0 +1,22 @@
+/* { dg-options "-O2 -muniform-simt" } */
+/* { dg-do run } */
+
+#include <stdarg.h>
+
+static __attribute__((noinline,noclone)) int print (const char *fmt, ...)
+{
+  va_list va;
+  va_start (va, fmt);
+  int r = __builtin_vprintf (fmt, va);
+  va_end (va);
+  return r;
+}
+
+int main()
+{
+  static int v;
+  __sync_fetch_and_add (&v, 1);
+  if (print ("%d", v) != 1) /* { dg-output "^1$" } */
+    __builtin_abort ();
+  return 0;
+}
diff --git a/libgcc/config/nvptx/crt0.c b/libgcc/config/nvptx/crt0.c
index 290dc6c..5a7dbf6 100644
--- a/libgcc/config/nvptx/crt0.c
+++ b/libgcc/config/nvptx/crt0.c
@@ -29,6 +29,9 @@ extern int main (int, void **);
    must match the external declaration emitted by the compiler.  */
 void *__nvptx_stacks[32] __attribute__((shared,nocommon));
 
+/* Likewise for -muniform-simt.  */
+unsigned __nvptx_uni[32] __attribute__((shared,nocommon));
+
 void __attribute__((kernel))
 __main (int *rval_ptr, int argc, void **argv)
 {
@@ -40,6 +43,7 @@ __main (int *rval_ptr, int argc, void **argv)
 
   static char stack[131072] __attribute__((aligned(8)));
   __nvptx_stacks[0] = stack + sizeof stack;
+  __nvptx_uni[0] = 0;
 
   exit (main (argc, argv));
 }
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
@ 2016-10-14 16:40 Alexander Monakov
  2016-10-14 16:40 ` [PATCH 4/8] nvptx -mgomp Alexander Monakov
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

Hi,

I'm resending the patch series with backend prerequisites for OpenMP
offloading to the NVIDIA PTX ISA.  The patches are rebased on trunk.

Could a global reviewer have a look at patch 6 (new TARGET_SIMT_VF hook) please?

Documentation changes in doc/invoke.texi have already been reviewed
by Sandra Loosemore (thank you!).

Alexander

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/8] new target hook: TARGET_SIMT_VF
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
                   ` (5 preceding siblings ...)
  2016-10-14 16:40 ` [PATCH 3/8] nvptx -muniform-simt Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-14 16:40 ` [PATCH 5/8] nvptx mkoffload: pass -mgomp for OpenMP offloading Alexander Monakov
  2016-10-17 15:51 ` [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Bernd Schmidt
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch adds a new target hook and implements it in a straightforward
manner on NVPTX to indicate that the target is running in SIMT fashion with 32
threads in a synchronous group ("warp").  For use in OpenMP transforms.

	* config/nvptx/nvptx.c (nvptx_simt_vf): New.
	(TARGET_SIMT_VF): Define.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: (TARGET_SIMT_VF): New hook.
	* target.def: Define it.
---
 gcc/config/nvptx/nvptx.c | 11 +++++++++++
 gcc/doc/tm.texi          |  4 ++++
 gcc/doc/tm.texi.in       |  2 ++
 gcc/target.def           | 12 ++++++++++++
 4 files changed, 29 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 1c3267f..ef85ef6 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4453,6 +4453,14 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 #define PTX_WORKER_LENGTH 32
 #define PTX_GANG_DEFAULT  32
 
+/* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
+
+static int
+nvptx_simt_vf ()
+{
+  return PTX_VECTOR_LENGTH;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  If
@@ -5221,6 +5229,9 @@ nvptx_goacc_reduction (gcall *call)
 #undef  TARGET_BUILTIN_DECL
 #define TARGET_BUILTIN_DECL nvptx_builtin_decl
 
+#undef TARGET_SIMT_VF
+#define TARGET_SIMT_VF nvptx_simt_vf
+
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index a4a8e49..76477d6 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5836,6 +5836,10 @@ usable.  In that case, the smaller the number is, the more desirable it is
 to use it.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_SIMT_VF (void)
+Return number of threads in SIMT thread group on the target.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
 This hook should check the launch dimensions provided for an OpenACC
 compute region, or routine.  Defaulted values are represented as -1
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 265f1be..36672af 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4289,6 +4289,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_SIMD_CLONE_USABLE
 
+@hook TARGET_SIMT_VF
+
 @hook TARGET_GOACC_VALIDATE_DIMS
 
 @hook TARGET_GOACC_DIM_LIMIT
diff --git a/gcc/target.def b/gcc/target.def
index b6968f7..0018f4d 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1648,6 +1648,18 @@ int, (struct cgraph_node *), NULL)
 
 HOOK_VECTOR_END (simd_clone)
 
+/* Functions relating to OpenMP SIMT vectorization transform.  */
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_SIMT_"
+HOOK_VECTOR (TARGET_SIMT, simt)
+
+DEFHOOK
+(vf,
+"Return number of threads in SIMT thread group on the target.",
+int, (void), NULL)
+
+HOOK_VECTOR_END (simt)
+
 /* Functions relating to openacc.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_GOACC_"
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 7/8] nvptx backend: new insns for OpenMP SIMD-via-SIMT
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
  2016-10-14 16:40 ` [PATCH 4/8] nvptx -mgomp Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-14 16:40 ` [PATCH 2/8] nvptx: implement predicated instructions Alexander Monakov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch implements in nvptx.md a few new instruction patterns that are used
for OpenMP SIMD code.

	* config/nvptx/nvptx-protos.h (nvptx_shuffle_kind): Move enum
	declaration from nvptx.c.
	(nvptx_gen_shuffle): Declare.
	* config/nvptx/nvptx.c (nvptx_shuffle_kind): Move to nvptx-protos.h.
	(nvptx_gen_shuffle): Export.
	* config/nvptx/nvptx.md (UNSPEC_VOTE_BALLOT): New unspec.
	(UNSPEC_LANEID): Ditto.
	(UNSPECV_NOUNROLL): Ditto.
	(nvptx_vote_ballot): New pattern.
	(omp_simt_lane): Ditto.
	(omp_simt_last_lane): Ditto.
	(omp_simt_ordered): Ditto.
	(omp_simt_vote_any): Ditto.
	(omp_simt_xchg_bfly): Ditto.
	(omp_simt_xchg_idx): Ditto.
	(nvptx_nounroll): Ditto.
	* target-insns.def (omp_simt_lane): New.
	(omp_simt_last_lane): New.
	(omp_simt_ordered): New.
	(omp_simt_vote_any): New.
	(omp_simt_xchg_bfly): New.
	(omp_simt_xchg_idx): New.
---
 gcc/config/nvptx/nvptx-protos.h | 11 +++++
 gcc/config/nvptx/nvptx.c        | 12 +-----
 gcc/config/nvptx/nvptx.md       | 94 +++++++++++++++++++++++++++++++++++++++++
 gcc/target-insns.def            |  6 +++
 4 files changed, 112 insertions(+), 11 deletions(-)

diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h
index 647607d..331ec0a 100644
--- a/gcc/config/nvptx/nvptx-protos.h
+++ b/gcc/config/nvptx/nvptx-protos.h
@@ -21,6 +21,16 @@
 #ifndef GCC_NVPTX_PROTOS_H
 #define GCC_NVPTX_PROTOS_H
 
+/* The kind of shuffe instruction.  */
+enum nvptx_shuffle_kind
+{
+  SHUFFLE_UP,
+  SHUFFLE_DOWN,
+  SHUFFLE_BFLY,
+  SHUFFLE_IDX,
+  SHUFFLE_MAX
+};
+
 extern void nvptx_declare_function_name (FILE *, const char *, const_tree decl);
 extern void nvptx_declare_object_name (FILE *file, const char *name,
 				       const_tree decl);
@@ -36,6 +46,7 @@ extern void nvptx_register_pragmas (void);
 extern void nvptx_expand_oacc_fork (unsigned);
 extern void nvptx_expand_oacc_join (unsigned);
 extern void nvptx_expand_call (rtx, rtx);
+extern rtx nvptx_gen_shuffle (rtx, rtx, rtx, nvptx_shuffle_kind);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
 extern const char *nvptx_output_mov_insn (rtx, rtx);
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index ef85ef6..f9ac380 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -72,16 +72,6 @@
 /* This file should be included last.  */
 #include "target-def.h"
 
-/* The kind of shuffe instruction.  */
-enum nvptx_shuffle_kind
-{
-  SHUFFLE_UP,
-  SHUFFLE_DOWN,
-  SHUFFLE_BFLY,
-  SHUFFLE_IDX,
-  SHUFFLE_MAX
-};
-
 /* The various PTX memory areas an object might reside in.  */
 enum nvptx_data_area
 {
@@ -1455,7 +1445,7 @@ nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
 /* Generate an instruction or sequence to broadcast register REG
    across the vectors of a single warp.  */
 
-static rtx
+rtx
 nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, nvptx_shuffle_kind kind)
 {
   rtx res;
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 35ae71e..91d1129 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -42,6 +42,10 @@ (define_c_enum "unspec" [
 
    UNSPEC_BIT_CONV
 
+   UNSPEC_VOTE_BALLOT
+
+   UNSPEC_LANEID
+
    UNSPEC_SHUFFLE
    UNSPEC_BR_UNIFIED
 ])
@@ -57,6 +61,8 @@ (define_c_enum "unspecv" [
    UNSPECV_FORKED
    UNSPECV_JOINING
    UNSPECV_JOIN
+
+   UNSPECV_NOUNROLL
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -1169,6 +1175,88 @@ (define_insn "nvptx_shuffle<mode>"
   ""
   "%.\\tshfl%S3.b32\\t%0, %1, %2, 31;")
 
+(define_insn "nvptx_vote_ballot"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
+	(unspec:SI [(match_operand:BI 1 "nvptx_register_operand" "R")]
+		   UNSPEC_VOTE_BALLOT))]
+  ""
+  "%.\\tvote.ballot.b32\\t%0, %1;")
+
+;; Patterns for OpenMP SIMD-via-SIMT lowering
+
+;; Implement IFN_GOMP_SIMT_LANE: set operand 0 to lane index
+(define_insn "omp_simt_lane"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+	(unspec:SI [(const_int 0)] UNSPEC_LANEID))]
+  ""
+  "%.\\tmov.u32\\t%0, %%laneid;")
+
+;; Implement IFN_GOMP_SIMT_ORDERED: copy operand 1 to operand 0 and
+;; place a compiler barrier to disallow unrolling/peeling the containing loop
+(define_expand "omp_simt_ordered"
+  [(match_operand:SI 0 "nvptx_register_operand" "=R")
+   (match_operand:SI 1 "nvptx_register_operand" "R")]
+  ""
+{
+  emit_move_insn (operands[0], operands[1]);
+  emit_insn (gen_nvptx_nounroll ());
+  DONE;
+})
+
+;; Implement IFN_GOMP_SIMT_XCHG_BFLY: perform a "butterfly" exchange
+;; across lanes
+(define_expand "omp_simt_xchg_bfly"
+  [(match_operand 0 "nvptx_register_operand" "=R")
+   (match_operand 1 "nvptx_register_operand" "R")
+   (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")]
+  ""
+{
+  emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2],
+				SHUFFLE_BFLY));
+  DONE;
+})
+
+;; Implement IFN_GOMP_SIMT_XCHG_IDX: broadcast value in operand 1
+;; from lane given by index in operand 2 to operand 0 in all lanes
+(define_expand "omp_simt_xchg_idx"
+  [(match_operand 0 "nvptx_register_operand" "=R")
+   (match_operand 1 "nvptx_register_operand" "R")
+   (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")]
+  ""
+{
+  emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2],
+				SHUFFLE_IDX));
+  DONE;
+})
+
+;; Implement IFN_GOMP_SIMT_VOTE_ANY:
+;; set operand 0 to zero iff all lanes supply zero in operand 1
+(define_expand "omp_simt_vote_any"
+  [(match_operand:SI 0 "nvptx_register_operand" "=R")
+   (match_operand:SI 1 "nvptx_register_operand" "R")]
+  ""
+{
+  rtx pred = gen_reg_rtx (BImode);
+  emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx));
+  emit_insn (gen_nvptx_vote_ballot (operands[0], pred));
+  DONE;
+})
+
+;; Implement IFN_GOMP_SIMT_LAST_LANE:
+;; set operand 0 to the lowest lane index that passed non-zero in operand 1
+(define_expand "omp_simt_last_lane"
+  [(match_operand:SI 0 "nvptx_register_operand" "=R")
+   (match_operand:SI 1 "nvptx_register_operand" "R")]
+  ""
+{
+  rtx pred = gen_reg_rtx (BImode);
+  rtx tmp = gen_reg_rtx (SImode);
+  emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx));
+  emit_insn (gen_nvptx_vote_ballot (tmp, pred));
+  emit_insn (gen_ctzsi2 (operands[0], tmp));
+  DONE;
+})
+
 ;; extract parts of a 64 bit object into 2 32-bit ints
 (define_insn "unpack<mode>si2"
   [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
@@ -1285,3 +1373,9 @@ (define_insn "nvptx_barsync"
   ""
   "\\tbar.sync\\t%0;"
   [(set_attr "predicable" "false")])
+
+(define_insn "nvptx_nounroll"
+  [(unspec_volatile [(const_int 0)] UNSPECV_NOUNROLL)]
+  ""
+  "\\t.pragma \\\"nounroll\\\";"
+  [(set_attr "predicable" "false")])
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index a6a040e..e011a5a 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -68,6 +68,12 @@ DEF_TARGET_INSN (oacc_dim_pos, (rtx x0, rtx x1))
 DEF_TARGET_INSN (oacc_dim_size, (rtx x0, rtx x1))
 DEF_TARGET_INSN (oacc_fork, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (oacc_join, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (omp_simt_lane, (rtx x0))
+DEF_TARGET_INSN (omp_simt_last_lane, (rtx x0, rtx x1))
+DEF_TARGET_INSN (omp_simt_ordered, (rtx x0, rtx x1))
+DEF_TARGET_INSN (omp_simt_vote_any, (rtx x0, rtx x1))
+DEF_TARGET_INSN (omp_simt_xchg_bfly, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (omp_simt_xchg_idx, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (prefetch, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (probe_stack, (rtx x0))
 DEF_TARGET_INSN (probe_stack_address, (rtx x0))
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 8/8] nvptx: handle OpenMP "omp target entrypoint"
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
                   ` (3 preceding siblings ...)
  2016-10-14 16:40 ` [PATCH 1/8] nvptx -msoft-stack Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-14 16:40 ` [PATCH 3/8] nvptx -muniform-simt Alexander Monakov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch implements emission of OpenMP target region entrypoints: the
compiler emits the target function with '$impl' appended to the name, and
under the original name it emits a short entry sequence that sets up shared
memory arrays and calls the target function via 'gomp_nvptx_main' (which is
implemented in libgomp).

	* config/nvptx/nvptx.c (write_as_kernel): Restrict to OpenACC target
	regions.
	(write_omp_entry): New.  Use it...
	(nvptx_declare_function_name): ...here to emit OpenMP target region
	entrypoints.
	(nvptx_record_offload_symbol): Handle NULL attributes.
---
 gcc/config/nvptx/nvptx.c | 82 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 78 insertions(+), 4 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index f9ac380..8d86aa8 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -739,7 +739,10 @@ static bool
 write_as_kernel (tree attrs)
 {
   return (lookup_attribute ("kernel", attrs) != NULL_TREE
-	  || lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE);
+	  || (lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE
+	      && lookup_attribute ("oacc function", attrs) != NULL_TREE));
+  /* For OpenMP target regions, the corresponding kernel entry is emitted from
+     write_omp_entry as a separate function.  */
 }
 
 /* Emit a linker marker for a function decl or defn.  */
@@ -1096,6 +1099,69 @@ nvptx_init_unisimt_predicate (FILE *file)
   need_unisimt_decl = true;
 }
 
+/* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
+
+   extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
+   void __attribute__((kernel)) NAME (void *arg, char *stack, size_t stacksize)
+   {
+     __nvptx_stacks[tid.y] = stack + stacksize * (ctaid.x * ntid.y + tid.y + 1);
+     __nvptx_uni[tid.y] = 0;
+     gomp_nvptx_main (ORIG, arg);
+   }
+   ORIG itself should not be emitted as a PTX .entry function.  */
+
+static void
+write_omp_entry (FILE *file, const char *name, const char *orig)
+{
+  static bool gomp_nvptx_main_declared;
+  if (!gomp_nvptx_main_declared)
+    {
+      gomp_nvptx_main_declared = true;
+      write_fn_marker (func_decls, false, true, "gomp_nvptx_main");
+      func_decls << ".extern .func gomp_nvptx_main (.param.u" << POINTER_SIZE
+        << " %in_ar1, .param.u" << POINTER_SIZE << " %in_ar2);\n";
+    }
+#define ENTRY_TEMPLATE(PS, PS_BYTES, MAD_PS_32) "\
+ (.param.u" PS " %arg, .param.u" PS " %stack, .param.u" PS " %sz)\n\
+{\n\
+	.reg.u32 %r<3>;\n\
+	.reg.u" PS " %R<4>;\n\
+	mov.u32 %r0, %tid.y;\n\
+	mov.u32 %r1, %ntid.y;\n\
+	mov.u32 %r2, %ctaid.x;\n\
+	cvt.u" PS ".u32 %R1, %r0;\n\
+	" MAD_PS_32 " %R1, %r1, %r2, %R1;\n\
+	mov.u" PS " %R0, __nvptx_stacks;\n\
+	" MAD_PS_32 " %R0, %r0, " PS_BYTES ", %R0;\n\
+	ld.param.u" PS " %R2, [%stack];\n\
+	ld.param.u" PS " %R3, [%sz];\n\
+	add.u" PS " %R2, %R2, %R3;\n\
+	mad.lo.u" PS " %R2, %R1, %R3, %R2;\n\
+	st.shared.u" PS " [%R0], %R2;\n\
+	mov.u" PS " %R0, __nvptx_uni;\n\
+	" MAD_PS_32 " %R0, %r0, 4, %R0;\n\
+	mov.u32 %r0, 0;\n\
+	st.shared.u32 [%R0], %r0;\n\
+	mov.u" PS " %R0, \0;\n\
+	ld.param.u" PS " %R1, [%arg];\n\
+	{\n\
+		.param.u" PS " %P<2>;\n\
+		st.param.u" PS " [%P0], %R0;\n\
+		st.param.u" PS " [%P1], %R1;\n\
+		call.uni gomp_nvptx_main, (%P0, %P1);\n\
+	}\n\
+	ret.uni;\n\
+}\n"
+  static const char entry64[] = ENTRY_TEMPLATE ("64", "8", "mad.wide.u32");
+  static const char entry32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32  ");
+#undef ENTRY_TEMPLATE
+  const char *entry_1 = TARGET_ABI64 ? entry64 : entry32;
+  /* Position ENTRY_2 after the embedded nul using strlen of the prefix.  */
+  const char *entry_2 = entry_1 + strlen (entry64) + 1;
+  fprintf (file, ".visible .entry %s%s%s%s", name, entry_1, orig, entry_2);
+  need_softstack_decl = need_unisimt_decl = true;
+}
+
 /* Implement ASM_DECLARE_FUNCTION_NAME.  Writes the start of a ptx
    function, including local var decls and copies from the arguments to
    local regs.  */
@@ -1107,6 +1173,14 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   tree result_type = TREE_TYPE (fntype);
   int argno = 0;
 
+  if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl))
+      && !lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl)))
+    {
+      char *buf = (char *) alloca (strlen (name) + sizeof ("$impl"));
+      sprintf (buf, "%s$impl", name);
+      write_omp_entry (file, name, buf);
+      name = buf;
+    }
   /* We construct the initial part of the function into a string
      stream, in order to share the prototype writing code.  */
   std::stringstream s;
@@ -4176,13 +4250,13 @@ nvptx_record_offload_symbol (tree decl)
     case FUNCTION_DECL:
       {
 	tree attr = get_oacc_fn_attrib (decl);
-	tree dims = TREE_VALUE (attr);
-	unsigned ix;
+	/* OpenMP offloading does not set this attribute.  */
+	tree dims = attr ? TREE_VALUE (attr) : NULL_TREE;
 
 	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
 		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
 
-	for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+	for (; dims; dims = TREE_CHAIN (dims))
 	  {
 	    int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/8] nvptx -mgomp
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-14 16:40 ` [PATCH 7/8] nvptx backend: new insns for OpenMP SIMD-via-SIMT Alexander Monakov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch adds option -mgomp which enables -msoft-stack plus -muniform-simt,
and wires up the corresponding multilib variant.  This codegen convention is
used for OpenMP offloading.

	* config/nvptx/nvptx.c (diagnose_openacc_conflict): New.  Use it...
        (nvptx_option_override): ...here.  Handle TARGET_GOMP.
	* config/nvptx/nvptx.opt (mgomp): New option.
	* config/nvptx/t-nvptx (MULTILIB_OPTIONS): New.
	* doc/invoke.texi (mgomp): Document.

libgcc:
	config/nvptx/mgomp.c: New file.
        config/nvptx/t-nvptx: Add mgomp.c
---
 gcc/config/nvptx/nvptx.c    | 17 +++++++++++++++++
 gcc/config/nvptx/nvptx.opt  |  4 ++++
 gcc/config/nvptx/t-nvptx    |  2 ++
 gcc/doc/invoke.texi         |  6 ++++++
 libgcc/config/nvptx/mgomp.c | 32 ++++++++++++++++++++++++++++++++
 libgcc/config/nvptx/t-nvptx |  3 ++-
 6 files changed, 63 insertions(+), 1 deletion(-)
 create mode 100644 libgcc/config/nvptx/mgomp.c

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 65217ab..1c3267f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -157,6 +157,16 @@ nvptx_init_machine_status (void)
   return p;
 }
 
+/* Issue a diagnostic when option OPTNAME is enabled (as indicated by OPTVAL)
+   and -fopenacc is also enabled.  */
+
+static void
+diagnose_openacc_conflict (bool optval, const char *optname)
+{
+  if (flag_openacc && optval)
+    error ("option %s is not supported together with -fopenacc", optname);
+}
+
 /* Implement TARGET_OPTION_OVERRIDE.  */
 
 static void
@@ -194,6 +204,13 @@ nvptx_option_override (void)
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
+  diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
+  diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
+  diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
+
+  if (TARGET_GOMP)
+    target_flags |= MASK_SOFT_STACK | MASK_UNIFORM_SIMT;
 }
 
 /* Return a ptx type for MODE.  If PROMOTE, then use .u32 for QImode to
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 0d46e1d..cb6194d 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -40,3 +40,7 @@ Use custom stacks instead of local memory for automatic storage.
 muniform-simt
 Target Report Mask(UNIFORM_SIMT)
 Generate code that can keep local state uniform across all lanes.
+
+mgomp
+Target Report Mask(GOMP)
+Generate code for OpenMP offloading: enables -msoft-stack and -muniform-simt.
diff --git a/gcc/config/nvptx/t-nvptx b/gcc/config/nvptx/t-nvptx
index e2580c9..6c1010d 100644
--- a/gcc/config/nvptx/t-nvptx
+++ b/gcc/config/nvptx/t-nvptx
@@ -8,3 +8,5 @@ ALL_HOST_OBJS += mkoffload.o
 mkoffload$(exeext): mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBDEPS)
 	+$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ \
 	  mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBS)
+
+MULTILIB_OPTIONS = mgomp
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 5150b2f..6d6247c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -20281,6 +20281,12 @@ variant is used for OpenMP offloading, but the option is exposed on its own
 for the purpose of testing the compiler; to generate code suitable for linking
 into programs using OpenMP offloading, use option @option{-mgomp}.
 
+@item -mgomp
+@opindex mgomp
+Generate code for use in OpenMP offloading: enables the @option{-msoft-stack}
+and @option{-muniform-simt} options, and selects the corresponding multilib
+variant.
+
 @end table
 
 @node PDP-11 Options
diff --git a/libgcc/config/nvptx/mgomp.c b/libgcc/config/nvptx/mgomp.c
new file mode 100644
index 0000000..d8ca581
--- /dev/null
+++ b/libgcc/config/nvptx/mgomp.c
@@ -0,0 +1,32 @@
+/* Define shared memory arrays for -msoft-stack and -muniform-simt.
+
+   Copyright (C) 2015-2016 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* OpenACC offloading does not use these symbols; thus, they are exposed
+   only for the -mgomp multilib.  The same definitions are also provided
+   in crt0.c for the case of non-offloading compilation.  32 is the maximum
+   number of warps in a CTA.  */
+
+#if defined(__nvptx_softstack__) && defined(__nvptx_unisimt__)
+void *__nvptx_stacks[32] __attribute__((shared,nocommon));
+unsigned __nvptx_uni[32] __attribute__((shared,nocommon));
+#endif
diff --git a/libgcc/config/nvptx/t-nvptx b/libgcc/config/nvptx/t-nvptx
index daf252f..c4d20c9 100644
--- a/libgcc/config/nvptx/t-nvptx
+++ b/libgcc/config/nvptx/t-nvptx
@@ -1,4 +1,5 @@
-LIB2ADD=$(srcdir)/config/nvptx/reduction.c
+LIB2ADD=$(srcdir)/config/nvptx/reduction.c \
+	$(srcdir)/config/nvptx/mgomp.c
 
 LIB2ADDEH=
 LIB2FUNCS_EXCLUDE=__main
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] nvptx -msoft-stack
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
                   ` (2 preceding siblings ...)
  2016-10-14 16:40 ` [PATCH 2/8] nvptx: implement predicated instructions Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-14 16:40 ` [PATCH 8/8] nvptx: handle OpenMP "omp target entrypoint" Alexander Monakov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch implements '-msoft-stack' code generation variant for NVPTX.  The
goal is to avoid relying on '.local' memory space for placement of automatic
data, and instead have an explicitely-maintained stack pointer (which can be
set up to point to preallocated global memory space).  This allows to have
stack data accessible from all threads and modifiable with atomic
instructions.  This also allows to implement variable-length stack allocation
(for 'alloca' and C99 VLAs).

Each warp has its own 'soft stack' pointer.  It lives in shared memory array
called __nvptx_stacks at index %tid.y (like in OpenACC, OpenMP offloading is
going to use launch geometry such that %tid.y gives the warp index).  It is
retrieved in function prologue (if the function needs a stack frame) and may
also be written there (if the function is non-leaf, so that its callees see
the updated stack pointer), and restored prior to returning.

Startup code is responsible for setting up the initial soft-stack pointer. For
-mmainkernel testing it is libgcc's __main, for OpenMP offloading it's the
kernel region entry code.

gcc/:
	* config/nvptx/nvptx-protos.h (nvptx_output_set_softstack): Declare.
	* config/nvptx/nvptx.c: (need_softstack_decl): New variable.
	(init_softstack_frame): New.
	(nvptx_declare_function_name): Handle TARGET_SOFT_STACK.
	(nvptx_output_set_softstack): New.
	(nvptx_get_drap_rtx): Return %argp as the DRAP if needed.
	(nvptx_file_end): Handle need_softstack_decl.
	* config/nvptx/nvptx.h: (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_softstack__ when -msoft-stack is active.
	(STACK_SIZE_MODE): Define.
	(FIXED_REGISTERS): Adjust.
	(SOFTSTACK_SLOT_REGNUM): New.
	(SOFTSTACK_PREV_REGNUM): New.
	(REGISTER_NAMES): Adjust.
	(struct machine_function): New bool field has_softstack.
	* config/nvptx/nvptx.md (UNSPEC_SET_SOFTSTACK): New.
	(epilogue): Emit stack restore if TARGET_SOFT_STACK.
	(allocate_stack): Implement for TARGET_SOFT_STACK.  Remove unused code.
	(allocate_stack_<mode>): Remove unused pattern.
	(set_softstack_insn): New pattern.
	(restore_stack_block): Handle for TARGET_SOFT_STACK.
	* config/nvptx/nvptx.opt: (msoft-stack): New option.
	* doc/invoke.texi (msoft-stack): Document.

gcc/testsuite/:
	* gcc.target/nvptx/softstack.c: New test.
	* lib/target-supports.exp (check_effective_target_alloca): Use a
	compile test.

libgcc/:
	* config/nvptx/crt0.c (__main): Setup __nvptx_stacks.
---
 gcc/config/nvptx/nvptx-protos.h            |   1 +
 gcc/config/nvptx/nvptx.c                   | 120 ++++++++++++++++++++++++++---
 gcc/config/nvptx/nvptx.h                   |  15 +++-
 gcc/config/nvptx/nvptx.md                  |  36 ++++++---
 gcc/config/nvptx/nvptx.opt                 |   4 +
 gcc/doc/invoke.texi                        |  12 +++
 gcc/testsuite/gcc.target/nvptx/softstack.c |  23 ++++++
 gcc/testsuite/lib/target-supports.exp      |   5 +-
 libgcc/config/nvptx/crt0.c                 |   8 ++
 9 files changed, 198 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/nvptx/softstack.c

diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h
index ec4588e..647607d 100644
--- a/gcc/config/nvptx/nvptx-protos.h
+++ b/gcc/config/nvptx/nvptx-protos.h
@@ -41,5 +41,6 @@ extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
 extern const char *nvptx_output_mov_insn (rtx, rtx);
 extern const char *nvptx_output_call_insn (rtx_insn *, rtx, rtx);
 extern const char *nvptx_output_return (void);
+extern const char *nvptx_output_set_softstack (unsigned);
 #endif
 #endif
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9e04f5b..0525b17 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -141,6 +141,9 @@ static GTY(()) rtx worker_red_sym;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function references __nvptx_stacks.  */
+static bool need_softstack_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -981,6 +984,67 @@ init_frame (FILE  *file, int regno, unsigned align, unsigned size)
 	   POINTER_SIZE, reg_names[regno], reg_names[regno]);
 }
 
+/* Emit soft stack frame setup sequence.  */
+
+static void
+init_softstack_frame (FILE *file, unsigned alignment, HOST_WIDE_INT size)
+{
+  /* Maintain 64-bit stack alignment.  */
+  unsigned keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
+  size = ROUND_UP (size, keep_align);
+  int bits = POINTER_SIZE;
+  const char *reg_stack = reg_names[STACK_POINTER_REGNUM];
+  const char *reg_frame = reg_names[FRAME_POINTER_REGNUM];
+  const char *reg_sspslot = reg_names[SOFTSTACK_SLOT_REGNUM];
+  const char *reg_sspprev = reg_names[SOFTSTACK_PREV_REGNUM];
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_stack);
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_frame);
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspslot);
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspprev);
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32 %%fstmp0;\n");
+  fprintf (file, "\t\t.reg.u%d %%fstmp1;\n", bits);
+  fprintf (file, "\t\t.reg.u%d %%fstmp2;\n", bits);
+  fprintf (file, "\t\tmov.u32 %%fstmp0, %%tid.y;\n");
+  fprintf (file, "\t\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
+	   bits == 64 ? ".wide" : ".lo", bits / 8);
+  fprintf (file, "\t\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits);
+
+  /* Initialize %sspslot = &__nvptx_stacks[tid.y].  */
+  fprintf (file, "\t\tadd.u%d %s, %%fstmp2, %%fstmp1;\n", bits, reg_sspslot);
+
+  /* Initialize %sspprev = __nvptx_stacks[tid.y].  */
+  fprintf (file, "\t\tld.shared.u%d %s, [%s];\n",
+	   bits, reg_sspprev, reg_sspslot);
+
+  /* Initialize %frame = %sspprev - size.  */
+  fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n",
+	   bits, reg_frame, reg_sspprev, size);
+
+  /* Apply alignment, if larger than 64.  */
+  if (alignment > keep_align)
+    fprintf (file, "\t\tand.b%d %s, %s, %d;\n",
+	     bits, reg_frame, reg_frame, -alignment);
+
+  size = crtl->outgoing_args_size;
+  gcc_assert (size % keep_align == 0);
+
+  /* Initialize %stack.  */
+  fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n",
+	   bits, reg_stack, reg_frame, size);
+
+  /* Usually 'crtl->is_leaf' is computed during register allocator
+     initialization, which is not done on NVPTX.  Compute it now.  */
+  gcc_assert (!crtl->is_leaf);
+  crtl->is_leaf = leaf_function_p ();
+  if (!crtl->is_leaf)
+    fprintf (file, "\t\tst.shared.u%d [%s], %s;\n",
+	     bits, reg_sspslot, reg_stack);
+  fprintf (file, "\t}\n");
+  cfun->machine->has_softstack = true;
+  need_softstack_decl = true;
+}
+
 /* Emit code to initialize the REGNO predicate register to indicate
    whether we are not lane zero on the NAME axis.  */
 
@@ -1042,19 +1106,24 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 
   fprintf (file, "%s", s.str().c_str());
 
-  /* Declare a local var for outgoing varargs.  */
-  if (cfun->machine->has_varadic)
-    init_frame (file, STACK_POINTER_REGNUM,
-		UNITS_PER_WORD, crtl->outgoing_args_size);
-
-  /* Declare a local variable for the frame.  Force its size to be
-     DImode-compatible.  */
   HOST_WIDE_INT sz = get_frame_size ();
-  if (sz || cfun->machine->has_chain)
-    init_frame (file, FRAME_POINTER_REGNUM,
-		crtl->stack_alignment_needed / BITS_PER_UNIT,
-		(sz + GET_MODE_SIZE (DImode) - 1)
-		& ~(HOST_WIDE_INT)(GET_MODE_SIZE (DImode) - 1));
+  bool need_frameptr = sz || cfun->machine->has_chain;
+  int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
+  if (!TARGET_SOFT_STACK)
+    {
+      /* Declare a local var for outgoing varargs.  */
+      if (cfun->machine->has_varadic)
+	init_frame (file, STACK_POINTER_REGNUM,
+		    UNITS_PER_WORD, crtl->outgoing_args_size);
+
+      /* Declare a local variable for the frame.  Force its size to be
+	 DImode-compatible.  */
+      if (need_frameptr)
+	init_frame (file, FRAME_POINTER_REGNUM, alignment,
+		    ROUND_UP (sz, GET_MODE_SIZE (DImode)));
+    }
+  else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca)
+    init_softstack_frame (file, alignment, sz);
 
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
@@ -1082,6 +1151,21 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 			       REGNO (cfun->machine->axis_predicate[1]), "x");
 }
 
+/* Output instruction that sets soft stack pointer in shared memory to the
+   value in register given by SRC_REGNO.  */
+
+const char *
+nvptx_output_set_softstack (unsigned src_regno)
+{
+  if (cfun->machine->has_softstack && !crtl->is_leaf)
+    {
+      fprintf (asm_out_file, "\tst.shared.u%d\t[%s], ",
+	       POINTER_SIZE, reg_names[SOFTSTACK_SLOT_REGNUM]);
+      output_reg (asm_out_file, src_regno, VOIDmode);
+      fprintf (asm_out_file, ";\n");
+    }
+  return "";
+}
 /* Output a return instruction.  Also copy the return value to its outgoing
    location.  */
 
@@ -1121,6 +1205,8 @@ nvptx_function_ok_for_sibcall (tree, tree)
 static rtx
 nvptx_get_drap_rtx (void)
 {
+  if (TARGET_SOFT_STACK && stack_realign_drap)
+    return arg_pointer_rtx;
   return NULL_RTX;
 }
 
@@ -4029,6 +4115,16 @@ nvptx_file_end (void)
   if (worker_red_size)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
+
+  if (need_softstack_decl)
+    {
+      write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
+      /* 32 is the maximum number of warps in a block.  Even though it's an
+         external declaration, emit the array size explicitly; otherwise, it
+         may fail at PTX JIT time if the definition is later in link order.  */
+      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",
+	       POINTER_SIZE);
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 381269e..2c4962a 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -31,6 +31,8 @@
       builtin_assert ("machine=nvptx");		\
       builtin_assert ("cpu=nvptx");		\
       builtin_define ("__nvptx__");		\
+      if (TARGET_SOFT_STACK)			\
+        builtin_define ("__nvptx_softstack__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -79,13 +81,14 @@
 
 #define POINTER_SIZE (TARGET_ABI64 ? 64 : 32)
 #define Pmode (TARGET_ABI64 ? DImode : SImode)
+#define STACK_SIZE_MODE Pmode
 
 /* Registers.  Since ptx is a virtual target, we just define a few
    hard registers for special purposes and leave pseudos unallocated.
    We have to have some available hard registers, to keep gcc setup
    happy.  */
 #define FIRST_PSEUDO_REGISTER 16
-#define FIXED_REGISTERS	    { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
+#define FIXED_REGISTERS	    { 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
 #define CALL_USED_REGISTERS { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }
 
 #define HARD_REGNO_NREGS(REG, MODE)		\
@@ -133,10 +136,17 @@ enum reg_class             {  NO_REGS,    ALL_REGS,	LIM_REG_CLASSES };
 #define FRAME_POINTER_REGNUM 2
 #define ARG_POINTER_REGNUM 3
 #define STATIC_CHAIN_REGNUM 4
+/* This register points to the shared memory location with the current warp's
+   soft stack pointer (__nvptx_stacks[tid.y]).  */
+#define SOFTSTACK_SLOT_REGNUM 5
+/* This register is used to save the previous value of the soft stack pointer
+   in the prologue and restore it when returning.  */
+#define SOFTSTACK_PREV_REGNUM 6
 
 #define REGISTER_NAMES							\
   {									\
-    "%value", "%stack", "%frame", "%args", "%chain", "%hr5", "%hr6", "%hr7", \
+    "%value", "%stack", "%frame", "%args",                              \
+    "%chain", "%sspslot", "%sspprev", "%hr7",                           \
     "%hr8", "%hr9", "%hr10", "%hr11", "%hr12", "%hr13", "%hr14", "%hr15" \
   }
 
@@ -200,6 +210,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool has_softstack; /* Current function has a soft stack frame.  */
   int num_args;	/* Number of args of current call.  */
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index d117343..e91e8ac 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -36,6 +36,8 @@ (define_c_enum "unspec" [
 
    UNSPEC_ALLOCA
 
+   UNSPEC_SET_SOFTSTACK
+
    UNSPEC_DIM_SIZE
 
    UNSPEC_BIT_CONV
@@ -944,6 +946,9 @@ (define_expand "epilogue"
   [(clobber (const_int 0))]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    emit_insn (gen_set_softstack_insn (gen_rtx_REG (Pmode,
+						    SOFTSTACK_PREV_REGNUM)));
   emit_jump_insn (gen_return ());
   DONE;
 })
@@ -972,31 +977,40 @@ (define_expand "allocate_stack"
    (match_operand 1 "nvptx_register_operand")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (stack_pointer_rtx,
+		      gen_rtx_MINUS (Pmode, stack_pointer_rtx, operands[1]));
+      emit_insn (gen_set_softstack_insn (stack_pointer_rtx));
+      emit_move_insn (operands[0], virtual_stack_dynamic_rtx);
+      DONE;
+    }
   /* The ptx documentation specifies an alloca intrinsic (for 32 bit
      only)  but notes it is not implemented.  The assembler emits a
      confused error message.  Issue a blunt one now instead.  */
   sorry ("target cannot support alloca.");
   emit_insn (gen_nop ());
   DONE;
-  if (TARGET_ABI64)
-    emit_insn (gen_allocate_stack_di (operands[0], operands[1]));
-  else
-    emit_insn (gen_allocate_stack_si (operands[0], operands[1]));
-  DONE;
 })
 
-(define_insn "allocate_stack_<mode>"
-  [(set (match_operand:P 0 "nvptx_register_operand" "=R")
-        (unspec:P [(match_operand:P 1 "nvptx_register_operand" "R")]
-                   UNSPEC_ALLOCA))]
-  ""
-  "%.\\tcall (%0), %%alloca, (%1);")
+(define_insn "set_softstack_insn"
+  [(unspec [(match_operand 0 "nvptx_register_operand" "R")]
+	   UNSPEC_SET_SOFTSTACK)]
+  "TARGET_SOFT_STACK"
+{
+  return nvptx_output_set_softstack (REGNO (operands[0]));
+})
 
 (define_expand "restore_stack_block"
   [(match_operand 0 "register_operand" "")
    (match_operand 1 "register_operand" "")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (operands[0], operands[1]);
+      emit_insn (gen_set_softstack_insn (operands[0]));
+    }
   DONE;
 })
 
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 601cf12..71ddadb 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -32,3 +32,7 @@ Link in code for a __main kernel.
 moptimize
 Target Report Var(nvptx_optimize) Init(-1)
 Optimize partition neutering.
+
+msoft-stack
+Target Report Mask(SOFT_STACK)
+Use custom stacks instead of local memory for automatic storage.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 0241cb5..57dd62f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -20258,6 +20258,18 @@ offloading execution.
 Apply partitioned execution optimizations.  This is the default when any
 level of optimization is selected.
 
+@item -msoft-stack
+@opindex msoft-stack
+Generate code that does not use @code{.local} memory
+directly for stack storage. Instead, a per-warp stack pointer is
+maintained explicitly. This enables variable-length stack allocation (with
+variable-length arrays or @code{alloca}), and when global memory is used for
+underlying storage, makes it possible to access automatic variables from other
+threads, or with atomic instructions. This code generation variant is used
+for OpenMP offloading, but the option is exposed on its own for the purpose
+of testing the compiler; to generate code suitable for linking into programs
+using OpenMP offloading, use option @option{-mgomp}.
+
 @end table
 
 @node PDP-11 Options
diff --git a/gcc/testsuite/gcc.target/nvptx/softstack.c b/gcc/testsuite/gcc.target/nvptx/softstack.c
new file mode 100644
index 0000000..73e60f2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/softstack.c
@@ -0,0 +1,23 @@
+/* { dg-options "-O2 -msoft-stack" } */
+/* { dg-do run } */
+
+static __attribute__((noinline,noclone)) int f(int *p)
+{
+  return __sync_lock_test_and_set(p, 1);
+}
+
+static __attribute__((noinline,noclone)) int g(int n)
+{
+  /* Check that variable-length stack allocation works.  */
+  int v[n];
+  v[0] = 0;
+  /* Check that atomic operations can be applied to auto data.  */
+  return f(v) == 0 && v[0] == 1;
+}
+
+int main()
+{
+  if (!g(1))
+    __builtin_abort();
+  return 0;
+}
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 201ed4b..6faa3d8 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -765,7 +765,10 @@ proc check_effective_target_untyped_assembly {} {
 
 proc check_effective_target_alloca {} {
     if { [istarget nvptx-*-*] } {
-	return 0
+	return [check_no_compiler_messages alloca assembly {
+	    void f (void*);
+	    void g (int n) { f (__builtin_alloca (n)); }
+	}]
     }
     return 1
 }
diff --git a/libgcc/config/nvptx/crt0.c b/libgcc/config/nvptx/crt0.c
index 3b7382d..290dc6c 100644
--- a/libgcc/config/nvptx/crt0.c
+++ b/libgcc/config/nvptx/crt0.c
@@ -24,6 +24,11 @@ int *__exitval_ptr;
 extern void __attribute__((noreturn)) exit (int status);
 extern int main (int, void **);
 
+/* Always setup soft stacks to allow testing with -msoft-stack but without
+   -mgomp.  32 is the maximum number of warps in a CTA: the definition here
+   must match the external declaration emitted by the compiler.  */
+void *__nvptx_stacks[32] __attribute__((shared,nocommon));
+
 void __attribute__((kernel))
 __main (int *rval_ptr, int argc, void **argv)
 {
@@ -33,5 +38,8 @@ __main (int *rval_ptr, int argc, void **argv)
   if (rval_ptr)
     *rval_ptr = 255;
 
+  static char stack[131072] __attribute__((aligned(8)));
+  __nvptx_stacks[0] = stack + sizeof stack;
+
   exit (main (argc, argv));
 }
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/8] nvptx: implement predicated instructions
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
  2016-10-14 16:40 ` [PATCH 4/8] nvptx -mgomp Alexander Monakov
  2016-10-14 16:40 ` [PATCH 7/8] nvptx backend: new insns for OpenMP SIMD-via-SIMT Alexander Monakov
@ 2016-10-14 16:40 ` Alexander Monakov
  2016-10-14 16:40 ` [PATCH 1/8] nvptx -msoft-stack Alexander Monakov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-14 16:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch wires up generation of predicated instruction forms in nvptx.md and
fixes their handling in nvptx.c.  This is a prerequisite for the following
patch.  On its own it doesn't affect generated code because COND_EXEC
instructions are created by if-conversion only after register allocation,
which is not performed on NVPTX.

	* config/nvptx/nvptx.c (nvptx_output_call_insn): Handle COND_EXEC
	patterns.  Emit instruction predicate.
	(nvptx_print_operand): Fix handling of instruction predicates.
	* config/nvptx/nvptx.md (predicable): New attribute.  Generate
	predicated forms via define_cond_exec.
	(br_true): Mark as not predicable.
	(br_false): Ditto.
	(br_true_uni): Ditto.
	(br_false_uni): Ditto.
	(return): Ditto.
	(trap_if_true): Ditto.
	(trap_if_false): Ditto.
	(nvptx_fork): Ditto.
	(nvptx_forked): Ditto.
	(nvptx_joining): Ditto.
	(nvptx_join): Ditto.
	(nvptx_barsync): Ditto.
---
 gcc/config/nvptx/nvptx.c  | 14 ++++++++------
 gcc/config/nvptx/nvptx.md | 43 +++++++++++++++++++++++++++++++------------
 2 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 0525b17..4cdaa1e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1927,6 +1927,8 @@ nvptx_output_mov_insn (rtx dst, rtx src)
   return "%.\tcvt%t0%t1\t%0, %1;";
 }
 
+static void nvptx_print_operand (FILE *, rtx, int);
+
 /* Output INSN, which is a call to CALLEE with result RESULT.  For ptx, this
    involves writing .param declarations and in/out copies into them.  For
    indirect calls, also write the .callprototype.  */
@@ -1938,6 +1940,8 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee)
   static int labelno;
   bool needs_tgt = register_operand (callee, Pmode);
   rtx pat = PATTERN (insn);
+  if (GET_CODE (pat) == COND_EXEC)
+    pat = COND_EXEC_CODE (pat);
   int arg_end = XVECLEN (pat, 0);
   tree decl = NULL_TREE;
 
@@ -1982,6 +1986,8 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee)
       fprintf (asm_out_file, ";\n");
     }
 
+  /* The '.' stands for the call's predicate, if any.  */
+  nvptx_print_operand (asm_out_file, NULL_RTX, '.');
   fprintf (asm_out_file, "\t\tcall ");
   if (result != NULL_RTX)
     fprintf (asm_out_file, "(%s_in), ", reg_names[NVPTX_RETURN_REGNUM]);
@@ -2045,8 +2051,6 @@ nvptx_print_operand_punct_valid_p (unsigned char c)
   return c == '.' || c== '#';
 }
 
-static void nvptx_print_operand (FILE *, rtx, int);
-
 /* Subroutine of nvptx_print_operand; used to print a memory reference X to FILE.  */
 
 static void
@@ -2107,12 +2111,10 @@ nvptx_print_operand (FILE *file, rtx x, int code)
       x = current_insn_predicate;
       if (x)
 	{
-	  unsigned int regno = REGNO (XEXP (x, 0));
-	  fputs ("[", file);
+	  fputs ("@", file);
 	  if (GET_CODE (x) == EQ)
 	    fputs ("!", file);
-	  fputs (reg_names [regno], file);
-	  fputs ("]", file);
+	  output_reg (file, REGNO (XEXP (x, 0)), VOIDmode);
 	}
       return;
     }
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index e91e8ac..5c5c991 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -126,6 +126,17 @@ (define_predicate "call_operation"
   return true;
 })
 
+(define_attr "predicable" "false,true"
+  (const_string "true"))
+
+(define_cond_exec
+  [(match_operator 0 "predicate_operator"
+      [(match_operand:BI 1 "nvptx_register_operand" "")
+       (match_operand:BI 2 "const0_operand" "")])]
+  ""
+  ""
+  )
+
 (define_constraint "P0"
   "An integer with the value 0."
   (and (match_code "const_int")
@@ -511,7 +522,8 @@ (define_insn "br_true"
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra\\t%l1;")
+  "%j0\\tbra\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 (define_insn "br_false"
   [(set (pc)
@@ -520,7 +532,8 @@ (define_insn "br_false"
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra\\t%l1;")
+  "%J0\\tbra\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 ;; unified conditional branch
 (define_insn "br_true_uni"
@@ -529,7 +542,8 @@ (define_insn "br_true_uni"
 		       UNSPEC_BR_UNIFIED) (const_int 0))
         (label_ref (match_operand 1 "" "")) (pc)))]
   ""
-  "%j0\\tbra.uni\\t%l1;")
+  "%j0\\tbra.uni\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 (define_insn "br_false_uni"
   [(set (pc) (if_then_else
@@ -537,7 +551,8 @@ (define_insn "br_false_uni"
 		       UNSPEC_BR_UNIFIED) (const_int 0))
         (label_ref (match_operand 1 "" "")) (pc)))]
   ""
-  "%J0\\tbra.uni\\t%l1;")
+  "%J0\\tbra.uni\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -940,7 +955,8 @@ (define_insn "return"
   ""
 {
   return nvptx_output_return ();
-})
+}
+  [(set_attr "predicable" "false")])
 
 (define_expand "epilogue"
   [(clobber (const_int 0))]
@@ -1032,14 +1048,16 @@ (define_insn "trap_if_true"
 		(const_int 0))
 	    (const_int 0))]
   ""
-  "%j0 trap;")
+  "%j0 trap;"
+  [(set_attr "predicable" "false")])
 
 (define_insn "trap_if_false"
   [(trap_if (eq (match_operand:BI 0 "nvptx_register_operand" "R")
 		(const_int 0))
 	    (const_int 0))]
   ""
-  "%J0 trap;")
+  "%J0 trap;"
+  [(set_attr "predicable" "false")])
 
 (define_expand "ctrap<mode>4"
   [(trap_if (match_operator 0 "nvptx_comparison_operator"
@@ -1088,28 +1106,28 @@ (define_insn "nvptx_fork"
 		       UNSPECV_FORK)]
   ""
   "// fork %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_forked"
   [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
 		       UNSPECV_FORKED)]
   ""
   "// forked %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_joining"
   [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
 		       UNSPECV_JOINING)]
   ""
   "// joining %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_join"
   [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
 		       UNSPECV_JOIN)]
   ""
   "// join %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_expand "oacc_fork"
   [(set (match_operand:SI 0 "nvptx_nonmemory_operand" "")
@@ -1257,4 +1275,5 @@ (define_insn "nvptx_barsync"
   [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;")
+  "\\tbar.sync\\t%0;"
+  [(set_attr "predicable" "false")])
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
                   ` (7 preceding siblings ...)
  2016-10-14 16:40 ` [PATCH 5/8] nvptx mkoffload: pass -mgomp for OpenMP offloading Alexander Monakov
@ 2016-10-17 15:51 ` Bernd Schmidt
  2016-10-17 17:08   ` Alexander Monakov
  8 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2016-10-17 15:51 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Nathan Sidwell

On 10/14/2016 06:39 PM, Alexander Monakov wrote:
> I'm resending the patch series with backend prerequisites for OpenMP
> offloading to the NVIDIA PTX ISA.  The patches are rebased on trunk.

What's the status of the branch? Is it expected to work? I'm trying to 
compile the OpenMP version of these benchmarks:
   https://codesign.llnl.gov/lulesh.php

and the resulting binary fails as follows:

libgomp: Link error log error   : Size doesn't match for 
'__nvptx_stacks' in 'Input 8', first specified in 'Input 8'
error   : Multiple definition of '__nvptx_stacks' in 'Input 8', first 
defined in 'Input 8'

I think before merging this work we'll need to have some idea of how 
well it works on real-world code.


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-17 15:51 ` [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Bernd Schmidt
@ 2016-10-17 17:08   ` Alexander Monakov
  2016-10-18 11:03     ` Bernd Schmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Monakov @ 2016-10-17 17:08 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Nathan Sidwell

On Mon, 17 Oct 2016, Bernd Schmidt wrote:
> On 10/14/2016 06:39 PM, Alexander Monakov wrote:
> > I'm resending the patch series with backend prerequisites for OpenMP
> > offloading to the NVIDIA PTX ISA.  The patches are rebased on trunk.
> 
> What's the status of the branch? Is it expected to work? I'm trying to compile
> the OpenMP version of these benchmarks:
>   https://codesign.llnl.gov/lulesh.php
> 
> and the resulting binary fails as follows:
> 
> libgomp: Link error log error   : Size doesn't match for '__nvptx_stacks' in
> 'Input 8', first specified in 'Input 8'
> error   : Multiple definition of '__nvptx_stacks' in 'Input 8', first defined
> in 'Input 8'

I've just pushed two commits to the branch to fix this issue.  Before those, the
last commit left the branch in a state where an incremental build seemed ok
(because libgcc/libgomp weren't rebuilt with the new cc1), but a from-scratch
build was broken like you've shown.  LULESH is known to work.  I also intend to
perform a trunk merge soon.

> I think before merging this work we'll need to have some idea of how well it
> works on real-world code.

This patchset and the branch lay the foundation, there's more work to be
done, in particular on the performance improvements side. There should be
an agreement on these fundamental bits first, before moving on to fine-tuning.

Alexander

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-17 17:08   ` Alexander Monakov
@ 2016-10-18 11:03     ` Bernd Schmidt
  2016-10-18 16:59       ` Alexander Monakov
  2016-10-19 10:35       ` Alexander Monakov
  0 siblings, 2 replies; 22+ messages in thread
From: Bernd Schmidt @ 2016-10-18 11:03 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Nathan Sidwell

On 10/17/2016 07:06 PM, Alexander Monakov wrote:

> I've just pushed two commits to the branch to fix this issue.  Before those, the
> last commit left the branch in a state where an incremental build seemed ok
> (because libgcc/libgomp weren't rebuilt with the new cc1), but a from-scratch
> build was broken like you've shown.  LULESH is known to work.  I also intend to
> perform a trunk merge soon.

Ok that did work, however...

>> I think before merging this work we'll need to have some idea of how well it
>> works on real-world code.
>
> This patchset and the branch lay the foundation, there's more work to be
> done, in particular on the performance improvements side. There should be
> an agreement on these fundamental bits first, before moving on to fine-tuning.

The performance I saw was lower by a factor of 80 or so compared to 
their CUDA version, and even lower than OpenMP on the host. Does this 
match what you are seeing? Do you have a clear plan how this can be 
improved?

To me this kind of performance doesn't look like something that will be 
fixed by fine-tuning; it leaves me undecided whether the chosen approach 
(what you call the fundamentals) is viable at all. Performance is still 
better than the OpenACC version of the benchmark, but then I think we 
shouldn't repeat the mistakes we made with OpenACC and avoid merging 
something until we're sure it's ready and of benefit to users.

Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-18 11:03     ` Bernd Schmidt
@ 2016-10-18 16:59       ` Alexander Monakov
  2016-10-18 21:07         ` Jakub Jelinek
                           ` (2 more replies)
  2016-10-19 10:35       ` Alexander Monakov
  1 sibling, 3 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-18 16:59 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Nathan Sidwell

On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> The performance I saw was lower by a factor of 80 or so compared to their CUDA
> version, and even lower than OpenMP on the host.

The currently published OpenMP version of LULESH simply doesn't use openmp-simd
anywhere. This should make it obvious that it won't be anywhere near any
reasonable CUDA implementation, and also bound to be below host performance.
Besides, it's common for such benchmark suites to have very different levels of
hand tuning for the native-CUDA implementation vs OpenMP implementation,
sometimes to the point of significant algorithmic differences. So you're
making an invalid comparison here.

Internally at ISP RAS we used a small set of microbenchmarks implemented in
CUDA/OpenACC/OpenMP specifically for the purpose of evaluating the exact same
computations implemented in terms of different APIs. We got close performance in
all three. The biggest issue is visible on short-running OpenMP target regions:
the startup cost (going through libgomp) is non-trivial. That can be improved
with further changes in libgomp port, notably avoiding malloc, shaving off more
code, perhaps inlining more code (e.g. via LTO eventually). There's also
avoidable cuMemAlloc/cuMemFree on the libgomp plugin side.

For example, there's this patch on the branch:

    libgomp: avoid malloc calls in gomp_nvptx_main

    Avoid calling malloc where it's easy to use stack storage instead: device
    malloc is very slow in CUDA.  This cuts about 60-80 microseconds from target
    region entry/exit time, slimming down empty target regions from ~95 to ~17
    microseconds (as measured on a GTX Titan).

(empty CUDA kernel is ~5 microseconds; all figures are taken via nvprof)

> To me this kind of performance doesn't look like something that will be fixed
> by fine-tuning; it leaves me undecided whether the chosen approach (what you
> call the fundamentals) is viable at all.

If you try to draw conclusions just from comparing the performance you got on
LULESH, without looking at benchmark's source (otherwise you should have
acknowledged the lack of openmp-simd and significant source-level differences
between CUDA and OpenMP implementations, like the use of __shared__ in CUDA
algorithms), I am sorry to say, but that is just ridiculous. The implementation
on the branch is far from ideal, but your method of evaluation is nonsensical.

> Performance is still better than the OpenACC version of the benchmark, but
> then I think we shouldn't repeat the mistakes we made with OpenACC and avoid
> merging something until we're sure it's ready and of benefit to users.

Would you kindly try and keep your commentary constructive. It's frustrating to
me to have to tolerate hostilities like an ad hominem attack, ignored
nvptx-backend-related questions, etc. How can the work get ready if all you do
is passively push back?  Please trust me, I have experience with GPUs and GCC.
There should be a process for getting this gradually reviewed, with fundamental
design decisions acked and patches reviewed before all tweaks and optimizations
are in place. If you suggest that the work needs to proceed on the branch
without any kind of interim review, and then reviewed in one go after satisfying
some unspecified criteria of being "ready and of benefit", that doesn't sound
right to me.

Alexander

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-18 16:59       ` Alexander Monakov
@ 2016-10-18 21:07         ` Jakub Jelinek
  2016-10-19 10:39         ` Bernd Schmidt
  2016-10-19 12:19         ` Jakub Jelinek
  2 siblings, 0 replies; 22+ messages in thread
From: Jakub Jelinek @ 2016-10-18 21:07 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Bernd Schmidt, gcc-patches, Nathan Sidwell

On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote:
> On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> > The performance I saw was lower by a factor of 80 or so compared to their CUDA
> > version, and even lower than OpenMP on the host.
> 
> The currently published OpenMP version of LULESH simply doesn't use openmp-simd
> anywhere. This should make it obvious that it won't be anywhere near any
> reasonable CUDA implementation, and also bound to be below host performance.

Yeah, perhaps just changing some or all #pragma omp distribute parallel for
into #pragma omp distribute parallel for simd could do something (of course,
one should actually analyze what it does, but if it is valid for distribute
without dist_schedule clause, then the loops ought to be without forward or
backward lexical dependencies (teams can't really synchronize, though they
can use some atomics).
That said, the OpenMP port of LULESH doesn't seem to be done very carefully,
e.g. in CalcHourglassControlForElems I see:
      /* Do a check for negative volumes */
      if ( v[i] <= Real_t(0.0) ) {
        vol_error = i;
      }
There is not any kind of explicit mapping of vol_error nor reduction of it,
so while in OpenMP 4.0 it would be just a possible data race (the var would
be map(tofrom: vol_error) by default and shared between teams/threads, so if
more than one thread decides to write it, it is a data race, in OpenMP 4.5
it is implicitly firstprivate(vol_error) and thus the changes to the var
(still racy) would just never be propagated back to the caller.

For the missing simd regions, it might be helpful if we were able to
"autovectorize" into the SIMT, but I guess that might be quite a lot of
work.

	Jakub

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-18 11:03     ` Bernd Schmidt
  2016-10-18 16:59       ` Alexander Monakov
@ 2016-10-19 10:35       ` Alexander Monakov
  1 sibling, 0 replies; 22+ messages in thread
From: Alexander Monakov @ 2016-10-19 10:35 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Nathan Sidwell

On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> [...] but then I think we shouldn't repeat the mistakes we made with OpenACC

I think it would be good if you'd mention for posterity what, specifically,
the mistakes were, in particular if you want those not to be repeated in the
context of OpenMP offloading.

Alexander

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-18 16:59       ` Alexander Monakov
  2016-10-18 21:07         ` Jakub Jelinek
@ 2016-10-19 10:39         ` Bernd Schmidt
  2016-11-11 13:02           ` Bernd Schmidt
  2016-10-19 12:19         ` Jakub Jelinek
  2 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2016-10-19 10:39 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Nathan Sidwell

On 10/18/2016 06:58 PM, Alexander Monakov wrote:
>
> The currently published OpenMP version of LULESH simply doesn't use openmp-simd
> anywhere. This should make it obvious that it won't be anywhere near any
> reasonable CUDA implementation, and also bound to be below host performance.
> Besides, it's common for such benchmark suites to have very different levels of
> hand tuning for the native-CUDA implementation vs OpenMP implementation,
> sometimes to the point of significant algorithmic differences. So you're
> making an invalid comparison here.

The information I have is that the LULESH code is representative of how 
at least some groups on the HPC side expect to write OpenMP code. It's 
the biggest real-world piece of code that I'm aware of that's available 
for testing, so it seemed like a good thing to try. If you have other 
real-world tests available, please let us know. If you can demonstrate 
good performance by modifying LULESH sources, that would also be a good 
step, although maybe not the ideal case. But I think it's not 
unreasonable to look for a demonstration that reasonable performance is 
achievable on something that isn't just a microbenchmark.

I'll refrain from any further comments on the topic. The ptx patches 
don't look unreasonable iff someone else decides that this version of 
OpenMP support should be merged and I'll look into them in more detail 
if that happens. Patch 2/8 is ok now.

Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-18 16:59       ` Alexander Monakov
  2016-10-18 21:07         ` Jakub Jelinek
  2016-10-19 10:39         ` Bernd Schmidt
@ 2016-10-19 12:19         ` Jakub Jelinek
  2 siblings, 0 replies; 22+ messages in thread
From: Jakub Jelinek @ 2016-10-19 12:19 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Bernd Schmidt, gcc-patches, Nathan Sidwell

On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote:
> On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> > The performance I saw was lower by a factor of 80 or so compared to their CUDA
> > version, and even lower than OpenMP on the host.
> 
> The currently published OpenMP version of LULESH simply doesn't use openmp-simd
> anywhere. This should make it obvious that it won't be anywhere near any
> reasonable CUDA implementation, and also bound to be below host performance.
> Besides, it's common for such benchmark suites to have very different levels of
> hand tuning for the native-CUDA implementation vs OpenMP implementation,
> sometimes to the point of significant algorithmic differences. So you're
> making an invalid comparison here.

This is related to the independent clause/construct (or whatever other
names) discussions, the problem with LULESH's
#pragma distribute parallel for
rather than
#pragma distribute parallel for simd
is that usually it calls (inline) functions, and distribute parallel for,
even with the implementation defined default for schedule() clause, isn't
just let the implementation choose distribution between teams/threads/simd
it likes; for loops which don't call any functions we can scan the loop body
and figure out if it could e.g. through various omp_* calls observe anything
that could reveal how it is distributed among teams/threads/simd, but for
loops that can call other functions that is hard to do, especially as early
as during omp lowering/expansion.
OpenMP 5.0 is likely going to have some clause or whatever that will just
say the loop iterations are completely independent, but until then the
programmer uses more prescriptive pragmas and needs to be careful what
exactly they want.

But, certainly we should collect some OpenMP/OpenACC offloading benchmarks
or write our own and use that to compare GCC with other compilers.

	Jakub

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-10-19 10:39         ` Bernd Schmidt
@ 2016-11-11 13:02           ` Bernd Schmidt
  2016-11-11 15:36             ` Alexander Monakov
  0 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2016-11-11 13:02 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Nathan Sidwell

On 10/19/2016 12:39 PM, Bernd Schmidt wrote:
> I'll refrain from any further comments on the topic. The ptx patches
> don't look unreasonable iff someone else decides that this version of
> OpenMP support should be merged and I'll look into them in more detail
> if that happens. Patch 2/8 is ok now.

Sounds like Jakub has made that decision. So I'll get out of the way and 
just approve all these.


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-11-11 13:02           ` Bernd Schmidt
@ 2016-11-11 15:36             ` Alexander Monakov
  2016-11-11 15:38               ` Bernd Schmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Monakov @ 2016-11-11 15:36 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Nathan Sidwell

On Fri, 11 Nov 2016, Bernd Schmidt wrote:
> On 10/19/2016 12:39 PM, Bernd Schmidt wrote:
> > I'll refrain from any further comments on the topic. The ptx patches
> > don't look unreasonable iff someone else decides that this version of
> > OpenMP support should be merged and I'll look into them in more detail
> > if that happens. Patch 2/8 is ok now.
> 
> Sounds like Jakub has made that decision. So I'll get out of the way and just
> approve all these.

For the avoidance of doubt, is this a statement of intent, or an actual approval
for the patchset?

After these backend modifications and the rest of libgomp/middle-end changes are
applied, trunk will need the following flip-the-switch patch to allow OpenMP
offloading for NVPTX.  OK?

Thanks.
Alexander

	PR target/67822
	* config/nvptx/mkoffload.c (main): Allow -fopenmp.

diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c
index c8eed45..e99ef37 100644
--- a/gcc/config/nvptx/mkoffload.c
+++ b/gcc/config/nvptx/mkoffload.c
@@ -517,8 +524,8 @@ main (int argc, char **argv)
     fatal_error (input_location, "cannot open '%s'", ptx_cfile_name);

   /* PR libgomp/65099: Currently, we only support offloading in 64-bit
-     configurations.  PR target/67822: OpenMP offloading to nvptx fails.  */
-  if (offload_abi == OFFLOAD_ABI_LP64 && !fopenmp)
+     configurations.  */
+  if (offload_abi == OFFLOAD_ABI_LP64)
     {
       ptx_name = make_temp_file (".mkoffload");
       obstack_ptr_grow (&argv_obstack, "-o");

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-11-11 15:36             ` Alexander Monakov
@ 2016-11-11 15:38               ` Bernd Schmidt
  0 siblings, 0 replies; 22+ messages in thread
From: Bernd Schmidt @ 2016-11-11 15:38 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Nathan Sidwell

On 11/11/2016 04:35 PM, Alexander Monakov wrote:

> For the avoidance of doubt, is this a statement of intent, or an actual approval
> for the patchset?
>
> After these backend modifications and the rest of libgomp/middle-end changes are
> applied, trunk will need the following flip-the-switch patch to allow OpenMP
> offloading for NVPTX.  OK?

Ok for everything.


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
  2016-06-09 16:54 Alexander Monakov
@ 2016-06-09 17:00 ` Jakub Jelinek
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Jelinek @ 2016-06-09 17:00 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Nathan Sidwell

On Thu, Jun 09, 2016 at 07:53:52PM +0300, Alexander Monakov wrote:
> I'm sending updated patch series with backend prerequisites for OpenMP
> offloading to the NVIDIA PTX ISA.  The first patch has already received some
> comments and this version reflects review feedback.  The other patches have
> been adjusted for clarity and re-cut in a more rigorous manner.  All patches are
> rebased onto current trunk.
> 
> Jakub, can you offer wishes/recommendations for sending the rest of
> (middle-end and libgomp) patches?  As you know there's a branch with

Once all the prerequisites are in (I assume the patches depend on the NVPTX
backend patches you've just posted), then I'd prefer if you rebase the rest
to current trunk and post  in reasonably reviewable chunks (that can be all
of middle-end changes in one patch, all of libgomp plugin changes, all of
other libgomp changes, or if some of those would be too large, split that a
little bit).

	Jakub

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
@ 2016-06-09 16:54 Alexander Monakov
  2016-06-09 17:00 ` Jakub Jelinek
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Monakov @ 2016-06-09 16:54 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell, Jakub Jelinek

Hi,

I'm sending updated patch series with backend prerequisites for OpenMP
offloading to the NVIDIA PTX ISA.  The first patch has already received some
comments and this version reflects review feedback.  The other patches have
been adjusted for clarity and re-cut in a more rigorous manner.  All patches are
rebased onto current trunk.

Jakub, can you offer wishes/recommendations for sending the rest of
(middle-end and libgomp) patches?  As you know there's a branch with
development history; is that of interest, or would it be easier if I rebased
all stuff anew on current trunk?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-11-11 15:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-14 16:40 [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Alexander Monakov
2016-10-14 16:40 ` [PATCH 4/8] nvptx -mgomp Alexander Monakov
2016-10-14 16:40 ` [PATCH 7/8] nvptx backend: new insns for OpenMP SIMD-via-SIMT Alexander Monakov
2016-10-14 16:40 ` [PATCH 2/8] nvptx: implement predicated instructions Alexander Monakov
2016-10-14 16:40 ` [PATCH 1/8] nvptx -msoft-stack Alexander Monakov
2016-10-14 16:40 ` [PATCH 8/8] nvptx: handle OpenMP "omp target entrypoint" Alexander Monakov
2016-10-14 16:40 ` [PATCH 3/8] nvptx -muniform-simt Alexander Monakov
2016-10-14 16:40 ` [PATCH 6/8] new target hook: TARGET_SIMT_VF Alexander Monakov
2016-10-14 16:40 ` [PATCH 5/8] nvptx mkoffload: pass -mgomp for OpenMP offloading Alexander Monakov
2016-10-17 15:51 ` [PATCH 0/8] NVPTX offloading to NVPTX: backend patches Bernd Schmidt
2016-10-17 17:08   ` Alexander Monakov
2016-10-18 11:03     ` Bernd Schmidt
2016-10-18 16:59       ` Alexander Monakov
2016-10-18 21:07         ` Jakub Jelinek
2016-10-19 10:39         ` Bernd Schmidt
2016-11-11 13:02           ` Bernd Schmidt
2016-11-11 15:36             ` Alexander Monakov
2016-11-11 15:38               ` Bernd Schmidt
2016-10-19 12:19         ` Jakub Jelinek
2016-10-19 10:35       ` Alexander Monakov
  -- strict thread matches above, loose matches on Subject: below --
2016-06-09 16:54 Alexander Monakov
2016-06-09 17:00 ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).