[PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
@ 2016-04-20 17:00 Alexander Monakov
  2016-04-21 14:08 ` Nathan Sidwell
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Monakov @ 2016-04-20 17:00 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell

This patch implements per-warp compiler-defined stacks under -msoft-stack
option, and implements alloca on top of that.  In a few obvious places,
changes from -muniform-simt patch are present in the hunks.

Previously posted here:

[PATCH] nvptx: implement automatic storage in custom stacks
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01519.html

[gomp-nvptx] nvptx backend: implement alloca with -msoft-stack
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01397.html

[gomp-nvptx 7/7] nvptx backend: define STACK_SIZE_MODE
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg01108.html

2016-03-15  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.h (STACK_SIZE_MODE): Define.

2015-12-14  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c (nvptx_declare_function_name): Emit %outargs
	using .local %outargs_ar only if not TARGET_SOFT_STACK.  Emit %outargs
	under TARGET_SOFT_STACK by offsetting from %frame.
	(nvptx_get_drap_rtx): Return %argp as the DRAP if needed.
	* config/nvptx/nvptx.md (nvptx_register_operand): Allow %outargs under
	TARGET_SOFT_STACK.
	(nvptx_nonimmediate_operand): Ditto.
	(allocate_stack): Implement for TARGET_SOFT_STACK.  Remove unused code.
	(allocate_stack_<mode>): Remove unused pattern.
	(set_softstack_insn): New pattern.
	(restore_stack_block): Handle for TARGET_SOFT_STACK.

2015-12-09  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c: (need_softstack_decl): New variable.
	(nvptx_declare_function_name): Handle TARGET_SOFT_STACK.
	(nvptx_output_return): Emit stack restore if needed.
	(nvptx_file_end): Handle need_softstack_decl.
	* config/nvptx/nvptx.h: (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_softstack__ when -msoft-stack is active.
	(struct machine_function): New bool field using_softstack.
	* config/nvptx/nvptx.opt: (msoft-stack): New option.
	* doc/invoke.texi (msoft-stack): Document.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2d4dad1..e9e4d06 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -139,6 +129,12 @@ static GTY(()) rtx worker_red_sym;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function references __nvptx_stacks.  */
+static bool need_softstack_decl;
+
+/* True if any function references __nvptx_uni.  */
+static bool need_unisimt_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -992,16 +1086,55 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 
   fprintf (file, "%s", s.str().c_str());
 
-  /* Declare a local var for outgoing varargs.  */
-  if (cfun->machine->has_varadic)
-    init_frame (file, STACK_POINTER_REGNUM,
-		UNITS_PER_WORD, crtl->outgoing_args_size);
-
-  /* Declare a local variable for the frame.  */
   HOST_WIDE_INT sz = get_frame_size ();
-  if (sz || cfun->machine->has_chain)
-    init_frame (file, FRAME_POINTER_REGNUM,
-		crtl->stack_alignment_needed / BITS_PER_UNIT, sz);
+  bool need_frameptr = sz || cfun->machine->has_chain;
+  int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
+  if (!TARGET_SOFT_STACK)
+    {
+      /* Declare a local var for outgoing varargs.  */
+      if (cfun->machine->has_varadic)
+	init_frame (file, STACK_POINTER_REGNUM,
+		    UNITS_PER_WORD, crtl->outgoing_args_size);
+
+      /* Declare a local variable for the frame.  */
+      if (need_frameptr)
+	init_frame (file, FRAME_POINTER_REGNUM, alignment, sz);
+    }
+  else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca)
+    {
+      /* Maintain 64-bit stack alignment.  */
+      int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
+      sz = ROUND_UP (sz, keep_align);
+      int bits = POINTER_SIZE;
+      fprintf (file, "\t.reg.u%d %%frame;\n", bits);
+      fprintf (file, "\t.reg.u32 %%fstmp0;\n");
+      fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
+      fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);
+      fprintf (file, "\tmov.u32 %%fstmp0, %%tid.y;\n");
+      fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
+	       bits == 64 ? ".wide" : ".lo", bits / 8);
+      fprintf (file, "\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits);
+      /* fstmp2 = &__nvptx_stacks[tid.y];  */
+      fprintf (file, "\tadd.u%d %%fstmp2, %%fstmp2, %%fstmp1;\n", bits);
+      fprintf (file, "\tld.shared.u%d %%fstmp1, [%%fstmp2];\n", bits);
+      fprintf (file, "\tsub.u%d %%frame, %%fstmp1, "
+	       HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz);
+      if (alignment > keep_align)
+	fprintf (file, "\tand.b%d %%frame, %%frame, %d;\n",
+		 bits, -alignment);
+      fprintf (file, "\t.reg.u%d %%stack;\n", bits);
+      sz = crtl->outgoing_args_size;
+      gcc_assert (sz % keep_align == 0);
+      fprintf (file, "\tsub.u%d %%stack, %%frame, "
+	       HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz);
+      /* crtl->is_leaf is not initialized because RA is not run.  */
+      if (!leaf_function_p ())
+	{
+	  fprintf (file, "\tst.shared.u%d [%%fstmp2], %%stack;\n", bits);
+	  cfun->machine->using_softstack = true;
+	}
+      need_softstack_decl = true;
+    }
 
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
@@ -1037,6 +1172,10 @@ nvptx_output_return (void)
 {
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
+  if (cfun->machine->using_softstack)
+    fprintf (asm_out_file, "\tst.shared.u%d [%%fstmp2], %%fstmp1;\n",
+	     POINTER_SIZE);
+
   if (mode != VOIDmode)
     fprintf (asm_out_file, "\tst.param%s\t[%s_out], %s;\n",
 	     nvptx_ptx_type_from_mode (mode, false),
@@ -1068,6 +1207,8 @@ nvptx_function_ok_for_sibcall (tree, tree)
 static rtx
 nvptx_get_drap_rtx (void)
 {
+  if (TARGET_SOFT_STACK && stack_realign_drap)
+    return arg_pointer_rtx;
   return NULL_RTX;
 }
 
@@ -3939,6 +4183,18 @@ nvptx_file_end (void)
   if (worker_red_size)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
+
+  if (need_softstack_decl)
+    {
+      write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
+      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",
+	       POINTER_SIZE);
+    }
+  if (need_unisimt_decl)
+    {
+      write_var_marker (asm_out_file, false, true, "__nvptx_uni");
+      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n");
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 381269e..6da4d06 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -31,6 +31,10 @@
       builtin_assert ("machine=nvptx");		\
       builtin_assert ("cpu=nvptx");		\
       builtin_define ("__nvptx__");		\
+      if (TARGET_SOFT_STACK)			\
+        builtin_define ("__nvptx_softstack__");	\
+      if (TARGET_UNIFORM_SIMT)			\
+        builtin_define ("__nvptx_unisimt__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -79,6 +83,7 @@
 
 #define POINTER_SIZE (TARGET_ABI64 ? 64 : 32)
 #define Pmode (TARGET_ABI64 ? DImode : SImode)
+#define STACK_SIZE_MODE Pmode
 
 /* Registers.  Since ptx is a virtual target, we just define a few
    hard registers for special purposes and leave pseudos unallocated.
@@ -200,10 +205,13 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */
   int num_args;	/* Number of args of current call.  */
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  rtx unisimt_master; /* Master lane index for "uniform simt" mode.  */
+  rtx unisimt_predicate; /* Predicate register for "uniform simt".  */
 };
 #endif
 \f
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 33a4862..e5650b6 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -961,31 +986,41 @@ (define_expand "allocate_stack"
    (match_operand 1 "nvptx_register_operand")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (stack_pointer_rtx,
+		      gen_rtx_MINUS (Pmode, stack_pointer_rtx, operands[1]));
+      emit_insn (gen_set_softstack_insn (stack_pointer_rtx));
+      emit_move_insn (operands[0], virtual_stack_dynamic_rtx);
+      DONE;
+    }
   /* The ptx documentation specifies an alloca intrinsic (for 32 bit
      only)  but notes it is not implemented.  The assembler emits a
      confused error message.  Issue a blunt one now instead.  */
   sorry ("target cannot support alloca.");
   emit_insn (gen_nop ());
   DONE;
-  if (TARGET_ABI64)
-    emit_insn (gen_allocate_stack_di (operands[0], operands[1]));
-  else
-    emit_insn (gen_allocate_stack_si (operands[0], operands[1]));
-  DONE;
 })
 
-(define_insn "allocate_stack_<mode>"
-  [(set (match_operand:P 0 "nvptx_register_operand" "=R")
-        (unspec:P [(match_operand:P 1 "nvptx_register_operand" "R")]
-                   UNSPEC_ALLOCA))]
-  ""
-  "%.\\tcall (%0), %%alloca, (%1);")
+(define_insn "set_softstack_insn"
+  [(unspec [(match_operand 0 "nvptx_register_operand" "R")] UNSPEC_ALLOCA)]
+  "TARGET_SOFT_STACK"
+{
+  return (cfun->machine->using_softstack
+	  ? "%.\\tst.shared%t0\\t[%%fstmp2], %0;"
+	  : "");
+})
 
 (define_expand "restore_stack_block"
   [(match_operand 0 "register_operand" "")
    (match_operand 1 "register_operand" "")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (operands[0], operands[1]);
+      emit_insn (gen_set_softstack_insn (operands[0]));
+    }
   DONE;
 })
 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-04-20 17:00 [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack) Alexander Monakov
@ 2016-04-21 14:08 ` Nathan Sidwell
  2016-05-20 15:09   ` Alexander Monakov
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-04-21 14:08 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches

On 04/20/16 12:59, Alexander Monakov wrote:
> This patch implements per-warp compiler-defined stacks under -msoft-stack
> option, and implements alloca on top of that.  In a few obvious places,
> changes from -muniform-simt patch are present in the hunks.
>

It'd be better to not  mix fragments of patches, and have a description of how 
soft stacks works.

> +      /* fstmp2 = &__nvptx_stacks[tid.y];  */

?

> +      /* crtl->is_leaf is not initialized because RA is not run.  */

Cryptic comment is cryptic.


> +      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",

Magic constant '32'?

> +  if (need_unisimt_decl)
> +    {
> +      write_var_marker (asm_out_file, false, true, "__nvptx_uni");
> +      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n");
> +    }

Looks like some other patch?



>
>   /* Expander for the shuffle builtins.  */
> diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
> index 381269e..6da4d06 100644
> --- a/gcc/config/nvptx/nvptx.h
> +++ b/gcc/config/nvptx/nvptx.h

> +      if (TARGET_UNIFORM_SIMT)			\
> +        builtin_define ("__nvptx_unisimt__");	\

Likewise.


> +  rtx unisimt_master; /* Master lane index for "uniform simt" mode.  */
> +  rtx unisimt_predicate; /* Predicate register for "uniform simt".  */

Likewise.

Needs testcases.

nathan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-04-21 14:08 ` Nathan Sidwell
@ 2016-05-20 15:09   ` Alexander Monakov
  2016-05-20 16:36     ` Sandra Loosemore
  2016-05-20 19:42     ` Nathan Sidwell
  0 siblings, 2 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-05-20 15:09 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc-patches, Sandra Loosemore

On Thu, 21 Apr 2016, Nathan Sidwell wrote:
> On 04/20/16 12:59, Alexander Monakov wrote:
> > This patch implements per-warp compiler-defined stacks under -msoft-stack
> > option, and implements alloca on top of that.  In a few obvious places,
> > changes from -muniform-simt patch are present in the hunks.
> >
> 
> It'd be better to not  mix fragments of patches, and have a description of how
> soft stacks works.

Added a description in the intro text.

> > +      /* fstmp2 = &__nvptx_stacks[tid.y];  */
> 
> ?

This is what is being computed in the emitted assembly.  Changed the comment
to say,
+      /* Now %fstmp2 holds the value of '&__nvptx_stacks[%tid.y]'.  */

> > +      /* crtl->is_leaf is not initialized because RA is not run.  */
> 
> Cryptic comment is cryptic.

Changed the comment to say,
+      /* Ideally we'd use 'crtl->is_leaf' here, but it is computed during
+         register allocator initialization, which is not done on NVPTX.  */

> > +      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",
> 
> Magic constant '32'?

It's the maximum warp count in a CUDA block, but since it's an external
declaration, it can be omitted altogether; changed to use empty brackets.

> > +  if (need_unisimt_decl)
> > +    {
> > +      write_var_marker (asm_out_file, false, true, "__nvptx_uni");
> > +      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n");
> > +    }
> 
> Looks like some other patch?

Yes, and likewise in other instances. Both the cover letter and the patch
description mentioned that.  Removed in this resubmission.

> Needs testcases.

Added a testcase that enables -msoft-stack explicitely; testing with
RUNTESTFLAGS=--target_board=nvptx-none-run/-msoft-stack gives this
functionality plenty of exercise, otherwise.

Updated patch below.

Alexander

This patch implements '-msoft-stack' code generation variant for NVPTX.  The
goal is to avoid relying on '.local' memory space for placement of automatic
data, and instead have an explicitely-maintained stack pointer (which can be
set up to point to preallocated global memory space).  This allows to have
stack data accessible from all threads and modifiable with atomic
instructions.  This also allows to implement variable-length stack allocation
(for 'alloca' and C99 VLAs).

Each warp has its own 'soft stack' pointer.  It lives in shared memory array
called __nvptx_stacks at index %tid.y (like OpenACC, OpenMP is offloading is
going to use launch geometry such that %tid.y gives the warp index).  It is
retrieved in function prologue (if the function needs a stack frame) and may
also be written there (if the function is non-leaf, so that its callees see
the updated stack pointer), and restored prior to returning.

Startup code is responsible for setting up the initial soft-stack pointer. For
-mmainkernel testing it is libgcc's __main, for OpenMP offloading it's the
kernel region entry code.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* lib/target-supports.exp (check_effective_target_alloca): Use a
	compile test.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* gcc.target/nvptx/softstack.c: New test.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c (nvptx_declare_function_name): Expand comments.
	(nvptx_file_end): Do not emit element count in external declaration of
	__nvptx_stacks.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* doc/invoke.texi (msoft-stack): Rewrite.

2016-03-15  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.h (STACK_SIZE_MODE): Define.

2015-12-14  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c (nvptx_declare_function_name): Emit %outargs
	using .local %outargs_ar only if not TARGET_SOFT_STACK.  Emit %outargs
	under TARGET_SOFT_STACK by offsetting from %frame.
	(nvptx_get_drap_rtx): Return %argp as the DRAP if needed.
	* config/nvptx/nvptx.md (nvptx_register_operand): Allow %outargs under
	TARGET_SOFT_STACK.
	(nvptx_nonimmediate_operand): Ditto.
	(allocate_stack): Implement for TARGET_SOFT_STACK.  Remove unused code.
	(allocate_stack_<mode>): Remove unused pattern.
	(set_softstack_insn): New pattern.
	(restore_stack_block): Handle for TARGET_SOFT_STACK.

2015-12-09  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c: (need_softstack_decl): New variable.
	(nvptx_declare_function_name): Handle TARGET_SOFT_STACK.
	(nvptx_output_return): Emit stack restore if needed.
	(nvptx_file_end): Handle need_softstack_decl.
	* config/nvptx/nvptx.h: (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_softstack__ when -msoft-stack is active.
	(struct machine_function): New bool field using_softstack.
	* config/nvptx/nvptx.opt: (msoft-stack): New option.
	* doc/invoke.texi (msoft-stack): Document.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2d4dad1..700c4b0 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -139,6 +129,9 @@ static GTY(()) rtx worker_red_sym;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function references __nvptx_stacks.  */
+static bool need_softstack_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -992,16 +1091,56 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 
   fprintf (file, "%s", s.str().c_str());
 
-  /* Declare a local var for outgoing varargs.  */
-  if (cfun->machine->has_varadic)
-    init_frame (file, STACK_POINTER_REGNUM,
-		UNITS_PER_WORD, crtl->outgoing_args_size);
-
-  /* Declare a local variable for the frame.  */
   HOST_WIDE_INT sz = get_frame_size ();
-  if (sz || cfun->machine->has_chain)
-    init_frame (file, FRAME_POINTER_REGNUM,
-		crtl->stack_alignment_needed / BITS_PER_UNIT, sz);
+  bool need_frameptr = sz || cfun->machine->has_chain;
+  int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
+  if (!TARGET_SOFT_STACK)
+    {
+      /* Declare a local var for outgoing varargs.  */
+      if (cfun->machine->has_varadic)
+	init_frame (file, STACK_POINTER_REGNUM,
+		    UNITS_PER_WORD, crtl->outgoing_args_size);
+
+      /* Declare a local variable for the frame.  */
+      if (need_frameptr)
+	init_frame (file, FRAME_POINTER_REGNUM, alignment, sz);
+    }
+  else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca)
+    {
+      /* Maintain 64-bit stack alignment.  */
+      int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
+      sz = ROUND_UP (sz, keep_align);
+      int bits = POINTER_SIZE;
+      fprintf (file, "\t.reg.u%d %%frame;\n", bits);
+      fprintf (file, "\t.reg.u32 %%fstmp0;\n");
+      fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
+      fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);
+      fprintf (file, "\tmov.u32 %%fstmp0, %%tid.y;\n");
+      fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
+	       bits == 64 ? ".wide" : ".lo", bits / 8);
+      fprintf (file, "\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits);
+      fprintf (file, "\tadd.u%d %%fstmp2, %%fstmp2, %%fstmp1;\n", bits);
+      /* Now %fstmp2 holds the value of '&__nvptx_stacks[%tid.y]'.  */
+      fprintf (file, "\tld.shared.u%d %%fstmp1, [%%fstmp2];\n", bits);
+      fprintf (file, "\tsub.u%d %%frame, %%fstmp1, "
+	       HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz);
+      if (alignment > keep_align)
+	fprintf (file, "\tand.b%d %%frame, %%frame, %d;\n",
+		 bits, -alignment);
+      fprintf (file, "\t.reg.u%d %%stack;\n", bits);
+      sz = crtl->outgoing_args_size;
+      gcc_assert (sz % keep_align == 0);
+      fprintf (file, "\tsub.u%d %%stack, %%frame, "
+	       HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz);
+      /* Ideally we'd use 'crtl->is_leaf' here, but it is computed during
+         register allocator initialization, which is not done on NVPTX.  */
+      if (!leaf_function_p ())
+	{
+	  fprintf (file, "\tst.shared.u%d [%%fstmp2], %%stack;\n", bits);
+	  cfun->machine->using_softstack = true;
+	}
+      need_softstack_decl = true;
+    }
 
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
@@ -1037,6 +1178,10 @@ nvptx_output_return (void)
 {
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
+  if (cfun->machine->using_softstack)
+    fprintf (asm_out_file, "\tst.shared.u%d [%%fstmp2], %%fstmp1;\n",
+	     POINTER_SIZE);
+
   if (mode != VOIDmode)
     fprintf (asm_out_file, "\tst.param%s\t[%s_out], %s;\n",
 	     nvptx_ptx_type_from_mode (mode, false),
@@ -1068,6 +1213,8 @@ nvptx_function_ok_for_sibcall (tree, tree)
 static rtx
 nvptx_get_drap_rtx (void)
 {
+  if (TARGET_SOFT_STACK && stack_realign_drap)
+    return arg_pointer_rtx;
   return NULL_RTX;
 }
 
@@ -3939,6 +4189,13 @@ nvptx_file_end (void)
   if (worker_red_size)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
+
+  if (need_softstack_decl)
+    {
+      write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
+      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[];\n",
+	       POINTER_SIZE);
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 381269e..6da4d06 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -31,6 +31,8 @@
       builtin_assert ("machine=nvptx");		\
       builtin_assert ("cpu=nvptx");		\
       builtin_define ("__nvptx__");		\
+      if (TARGET_SOFT_STACK)			\
+        builtin_define ("__nvptx_softstack__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -79,6 +83,7 @@
 
 #define POINTER_SIZE (TARGET_ABI64 ? 64 : 32)
 #define Pmode (TARGET_ABI64 ? DImode : SImode)
+#define STACK_SIZE_MODE Pmode
 
 /* Registers.  Since ptx is a virtual target, we just define a few
    hard registers for special purposes and leave pseudos unallocated.
@@ -200,6 +205,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */
   int num_args;	/* Number of args of current call.  */
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 33a4862..e5650b6 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -961,31 +986,41 @@ (define_expand "allocate_stack"
    (match_operand 1 "nvptx_register_operand")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (stack_pointer_rtx,
+		      gen_rtx_MINUS (Pmode, stack_pointer_rtx, operands[1]));
+      emit_insn (gen_set_softstack_insn (stack_pointer_rtx));
+      emit_move_insn (operands[0], virtual_stack_dynamic_rtx);
+      DONE;
+    }
   /* The ptx documentation specifies an alloca intrinsic (for 32 bit
      only)  but notes it is not implemented.  The assembler emits a
      confused error message.  Issue a blunt one now instead.  */
   sorry ("target cannot support alloca.");
   emit_insn (gen_nop ());
   DONE;
-  if (TARGET_ABI64)
-    emit_insn (gen_allocate_stack_di (operands[0], operands[1]));
-  else
-    emit_insn (gen_allocate_stack_si (operands[0], operands[1]));
-  DONE;
 })
 
-(define_insn "allocate_stack_<mode>"
-  [(set (match_operand:P 0 "nvptx_register_operand" "=R")
-        (unspec:P [(match_operand:P 1 "nvptx_register_operand" "R")]
-                   UNSPEC_ALLOCA))]
-  ""
-  "%.\\tcall (%0), %%alloca, (%1);")
+(define_insn "set_softstack_insn"
+  [(unspec [(match_operand 0 "nvptx_register_operand" "R")] UNSPEC_ALLOCA)]
+  "TARGET_SOFT_STACK"
+{
+  return (cfun->machine->using_softstack
+	  ? "%.\\tst.shared%t0\\t[%%fstmp2], %0;"
+	  : "");
+})
 
 (define_expand "restore_stack_block"
   [(match_operand 0 "register_operand" "")
    (match_operand 1 "register_operand" "")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (operands[0], operands[1]);
+      emit_insn (gen_set_softstack_insn (operands[0]));
+    }
   DONE;
 })
 
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 056b9b2..1a7608b 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -32,3 +32,7 @@ Link in code for a __main kernel.
 moptimize
 Target Report Var(nvptx_optimize) Init(-1)
 Optimize partition neutering
+
+msoft-stack
+Target Report Mask(SOFT_STACK)
+Use custom stacks instead of local memory for automatic storage.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d281975..f0331e2 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -19341,6 +19341,18 @@ offloading execution.
 Apply partitioned execution optimizations.  This is the default when any
 level of optimization is selected.
 
+@item -msoft-stack
+@opindex msoft-stack
+Switch to code generation variant that does not use @code{.local} memory
+directly for stack storage. Instead, a per-warp stack pointer is
+maintained explicitely. This enables variable-length stack allocation (with
+variable-length arrays or @code{alloca}), and when global memory is used for
+underlying storage, makes possible to access automatic variables from other
+threads, or with atomic instructions. This code generation variant is used
+for OpenMP offloading, but the option is exposed on its own for the purpose
+of testing the compiler; to generate code suitable for linking into programs
+using OpenMP offloading, use option @option{-mgomp}.
+
 @end table
 
 @node PDP-11 Options
diff --git a/gcc/testsuite/gcc.target/nvptx/softstack.c b/gcc/testsuite/gcc.target/nvptx/softstack.c
new file mode 100644
index 0000000..73e60f2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/softstack.c
@@ -0,0 +1,23 @@
+/* { dg-options "-O2 -msoft-stack" } */
+/* { dg-do run } */
+
+static __attribute__((noinline,noclone)) int f(int *p)
+{
+  return __sync_lock_test_and_set(p, 1);
+}
+
+static __attribute__((noinline,noclone)) int g(int n)
+{
+  /* Check that variable-length stack allocation works.  */
+  int v[n];
+  v[0] = 0;
+  /* Check that atomic operations can be applied to auto data.  */
+  return f(v) == 0 && v[0] == 1;
+}
+
+int main()
+{
+  if (!g(1))
+    __builtin_abort();
+  return 0;
+}
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index dceabb5..878fdab 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -729,7 +729,10 @@ proc check_effective_target_untyped_assembly {} {
 
 proc check_effective_target_alloca {} {
     if { [istarget nvptx-*-*] } {
-	return 0
+	return [check_no_compiler_messages alloca assembly {
+	    void f (void*);
+	    void g (int n) { f (__builtin_alloca (n)); }
+	}]
     }
     return 1
 }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-05-20 15:09   ` Alexander Monakov
@ 2016-05-20 16:36     ` Sandra Loosemore
  2016-05-20 19:42     ` Nathan Sidwell
  1 sibling, 0 replies; 10+ messages in thread
From: Sandra Loosemore @ 2016-05-20 16:36 UTC (permalink / raw)
  To: Alexander Monakov, Nathan Sidwell; +Cc: gcc-patches

On 05/20/2016 09:09 AM, Alexander Monakov wrote:

> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index d281975..f0331e2 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -19341,6 +19341,18 @@ offloading execution.
>   Apply partitioned execution optimizations.  This is the default when any
>   level of optimization is selected.
>
> +@item -msoft-stack
> +@opindex msoft-stack
> +Switch to code generation variant that does not use @code{.local} memory

s/Switch to code generation variant/Generate code/

> +directly for stack storage. Instead, a per-warp stack pointer is
> +maintained explicitely. This enables variable-length stack allocation (with

s/explicitely/explicitly/

> +variable-length arrays or @code{alloca}), and when global memory is used for
> +underlying storage, makes possible to access automatic variables from other

s/makes possible/makes it possible/

> +threads, or with atomic instructions. This code generation variant is used
> +for OpenMP offloading, but the option is exposed on its own for the purpose
> +of testing the compiler; to generate code suitable for linking into programs
> +using OpenMP offloading, use option @option{-mgomp}.
> +
>   @end table
>
>   @node PDP-11 Options

The documentation bits are OK with those changes.

-Sandra

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-05-20 15:09   ` Alexander Monakov
  2016-05-20 16:36     ` Sandra Loosemore
@ 2016-05-20 19:42     ` Nathan Sidwell
  2016-05-25  2:32       ` Alexander Monakov
  1 sibling, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-05-20 19:42 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Sandra Loosemore

On 05/20/16 11:09, Alexander Monakov wrote:

> This patch implements '-msoft-stack' code generation variant for NVPTX.  The
> goal is to avoid relying on '.local' memory space for placement of automatic
> data, and instead have an explicitely-maintained stack pointer (which can be
> set up to point to preallocated global memory space).  This allows to have
> stack data accessible from all threads and modifiable with atomic
> instructions.  This also allows to implement variable-length stack allocation
> (for 'alloca' and C99 VLAs).
>
> Each warp has its own 'soft stack' pointer.  It lives in shared memory array
> called __nvptx_stacks at index %tid.y (like OpenACC, OpenMP is offloading is
> going to use launch geometry such that %tid.y gives the warp index).  It is
> retrieved in function prologue (if the function needs a stack frame) and may
> also be written there (if the function is non-leaf, so that its callees see
> the updated stack pointer), and restored prior to returning.
>
> Startup code is responsible for setting up the initial soft-stack pointer. For
> -mmainkernel testing it is libgcc's __main, for OpenMP offloading it's the
> kernel region entry code.

ah,  that's much more understandable,  thanks.  Presumably this doesn't support 
worker-single mode (in OpenACC parlance, I don't know what the OpenMP version of 
that is?)  And neither would it support calls  from vector-partitioned code (I 
think that's SIMD in OpenMP-land?).   It seems like we should reject the 
combination of -msoft-stack -fopenacc?

> 2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

> 2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

> 2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

> 2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

> 2016-03-15  Alexander Monakov  <amonakov@ispras.ru>

> 2015-12-14  Alexander Monakov  <amonakov@ispras.ru>

> 2015-12-09  Alexander Monakov  <amonakov@ispras.ru>

why so many changelogs?  The on-branch development history is irrelvant for 
trunk -- the usual single changelog style should be followed.


> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
> index 2d4dad1..700c4b0 100644

> @@ -992,16 +1091,56 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)


> +  else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca)
> +    {
> +      /* Maintain 64-bit stack alignment.  */

This block needs a more descriptive comment -- it appears to be doing a great 
deal more than maintaining 64-bit stack alignment!

> +      int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
> +      sz = ROUND_UP (sz, keep_align);
> +      int bits = POINTER_SIZE;
> +      fprintf (file, "\t.reg.u%d %%frame;\n", bits);
> +      fprintf (file, "\t.reg.u32 %%fstmp0;\n");
> +      fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
> +      fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);

Some of these register names appear to be long lived -- and referenced in other 
functions.  It would be better to give those more descriptive names, or even 
give them hard-regs.   You should  certainly  do so for those that are already 
hard regs (%frame & %stack) -- is it more feasible to augment init_frame to 
initialize them?

> @@ -1037,6 +1178,10 @@ nvptx_output_return (void)
>  {
>    machine_mode mode = (machine_mode)cfun->machine->return_mode;
>
> +  if (cfun->machine->using_softstack)
> +    fprintf (asm_out_file, "\tst.shared.u%d [%%fstmp2], %%fstmp1;\n",

See note above about obscure reg names.

   Since ptx is a virtual target, we just define a few
>     hard registers for special purposes and leave pseudos unallocated.
> @@ -200,6 +205,7 @@ struct GTY(()) machine_function
>    bool is_varadic;  /* This call is varadic  */
>    bool has_varadic;  /* Current function has a varadic call.  */
>    bool has_chain; /* Current function has outgoing static chain.  */
> +  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */

Comment should describe what the attribute is, not what it implies.  In this 
case I think it's /*  Current function has   a soft stack frame.  */


> diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
> index 33a4862..e5650b6 100644
> --- a/gcc/config/nvptx/nvptx.md
> +++ b/gcc/config/nvptx/nvptx.md


> +(define_insn "set_softstack_insn"
> +  [(unspec [(match_operand 0 "nvptx_register_operand" "R")] UNSPEC_ALLOCA)]
> +  "TARGET_SOFT_STACK"
> +{
> +  return (cfun->machine->using_softstack
> +	  ? "%.\\tst.shared%t0\\t[%%fstmp2], %0;"
> +	  : "");
> +})

Is this alloca related (UNSPEC_ALLOCA) or restore related (invoked in 
restore_stack_block), or stack setting (as insn name suggests).  Things seem 
inconsistently named.  Comments would be good.

>
>  (define_expand "restore_stack_block"
>    [(match_operand 0 "register_operand" "")
>     (match_operand 1 "register_operand" "")]
>    ""
>  {
> +  if (TARGET_SOFT_STACK)
> +    {
> +      emit_move_insn (operands[0], operands[1]);
> +      emit_insn (gen_set_softstack_insn (operands[0]));

Is it necessary to store the soft stack here?  Only restore_stack_block seems 
defined, and not save_stack_block.  Why the asymmetry?


> diff --git a/gcc/testsuite/gcc.target/nvptx/softstack.c b/gcc/testsuite/gcc.target/nvptx/softstack.c
> new file mode 100644
> index 0000000..73e60f2
> --- /dev/null


This test is ok, but probably wise to check varadic generates the expected code?

Are changes needed in nvptx's  crt0.s (or an auxilary file in libc or libgcc) to 
make this work for testing?  Have you tested an nvptx-elf target with this flag on?

nathan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-05-20 19:42     ` Nathan Sidwell
@ 2016-05-25  2:32       ` Alexander Monakov
  2016-05-25 13:33         ` Nathan Sidwell
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Monakov @ 2016-05-25  2:32 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc-patches, Sandra Loosemore

On Fri, 20 May 2016, Nathan Sidwell wrote:
> ah,  that's much more understandable,  thanks.  Presumably this doesn't
> support worker-single mode (in OpenACC parlance, I don't know what the OpenMP
> version of that is?)

I don't see why you have concerns.  In OpenMP, what OpenACC calls
'worker-single mode' should correspond to execution of a sequential region
(outside of any 'parallel' region). The region is executed by the initial
thread (warp 0), while other warps, having formed a thread pool, are suspended
on that thread pool's barrier.  When the initial thread reaches the parallel
region, it unblocks the warps in the pool.  The other warps may need data that
is allocated on warp 0 stack, so here it's essential that soft-stacks can
exist on global memory and thus be world-readable.

> And neither would it support calls  from vector-partitioned code (I think
> that's SIMD in OpenMP-land?).

Actually it would: the plan is to switch soft-stack pointer to a region of
.local memory when entering OpenMP SIMD region.  This makes soft-stacks use
lane-private storage inside of SIMD regions (but then it's, of course, no
longer world-readable and not modifiable by atomics).

> It seems like we should reject the combination of -msoft-stack -fopenacc?

Possibly; the doc text makes it explicit that the option is exposed only for
the purpose of testing the compiler, anyway.

> why so many changelogs?  The on-branch development history is irrelvant for
> trunk -- the usual single changelog style should be followed.

OK, if branch history is not interesting for review, I can squash it; I'll
have to do that for the final commit anyway.

> > +  else if (need_frameptr || cfun->machine->has_varadic ||
> > cfun->calls_alloca)
> > +    {
> > +      /* Maintain 64-bit stack alignment.  */
> 
> This block needs a more descriptive comment -- it appears to be doing a great
> deal more than maintaining 64-bit stack alignment!

The comment is just for the line that follows, not the whole block.

> > +      int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
> > +      sz = ROUND_UP (sz, keep_align);
> > +      int bits = POINTER_SIZE;
> > +      fprintf (file, "\t.reg.u%d %%frame;\n", bits);
> > +      fprintf (file, "\t.reg.u32 %%fstmp0;\n");
> > +      fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
> > +      fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);
> 
> Some of these register names appear to be long lived -- and referenced in
> other functions.  It would be better to give those more descriptive names, or
> even give them hard-regs.

That's just %fstmp2 (pointer into __nvptx_stacks) and %fstmp1 (previous stack
pointer that we need to restore). I can rename them to %ssloc and %ssold
(better names welcome), but I don't see a value in making them hard-regs --
there's no interface with the middle-end that would be interested in those.

> You should  certainly  do so for those that are already hard regs (%frame &
> %stack)

Sorry, do what? They are already hard regs, and have descriptive names.

> -- is it more feasible to augment init_frame to initialize them?

I don't think so. The whole block could be moved to a separate function though.

>   Since ptx is a virtual target, we just define a few
> >     hard registers for special purposes and leave pseudos unallocated.
> > @@ -200,6 +205,7 @@ struct GTY(()) machine_function
> >    bool is_varadic;  /* This call is varadic  */
> >    bool has_varadic;  /* Current function has a varadic call.  */
> >    bool has_chain; /* Current function has outgoing static chain.  */
> > +  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */
> 
> Comment should describe what the attribute is, not what it implies.  In this
> case I think it's /*  Current function has   a soft stack frame.  */

Yes; note it's false when current function is leaf, so the description should
be more like "Current function has a soft stack frame that needs restoring".

> > diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
> > index 33a4862..e5650b6 100644
> > --- a/gcc/config/nvptx/nvptx.md
> > +++ b/gcc/config/nvptx/nvptx.md
> 
> 
> > +(define_insn "set_softstack_insn"
> > +  [(unspec [(match_operand 0 "nvptx_register_operand" "R")] UNSPEC_ALLOCA)]
> > +  "TARGET_SOFT_STACK"
> > +{
> > +  return (cfun->machine->using_softstack
> > +	  ? "%.\\tst.shared%t0\\t[%%fstmp2], %0;"
> > +	  : "");
> > +})
> 
> Is this alloca related (UNSPEC_ALLOCA) or restore related (invoked in
> restore_stack_block), or stack setting (as insn name suggests).  Things seem
> inconsistently named.  Comments would be good.

OK, I'll add some in a respin. This is related to stack setting. I can add a
new UNSPEC for that (UNSPEC_SET_SOFTSTACK).

> >
> >  (define_expand "restore_stack_block"
> >    [(match_operand 0 "register_operand" "")
> >    (match_operand 1 "register_operand" "")]
> >    ""
> > {
> > +  if (TARGET_SOFT_STACK)
> > +    {
> > +      emit_move_insn (operands[0], operands[1]);
> > +      emit_insn (gen_set_softstack_insn (operands[0]));
> 
> Is it necessary to store the soft stack here?

As a QoI matter, yes -- otherwise calls following a stack restore will see an
unrestored stack pointer, causing stack space waste.

> Only restore_stack_block seems defined, and not save_stack_block.  Why the
> asymmetry?

Because no special handling is needed on saving. We need to update the pointer
in shared memory when stack pointer is moved (thus, in allocate_stack and
restore_stack_block), but when the stack pointer is copied to a save area we
don't need to do anything special.

> This test is ok, but probably wise to check varadic generates the expected
> code?

Perhaps; I can add a printf to the test.

> Are changes needed in nvptx's  crt0.s (or an auxilary file in libc or libgcc)
> to make this work for testing?

Yes.

> Have you tested an nvptx-elf target with this flag on?

(assuming you mean nvptx-none via nvptx-none-run) Yes, with

make -k check-c DEJAGNU=...  RUNTESTFLAGS=--target_board=nvptx-none-run/-msoft-stack

I saw some changes due to tests passing when they otherwise wouldn't (e.g.
changing an auto variable with an atomic op). There are also changes in either
direction due to PTX translation bugs in the CUDA driver.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-05-25  2:32       ` Alexander Monakov
@ 2016-05-25 13:33         ` Nathan Sidwell
  2016-06-02 21:22           ` Alexander Monakov
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-05-25 13:33 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Sandra Loosemore

On 05/24/16 17:29, Alexander Monakov wrote:
> On Fri, 20 May 2016, Nathan Sidwell wrote:
>> ah,  that's much more understandable,  thanks.  Presumably this doesn't
>> support worker-single mode (in OpenACC parlance, I don't know what the OpenMP
>> version of that is?)
>
> I don't see why you have concerns.

Because I want to understand it.

>> It seems like we should reject the combination of -msoft-stack -fopenacc?
>
> Possibly; the doc text makes it explicit that the option is exposed only for
> the purpose of testing the compiler, anyway.

It is always best to prevent the user doing something you don't recommend, 
rather than presume they'll be sensible.

>>> +  else if (need_frameptr || cfun->machine->has_varadic ||
>>> cfun->calls_alloca)
>>> +    {
>>> +      /* Maintain 64-bit stack alignment.  */
>>
>> This block needs a more descriptive comment -- it appears to be doing a great
>> deal more than maintaining 64-bit stack alignment!
>
> The comment is just for the line that follows, not the whole block.

It's the only comment in the block and right at the start.   Fix it.

>>> +      int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
>>> +      sz = ROUND_UP (sz, keep_align);
>>> +      int bits = POINTER_SIZE;
>>> +      fprintf (file, "\t.reg.u%d %%frame;\n", bits);
>>> +      fprintf (file, "\t.reg.u32 %%fstmp0;\n");
>>> +      fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
>>> +      fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);
>>
>> Some of these register names appear to be long lived -- and referenced in
>> other functions.  It would be better to give those more descriptive names, or
>> even give them hard-regs.
>
> That's just %fstmp2 (pointer into __nvptx_stacks) and %fstmp1 (previous stack
> pointer that we need to restore). I can rename them to %ssloc and %ssold
> (better names welcome), but I don't see a value in making them hard-regs --
> there's no interface with the middle-end that would be interested in those.

Making them hard regs would elide the need to keep naming them in multiple 
places.  Sure, you'll have to use some #define symbol to index the array, but 
then you'll get a build time error if there's a mismatch.  You can also put a 
comment on that #define saying what it is in an obvious place.

The lack of comments means it's not clear to me what ssold is -- is it the 
caller's value of the slot?
%sspslot for the address of the __soft_stack array element.
%sspprev for the caller'svalue (if that is indeed what you mean by ssold)

>> You should  certainly  do so for those that are already hard regs (%frame &
>> %stack)
>
> Sorry, do what? They are already hard regs, and have descriptive names.

Use the hard reg name array, not duplicated string constants.

>> -- is it more feasible to augment init_frame to initialize them?
>
> I don't think so. The whole block could be moved to a separate function though.

That would be acceptable too.

>>> +  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */
>>
>> Comment should describe what the attribute is, not what it implies.  In this
>> case I think it's /*  Current function has   a soft stack frame.  */
 >
 > Yes; note it's false when current function is leaf, so the description should
 > be more like "Current function has a soft stack frame that needs restoring".

Ok.  so the name doesn't match the semantics.  I suggest renaming it to 
'restore_softstack', with a similarly adjusted comment.


>> Is this alloca related (UNSPEC_ALLOCA) or restore related (invoked in
>> restore_stack_block), or stack setting (as insn name suggests).  Things seem
>> inconsistently named.  Comments would be good.
>
> OK, I'll add some in a respin. This is related to stack setting. I can add a
> new UNSPEC for that (UNSPEC_SET_SOFTSTACK).

Thanks -- although integers are bounded in our computery world, it's not like 
there's a shortage here!

>> Only restore_stack_block seems defined, and not save_stack_block.  Why the
>> asymmetry?
>
> Because no special handling is needed on saving. We need to update the pointer
> in shared memory when stack pointer is moved (thus, in allocate_stack and
> restore_stack_block), but when the stack pointer is copied to a save area we
> don't need to do anything special.

Thanks -- lack of comments here really impedes understanding.

>> Are changes needed in nvptx's  crt0.s (or an auxilary file in libc or libgcc)
>> to make this work for testing?
>
> Yes.

Please include  that chunk in the next revision of this patch.  But see the 
commit I just made.  I intend to move the malloc stuff to newlib next, but that 
shouldn't affect the changes needed for this.


> I saw some changes due to tests passing when they otherwise wouldn't (e.g.
> changing an auto variable with an atomic op). There are also changes in either
> direction due to PTX translation bugs in the CUDA driver.

It'd be good to know what that delta is.

nathan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-05-25 13:33         ` Nathan Sidwell
@ 2016-06-02 21:22           ` Alexander Monakov
  2016-06-09 19:49             ` Nathan Sidwell
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Monakov @ 2016-06-02 21:22 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc-patches

On Wed, 25 May 2016, Nathan Sidwell wrote:
> > > It seems like we should reject the combination of -msoft-stack -fopenacc?
> >
> > Possibly; the doc text makes it explicit that the option is exposed only for
> > the purpose of testing the compiler, anyway.
> 
> It is always best to prevent the user doing something you don't recommend,
> rather than presume they'll be sensible.

No change for that yet in the respin; do you want something like

  if (flag_openacc && TARGET_SOFT_STACK)
    sorry ("-fopenacc and -msoft-stack are mutually exclusive");

in nvptx_override_options?

> > > > +  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */
> > >
> > > Comment should describe what the attribute is, not what it implies.  In
> > > this
> > > case I think it's /*  Current function has   a soft stack frame.  */
> >
> > Yes; note it's false when current function is leaf, so the description
> > should
> > be more like "Current function has a soft stack frame that needs restoring".
> 
> Ok.  so the name doesn't match the semantics.  I suggest renaming it to
> 'restore_softstack', with a similarly adjusted comment.

I went with 'has_softstack' (and the comment you've suggested previously) and
made the code check leafness separately.

> > > Are changes needed in nvptx's  crt0.s (or an auxilary file in libc or
> > > libgcc)
> > > to make this work for testing?
> >
> > Yes.
> 
> Please include  that chunk in the next revision of this patch.  But see the
> commit I just made.  I intend to move the malloc stuff to newlib next, but
> that shouldn't affect the changes needed for this.

Done.

Updated patch below.

Alexander

gcc/:
	* config/nvptx/nvptx-protos.h (nvptx_output_set_softstack): Declare.
	* config/nvptx/nvptx.c: (need_softstack_decl): New variable.
	(init_softstack_frame): New.
	(nvptx_declare_function_name): Handle TARGET_SOFT_STACK.
	(nvptx_output_set_softstack): New.
	(nvptx_output_return): Emit stack restore if needed.
	(nvptx_get_drap_rtx): Return %argp as the DRAP if needed.
	(nvptx_file_end): Handle need_softstack_decl.
	* config/nvptx/nvptx.h: (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_softstack__ when -msoft-stack is active.
	(STACK_SIZE_MODE): Define.
	(FIXED_REGISTERS): Adjust.
	(SOFTSTACK_SLOT_REGNUM): New.
	(SOFTSTACK_PREV_REGNUM): New.
	(REGISTER_NAMES): Adjust.
	(struct machine_function): New bool field has_softstack.
	* config/nvptx/nvptx.md (UNSPEC_SET_SOFTSTACK): New.
	(allocate_stack): Implement for TARGET_SOFT_STACK.  Remove unused code.
	(allocate_stack_<mode>): Remove unused pattern.
	(set_softstack_insn): New pattern.
	(restore_stack_block): Handle for TARGET_SOFT_STACK.
	* config/nvptx/nvptx.opt: (msoft-stack): New option.
	* doc/invoke.texi (msoft-stack): Document.

gcc/testsuite/:
	* gcc.target/nvptx/softstack.c: New test.
	* lib/target-supports.exp (check_effective_target_alloca): Use a
	compile test.

libgcc/:
	* config/nvptx/crt0.c (__main) [__nvptx_softstack__]: Setup
	__nvptx_stacks.
	* config/nvptx/stacks.c: New file.
	* config/nvptx/t-nvptx (LIB2ADD): Add stacks.c.


diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h
index ec4588e..331ec0a 100644
--- a/gcc/config/nvptx/nvptx-protos.h
+++ b/gcc/config/nvptx/nvptx-protos.h
@@ -36,5 +36,6 @@ extern void nvptx_register_pragmas (void);
 extern const char *nvptx_output_mov_insn (rtx, rtx);
 extern const char *nvptx_output_call_insn (rtx_insn *, rtx, rtx);
 extern const char *nvptx_output_return (void);
+extern const char *nvptx_output_set_softstack (unsigned);
 #endif
 #endif
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2d4dad1..05cc6dd 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -139,6 +129,9 @@ static GTY(()) rtx worker_red_sym;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function references __nvptx_stacks.  */
+static bool need_softstack_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -931,6 +932,60 @@ init_frame (FILE  *file, int regno, unsigned align, unsigned size)
 	   POINTER_SIZE, reg_names[regno], reg_names[regno]);
 }
 
+/* Emit soft stack frame setup sequence.  */
+
+static void
+init_softstack_frame (FILE *file, unsigned alignment, unsigned size)
+{
+  /* Maintain 64-bit stack alignment.  */
+  int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
+  size = ROUND_UP (size, keep_align);
+  int bits = POINTER_SIZE;
+  const char *reg_stack = reg_names[STACK_POINTER_REGNUM];
+  const char *reg_frame = reg_names[FRAME_POINTER_REGNUM];
+  const char *reg_sspslot = reg_names[SOFTSTACK_SLOT_REGNUM];
+  const char *reg_sspprev = reg_names[SOFTSTACK_PREV_REGNUM];
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_stack);
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_frame);
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspslot);
+  fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspprev);
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32 %%fstmp0;\n");
+  fprintf (file, "\t\t.reg.u%d %%fstmp1;\n", bits);
+  fprintf (file, "\t\t.reg.u%d %%fstmp2;\n", bits);
+  fprintf (file, "\t\tmov.u32 %%fstmp0, %%tid.y;\n");
+  fprintf (file, "\t\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
+	   bits == 64 ? ".wide" : ".lo", bits / 8);
+  fprintf (file, "\t\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits);
+  /* Initialize %sspslot = &__nvptx_stacks[tid.y].  */
+  fprintf (file, "\t\tadd.u%d %s, %%fstmp2, %%fstmp1;\n", bits, reg_sspslot);
+  /* Initialize %sspprev = __nvptx_stacks[tid.y].  */
+  fprintf (file, "\t\tld.shared.u%d %s, [%s];\n",
+	   bits, reg_sspprev, reg_sspslot);
+  /* Initialize %frame = %sspprev - size.  */
+  fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n",
+	   bits, reg_frame, reg_sspprev, size);
+  /* Apply alignment, if larger than 64.  */
+  if (alignment > keep_align)
+    fprintf (file, "\t\tand.b%d %s, %s, %d;\n",
+	     bits, reg_frame, reg_frame, -alignment);
+  size = crtl->outgoing_args_size;
+  gcc_assert (size % keep_align == 0);
+  /* Initialize %stack.  */
+  fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n",
+	   bits, reg_stack, reg_frame, size);
+  /* Usually 'crtl->is_leaf' is computed during register allocator
+     initialization, which is not done on NVPTX.  Compute it now.  */
+  gcc_assert (!crtl->is_leaf);
+  crtl->is_leaf = leaf_function_p ();
+  if (!crtl->is_leaf)
+    fprintf (file, "\t\tst.shared.u%d [%s], %s;\n",
+	     bits, reg_sspslot, reg_stack);
+  fprintf (file, "\t}\n");
+  cfun->machine->has_softstack = true;
+  need_softstack_decl = true;
+}
+
 /* Emit code to initialize the REGNO predicate register to indicate
    whether we are not lane zero on the NAME axis.  */
 
@@ -992,16 +1145,22 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 
   fprintf (file, "%s", s.str().c_str());
 
-  /* Declare a local var for outgoing varargs.  */
-  if (cfun->machine->has_varadic)
-    init_frame (file, STACK_POINTER_REGNUM,
-		UNITS_PER_WORD, crtl->outgoing_args_size);
-
-  /* Declare a local variable for the frame.  */
   HOST_WIDE_INT sz = get_frame_size ();
-  if (sz || cfun->machine->has_chain)
-    init_frame (file, FRAME_POINTER_REGNUM,
-		crtl->stack_alignment_needed / BITS_PER_UNIT, sz);
+  bool need_frameptr = sz || cfun->machine->has_chain;
+  int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
+  if (!TARGET_SOFT_STACK)
+    {
+      /* Declare a local var for outgoing varargs.  */
+      if (cfun->machine->has_varadic)
+	init_frame (file, STACK_POINTER_REGNUM,
+		    UNITS_PER_WORD, crtl->outgoing_args_size);
+
+      /* Declare a local variable for the frame.  */
+      if (need_frameptr)
+	init_frame (file, FRAME_POINTER_REGNUM, alignment, sz);
+    }
+  else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca)
+    init_softstack_frame (file, alignment, sz);
 
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
@@ -1027,6 +1186,21 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 			       REGNO (cfun->machine->axis_predicate[1]), "x");
 }
 
+/* Output instruction that sets soft stack pointer in shared memory to the
+   value in register given by SRC_REGNO.  */
+
+const char *
+nvptx_output_set_softstack (unsigned src_regno)
+{
+  if (cfun->machine->has_softstack && !crtl->is_leaf)
+    {
+      fprintf (asm_out_file, "\tst.shared.u%d\t[%s], ",
+	       POINTER_SIZE, reg_names[SOFTSTACK_SLOT_REGNUM]);
+      output_reg (asm_out_file, src_regno, VOIDmode);
+      fprintf (asm_out_file, ";\n");
+    }
+  return "";
+}
 /* Output a return instruction.  Also copy the return value to its outgoing
    location.  */
 
@@ -1037,6 +1213,9 @@ nvptx_output_return (void)
 {
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
+  if (TARGET_SOFT_STACK)
+    nvptx_output_set_softstack (SOFTSTACK_PREV_REGNUM);
+
   if (mode != VOIDmode)
     fprintf (asm_out_file, "\tst.param%s\t[%s_out], %s;\n",
 	     nvptx_ptx_type_from_mode (mode, false),
@@ -1068,6 +1247,8 @@ nvptx_function_ok_for_sibcall (tree, tree)
 static rtx
 nvptx_get_drap_rtx (void)
 {
+  if (TARGET_SOFT_STACK && stack_realign_drap)
+    return arg_pointer_rtx;
   return NULL_RTX;
 }
 
@@ -3939,6 +4223,13 @@ nvptx_file_end (void)
   if (worker_red_size)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
+
+  if (need_softstack_decl)
+    {
+      write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
+      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[];\n",
+	       POINTER_SIZE);
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 381269e..aceaa40 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -31,6 +31,8 @@
       builtin_assert ("machine=nvptx");		\
       builtin_assert ("cpu=nvptx");		\
       builtin_define ("__nvptx__");		\
+      if (TARGET_SOFT_STACK)			\
+        builtin_define ("__nvptx_softstack__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -79,13 +83,14 @@
 
 #define POINTER_SIZE (TARGET_ABI64 ? 64 : 32)
 #define Pmode (TARGET_ABI64 ? DImode : SImode)
+#define STACK_SIZE_MODE Pmode
 
 /* Registers.  Since ptx is a virtual target, we just define a few
    hard registers for special purposes and leave pseudos unallocated.
    We have to have some available hard registers, to keep gcc setup
    happy.  */
 #define FIRST_PSEUDO_REGISTER 16
-#define FIXED_REGISTERS	    { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
+#define FIXED_REGISTERS	    { 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
 #define CALL_USED_REGISTERS { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }
 
 #define HARD_REGNO_NREGS(REG, MODE)		\
@@ -133,10 +138,17 @@ enum reg_class             {  NO_REGS,    ALL_REGS,	LIM_REG_CLASSES };
 #define FRAME_POINTER_REGNUM 2
 #define ARG_POINTER_REGNUM 3
 #define STATIC_CHAIN_REGNUM 4
+/* This register points to the shared memory location with the current warp's
+   soft stack pointer (__nvptx_stacks[tid.y]).  */
+#define SOFTSTACK_SLOT_REGNUM 5
+/* This register is used to save the previous value of the soft stack pointer
+   in the prologue and restore it when returning.  */
+#define SOFTSTACK_PREV_REGNUM 6
 
 #define REGISTER_NAMES							\
   {									\
-    "%value", "%stack", "%frame", "%args", "%chain", "%hr5", "%hr6", "%hr7", \
+    "%value", "%stack", "%frame", "%args",                              \
+    "%chain", "%sspslot", "%sspprev", "%hr7",                           \
     "%hr8", "%hr9", "%hr10", "%hr11", "%hr12", "%hr13", "%hr14", "%hr15" \
   }
 
@@ -200,6 +212,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool has_softstack; /* Current function has a soft stack frame.  */
   int num_args;	/* Number of args of current call.  */
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 33a4862..8e08178 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -36,6 +36,8 @@ (define_c_enum "unspec" [
 
    UNSPEC_ALLOCA
 
+   UNSPEC_SET_SOFTSTACK
+
    UNSPEC_DIM_SIZE
 
    UNSPEC_BIT_CONV
@@ -961,31 +988,40 @@ (define_expand "allocate_stack"
    (match_operand 1 "nvptx_register_operand")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (stack_pointer_rtx,
+		      gen_rtx_MINUS (Pmode, stack_pointer_rtx, operands[1]));
+      emit_insn (gen_set_softstack_insn (stack_pointer_rtx));
+      emit_move_insn (operands[0], virtual_stack_dynamic_rtx);
+      DONE;
+    }
   /* The ptx documentation specifies an alloca intrinsic (for 32 bit
      only)  but notes it is not implemented.  The assembler emits a
      confused error message.  Issue a blunt one now instead.  */
   sorry ("target cannot support alloca.");
   emit_insn (gen_nop ());
   DONE;
-  if (TARGET_ABI64)
-    emit_insn (gen_allocate_stack_di (operands[0], operands[1]));
-  else
-    emit_insn (gen_allocate_stack_si (operands[0], operands[1]));
-  DONE;
 })
 
-(define_insn "allocate_stack_<mode>"
-  [(set (match_operand:P 0 "nvptx_register_operand" "=R")
-        (unspec:P [(match_operand:P 1 "nvptx_register_operand" "R")]
-                   UNSPEC_ALLOCA))]
-  ""
-  "%.\\tcall (%0), %%alloca, (%1);")
+(define_insn "set_softstack_insn"
+  [(unspec [(match_operand 0 "nvptx_register_operand" "R")]
+	   UNSPEC_SET_SOFTSTACK)]
+  "TARGET_SOFT_STACK"
+{
+  return nvptx_output_set_softstack (REGNO (operands[0]));
+})
 
 (define_expand "restore_stack_block"
   [(match_operand 0 "register_operand" "")
    (match_operand 1 "register_operand" "")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (operands[0], operands[1]);
+      emit_insn (gen_set_softstack_insn (operands[0]));
+    }
   DONE;
 })
 
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 056b9b2..1a7608b 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -32,3 +32,7 @@ Link in code for a __main kernel.
 moptimize
 Target Report Var(nvptx_optimize) Init(-1)
 Optimize partition neutering
+
+msoft-stack
+Target Report Mask(SOFT_STACK)
+Use custom stacks instead of local memory for automatic storage.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d281975..44d75c0 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -19341,6 +19341,18 @@ offloading execution.
 Apply partitioned execution optimizations.  This is the default when any
 level of optimization is selected.
 
+@item -msoft-stack
+@opindex msoft-stack
+Generate code that does not use @code{.local} memory
+directly for stack storage. Instead, a per-warp stack pointer is
+maintained explicitly. This enables variable-length stack allocation (with
+variable-length arrays or @code{alloca}), and when global memory is used for
+underlying storage, makes it possible to access automatic variables from other
+threads, or with atomic instructions. This code generation variant is used
+for OpenMP offloading, but the option is exposed on its own for the purpose
+of testing the compiler; to generate code suitable for linking into programs
+using OpenMP offloading, use option @option{-mgomp}.
+
 @end table
 
 @node PDP-11 Options
diff --git a/gcc/testsuite/gcc.target/nvptx/softstack.c b/gcc/testsuite/gcc.target/nvptx/softstack.c
new file mode 100644
index 0000000..73e60f2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/softstack.c
@@ -0,0 +1,23 @@
+/* { dg-options "-O2 -msoft-stack" } */
+/* { dg-do run } */
+
+static __attribute__((noinline,noclone)) int f(int *p)
+{
+  return __sync_lock_test_and_set(p, 1);
+}
+
+static __attribute__((noinline,noclone)) int g(int n)
+{
+  /* Check that variable-length stack allocation works.  */
+  int v[n];
+  v[0] = 0;
+  /* Check that atomic operations can be applied to auto data.  */
+  return f(v) == 0 && v[0] == 1;
+}
+
+int main()
+{
+  if (!g(1))
+    __builtin_abort();
+  return 0;
+}
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index dceabb5..878fdab 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -729,7 +729,10 @@ proc check_effective_target_untyped_assembly {} {
 
 proc check_effective_target_alloca {} {
     if { [istarget nvptx-*-*] } {
-	return 0
+	return [check_no_compiler_messages alloca assembly {
+	    void f (void*);
+	    void g (int n) { f (__builtin_alloca (n)); }
+	}]
     }
     return 1
 }
diff --git a/libgcc/config/nvptx/crt0.c b/libgcc/config/nvptx/crt0.c
index 3b7382d..a147c7e 100644
--- a/libgcc/config/nvptx/crt0.c
+++ b/libgcc/config/nvptx/crt0.c
@@ -33,5 +33,11 @@ __main (int *rval_ptr, int argc, void **argv)
   if (rval_ptr)
     *rval_ptr = 255;
 
+#ifdef __nvptx_softstack__
+  extern char *__nvptx_stacks[] __attribute__((shared));
+  static char stack[131072] __attribute__((aligned(8)));
+  __nvptx_stacks[0] = stack + sizeof stack;
+#endif
+
   exit (main (argc, argv));
 }
diff --git a/libgcc/config/nvptx/stacks.c b/libgcc/config/nvptx/stacks.c
new file mode 100644
index 0000000..4640fc9
--- /dev/null
+++ b/libgcc/config/nvptx/stacks.c
@@ -0,0 +1,25 @@
+/* Define shared memory array for -msoft-stack.
+
+   Copyright (C) 2015-2016 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+char *__nvptx_stacks[32] __attribute__((shared,nocommon));
diff --git a/libgcc/config/nvptx/t-nvptx b/libgcc/config/nvptx/t-nvptx
index daf252f..23a8252 100644
--- a/libgcc/config/nvptx/t-nvptx
+++ b/libgcc/config/nvptx/t-nvptx
@@ -1,4 +1,5 @@
-LIB2ADD=$(srcdir)/config/nvptx/reduction.c
+LIB2ADD=$(srcdir)/config/nvptx/reduction.c \
+	$(srcdir)/config/nvptx/stacks.c
 
 LIB2ADDEH=
 LIB2FUNCS_EXCLUDE=__main

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-06-02 21:22           ` Alexander Monakov
@ 2016-06-09 19:49             ` Nathan Sidwell
  2016-06-09 20:11               ` Alexander Monakov
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Sidwell @ 2016-06-09 19:49 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches

On 06/02/16 17:22, Alexander Monakov wrote:
> On Wed, 25 May 2016, Nathan Sidwell wrote:
>>>> It seems like we should reject the combination of -msoft-stack -fopenacc?
>>>
>>> Possibly; the doc text makes it explicit that the option is exposed only for
>>> the purpose of testing the compiler, anyway.
>>
>> It is always best to prevent the user doing something you don't recommend,
>> rather than presume they'll be sensible.
>
> No change for that yet in the respin; do you want something like
>
>   if (flag_openacc && TARGET_SOFT_STACK)
>     sorry ("-fopenacc and -msoft-stack are mutually exclusive");
>
> in nvptx_override_options?

Yes, but it's not a sorry, just a regular error.


> +static void
> +init_softstack_frame (FILE *file, unsigned alignment, unsigned size)
> +{
> +  /* Maintain 64-bit stack alignment.  */

This still needs blank lines to aid readibility.

> @@ -1027,6 +1186,21 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
>  			       REGNO (cfun->machine->axis_predicate[1]), "x");
>  }
>
> +/* Output instruction that sets soft stack pointer in shared memory to the
> +   value in register given by SRC_REGNO.  */
> +
> +const char *
> +nvptx_output_set_softstack (unsigned src_regno)
> +{
> +  if (cfun->machine->has_softstack && !crtl->is_leaf)
> +    {
> +      fprintf (asm_out_file, "\tst.shared.u%d\t[%s], ",
> +	       POINTER_SIZE, reg_names[SOFTSTACK_SLOT_REGNUM]);
> +      output_reg (asm_out_file, src_regno, VOIDmode);
> +      fprintf (asm_out_file, ";\n");
> +    }
> +  return "";

I think this would  be clearer by having the softstack insn's pattern mention 
both the SRC and softstack reg and then simply return either "" or 
"st.shared%t\\t%0, %1"  (and do that directly in the insn's emitter section 
rather than an fn call?)  Then also have an epilogue  expander emit the insn, 
rather than embed it in the return emitter.


> --- a/gcc/config/nvptx/nvptx.md
> +++ b/gcc/config/nvptx/nvptx.md



>  (define_expand "restore_stack_block"
>    [(match_operand 0 "register_operand" "")
>     (match_operand 1 "register_operand" "")]

you've not addressed my previous comments about this.


> @@ -0,0 +1,25 @@

> +char *__nvptx_stacks[32] __attribute__((shared,nocommon));

Is there a reason this can't be in crt0?
It should be 'void *'  Also, why '32' when only slot zero is initialized?  ISTM 
that this should be size 1.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)
  2016-06-09 19:49             ` Nathan Sidwell
@ 2016-06-09 20:11               ` Alexander Monakov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexander Monakov @ 2016-06-09 20:11 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc-patches

On Thu, 9 Jun 2016, Nathan Sidwell wrote:
> >  (define_expand "restore_stack_block"
> >    [(match_operand 0 "register_operand" "")
> >     (match_operand 1 "register_operand" "")]
> 
> you've not addressed my previous comments about this.

To be clear -- do you mean that "restore_stack_block" should have a comment
mentioning why "save_stack_block" is not overridden?

> > @@ -0,0 +1,25 @@
> 
> > +char *__nvptx_stacks[32] __attribute__((shared,nocommon));
> 
> Is there a reason this can't be in crt0?

It's also needed for offloading compilation, which doesn't link crt0.

> It should be 'void *'  Also, why '32' when only slot zero is initialized?
> ISTM that this should be size 1.

Offloading will need the full width.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-06-09 20:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-20 17:00 [PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack) Alexander Monakov
2016-04-21 14:08 ` Nathan Sidwell
2016-05-20 15:09   ` Alexander Monakov
2016-05-20 16:36     ` Sandra Loosemore
2016-05-20 19:42     ` Nathan Sidwell
2016-05-25  2:32       ` Alexander Monakov
2016-05-25 13:33         ` Nathan Sidwell
2016-06-02 21:22           ` Alexander Monakov
2016-06-09 19:49             ` Nathan Sidwell
2016-06-09 20:11               ` Alexander Monakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).