public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
       [not found]     ` <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com>
@ 2022-12-23 14:08       ` Thomas Schwinge
  2022-12-23 21:23         ` Jerry D
  2023-01-11 12:06         ` [PING] " Thomas Schwinge
  0 siblings, 2 replies; 8+ messages in thread
From: Thomas Schwinge @ 2022-12-23 14:08 UTC (permalink / raw)
  To: Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 7524 bytes --]

Hi!

On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>> For example, for Fortran code like:
>>
>>     write (*,*) "Hello world"
>>
>> ..., 'gfortran' creates:
>>
>>     struct __st_parameter_dt dt_parm.0;
>>
>>     try
>>       {
>>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>         dt_parm.0.common.line = 29;
>>         dt_parm.0.common.flags = 128;
>>         dt_parm.0.common.unit = 6;
>>         _gfortran_st_write (&dt_parm.0);
>>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>         _gfortran_st_write_done (&dt_parm.0);
>>       }
>>     finally
>>       {
>>         dt_parm.0 = {CLOBBER(eol)};
>>       }
>>
>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>> "Use custom stacks instead of local memory for automatic storage".)
>>
>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>> and dynamically increases the per-thread stack as necessary (thereby
>> potentially reducing parallelism) -- if it manages to understand the call
>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>> to disprove existance of recursion is the common problem, as I've read.
>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>
>>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>
>> That's still not an actual problem: if the GPU kernel's stack usage still
>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>> stack then overflows; device-side SIGSEGV.
>>
>> (There is, by the way, some similar analysis by Tom de Vries in
>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>> Recursive tests may fail due to thread stack limit".)
>>
>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>> do like their occasional "'printf' debugging", so we ought to make that
>> work (... without pessimizing any "normal" code).
>>
>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>> scope.
>>
>> There is a way to manually set a per-thread stack size, but it's not
>> obvious which size to set: that sizes needs to work for the whole GPU
>> kernel, and should be as low as possible (to maximize parallelism).
>> I assume that even if GCC did an accurate call graph analysis of the GPU
>> kernel's maximum stack usage, that still wouldn't help: that's before the
>> PTX JIT does its own code transformations, including stack spilling.
>>
>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
>> (-dlto) for device code".  This might help, assuming that it manages to
>> simplify the libgfortran I/O code such that the PTX JIT then understands
>> the call graph.  But: that's available only starting with recent
>> CUDA 11.4, so not a general solution -- if it works at all, which I've
>> not tested.
>>
>> Similarly, we could enable GCC's LTO for device code generation -- but
>> that's a big project, out of scope at this time.  And again, we don't
>> know if that at all helps this case.
>>
>> I see a few options:
>>
>> (a) Figure out what it is in the libgfortran I/O implementation that
>> causes "Stack size [...] cannot be statically determined", and re-work
>> that code to avoid that, or even disable certain things for nvptx, if
>> feasible.

> Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
> bloat is from things that are unused for simpler I/O cases (so some
> "inheritance" could help), and lots of the bloat is from using
> string/length pairs using char * + size_t for what looks like could be
> encoded a lot more efficiently.
>
> There's probably not much low-hanging fruit.

(Similarly comments in Janne's email.)


Well, as had to be expected, libgfortran I/O is really just one example,
but the underlying problem may also be triggered in other ways (via other
newlib/libc functions, for example).

So, really a generic solution seems to be called for.

>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
>> I don't really want to do that however: it does introduce a bit of
>> complexity in all the generated device code and run-time overhead that we
>> generally would like to avoid.

Directly using '-msoft-stack' isn't actually possible: it does implement
"one stack per 32-threads warp", but for OpenACC we need "one stack per
thread of a warp" (that is, each OpenACC 'vector' independently), and
pre-allocating from device memory all those stacks (which may be a lot!)
I foresee to really negatively impact overall performance?

>> (c) I'm contemplating a tweak/compiler pass for transforming such large
>> stack objects into heap allocation (during nvptx offloading compilation).
>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
>> code paths this is to affect.  (Might also add some compile-time
>> diagnostic, of course.)  Could maybe even limit this to only be used
>> during libgfortran compilation?  This is then conceptually a bit similar
>> to (b), but localized to relevant parts only.  Has such a thing been done
>> before in GCC, that I could build upon?
>>
>> Any other clever ideas?

> Converting to heap allocation is difficult outside of the frontend and you
> have to be very careful with memleaks.

Heh, in fact it seems to be pretty simple!  (Famous last words?)  See
"[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
attached.  What do people think about such a thing?

Still to be discussed are '-Wframe-malloc-threshold' (default-on vs.
'-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?),
default value for '-mframe-malloc-threshold=[...]' (potentially different
for GCC/nvptx target libraries build vs. user-compiled code?), etc.


> The library is written in C and
> I see heap allocated temporaries there but in at least one
> place a stack one is used:
>
> void
> st_endfile (st_parameter_filepos *fpp)
> {
> ...
>       if (u->current_record)
>         {
>           st_parameter_dt dtp;
>           dtp.common = fpp->common;
>           memset (&dtp.u.p, 0, sizeof (dtp.u.p));
>           dtp.u.p.current_unit = u;
>           next_record (&dtp, 1);
>
> that might be a mistake though - maybe it's enough to change that
> to a heap allocation?  It might be also totally superfluous since
> only 'u' should matter here ... (not sure if the above is the case
> you are running into).

(Have not yet looked into that; won't solve the general issue.)


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-WIP-nvptx-mframe-malloc-threshold-Wframe-malloc-thre.patch --]
[-- Type: text/x-diff, Size: 16585 bytes --]

From 3f5524adacff23710cf1cab393a56bf23853cafa Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Dec 2022 21:25:19 +0100
Subject: [PATCH] [WIP] nvptx: '-mframe-malloc-threshold',
 '-Wframe-malloc-threshold'

---
 gcc/config/nvptx/nvptx.cc                     | 102 ++++++++++++++++--
 gcc/config/nvptx/nvptx.h                      |   3 +
 gcc/config/nvptx/nvptx.opt                    |  12 +++
 gcc/doc/invoke.texi                           |  16 ++-
 .../nvptx/frame-malloc-threshold-1.c          |  29 +++++
 .../nvptx/frame-malloc-threshold-2.c          |  13 +++
 .../nvptx/frame-malloc-threshold-3.c          |  14 +++
 .../nvptx/frame-malloc-threshold-4.c          |  16 +++
 .../nvptx/frame-malloc-threshold-5.c          |  15 +++
 .../nvptx/frame-malloc-threshold-6.c          |  15 +++
 .../nvptx/frame-malloc-threshold-7.c          |  15 +++
 11 files changed, 240 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index b93a253ab318..2efd70595991 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -178,6 +178,16 @@ static hash_map<tree_decl_hash, unsigned int> gang_private_shared_hmap;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function 'has_malloc_frame'.
+   Because of 'nvptx_name_replacement', we can't just:
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_FREE));
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_MALLOC));
+   ..., but instead have to track them individually.
+*/
+static bool need_free_malloc_decl;
+static bool have_free_decl;
+static bool have_malloc_decl;
+
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 static bool have_softstack_decl;
@@ -976,6 +986,11 @@ write_fn_marker (std::stringstream &s, bool is_defn, bool globalize,
     s << " GLOBAL";
   s << " FUNCTION " << (is_defn ? "DEF: " : "DECL: ");
   s << name << "\n";
+
+  if (strcmp (name, "free") == 0)
+    have_free_decl = true;
+  else if (strcmp (name, "malloc") == 0)
+    have_malloc_decl = true;
 }
 
 /* Emit a linker marker for a variable decl or defn.  */
@@ -1231,22 +1246,66 @@ nvptx_maybe_record_fnsym (rtx sym)
     nvptx_record_needed_fndecl (decl);
 }
 
+//TODO
 /* Emit a local array to hold some part of a conventional stack frame
    and initialize REGNO to point to it.  If the size is zero, it'll
    never be valid to dereference, so we can simply initialize to
    zero.  */
 
 static void
-init_frame (FILE  *file, int regno, unsigned align, unsigned size)
+init_frame (FILE *file, int regno, int align, HOST_WIDE_INT size)
 {
-  if (size)
-    fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n",
-	     align, reg_names[regno], size);
   fprintf (file, "\t.reg.u%d %s;\n",
 	   POINTER_SIZE, reg_names[regno]);
-  fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
-		  :  "\tmov.u%d %s, 0;\n"),
-	   POINTER_SIZE, reg_names[regno], reg_names[regno]);
+
+  if (regno == FRAME_POINTER_REGNUM
+      && ((unsigned HOST_WIDE_INT) size
+	  >= (unsigned HOST_WIDE_INT) nvptx_frame_malloc_threshold))
+    {
+      warning_at (DECL_SOURCE_LOCATION (current_function_decl),
+		  OPT_Wframe_malloc_threshold,
+		  "using %<malloc%> for frame with size of %wu bytes", size);
+
+      /* <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) states that in addition to the "in-kernel
+	 'malloc()' function" there also exists an "in-kernel
+	 '__nv_aligned_device_malloc()' function", where "the address of the
+	 allocated memory will be a multiple of 'align'".  However that's not
+	 documented on
+	 <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0), so we shall not use that function.  */
+      /* <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0) does not, but
+	 <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) does state that the pointer returned by
+	 "in-kernel 'malloc()' [...] is guaranteed to be aligned to a
+	 16-byte boundary".  */
+      if (align > 16)
+	sorry ("unfulfilled %d bytes alignment for frame", align);
+
+      /* We don't need to support 'realloc', so instead of newlib 'malloc'
+	 directly use the PTX 'malloc'.  */
+      fprintf (file,
+	       "\t{\n"
+	       "\t  .param .u64 %%ptr;\n"
+	       "\t  .param .u64 %%size;\n"
+	       "\t  st.param.u64 [%%size], " HOST_WIDE_INT_PRINT_DEC ";\n"
+	       "\t  call (%%ptr), malloc, (%%size);\n"
+	       "\t  ld.param.u64 %s, [%%ptr];\n"
+	       "\t}\n",
+	       size, reg_names[regno]);
+      cfun->machine->has_malloc_frame = true;
+      need_free_malloc_decl = true;
+    }
+  else
+    {
+      if (size)
+	fprintf (file, "\t.local .align %d .b8 %s_ar[" HOST_WIDE_INT_PRINT_DEC "];\n",
+		 align, reg_names[regno], size);
+      fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
+		      :  "\tmov.u%d %s, 0;\n"),
+	       POINTER_SIZE, reg_names[regno], reg_names[regno]);
+    }
 }
 
 /* Emit soft stack frame setup sequence.  */
@@ -1744,12 +1803,22 @@ nvptx_output_set_softstack (unsigned src_regno)
     }
   return "";
 }
+
 /* Output a return instruction.  Also copy the return value to its outgoing
    location.  */
 
 const char *
 nvptx_output_return (void)
 {
+  if (cfun->machine->has_malloc_frame)
+    fprintf (asm_out_file,
+	     "\t{\n"
+	     "\t  .param .u64 %%ptr;\n"
+	     "\t  st.param.u64 [%%ptr], %s;\n"
+	     "\t  call free, (%%ptr);\n"
+	     "\t}\n",
+	     reg_names[FRAME_POINTER_REGNUM]);
+
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
   if (mode != VOIDmode)
@@ -4470,8 +4539,8 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
       rtx_code_label *label = NULL;
 
       empty = false;
-      /* The frame size might not be DImode compatible, but the frame
-	 array's declaration will be.  So it's ok to round up here.  */
+      /* The frame size might not be DImode-compatible, but the actual frame
+	 allocated by 'init_frame' will be.  So it's ok to round up here.  */
       fs = (fs + GET_MODE_SIZE (DImode) - 1) / GET_MODE_SIZE (DImode);
       /* Detect single iteration loop. */
       if (fs == 1)
@@ -5989,6 +6058,21 @@ write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
 static void
 nvptx_file_end (void)
 {
+  if (need_free_malloc_decl)
+    {
+      if (!have_free_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "free");
+	  func_decls << ".extern .func free (.param .b64 %ptr);\n";
+	}
+      if (!have_malloc_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "malloc");
+	  func_decls
+	    << ".extern .func (.param .b64 %ptr) malloc (.param .b64 %size);\n";
+	}
+    }
+
   hash_table<tree_hasher>::iterator iter;
   tree decl;
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index bc1021a80317..82d695551090 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -214,6 +214,8 @@ struct nvptx_args {
 
 #define TRAMPOLINE_SIZE 32
 #define TRAMPOLINE_ALIGNMENT 256
+
+#define NVPTX_FRAME_MALLOC_THRESHOLD_INIT 257
 \f
 /* We don't run reload, so this isn't actually used, but it still needs to be
    defined.  Showing an argp->fp elimination also stops
@@ -244,6 +246,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool has_malloc_frame;
   bool has_softstack; /* Current function has a soft stack frame.  */
   bool has_simtreg; /* Current function has an OpenMP SIMD region.  */
   int num_args;	/* Number of args of current call.  */
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 71d3b68510bd..6ccd3defc776 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -28,6 +28,18 @@ Target RejectNegative Mask(ABI64)
 Ignored, but preserved for backward compatibility.  Only 64-bit ABI is
 supported.
 
+mframe-malloc-threshold=
+Target Joined RejectNegative Host_Wide_Int ByteSize Var(nvptx_frame_malloc_threshold) Init(NVPTX_FRAME_MALLOC_THRESHOLD_INIT)
+-mframe-malloc-threshold=<byte-size>	When the frame size exceeds <byte-size>, frame allocation switches from '.local' memory to 'malloc'.
+
+mno-frame-malloc-threshold
+Target Alias(mframe-malloc-threshold=,18446744073709551615EiB,none)
+Always use '.local' memory for frame allocation.  Equivalent to -mframe-malloc-threshold=<SIZE_MAX> or larger.
+
+Wframe-malloc-threshold
+Target Warning
+Warn when the threshold is reached where frame allocation switches from '.local' memory to 'malloc'.
+
 mmainkernel
 Target RejectNegative
 Link in code for a __main kernel.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 471309dfacfe..e3b6ea0fe4b8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1179,7 +1179,9 @@ Objective-C and Objective-C++ Dialects}.
 -march=@var{arch}  -mbmx  -mno-bmx  -mcdx  -mno-cdx}
 
 @emph{Nvidia PTX Options}
-@gccoptlist{-m64  -mmainkernel  -moptimize}
+@gccoptlist{-m64 @gol
+-mframe-malloc-threshold=@var{byte-size} @gol
+-mmainkernel  -moptimize}
 
 @emph{OpenRISC Options}
 @gccoptlist{-mboard=@var{name}  -mnewlib  -mhard-mul  -mhard-div @gol
@@ -28367,6 +28369,18 @@ This option sets the values of the preprocessor macros
 for instance, for @samp{3.1} the macros have the values @samp{3} and
 @samp{1}, respectively.
 
+@item -mframe-malloc-threshold=@var{byte-size}
+@opindex mframe-malloc-threshold=
+@opindex mno-frame-malloc-threshold
+TODO
+
+This is not relevant if @code{-msoft-stack} is enabled.
+
+@option{-mframe-malloc-threshold=TODO} is enabled by default.
+This may be disabled either by specifying
+@var{byte-size} of @samp{SIZE_MAX} or more or by
+@option{-mno-frame-malloc-threshold}.
+
 @item -mmainkernel
 @opindex mmainkernel
 Link in code for a __main kernel.  This is for stand-alone instead of
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
new file mode 100644
index 000000000000..b16c17bfdf99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
@@ -0,0 +1,29 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+/* PTX-provided 'free', 'malloc'; cf. 'nvptx_name_replacement'.  */
+void ptx_free (void *) __asm__ ("free");
+void *ptx_malloc (__SIZE_TYPE__) __asm__ ("malloc");
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[1234];
+
+  ptx_malloc (5);
+
+  ptx_free (ptx_malloc (1));
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 2 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 3 } }
+*/
+
+/* Of the implicit (via 'need_free_malloc_decl') and explicit declarations of
+   'free', 'malloc', only one is emitted each:
+   { dg-final { scan-assembler-times {(?n)\.extern .* free .*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)\.extern .* malloc .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
new file mode 100644
index 000000000000..2f6a919eb1f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
new file mode 100644
index 000000000000..7434132b2ad5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[256];
+}
+
+/* We don't exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
new file mode 100644
index 000000000000..c4068ab7ad23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=32 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[32];
+}
+
+/* We exceed the specified '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
new file mode 100644
index 000000000000..cc262427b03c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=1249 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
new file mode 100644
index 000000000000..72017ca2f439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=2KiB } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
new file mode 100644
index 000000000000..b2f85a55f050
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mno-frame-malloc-threshold } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We'll never exceed the specified unlimited '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
-- 
2.35.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
  2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
@ 2022-12-23 21:23         ` Jerry D
  2023-01-11 12:06         ` [PING] " Thomas Schwinge
  1 sibling, 0 replies; 8+ messages in thread
From: Jerry D @ 2022-12-23 21:23 UTC (permalink / raw)
  To: Thomas Schwinge, Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

On 12/23/22 6:08 AM, Thomas Schwinge wrote:
> Hi!
> 
> On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
>> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>>> For example, for Fortran code like:
>>>
>>>      write (*,*) "Hello world"
>>>
>>> ..., 'gfortran' creates:
>>>
>>>      struct __st_parameter_dt dt_parm.0;
>>>
>>>      try
>>>        {
>>>          dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>>          dt_parm.0.common.line = 29;
>>>          dt_parm.0.common.flags = 128;
>>>          dt_parm.0.common.unit = 6;
>>>          _gfortran_st_write (&dt_parm.0);
>>>          _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>>          _gfortran_st_write_done (&dt_parm.0);
>>>        }
>>>      finally
>>>        {
>>>          dt_parm.0 = {CLOBBER(eol)};
>>>        }
>>>
>>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>>> "Use custom stacks instead of local memory for automatic storage".)
>>>
>>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>>> and dynamically increases the per-thread stack as necessary (thereby
>>> potentially reducing parallelism) -- if it manages to understand the call
>>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>>> to disprove existance of recursion is the common problem, as I've read.
>>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>>
>>>      warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>>
>>> That's still not an actual problem: if the GPU kernel's stack usage still
>>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>>> stack then overflows; device-side SIGSEGV.
>>>
>>> (There is, by the way, some similar analysis by Tom de Vries in
>>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>>> Recursive tests may fail due to thread stack limit".)
>>>
>>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>>> do like their occasional "'printf' debugging", so we ought to make that
>>> work (... without pessimizing any "normal" code).
>>>
>>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>>> scope.

There are so many wiggles and turns and corner cases and the like of 
nightmares in I/O I would advise not trying to reduce the dt_parm.  It 
could probably be done.

For debugging GPU, would it not be better to have a way you signal back 
to a main thread to do a print from there, like some sort of call back 
in the users code under test.

Putting this another way, recommend users debugging to use a different 
method than embedding print statements for debugging rather than do a 
tone of work to enable something that is not really a legitimate use case.

FWIW,

Jerry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
  2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
  2022-12-23 21:23         ` Jerry D
@ 2023-01-11 12:06         ` Thomas Schwinge
  2023-01-12  2:46           ` Jerry D
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Schwinge @ 2023-01-11 12:06 UTC (permalink / raw)
  To: Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 8377 bytes --]

Hi!

Ping -- the '-mframe-malloc-threshold' idea, at least.

Note that while this issue originally did pop up for Fortran I/O, it's
likewise relevant for other functions that maintain big frames, for
example in newlib:

    libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064];
    libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064];
    libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064];
    libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560];

Therefore a generic solution (or, workaround if you'd like) does seem
appropriate.


Grüße
 Thomas


On 2022-12-23T15:08:06+0100, I wrote:
> Hi!
>
> On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
>> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>>> For example, for Fortran code like:
>>>
>>>     write (*,*) "Hello world"
>>>
>>> ..., 'gfortran' creates:
>>>
>>>     struct __st_parameter_dt dt_parm.0;
>>>
>>>     try
>>>       {
>>>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>>         dt_parm.0.common.line = 29;
>>>         dt_parm.0.common.flags = 128;
>>>         dt_parm.0.common.unit = 6;
>>>         _gfortran_st_write (&dt_parm.0);
>>>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>>         _gfortran_st_write_done (&dt_parm.0);
>>>       }
>>>     finally
>>>       {
>>>         dt_parm.0 = {CLOBBER(eol)};
>>>       }
>>>
>>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>>> "Use custom stacks instead of local memory for automatic storage".)
>>>
>>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>>> and dynamically increases the per-thread stack as necessary (thereby
>>> potentially reducing parallelism) -- if it manages to understand the call
>>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>>> to disprove existance of recursion is the common problem, as I've read.
>>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>>
>>>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>>
>>> That's still not an actual problem: if the GPU kernel's stack usage still
>>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>>> stack then overflows; device-side SIGSEGV.
>>>
>>> (There is, by the way, some similar analysis by Tom de Vries in
>>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>>> Recursive tests may fail due to thread stack limit".)
>>>
>>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>>> do like their occasional "'printf' debugging", so we ought to make that
>>> work (... without pessimizing any "normal" code).
>>>
>>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>>> scope.
>>>
>>> There is a way to manually set a per-thread stack size, but it's not
>>> obvious which size to set: that sizes needs to work for the whole GPU
>>> kernel, and should be as low as possible (to maximize parallelism).
>>> I assume that even if GCC did an accurate call graph analysis of the GPU
>>> kernel's maximum stack usage, that still wouldn't help: that's before the
>>> PTX JIT does its own code transformations, including stack spilling.
>>>
>>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
>>> (-dlto) for device code".  This might help, assuming that it manages to
>>> simplify the libgfortran I/O code such that the PTX JIT then understands
>>> the call graph.  But: that's available only starting with recent
>>> CUDA 11.4, so not a general solution -- if it works at all, which I've
>>> not tested.
>>>
>>> Similarly, we could enable GCC's LTO for device code generation -- but
>>> that's a big project, out of scope at this time.  And again, we don't
>>> know if that at all helps this case.
>>>
>>> I see a few options:
>>>
>>> (a) Figure out what it is in the libgfortran I/O implementation that
>>> causes "Stack size [...] cannot be statically determined", and re-work
>>> that code to avoid that, or even disable certain things for nvptx, if
>>> feasible.
>
>> Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
>> bloat is from things that are unused for simpler I/O cases (so some
>> "inheritance" could help), and lots of the bloat is from using
>> string/length pairs using char * + size_t for what looks like could be
>> encoded a lot more efficiently.
>>
>> There's probably not much low-hanging fruit.
>
> (Similarly comments in Janne's email.)
>
>
> Well, as had to be expected, libgfortran I/O is really just one example,
> but the underlying problem may also be triggered in other ways (via other
> newlib/libc functions, for example).
>
> So, really a generic solution seems to be called for.
>
>>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
>>> I don't really want to do that however: it does introduce a bit of
>>> complexity in all the generated device code and run-time overhead that we
>>> generally would like to avoid.
>
> Directly using '-msoft-stack' isn't actually possible: it does implement
> "one stack per 32-threads warp", but for OpenACC we need "one stack per
> thread of a warp" (that is, each OpenACC 'vector' independently), and
> pre-allocating from device memory all those stacks (which may be a lot!)
> I foresee to really negatively impact overall performance?
>
>>> (c) I'm contemplating a tweak/compiler pass for transforming such large
>>> stack objects into heap allocation (during nvptx offloading compilation).
>>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
>>> code paths this is to affect.  (Might also add some compile-time
>>> diagnostic, of course.)  Could maybe even limit this to only be used
>>> during libgfortran compilation?  This is then conceptually a bit similar
>>> to (b), but localized to relevant parts only.  Has such a thing been done
>>> before in GCC, that I could build upon?
>>>
>>> Any other clever ideas?
>
>> Converting to heap allocation is difficult outside of the frontend and you
>> have to be very careful with memleaks.
>
> Heh, in fact it seems to be pretty simple!  (Famous last words?)  See
> "[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
> attached.  What do people think about such a thing?
>
> Still to be discussed are '-Wframe-malloc-threshold' (default-on vs.
> '-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?),
> default value for '-mframe-malloc-threshold=[...]' (potentially different
> for GCC/nvptx target libraries build vs. user-compiled code?), etc.
>
>
>> The library is written in C and
>> I see heap allocated temporaries there but in at least one
>> place a stack one is used:
>>
>> void
>> st_endfile (st_parameter_filepos *fpp)
>> {
>> ...
>>       if (u->current_record)
>>         {
>>           st_parameter_dt dtp;
>>           dtp.common = fpp->common;
>>           memset (&dtp.u.p, 0, sizeof (dtp.u.p));
>>           dtp.u.p.current_unit = u;
>>           next_record (&dtp, 1);
>>
>> that might be a mistake though - maybe it's enough to change that
>> to a heap allocation?  It might be also totally superfluous since
>> only 'u' should matter here ... (not sure if the above is the case
>> you are running into).
>
> (Have not yet looked into that; won't solve the general issue.)
>
>
> Grüße
>  Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-WIP-nvptx-mframe-malloc-threshold-Wframe-malloc-thre.patch --]
[-- Type: text/x-diff, Size: 16585 bytes --]

From 3f5524adacff23710cf1cab393a56bf23853cafa Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Dec 2022 21:25:19 +0100
Subject: [PATCH] [WIP] nvptx: '-mframe-malloc-threshold',
 '-Wframe-malloc-threshold'

---
 gcc/config/nvptx/nvptx.cc                     | 102 ++++++++++++++++--
 gcc/config/nvptx/nvptx.h                      |   3 +
 gcc/config/nvptx/nvptx.opt                    |  12 +++
 gcc/doc/invoke.texi                           |  16 ++-
 .../nvptx/frame-malloc-threshold-1.c          |  29 +++++
 .../nvptx/frame-malloc-threshold-2.c          |  13 +++
 .../nvptx/frame-malloc-threshold-3.c          |  14 +++
 .../nvptx/frame-malloc-threshold-4.c          |  16 +++
 .../nvptx/frame-malloc-threshold-5.c          |  15 +++
 .../nvptx/frame-malloc-threshold-6.c          |  15 +++
 .../nvptx/frame-malloc-threshold-7.c          |  15 +++
 11 files changed, 240 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index b93a253ab318..2efd70595991 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -178,6 +178,16 @@ static hash_map<tree_decl_hash, unsigned int> gang_private_shared_hmap;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function 'has_malloc_frame'.
+   Because of 'nvptx_name_replacement', we can't just:
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_FREE));
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_MALLOC));
+   ..., but instead have to track them individually.
+*/
+static bool need_free_malloc_decl;
+static bool have_free_decl;
+static bool have_malloc_decl;
+
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 static bool have_softstack_decl;
@@ -976,6 +986,11 @@ write_fn_marker (std::stringstream &s, bool is_defn, bool globalize,
     s << " GLOBAL";
   s << " FUNCTION " << (is_defn ? "DEF: " : "DECL: ");
   s << name << "\n";
+
+  if (strcmp (name, "free") == 0)
+    have_free_decl = true;
+  else if (strcmp (name, "malloc") == 0)
+    have_malloc_decl = true;
 }
 
 /* Emit a linker marker for a variable decl or defn.  */
@@ -1231,22 +1246,66 @@ nvptx_maybe_record_fnsym (rtx sym)
     nvptx_record_needed_fndecl (decl);
 }
 
+//TODO
 /* Emit a local array to hold some part of a conventional stack frame
    and initialize REGNO to point to it.  If the size is zero, it'll
    never be valid to dereference, so we can simply initialize to
    zero.  */
 
 static void
-init_frame (FILE  *file, int regno, unsigned align, unsigned size)
+init_frame (FILE *file, int regno, int align, HOST_WIDE_INT size)
 {
-  if (size)
-    fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n",
-	     align, reg_names[regno], size);
   fprintf (file, "\t.reg.u%d %s;\n",
 	   POINTER_SIZE, reg_names[regno]);
-  fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
-		  :  "\tmov.u%d %s, 0;\n"),
-	   POINTER_SIZE, reg_names[regno], reg_names[regno]);
+
+  if (regno == FRAME_POINTER_REGNUM
+      && ((unsigned HOST_WIDE_INT) size
+	  >= (unsigned HOST_WIDE_INT) nvptx_frame_malloc_threshold))
+    {
+      warning_at (DECL_SOURCE_LOCATION (current_function_decl),
+		  OPT_Wframe_malloc_threshold,
+		  "using %<malloc%> for frame with size of %wu bytes", size);
+
+      /* <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) states that in addition to the "in-kernel
+	 'malloc()' function" there also exists an "in-kernel
+	 '__nv_aligned_device_malloc()' function", where "the address of the
+	 allocated memory will be a multiple of 'align'".  However that's not
+	 documented on
+	 <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0), so we shall not use that function.  */
+      /* <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0) does not, but
+	 <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) does state that the pointer returned by
+	 "in-kernel 'malloc()' [...] is guaranteed to be aligned to a
+	 16-byte boundary".  */
+      if (align > 16)
+	sorry ("unfulfilled %d bytes alignment for frame", align);
+
+      /* We don't need to support 'realloc', so instead of newlib 'malloc'
+	 directly use the PTX 'malloc'.  */
+      fprintf (file,
+	       "\t{\n"
+	       "\t  .param .u64 %%ptr;\n"
+	       "\t  .param .u64 %%size;\n"
+	       "\t  st.param.u64 [%%size], " HOST_WIDE_INT_PRINT_DEC ";\n"
+	       "\t  call (%%ptr), malloc, (%%size);\n"
+	       "\t  ld.param.u64 %s, [%%ptr];\n"
+	       "\t}\n",
+	       size, reg_names[regno]);
+      cfun->machine->has_malloc_frame = true;
+      need_free_malloc_decl = true;
+    }
+  else
+    {
+      if (size)
+	fprintf (file, "\t.local .align %d .b8 %s_ar[" HOST_WIDE_INT_PRINT_DEC "];\n",
+		 align, reg_names[regno], size);
+      fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
+		      :  "\tmov.u%d %s, 0;\n"),
+	       POINTER_SIZE, reg_names[regno], reg_names[regno]);
+    }
 }
 
 /* Emit soft stack frame setup sequence.  */
@@ -1744,12 +1803,22 @@ nvptx_output_set_softstack (unsigned src_regno)
     }
   return "";
 }
+
 /* Output a return instruction.  Also copy the return value to its outgoing
    location.  */
 
 const char *
 nvptx_output_return (void)
 {
+  if (cfun->machine->has_malloc_frame)
+    fprintf (asm_out_file,
+	     "\t{\n"
+	     "\t  .param .u64 %%ptr;\n"
+	     "\t  st.param.u64 [%%ptr], %s;\n"
+	     "\t  call free, (%%ptr);\n"
+	     "\t}\n",
+	     reg_names[FRAME_POINTER_REGNUM]);
+
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
   if (mode != VOIDmode)
@@ -4470,8 +4539,8 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
       rtx_code_label *label = NULL;
 
       empty = false;
-      /* The frame size might not be DImode compatible, but the frame
-	 array's declaration will be.  So it's ok to round up here.  */
+      /* The frame size might not be DImode-compatible, but the actual frame
+	 allocated by 'init_frame' will be.  So it's ok to round up here.  */
       fs = (fs + GET_MODE_SIZE (DImode) - 1) / GET_MODE_SIZE (DImode);
       /* Detect single iteration loop. */
       if (fs == 1)
@@ -5989,6 +6058,21 @@ write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
 static void
 nvptx_file_end (void)
 {
+  if (need_free_malloc_decl)
+    {
+      if (!have_free_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "free");
+	  func_decls << ".extern .func free (.param .b64 %ptr);\n";
+	}
+      if (!have_malloc_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "malloc");
+	  func_decls
+	    << ".extern .func (.param .b64 %ptr) malloc (.param .b64 %size);\n";
+	}
+    }
+
   hash_table<tree_hasher>::iterator iter;
   tree decl;
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index bc1021a80317..82d695551090 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -214,6 +214,8 @@ struct nvptx_args {
 
 #define TRAMPOLINE_SIZE 32
 #define TRAMPOLINE_ALIGNMENT 256
+
+#define NVPTX_FRAME_MALLOC_THRESHOLD_INIT 257
 \f
 /* We don't run reload, so this isn't actually used, but it still needs to be
    defined.  Showing an argp->fp elimination also stops
@@ -244,6 +246,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool has_malloc_frame;
   bool has_softstack; /* Current function has a soft stack frame.  */
   bool has_simtreg; /* Current function has an OpenMP SIMD region.  */
   int num_args;	/* Number of args of current call.  */
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 71d3b68510bd..6ccd3defc776 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -28,6 +28,18 @@ Target RejectNegative Mask(ABI64)
 Ignored, but preserved for backward compatibility.  Only 64-bit ABI is
 supported.
 
+mframe-malloc-threshold=
+Target Joined RejectNegative Host_Wide_Int ByteSize Var(nvptx_frame_malloc_threshold) Init(NVPTX_FRAME_MALLOC_THRESHOLD_INIT)
+-mframe-malloc-threshold=<byte-size>	When the frame size exceeds <byte-size>, frame allocation switches from '.local' memory to 'malloc'.
+
+mno-frame-malloc-threshold
+Target Alias(mframe-malloc-threshold=,18446744073709551615EiB,none)
+Always use '.local' memory for frame allocation.  Equivalent to -mframe-malloc-threshold=<SIZE_MAX> or larger.
+
+Wframe-malloc-threshold
+Target Warning
+Warn when the threshold is reached where frame allocation switches from '.local' memory to 'malloc'.
+
 mmainkernel
 Target RejectNegative
 Link in code for a __main kernel.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 471309dfacfe..e3b6ea0fe4b8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1179,7 +1179,9 @@ Objective-C and Objective-C++ Dialects}.
 -march=@var{arch}  -mbmx  -mno-bmx  -mcdx  -mno-cdx}
 
 @emph{Nvidia PTX Options}
-@gccoptlist{-m64  -mmainkernel  -moptimize}
+@gccoptlist{-m64 @gol
+-mframe-malloc-threshold=@var{byte-size} @gol
+-mmainkernel  -moptimize}
 
 @emph{OpenRISC Options}
 @gccoptlist{-mboard=@var{name}  -mnewlib  -mhard-mul  -mhard-div @gol
@@ -28367,6 +28369,18 @@ This option sets the values of the preprocessor macros
 for instance, for @samp{3.1} the macros have the values @samp{3} and
 @samp{1}, respectively.
 
+@item -mframe-malloc-threshold=@var{byte-size}
+@opindex mframe-malloc-threshold=
+@opindex mno-frame-malloc-threshold
+TODO
+
+This is not relevant if @code{-msoft-stack} is enabled.
+
+@option{-mframe-malloc-threshold=TODO} is enabled by default.
+This may be disabled either by specifying
+@var{byte-size} of @samp{SIZE_MAX} or more or by
+@option{-mno-frame-malloc-threshold}.
+
 @item -mmainkernel
 @opindex mmainkernel
 Link in code for a __main kernel.  This is for stand-alone instead of
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
new file mode 100644
index 000000000000..b16c17bfdf99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
@@ -0,0 +1,29 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+/* PTX-provided 'free', 'malloc'; cf. 'nvptx_name_replacement'.  */
+void ptx_free (void *) __asm__ ("free");
+void *ptx_malloc (__SIZE_TYPE__) __asm__ ("malloc");
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[1234];
+
+  ptx_malloc (5);
+
+  ptx_free (ptx_malloc (1));
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 2 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 3 } }
+*/
+
+/* Of the implicit (via 'need_free_malloc_decl') and explicit declarations of
+   'free', 'malloc', only one is emitted each:
+   { dg-final { scan-assembler-times {(?n)\.extern .* free .*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)\.extern .* malloc .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
new file mode 100644
index 000000000000..2f6a919eb1f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
new file mode 100644
index 000000000000..7434132b2ad5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[256];
+}
+
+/* We don't exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
new file mode 100644
index 000000000000..c4068ab7ad23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=32 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[32];
+}
+
+/* We exceed the specified '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
new file mode 100644
index 000000000000..cc262427b03c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=1249 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
new file mode 100644
index 000000000000..72017ca2f439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=2KiB } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
new file mode 100644
index 000000000000..b2f85a55f050
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mno-frame-malloc-threshold } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We'll never exceed the specified unlimited '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
-- 
2.35.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
  2023-01-11 12:06         ` [PING] " Thomas Schwinge
@ 2023-01-12  2:46           ` Jerry D
  0 siblings, 0 replies; 8+ messages in thread
From: Jerry D @ 2023-01-12  2:46 UTC (permalink / raw)
  To: Thomas Schwinge, Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

On 1/11/23 4:06 AM, Thomas Schwinge wrote:
> Hi!
> 
> Ping -- the '-mframe-malloc-threshold' idea, at least.
> 
> Note that while this issue originally did pop up for Fortran I/O, it's
> likewise relevant for other functions that maintain big frames, for
> example in newlib:
> 
>      libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064];
>      libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064];
>      libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064];
>      libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560];
> 
> Therefore a generic solution (or, workaround if you'd like) does seem
> appropriate.
> 
---snip ---

AS a gfortranner I have to at least say anyone doing fortran I/O on a 
GPU is nuts.

With that said, a configurable option to address the broader issue makes 
sense. Perhaps the default threshold should be whatever it is now and if 
someone has a real situation where it is needed, they can adjust.

Regards,

Jerry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* nvptx, libgcc: Stub unwinding implementation
       [not found] <ae825c453f484ffd99c9be34af726089@mentor.com>
       [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
@ 2023-01-20 21:04 ` Thomas Schwinge
  2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Schwinge @ 2023-01-20 21:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs

[-- Attachment #1: Type: text/plain, Size: 796 bytes --]

Hi!

We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
configuration of libgfortran.  One prerequisite patch, based on WIP work
by Andrew Stubbs, is: "nvptx, libgcc: Stub unwinding implementation", see
attached.  This I've just pushed to devel/omp/gcc-12 branch in
commit 26d3146736218ccaaaafdaba4da1edf969bc190d, and would like to push
to master branch once other pending GCC patches have been accepted.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-nvptx-libgcc-Stub-unwinding-implementation.patch --]
[-- Type: text/x-diff, Size: 3312 bytes --]

From 26d3146736218ccaaaafdaba4da1edf969bc190d Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Sep 2022 18:58:34 +0200
Subject: [PATCH] nvptx, libgcc: Stub unwinding implementation

Adding stub '_Unwind_Backtrace', '_Unwind_GetIPInfo' functions is necessary
for linking libbacktrace, as a normal (non-'LIBGFOR_MINIMAL') configuration
of libgfortran wants to do, for example.

The file 'libgcc/config/nvptx/unwind-nvptx.c' is copied from
'libgcc/config/gcn/unwind-gcn.c'.

libgcc/ChangeLog:

	* config/nvptx/t-nvptx: Add unwind-nvptx.c.
	* config/nvptx/unwind-nvptx.c: New file.

Co-authored-by: Andrew Stubbs <ams@codesourcery.com>
---
 libgcc/ChangeLog.omp               |  6 +++++
 libgcc/config/nvptx/t-nvptx        |  3 ++-
 libgcc/config/nvptx/unwind-nvptx.c | 36 ++++++++++++++++++++++++++++++
 3 files changed, 44 insertions(+), 1 deletion(-)
 create mode 100644 libgcc/config/nvptx/unwind-nvptx.c

diff --git a/libgcc/ChangeLog.omp b/libgcc/ChangeLog.omp
index 2e7bf5cc029..c46f49bf5b7 100644
--- a/libgcc/ChangeLog.omp
+++ b/libgcc/ChangeLog.omp
@@ -1,3 +1,9 @@
+2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
+	    Andrew Stubbs  <ams@codesourcery.com>
+
+	* config/nvptx/t-nvptx: Add unwind-nvptx.c.
+	* config/nvptx/unwind-nvptx.c: New file.
+
 2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
 
 	* config/nvptx/crtstuff.c ["mgomp"]
diff --git a/libgcc/config/nvptx/t-nvptx b/libgcc/config/nvptx/t-nvptx
index 9a0454c3a4d..1845a38a35e 100644
--- a/libgcc/config/nvptx/t-nvptx
+++ b/libgcc/config/nvptx/t-nvptx
@@ -1,6 +1,7 @@
 LIB2ADD=$(srcdir)/config/nvptx/reduction.c \
 	$(srcdir)/config/nvptx/mgomp.c \
-	$(srcdir)/config/nvptx/atomic.c
+	$(srcdir)/config/nvptx/atomic.c \
+	$(srcdir)/config/nvptx/unwind-nvptx.c
 
 LIB2ADDEH=
 LIB2FUNCS_EXCLUDE=
diff --git a/libgcc/config/nvptx/unwind-nvptx.c b/libgcc/config/nvptx/unwind-nvptx.c
new file mode 100644
index 00000000000..c657b2af6f3
--- /dev/null
+++ b/libgcc/config/nvptx/unwind-nvptx.c
@@ -0,0 +1,36 @@
+/* Stub unwinding implementation.
+
+   Copyright (C) 2019-2023 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "unwind.h"
+
+_Unwind_Reason_Code
+_Unwind_Backtrace(_Unwind_Trace_Fn trace, void * trace_argument)
+{
+  return 0;
+}
+
+_Unwind_Ptr
+_Unwind_GetIPInfo (struct _Unwind_Context *c, int *ip_before_insn)
+{
+  return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* nvptx, libgfortran: Switch out of "minimal" mode
  2023-01-20 21:04 ` nvptx, libgcc: Stub unwinding implementation Thomas Schwinge
@ 2023-01-20 21:16   ` Thomas Schwinge
  2023-01-20 22:10     ` Thomas Koenig
  2023-01-24  9:37     ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge
  0 siblings, 2 replies; 8+ messages in thread
From: Thomas Schwinge @ 2023-01-20 21:16 UTC (permalink / raw)
  To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

Hi!

On 2023-01-20T22:04:02+0100, I wrote:
> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
> offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
> configuration of libgfortran.

This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode',
see attached, again based on WIP work by Andrew Stubbs.  This I've just
pushed to devel/omp/gcc-12 branch in
commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push
to master branch once other pending GCC patches have been accepted.


The OpenACC XFAILs: "[...] overflows the stack for nvptx offloading"
are unresolved at this point; see the discussion around
"Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?",
and my "nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
experimenting.  (The latter works to some extent, but also has other
issues that I shall detail at some later point in time.)


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: 0001-nvptx-libgfortran-Switch-out-of-minimal-mode.patch --]
[-- Type: text/x-diff, Size: 9644 bytes --]

From c7734c6fbb5513b4da6306de7bc85de9b8547988 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Sep 2022 18:58:34 +0200
Subject: [PATCH] nvptx, libgfortran: Switch out of "minimal" mode

..., in order to enable (portions of) Fortran I/O, for example.

libgfortran/ChangeLog:

	* configure: Regenerate.
	* configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.

libgomp/ChangeLog:

	* testsuite/libgomp.fortran/target-print-1.f90: Adjust.
	* testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
	* testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
	* testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
	* testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
	* testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.

Co-authored-by: Andrew Stubbs <ams@codesourcery.com>
---
 libgfortran/ChangeLog.omp                       |  6 ++++++
 libgfortran/configure                           | 17 ++++++-----------
 libgfortran/configure.ac                        | 17 ++++++-----------
 libgomp/ChangeLog.omp                           |  7 +++++++
 .../libgomp.fortran/target-print-1-nvptx.f90    | 11 -----------
 .../libgomp.fortran/target-print-1.f90          |  3 ---
 .../libgomp.oacc-fortran/error_stop-2.f         |  4 +++-
 .../libgomp.oacc-fortran/print-1-nvptx.f90      | 11 -----------
 .../testsuite/libgomp.oacc-fortran/print-1.f90  |  5 ++---
 libgomp/testsuite/libgomp.oacc-fortran/stop-2.f |  4 +++-
 10 files changed, 33 insertions(+), 52 deletions(-)
 delete mode 100644 libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
 delete mode 100644 libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90

diff --git a/libgfortran/ChangeLog.omp b/libgfortran/ChangeLog.omp
index b08c264daf9..925575e65fa 100644
--- a/libgfortran/ChangeLog.omp
+++ b/libgfortran/ChangeLog.omp
@@ -1,3 +1,9 @@
+2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
+	    Andrew Stubbs  <ams@codesourcery.com>
+
+	* configure: Regenerate.
+	* configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.
+
 2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
 
 	PR target/85463
diff --git a/libgfortran/configure b/libgfortran/configure
index ae64dca3114..3e5c931d4ad 100755
--- a/libgfortran/configure
+++ b/libgfortran/configure
@@ -6230,17 +6230,12 @@ else
 fi
 
 
-# For GPU offloading, not everything in libfortran can be supported.
-# Currently, the only target that has this problem is nvptx.  The
-# following is a (partial) list of features that are unsupportable on
-# this particular target:
-# * Constructors
-# * alloca
-# * C library support for I/O, with printf as the one notable exception
-# * C library support for other features such as signal, environment
-#   variables, time functions
-
- if test "x${target_cpu}" = xnvptx; then
+# "Minimal" mode is for targets that cannot (yet) support all features of
+# libgfortran.  It avoids the need for working constructors, alloca, and C
+# library support for I/O, signals, environment variables, time functions, etc.
+# At present there are no targets that require this mode.
+
+ if false; then
   LIBGFOR_MINIMAL_TRUE=
   LIBGFOR_MINIMAL_FALSE='#'
 else
diff --git a/libgfortran/configure.ac b/libgfortran/configure.ac
index 97cc490cb5e..e5552949cc6 100644
--- a/libgfortran/configure.ac
+++ b/libgfortran/configure.ac
@@ -222,17 +222,12 @@ AM_CONDITIONAL(LIBGFOR_USE_SYMVER, [test "x$gfortran_use_symver" != xno])
 AM_CONDITIONAL(LIBGFOR_USE_SYMVER_GNU, [test "x$gfortran_use_symver" = xgnu])
 AM_CONDITIONAL(LIBGFOR_USE_SYMVER_SUN, [test "x$gfortran_use_symver" = xsun])
 
-# For GPU offloading, not everything in libfortran can be supported.
-# Currently, the only target that has this problem is nvptx.  The
-# following is a (partial) list of features that are unsupportable on
-# this particular target:
-# * Constructors
-# * alloca
-# * C library support for I/O, with printf as the one notable exception
-# * C library support for other features such as signal, environment
-#   variables, time functions
-
-AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx])
+# "Minimal" mode is for targets that cannot (yet) support all features of
+# libgfortran.  It avoids the need for working constructors, alloca, and C
+# library support for I/O, signals, environment variables, time functions, etc.
+# At present there are no targets that require this mode.
+
+AM_CONDITIONAL(LIBGFOR_MINIMAL, false)
 
 # Some compiler target support may have limited support for integer
 # or floating point numbers – or may want to reduce the libgfortran size
diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index 32aa9705296..30b1e558ea3 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,5 +1,12 @@
 2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* testsuite/libgomp.fortran/target-print-1.f90: Adjust.
+	* testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
+	* testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
+	* testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
+	* testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
+	* testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.
+
 	* plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New.
 	(nvptx_close_device, GOMP_OFFLOAD_load_image)
 	(GOMP_OFFLOAD_unload_image): Call it.
diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
deleted file mode 100644
index a89c9c33484..00000000000
--- a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
+++ /dev/null
@@ -1,11 +0,0 @@
-! Ensure that write on the offload device works, nvptx offloading variant.
-
-! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
-! configuration.
-! { dg-do link } ! ..., but still apply 'dg-do run' options.
-! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
-
-! Skip duplicated testing.
-! { dg-skip-if "separate file" { ! offload_target_nvptx } }
-
-include 'target-print-1.f90'
diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
index 327bb22cb6d..9ac70e5a85f 100644
--- a/libgomp/testsuite/libgomp.fortran/target-print-1.f90
+++ b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
@@ -3,9 +3,6 @@
 ! { dg-do run }
 ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
 
-! Separate file 'target-print-1-nvptx.f90' for nvptx offloading.
-! { dg-skip-if "separate file" { offload_target_nvptx } }
-
 program main
   implicit none
   integer :: var = 42
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
index 5951e8cbe64..bbb4b55ef2c 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
@@ -17,7 +17,9 @@
 
 ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
 
-! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" }
+! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
+! overflows the stack for nvptx offloading, thus XFAILed.
+! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
 !
 ! In gfortran's main program, libfortran's set_options is called - which sets
 ! compiler_options.backtrace = 1 by default.  For an offload libgfortran, this
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
deleted file mode 100644
index 866c8654355..00000000000
--- a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
+++ /dev/null
@@ -1,11 +0,0 @@
-! Ensure that write on the offload device works, nvptx offloading variant.
-
-! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
-! configuration.
-! { dg-do link } ! ..., but still apply 'dg-do run' options.
-! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
-
-! Skip duplicated testing.
-! { dg-skip-if "separate file" { ! offload_target_nvptx } }
-
-include 'print-1.f90'
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
index d2f89d915f8..d04503a0249 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
@@ -2,9 +2,8 @@
 
 ! { dg-do run }
 ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
-
-! Separate file 'print-1-nvptx.f90' for nvptx offloading.
-! { dg-skip-if "separate file" { offload_target_nvptx } }
+! The 'write' overflows the stack for nvptx offloading, thus XFAILed.
+! { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } }
 
 ! { dg-additional-options "-fopt-info-note-omp" }
 ! { dg-additional-options "-foffload=-fopt-info-note-omp" }
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
index fe7ee37813a..394de034b1f 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
@@ -17,7 +17,9 @@
 
 ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
 
-! { dg-output "STOP 35(\n|\r\n|\r)+" }
+! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
+! overflows the stack for nvptx offloading, thus XFAILed.
+! { dg-output "STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
 !
 ! PR85463.  The 'exit' implementation used with nvptx
 ! offloading is a little bit different.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nvptx, libgfortran: Switch out of "minimal" mode
  2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
@ 2023-01-20 22:10     ` Thomas Koenig
  2023-01-24  9:37     ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Koenig @ 2023-01-20 22:10 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs

Hi Thomas,

> On 2023-01-20T22:04:02+0100, I wrote:
>> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
>> offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
>> configuration of libgfortran.
> 
> This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode',
> see attached, again based on WIP work by Andrew Stubbs.  This I've just
> pushed to devel/omp/gcc-12 branch in
> commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push
> to master branch once other pending GCC patches have been accepted.

Looks good to me.

Regards

	Thomas


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode)
  2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
  2023-01-20 22:10     ` Thomas Koenig
@ 2023-01-24  9:37     ` Thomas Schwinge
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Schwinge @ 2023-01-24  9:37 UTC (permalink / raw)
  To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 11871 bytes --]

Hi!

On 2023-01-20T22:16:00+0100, I wrote:
> On 2023-01-20T22:04:02+0100, I wrote:
>> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
>> offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
>> configuration of libgfortran.
>
> This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode',
> see attached, again based on WIP work by Andrew Stubbs.  This I've just
> pushed to devel/omp/gcc-12 branch in
> commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push
> to master branch once other pending GCC patches have been accepted.
>
>
> The OpenACC XFAILs: "[...] overflows the stack for nvptx offloading"
> are unresolved at this point; see the discussion around
> "Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?",
> and my "nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
> experimenting.  (The latter works to some extent, but also has other
> issues that I shall detail at some later point in time.)

I had a note from Tobias to "update the the last but one bullet point at
https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html".  Thus pushed to
devel/omp/gcc-12 branch commit 8c29332e98ca4669a059ebc0d90903b409ae049f
"Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode'",
see attached.  Please consider that one 'fixup'ed into the GCC master
branch submission.


Grüße
 Thomas


> From c7734c6fbb5513b4da6306de7bc85de9b8547988 Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge <thomas@codesourcery.com>
> Date: Wed, 21 Sep 2022 18:58:34 +0200
> Subject: [PATCH] nvptx, libgfortran: Switch out of "minimal" mode
>
> ..., in order to enable (portions of) Fortran I/O, for example.
>
> libgfortran/ChangeLog:
>
>       * configure: Regenerate.
>       * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.
>
> libgomp/ChangeLog:
>
>       * testsuite/libgomp.fortran/target-print-1.f90: Adjust.
>       * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
>       * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
>       * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
>       * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
>       * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.
>
> Co-authored-by: Andrew Stubbs <ams@codesourcery.com>
> ---
>  libgfortran/ChangeLog.omp                       |  6 ++++++
>  libgfortran/configure                           | 17 ++++++-----------
>  libgfortran/configure.ac                        | 17 ++++++-----------
>  libgomp/ChangeLog.omp                           |  7 +++++++
>  .../libgomp.fortran/target-print-1-nvptx.f90    | 11 -----------
>  .../libgomp.fortran/target-print-1.f90          |  3 ---
>  .../libgomp.oacc-fortran/error_stop-2.f         |  4 +++-
>  .../libgomp.oacc-fortran/print-1-nvptx.f90      | 11 -----------
>  .../testsuite/libgomp.oacc-fortran/print-1.f90  |  5 ++---
>  libgomp/testsuite/libgomp.oacc-fortran/stop-2.f |  4 +++-
>  10 files changed, 33 insertions(+), 52 deletions(-)
>  delete mode 100644 libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
>  delete mode 100644 libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
>
> diff --git a/libgfortran/ChangeLog.omp b/libgfortran/ChangeLog.omp
> index b08c264daf9..925575e65fa 100644
> --- a/libgfortran/ChangeLog.omp
> +++ b/libgfortran/ChangeLog.omp
> @@ -1,3 +1,9 @@
> +2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
> +         Andrew Stubbs  <ams@codesourcery.com>
> +
> +     * configure: Regenerate.
> +     * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.
> +
>  2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
>
>       PR target/85463
> diff --git a/libgfortran/configure b/libgfortran/configure
> index ae64dca3114..3e5c931d4ad 100755
> --- a/libgfortran/configure
> +++ b/libgfortran/configure
> @@ -6230,17 +6230,12 @@ else
>  fi
>
>
> -# For GPU offloading, not everything in libfortran can be supported.
> -# Currently, the only target that has this problem is nvptx.  The
> -# following is a (partial) list of features that are unsupportable on
> -# this particular target:
> -# * Constructors
> -# * alloca
> -# * C library support for I/O, with printf as the one notable exception
> -# * C library support for other features such as signal, environment
> -#   variables, time functions
> -
> - if test "x${target_cpu}" = xnvptx; then
> +# "Minimal" mode is for targets that cannot (yet) support all features of
> +# libgfortran.  It avoids the need for working constructors, alloca, and C
> +# library support for I/O, signals, environment variables, time functions, etc.
> +# At present there are no targets that require this mode.
> +
> + if false; then
>    LIBGFOR_MINIMAL_TRUE=
>    LIBGFOR_MINIMAL_FALSE='#'
>  else
> diff --git a/libgfortran/configure.ac b/libgfortran/configure.ac
> index 97cc490cb5e..e5552949cc6 100644
> --- a/libgfortran/configure.ac
> +++ b/libgfortran/configure.ac
> @@ -222,17 +222,12 @@ AM_CONDITIONAL(LIBGFOR_USE_SYMVER, [test "x$gfortran_use_symver" != xno])
>  AM_CONDITIONAL(LIBGFOR_USE_SYMVER_GNU, [test "x$gfortran_use_symver" = xgnu])
>  AM_CONDITIONAL(LIBGFOR_USE_SYMVER_SUN, [test "x$gfortran_use_symver" = xsun])
>
> -# For GPU offloading, not everything in libfortran can be supported.
> -# Currently, the only target that has this problem is nvptx.  The
> -# following is a (partial) list of features that are unsupportable on
> -# this particular target:
> -# * Constructors
> -# * alloca
> -# * C library support for I/O, with printf as the one notable exception
> -# * C library support for other features such as signal, environment
> -#   variables, time functions
> -
> -AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx])
> +# "Minimal" mode is for targets that cannot (yet) support all features of
> +# libgfortran.  It avoids the need for working constructors, alloca, and C
> +# library support for I/O, signals, environment variables, time functions, etc.
> +# At present there are no targets that require this mode.
> +
> +AM_CONDITIONAL(LIBGFOR_MINIMAL, false)
>
>  # Some compiler target support may have limited support for integer
>  # or floating point numbers – or may want to reduce the libgfortran size
> diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
> index 32aa9705296..30b1e558ea3 100644
> --- a/libgomp/ChangeLog.omp
> +++ b/libgomp/ChangeLog.omp
> @@ -1,5 +1,12 @@
>  2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
>
> +     * testsuite/libgomp.fortran/target-print-1.f90: Adjust.
> +     * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
> +     * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
> +     * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
> +     * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
> +     * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.
> +
>       * plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New.
>       (nvptx_close_device, GOMP_OFFLOAD_load_image)
>       (GOMP_OFFLOAD_unload_image): Call it.
> diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
> deleted file mode 100644
> index a89c9c33484..00000000000
> --- a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
> +++ /dev/null
> @@ -1,11 +0,0 @@
> -! Ensure that write on the offload device works, nvptx offloading variant.
> -
> -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
> -! configuration.
> -! { dg-do link } ! ..., but still apply 'dg-do run' options.
> -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
> -
> -! Skip duplicated testing.
> -! { dg-skip-if "separate file" { ! offload_target_nvptx } }
> -
> -include 'target-print-1.f90'
> diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
> index 327bb22cb6d..9ac70e5a85f 100644
> --- a/libgomp/testsuite/libgomp.fortran/target-print-1.f90
> +++ b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
> @@ -3,9 +3,6 @@
>  ! { dg-do run }
>  ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
>
> -! Separate file 'target-print-1-nvptx.f90' for nvptx offloading.
> -! { dg-skip-if "separate file" { offload_target_nvptx } }
> -
>  program main
>    implicit none
>    integer :: var = 42
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
> index 5951e8cbe64..bbb4b55ef2c 100644
> --- a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
> @@ -17,7 +17,9 @@
>
>  ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
>
> -! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" }
> +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
> +! overflows the stack for nvptx offloading, thus XFAILed.
> +! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
>  !
>  ! In gfortran's main program, libfortran's set_options is called - which sets
>  ! compiler_options.backtrace = 1 by default.  For an offload libgfortran, this
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
> deleted file mode 100644
> index 866c8654355..00000000000
> --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
> +++ /dev/null
> @@ -1,11 +0,0 @@
> -! Ensure that write on the offload device works, nvptx offloading variant.
> -
> -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
> -! configuration.
> -! { dg-do link } ! ..., but still apply 'dg-do run' options.
> -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
> -
> -! Skip duplicated testing.
> -! { dg-skip-if "separate file" { ! offload_target_nvptx } }
> -
> -include 'print-1.f90'
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
> index d2f89d915f8..d04503a0249 100644
> --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
> @@ -2,9 +2,8 @@
>
>  ! { dg-do run }
>  ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
> -
> -! Separate file 'print-1-nvptx.f90' for nvptx offloading.
> -! { dg-skip-if "separate file" { offload_target_nvptx } }
> +! The 'write' overflows the stack for nvptx offloading, thus XFAILed.
> +! { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } }
>
>  ! { dg-additional-options "-fopt-info-note-omp" }
>  ! { dg-additional-options "-foffload=-fopt-info-note-omp" }
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
> index fe7ee37813a..394de034b1f 100644
> --- a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
> @@ -17,7 +17,9 @@
>
>  ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
>
> -! { dg-output "STOP 35(\n|\r\n|\r)+" }
> +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
> +! overflows the stack for nvptx offloading, thus XFAILed.
> +! { dg-output "STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
>  !
>  ! PR85463.  The 'exit' implementation used with nvptx
>  ! offloading is a little bit different.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Update-libgomp-libgomp.texi-for-nvptx-libgfortran-Sw.patch --]
[-- Type: text/x-diff, Size: 1777 bytes --]

From 8c29332e98ca4669a059ebc0d90903b409ae049f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 24 Jan 2023 10:29:01 +0100
Subject: [PATCH] Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch
 out of "minimal" mode'

	libgomp/
	* libgomp.texi (nvptx): Update for
	'nvptx, libgfortran: Switch out of "minimal" mode'.
---
 libgomp/libgomp.texi | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 896d187f1ff..17f1509343f 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -4448,7 +4448,7 @@ The used sizes are
 
 The implementation remark:
 @itemize
-@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+@item I/O within OpenMP target regions and OpenACC compute regions is supported
       using the C library @code{printf} functions and the Fortran
       @code{print}/@code{write} statements.
 @end itemize
@@ -4496,9 +4496,11 @@ CUDA version and hardware.
 
 The implementation remark:
 @itemize
-@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
-      using the C library @code{printf} functions. Note that the Fortran
-      @code{print}/@code{write} statements are not supported, yet.
+@item I/O within OpenMP target regions and OpenACC compute regions is supported
+      using the C library @code{printf} functions.
+      Additionally, the Fortran @code{print}/@code{write} statements are
+      supported within OpenMP target regions, but not yet OpenACC compute
+      regions.
 @item Compilation OpenMP code that contains @code{requires reverse_offload}
       requires at least @code{-march=sm_35}, compiling for @code{-march=sm_30}
       is not supported.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-01-24  9:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ae825c453f484ffd99c9be34af726089@mentor.com>
     [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
     [not found]   ` <87zgcxoa05.fsf@euler.schwinge.homeip.net>
     [not found]     ` <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com>
2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
2022-12-23 21:23         ` Jerry D
2023-01-11 12:06         ` [PING] " Thomas Schwinge
2023-01-12  2:46           ` Jerry D
2023-01-20 21:04 ` nvptx, libgcc: Stub unwinding implementation Thomas Schwinge
2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
2023-01-20 22:10     ` Thomas Koenig
2023-01-24  9:37     ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).