Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

* Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
       [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
@ 2022-11-11 14:12   ` Thomas Schwinge
  2022-11-11 14:35     ` Richard Biener
  2022-11-11 14:38     ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Janne Blomqvist
  0 siblings, 2 replies; 11+ messages in thread
From: Thomas Schwinge @ 2022-11-11 14:12 UTC (permalink / raw)
  To: fortran, gcc; +Cc: Tom de Vries, Alexander Monakov

Hi!

For example, for Fortran code like:

    write (*,*) "Hello world"

..., 'gfortran' creates:

    struct __st_parameter_dt dt_parm.0;

    try
      {
        dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
        dt_parm.0.common.line = 29;
        dt_parm.0.common.flags = 128;
        dt_parm.0.common.unit = 6;
        _gfortran_st_write (&dt_parm.0);
        _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
        _gfortran_st_write_done (&dt_parm.0);
      }
    finally
      {
        dt_parm.0 = {CLOBBER(eol)};
      }

The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
really! -- there's a lot of state in Fortran I/O apparently).  That's a
problem for GPU execution -- here: OpenACC/nvptx -- where typically you
have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
"Use custom stacks instead of local memory for automatic storage".)

Now, the Nvidia Driver tries to accomodate for such largish stack usage,
and dynamically increases the per-thread stack as necessary (thereby
potentially reducing parallelism) -- if it manages to understand the call
graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
to disprove existance of recursion is the common problem, as I've read.
At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:

    warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined

That's still not an actual problem: if the GPU kernel's stack usage still
fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
I/O handling, there is another such 'dt_parm' put onto the stack, the
stack then overflows; device-side SIGSEGV.

(There is, by the way, some similar analysis by Tom de Vries in
<https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
Recursive tests may fail due to thread stack limit".)

Of course, you shouldn't really be doing I/O in GPU kernels, but people
do like their occasional "'printf' debugging", so we ought to make that
work (... without pessimizing any "normal" code).

I assume that generally reducing the size of 'dt_parm' etc. is out of
scope.

There is a way to manually set a per-thread stack size, but it's not
obvious which size to set: that sizes needs to work for the whole GPU
kernel, and should be as low as possible (to maximize parallelism).
I assume that even if GCC did an accurate call graph analysis of the GPU
kernel's maximum stack usage, that still wouldn't help: that's before the
PTX JIT does its own code transformations, including stack spilling.

There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
(-dlto) for device code".  This might help, assuming that it manages to
simplify the libgfortran I/O code such that the PTX JIT then understands
the call graph.  But: that's available only starting with recent
CUDA 11.4, so not a general solution -- if it works at all, which I've
not tested.

Similarly, we could enable GCC's LTO for device code generation -- but
that's a big project, out of scope at this time.  And again, we don't
know if that at all helps this case.

I see a few options:

(a) Figure out what it is in the libgfortran I/O implementation that
causes "Stack size [...] cannot be statically determined", and re-work
that code to avoid that, or even disable certain things for nvptx, if
feasible.

(b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
I don't really want to do that however: it does introduce a bit of
complexity in all the generated device code and run-time overhead that we
generally would like to avoid.

(c) I'm contemplating a tweak/compiler pass for transforming such large
stack objects into heap allocation (during nvptx offloading compilation).
'malloc'/'free' do exist; they're slow, but that's not a problem for the
code paths this is to affect.  (Might also add some compile-time
diagnostic, of course.)  Could maybe even limit this to only be used
during libgfortran compilation?  This is then conceptually a bit similar
to (b), but localized to relevant parts only.  Has such a thing been done
before in GCC, that I could build upon?

Any other clever ideas?

Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
  2022-11-11 14:12   ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Thomas Schwinge
@ 2022-11-11 14:35     ` Richard Biener
  2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
  2022-11-11 14:38     ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Janne Blomqvist
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Biener @ 2022-11-11 14:35 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: fortran, gcc, Tom de Vries, Alexander Monakov

On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>
> Hi!
>
> For example, for Fortran code like:
>
>     write (*,*) "Hello world"
>
> ..., 'gfortran' creates:
>
>     struct __st_parameter_dt dt_parm.0;
>
>     try
>       {
>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>         dt_parm.0.common.line = 29;
>         dt_parm.0.common.flags = 128;
>         dt_parm.0.common.unit = 6;
>         _gfortran_st_write (&dt_parm.0);
>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>         _gfortran_st_write_done (&dt_parm.0);
>       }
>     finally
>       {
>         dt_parm.0 = {CLOBBER(eol)};
>       }
>
> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
> really! -- there's a lot of state in Fortran I/O apparently).  That's a
> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
> "Use custom stacks instead of local memory for automatic storage".)
>
> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
> and dynamically increases the per-thread stack as necessary (thereby
> potentially reducing parallelism) -- if it manages to understand the call
> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
> to disprove existance of recursion is the common problem, as I've read.
> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>
>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>
> That's still not an actual problem: if the GPU kernel's stack usage still
> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
> I/O handling, there is another such 'dt_parm' put onto the stack, the
> stack then overflows; device-side SIGSEGV.
>
> (There is, by the way, some similar analysis by Tom de Vries in
> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
> Recursive tests may fail due to thread stack limit".)
>
> Of course, you shouldn't really be doing I/O in GPU kernels, but people
> do like their occasional "'printf' debugging", so we ought to make that
> work (... without pessimizing any "normal" code).
>
> I assume that generally reducing the size of 'dt_parm' etc. is out of
> scope.
>
> There is a way to manually set a per-thread stack size, but it's not
> obvious which size to set: that sizes needs to work for the whole GPU
> kernel, and should be as low as possible (to maximize parallelism).
> I assume that even if GCC did an accurate call graph analysis of the GPU
> kernel's maximum stack usage, that still wouldn't help: that's before the
> PTX JIT does its own code transformations, including stack spilling.
>
> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
> (-dlto) for device code".  This might help, assuming that it manages to
> simplify the libgfortran I/O code such that the PTX JIT then understands
> the call graph.  But: that's available only starting with recent
> CUDA 11.4, so not a general solution -- if it works at all, which I've
> not tested.
>
> Similarly, we could enable GCC's LTO for device code generation -- but
> that's a big project, out of scope at this time.  And again, we don't
> know if that at all helps this case.
>
> I see a few options:
>
> (a) Figure out what it is in the libgfortran I/O implementation that
> causes "Stack size [...] cannot be statically determined", and re-work
> that code to avoid that, or even disable certain things for nvptx, if
> feasible.
>
> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
> I don't really want to do that however: it does introduce a bit of
> complexity in all the generated device code and run-time overhead that we
> generally would like to avoid.
>
> (c) I'm contemplating a tweak/compiler pass for transforming such large
> stack objects into heap allocation (during nvptx offloading compilation).
> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
> code paths this is to affect.  (Might also add some compile-time
> diagnostic, of course.)  Could maybe even limit this to only be used
> during libgfortran compilation?  This is then conceptually a bit similar
> to (b), but localized to relevant parts only.  Has such a thing been done
> before in GCC, that I could build upon?
>
> Any other clever ideas?

Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
bloat is from things that are unused for simpler I/O cases (so some
"inheritance" could help), and lots of the bloat is from using
string/length pairs using char * + size_t for what looks like could be
encoded a lot more efficiently.

There's probably not much low-hanging fruit.

Converting to heap allocation is difficult outside of the frontend and you
have to be very careful with memleaks.  The library is written in C and
I see heap allocated temporaries there but in at least one
place a stack one is used:

void
st_endfile (st_parameter_filepos *fpp)
{
...
      if (u->current_record)
        {
          st_parameter_dt dtp;
          dtp.common = fpp->common;
          memset (&dtp.u.p, 0, sizeof (dtp.u.p));
          dtp.u.p.current_unit = u;
          next_record (&dtp, 1);

that might be a mistake though - maybe it's enough to change that
to a heap allocation?  It might be also totally superfluous since
only 'u' should matter here ... (not sure if the above is the case
you are running into).

Richard.

>
>
> Grüße
>  Thomas
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
  2022-11-11 14:12   ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Thomas Schwinge
  2022-11-11 14:35     ` Richard Biener
@ 2022-11-11 14:38     ` Janne Blomqvist
  1 sibling, 0 replies; 11+ messages in thread
From: Janne Blomqvist @ 2022-11-11 14:38 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: fortran, gcc, Tom de Vries, Alexander Monakov

On Fri, Nov 11, 2022 at 4:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
> For example, for Fortran code like:
>
>     write (*,*) "Hello world"
>
> ..., 'gfortran' creates:

> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
> really! -- there's a lot of state in Fortran I/O apparently).

> Any other clever ideas?

There's a lot of potential options to set during Fortran I/O, but in
the vast majority of cases only a few are used. So a better library
interface would be to transfer only those options that are used, and
then let the full set of options live in heap memory managed by
libgfortran. Say some kind of simple byte-code format, with an
'opcode' saying which option it is, followed by the value.

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48419 for some
rough ideas in this direction, although I'm not personally working on
GFortran at this time so somebody else would have to pick it up.

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 11+ messages in thread

* nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
  2022-11-11 14:35     ` Richard Biener
@ 2022-12-23 14:08       ` Thomas Schwinge
  2022-12-23 21:23         ` Jerry D
  2023-01-11 12:06         ` [PING] " Thomas Schwinge
  0 siblings, 2 replies; 11+ messages in thread
From: Thomas Schwinge @ 2022-12-23 14:08 UTC (permalink / raw)
  To: Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 7524 bytes --]

Hi!

On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>> For example, for Fortran code like:
>>
>>     write (*,*) "Hello world"
>>
>> ..., 'gfortran' creates:
>>
>>     struct __st_parameter_dt dt_parm.0;
>>
>>     try
>>       {
>>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>         dt_parm.0.common.line = 29;
>>         dt_parm.0.common.flags = 128;
>>         dt_parm.0.common.unit = 6;
>>         _gfortran_st_write (&dt_parm.0);
>>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>         _gfortran_st_write_done (&dt_parm.0);
>>       }
>>     finally
>>       {
>>         dt_parm.0 = {CLOBBER(eol)};
>>       }
>>
>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>> "Use custom stacks instead of local memory for automatic storage".)
>>
>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>> and dynamically increases the per-thread stack as necessary (thereby
>> potentially reducing parallelism) -- if it manages to understand the call
>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>> to disprove existance of recursion is the common problem, as I've read.
>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>
>>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>
>> That's still not an actual problem: if the GPU kernel's stack usage still
>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>> stack then overflows; device-side SIGSEGV.
>>
>> (There is, by the way, some similar analysis by Tom de Vries in
>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>> Recursive tests may fail due to thread stack limit".)
>>
>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>> do like their occasional "'printf' debugging", so we ought to make that
>> work (... without pessimizing any "normal" code).
>>
>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>> scope.
>>
>> There is a way to manually set a per-thread stack size, but it's not
>> obvious which size to set: that sizes needs to work for the whole GPU
>> kernel, and should be as low as possible (to maximize parallelism).
>> I assume that even if GCC did an accurate call graph analysis of the GPU
>> kernel's maximum stack usage, that still wouldn't help: that's before the
>> PTX JIT does its own code transformations, including stack spilling.
>>
>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
>> (-dlto) for device code".  This might help, assuming that it manages to
>> simplify the libgfortran I/O code such that the PTX JIT then understands
>> the call graph.  But: that's available only starting with recent
>> CUDA 11.4, so not a general solution -- if it works at all, which I've
>> not tested.
>>
>> Similarly, we could enable GCC's LTO for device code generation -- but
>> that's a big project, out of scope at this time.  And again, we don't
>> know if that at all helps this case.
>>
>> I see a few options:
>>
>> (a) Figure out what it is in the libgfortran I/O implementation that
>> causes "Stack size [...] cannot be statically determined", and re-work
>> that code to avoid that, or even disable certain things for nvptx, if
>> feasible.

> Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
> bloat is from things that are unused for simpler I/O cases (so some
> "inheritance" could help), and lots of the bloat is from using
> string/length pairs using char * + size_t for what looks like could be
> encoded a lot more efficiently.
>
> There's probably not much low-hanging fruit.

(Similarly comments in Janne's email.)


Well, as had to be expected, libgfortran I/O is really just one example,
but the underlying problem may also be triggered in other ways (via other
newlib/libc functions, for example).

So, really a generic solution seems to be called for.

>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
>> I don't really want to do that however: it does introduce a bit of
>> complexity in all the generated device code and run-time overhead that we
>> generally would like to avoid.

Directly using '-msoft-stack' isn't actually possible: it does implement
"one stack per 32-threads warp", but for OpenACC we need "one stack per
thread of a warp" (that is, each OpenACC 'vector' independently), and
pre-allocating from device memory all those stacks (which may be a lot!)
I foresee to really negatively impact overall performance?

>> (c) I'm contemplating a tweak/compiler pass for transforming such large
>> stack objects into heap allocation (during nvptx offloading compilation).
>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
>> code paths this is to affect.  (Might also add some compile-time
>> diagnostic, of course.)  Could maybe even limit this to only be used
>> during libgfortran compilation?  This is then conceptually a bit similar
>> to (b), but localized to relevant parts only.  Has such a thing been done
>> before in GCC, that I could build upon?
>>
>> Any other clever ideas?

> Converting to heap allocation is difficult outside of the frontend and you
> have to be very careful with memleaks.

Heh, in fact it seems to be pretty simple!  (Famous last words?)  See
"[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
attached.  What do people think about such a thing?

Still to be discussed are '-Wframe-malloc-threshold' (default-on vs.
'-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?),
default value for '-mframe-malloc-threshold=[...]' (potentially different
for GCC/nvptx target libraries build vs. user-compiled code?), etc.


> The library is written in C and
> I see heap allocated temporaries there but in at least one
> place a stack one is used:
>
> void
> st_endfile (st_parameter_filepos *fpp)
> {
> ...
>       if (u->current_record)
>         {
>           st_parameter_dt dtp;
>           dtp.common = fpp->common;
>           memset (&dtp.u.p, 0, sizeof (dtp.u.p));
>           dtp.u.p.current_unit = u;
>           next_record (&dtp, 1);
>
> that might be a mistake though - maybe it's enough to change that
> to a heap allocation?  It might be also totally superfluous since
> only 'u' should matter here ... (not sure if the above is the case
> you are running into).

(Have not yet looked into that; won't solve the general issue.)


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-WIP-nvptx-mframe-malloc-threshold-Wframe-malloc-thre.patch --]
[-- Type: text/x-diff, Size: 16585 bytes --]

From 3f5524adacff23710cf1cab393a56bf23853cafa Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Dec 2022 21:25:19 +0100
Subject: [PATCH] [WIP] nvptx: '-mframe-malloc-threshold',
 '-Wframe-malloc-threshold'

---
 gcc/config/nvptx/nvptx.cc                     | 102 ++++++++++++++++--
 gcc/config/nvptx/nvptx.h                      |   3 +
 gcc/config/nvptx/nvptx.opt                    |  12 +++
 gcc/doc/invoke.texi                           |  16 ++-
 .../nvptx/frame-malloc-threshold-1.c          |  29 +++++
 .../nvptx/frame-malloc-threshold-2.c          |  13 +++
 .../nvptx/frame-malloc-threshold-3.c          |  14 +++
 .../nvptx/frame-malloc-threshold-4.c          |  16 +++
 .../nvptx/frame-malloc-threshold-5.c          |  15 +++
 .../nvptx/frame-malloc-threshold-6.c          |  15 +++
 .../nvptx/frame-malloc-threshold-7.c          |  15 +++
 11 files changed, 240 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index b93a253ab318..2efd70595991 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -178,6 +178,16 @@ static hash_map<tree_decl_hash, unsigned int> gang_private_shared_hmap;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function 'has_malloc_frame'.
+   Because of 'nvptx_name_replacement', we can't just:
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_FREE));
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_MALLOC));
+   ..., but instead have to track them individually.
+*/
+static bool need_free_malloc_decl;
+static bool have_free_decl;
+static bool have_malloc_decl;
+
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 static bool have_softstack_decl;
@@ -976,6 +986,11 @@ write_fn_marker (std::stringstream &s, bool is_defn, bool globalize,
     s << " GLOBAL";
   s << " FUNCTION " << (is_defn ? "DEF: " : "DECL: ");
   s << name << "\n";
+
+  if (strcmp (name, "free") == 0)
+    have_free_decl = true;
+  else if (strcmp (name, "malloc") == 0)
+    have_malloc_decl = true;
 }
 
 /* Emit a linker marker for a variable decl or defn.  */
@@ -1231,22 +1246,66 @@ nvptx_maybe_record_fnsym (rtx sym)
     nvptx_record_needed_fndecl (decl);
 }
 
+//TODO
 /* Emit a local array to hold some part of a conventional stack frame
    and initialize REGNO to point to it.  If the size is zero, it'll
    never be valid to dereference, so we can simply initialize to
    zero.  */
 
 static void
-init_frame (FILE  *file, int regno, unsigned align, unsigned size)
+init_frame (FILE *file, int regno, int align, HOST_WIDE_INT size)
 {
-  if (size)
-    fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n",
-	     align, reg_names[regno], size);
   fprintf (file, "\t.reg.u%d %s;\n",
 	   POINTER_SIZE, reg_names[regno]);
-  fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
-		  :  "\tmov.u%d %s, 0;\n"),
-	   POINTER_SIZE, reg_names[regno], reg_names[regno]);
+
+  if (regno == FRAME_POINTER_REGNUM
+      && ((unsigned HOST_WIDE_INT) size
+	  >= (unsigned HOST_WIDE_INT) nvptx_frame_malloc_threshold))
+    {
+      warning_at (DECL_SOURCE_LOCATION (current_function_decl),
+		  OPT_Wframe_malloc_threshold,
+		  "using %<malloc%> for frame with size of %wu bytes", size);
+
+      /* <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) states that in addition to the "in-kernel
+	 'malloc()' function" there also exists an "in-kernel
+	 '__nv_aligned_device_malloc()' function", where "the address of the
+	 allocated memory will be a multiple of 'align'".  However that's not
+	 documented on
+	 <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0), so we shall not use that function.  */
+      /* <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0) does not, but
+	 <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) does state that the pointer returned by
+	 "in-kernel 'malloc()' [...] is guaranteed to be aligned to a
+	 16-byte boundary".  */
+      if (align > 16)
+	sorry ("unfulfilled %d bytes alignment for frame", align);
+
+      /* We don't need to support 'realloc', so instead of newlib 'malloc'
+	 directly use the PTX 'malloc'.  */
+      fprintf (file,
+	       "\t{\n"
+	       "\t  .param .u64 %%ptr;\n"
+	       "\t  .param .u64 %%size;\n"
+	       "\t  st.param.u64 [%%size], " HOST_WIDE_INT_PRINT_DEC ";\n"
+	       "\t  call (%%ptr), malloc, (%%size);\n"
+	       "\t  ld.param.u64 %s, [%%ptr];\n"
+	       "\t}\n",
+	       size, reg_names[regno]);
+      cfun->machine->has_malloc_frame = true;
+      need_free_malloc_decl = true;
+    }
+  else
+    {
+      if (size)
+	fprintf (file, "\t.local .align %d .b8 %s_ar[" HOST_WIDE_INT_PRINT_DEC "];\n",
+		 align, reg_names[regno], size);
+      fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
+		      :  "\tmov.u%d %s, 0;\n"),
+	       POINTER_SIZE, reg_names[regno], reg_names[regno]);
+    }
 }
 
 /* Emit soft stack frame setup sequence.  */
@@ -1744,12 +1803,22 @@ nvptx_output_set_softstack (unsigned src_regno)
     }
   return "";
 }
+
 /* Output a return instruction.  Also copy the return value to its outgoing
    location.  */
 
 const char *
 nvptx_output_return (void)
 {
+  if (cfun->machine->has_malloc_frame)
+    fprintf (asm_out_file,
+	     "\t{\n"
+	     "\t  .param .u64 %%ptr;\n"
+	     "\t  st.param.u64 [%%ptr], %s;\n"
+	     "\t  call free, (%%ptr);\n"
+	     "\t}\n",
+	     reg_names[FRAME_POINTER_REGNUM]);
+
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
   if (mode != VOIDmode)
@@ -4470,8 +4539,8 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
       rtx_code_label *label = NULL;
 
       empty = false;
-      /* The frame size might not be DImode compatible, but the frame
-	 array's declaration will be.  So it's ok to round up here.  */
+      /* The frame size might not be DImode-compatible, but the actual frame
+	 allocated by 'init_frame' will be.  So it's ok to round up here.  */
       fs = (fs + GET_MODE_SIZE (DImode) - 1) / GET_MODE_SIZE (DImode);
       /* Detect single iteration loop. */
       if (fs == 1)
@@ -5989,6 +6058,21 @@ write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
 static void
 nvptx_file_end (void)
 {
+  if (need_free_malloc_decl)
+    {
+      if (!have_free_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "free");
+	  func_decls << ".extern .func free (.param .b64 %ptr);\n";
+	}
+      if (!have_malloc_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "malloc");
+	  func_decls
+	    << ".extern .func (.param .b64 %ptr) malloc (.param .b64 %size);\n";
+	}
+    }
+
   hash_table<tree_hasher>::iterator iter;
   tree decl;
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index bc1021a80317..82d695551090 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -214,6 +214,8 @@ struct nvptx_args {
 
 #define TRAMPOLINE_SIZE 32
 #define TRAMPOLINE_ALIGNMENT 256
+
+#define NVPTX_FRAME_MALLOC_THRESHOLD_INIT 257
 \f
 /* We don't run reload, so this isn't actually used, but it still needs to be
    defined.  Showing an argp->fp elimination also stops
@@ -244,6 +246,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool has_malloc_frame;
   bool has_softstack; /* Current function has a soft stack frame.  */
   bool has_simtreg; /* Current function has an OpenMP SIMD region.  */
   int num_args;	/* Number of args of current call.  */
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 71d3b68510bd..6ccd3defc776 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -28,6 +28,18 @@ Target RejectNegative Mask(ABI64)
 Ignored, but preserved for backward compatibility.  Only 64-bit ABI is
 supported.
 
+mframe-malloc-threshold=
+Target Joined RejectNegative Host_Wide_Int ByteSize Var(nvptx_frame_malloc_threshold) Init(NVPTX_FRAME_MALLOC_THRESHOLD_INIT)
+-mframe-malloc-threshold=<byte-size>	When the frame size exceeds <byte-size>, frame allocation switches from '.local' memory to 'malloc'.
+
+mno-frame-malloc-threshold
+Target Alias(mframe-malloc-threshold=,18446744073709551615EiB,none)
+Always use '.local' memory for frame allocation.  Equivalent to -mframe-malloc-threshold=<SIZE_MAX> or larger.
+
+Wframe-malloc-threshold
+Target Warning
+Warn when the threshold is reached where frame allocation switches from '.local' memory to 'malloc'.
+
 mmainkernel
 Target RejectNegative
 Link in code for a __main kernel.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 471309dfacfe..e3b6ea0fe4b8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1179,7 +1179,9 @@ Objective-C and Objective-C++ Dialects}.
 -march=@var{arch}  -mbmx  -mno-bmx  -mcdx  -mno-cdx}
 
 @emph{Nvidia PTX Options}
-@gccoptlist{-m64  -mmainkernel  -moptimize}
+@gccoptlist{-m64 @gol
+-mframe-malloc-threshold=@var{byte-size} @gol
+-mmainkernel  -moptimize}
 
 @emph{OpenRISC Options}
 @gccoptlist{-mboard=@var{name}  -mnewlib  -mhard-mul  -mhard-div @gol
@@ -28367,6 +28369,18 @@ This option sets the values of the preprocessor macros
 for instance, for @samp{3.1} the macros have the values @samp{3} and
 @samp{1}, respectively.
 
+@item -mframe-malloc-threshold=@var{byte-size}
+@opindex mframe-malloc-threshold=
+@opindex mno-frame-malloc-threshold
+TODO
+
+This is not relevant if @code{-msoft-stack} is enabled.
+
+@option{-mframe-malloc-threshold=TODO} is enabled by default.
+This may be disabled either by specifying
+@var{byte-size} of @samp{SIZE_MAX} or more or by
+@option{-mno-frame-malloc-threshold}.
+
 @item -mmainkernel
 @opindex mmainkernel
 Link in code for a __main kernel.  This is for stand-alone instead of
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
new file mode 100644
index 000000000000..b16c17bfdf99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
@@ -0,0 +1,29 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+/* PTX-provided 'free', 'malloc'; cf. 'nvptx_name_replacement'.  */
+void ptx_free (void *) __asm__ ("free");
+void *ptx_malloc (__SIZE_TYPE__) __asm__ ("malloc");
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[1234];
+
+  ptx_malloc (5);
+
+  ptx_free (ptx_malloc (1));
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 2 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 3 } }
+*/
+
+/* Of the implicit (via 'need_free_malloc_decl') and explicit declarations of
+   'free', 'malloc', only one is emitted each:
+   { dg-final { scan-assembler-times {(?n)\.extern .* free .*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)\.extern .* malloc .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
new file mode 100644
index 000000000000..2f6a919eb1f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
new file mode 100644
index 000000000000..7434132b2ad5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[256];
+}
+
+/* We don't exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
new file mode 100644
index 000000000000..c4068ab7ad23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=32 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[32];
+}
+
+/* We exceed the specified '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
new file mode 100644
index 000000000000..cc262427b03c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=1249 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
new file mode 100644
index 000000000000..72017ca2f439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=2KiB } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
new file mode 100644
index 000000000000..b2f85a55f050
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mno-frame-malloc-threshold } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We'll never exceed the specified unlimited '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
-- 
2.35.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
  2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
@ 2022-12-23 21:23         ` Jerry D
  2023-01-11 12:06         ` [PING] " Thomas Schwinge
  1 sibling, 0 replies; 11+ messages in thread
From: Jerry D @ 2022-12-23 21:23 UTC (permalink / raw)
  To: Thomas Schwinge, Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

On 12/23/22 6:08 AM, Thomas Schwinge wrote:
> Hi!
> 
> On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
>> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>>> For example, for Fortran code like:
>>>
>>>      write (*,*) "Hello world"
>>>
>>> ..., 'gfortran' creates:
>>>
>>>      struct __st_parameter_dt dt_parm.0;
>>>
>>>      try
>>>        {
>>>          dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>>          dt_parm.0.common.line = 29;
>>>          dt_parm.0.common.flags = 128;
>>>          dt_parm.0.common.unit = 6;
>>>          _gfortran_st_write (&dt_parm.0);
>>>          _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>>          _gfortran_st_write_done (&dt_parm.0);
>>>        }
>>>      finally
>>>        {
>>>          dt_parm.0 = {CLOBBER(eol)};
>>>        }
>>>
>>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>>> "Use custom stacks instead of local memory for automatic storage".)
>>>
>>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>>> and dynamically increases the per-thread stack as necessary (thereby
>>> potentially reducing parallelism) -- if it manages to understand the call
>>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>>> to disprove existance of recursion is the common problem, as I've read.
>>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>>
>>>      warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>>
>>> That's still not an actual problem: if the GPU kernel's stack usage still
>>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>>> stack then overflows; device-side SIGSEGV.
>>>
>>> (There is, by the way, some similar analysis by Tom de Vries in
>>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>>> Recursive tests may fail due to thread stack limit".)
>>>
>>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>>> do like their occasional "'printf' debugging", so we ought to make that
>>> work (... without pessimizing any "normal" code).
>>>
>>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>>> scope.

There are so many wiggles and turns and corner cases and the like of 
nightmares in I/O I would advise not trying to reduce the dt_parm.  It 
could probably be done.

For debugging GPU, would it not be better to have a way you signal back 
to a main thread to do a print from there, like some sort of call back 
in the users code under test.

Putting this another way, recommend users debugging to use a different 
method than embedding print statements for debugging rather than do a 
tone of work to enable something that is not really a legitimate use case.

FWIW,

Jerry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
  2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
  2022-12-23 21:23         ` Jerry D
@ 2023-01-11 12:06         ` Thomas Schwinge
  2023-01-12  2:46           ` Jerry D
  1 sibling, 1 reply; 11+ messages in thread
From: Thomas Schwinge @ 2023-01-11 12:06 UTC (permalink / raw)
  To: Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 8377 bytes --]

Hi!

Ping -- the '-mframe-malloc-threshold' idea, at least.

Note that while this issue originally did pop up for Fortran I/O, it's
likewise relevant for other functions that maintain big frames, for
example in newlib:

    libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064];
    libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064];
    libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064];
    libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560];

Therefore a generic solution (or, workaround if you'd like) does seem
appropriate.


Grüße
 Thomas


On 2022-12-23T15:08:06+0100, I wrote:
> Hi!
>
> On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
>> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>>> For example, for Fortran code like:
>>>
>>>     write (*,*) "Hello world"
>>>
>>> ..., 'gfortran' creates:
>>>
>>>     struct __st_parameter_dt dt_parm.0;
>>>
>>>     try
>>>       {
>>>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>>         dt_parm.0.common.line = 29;
>>>         dt_parm.0.common.flags = 128;
>>>         dt_parm.0.common.unit = 6;
>>>         _gfortran_st_write (&dt_parm.0);
>>>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>>         _gfortran_st_write_done (&dt_parm.0);
>>>       }
>>>     finally
>>>       {
>>>         dt_parm.0 = {CLOBBER(eol)};
>>>       }
>>>
>>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>>> "Use custom stacks instead of local memory for automatic storage".)
>>>
>>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>>> and dynamically increases the per-thread stack as necessary (thereby
>>> potentially reducing parallelism) -- if it manages to understand the call
>>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>>> to disprove existance of recursion is the common problem, as I've read.
>>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>>
>>>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>>
>>> That's still not an actual problem: if the GPU kernel's stack usage still
>>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>>> stack then overflows; device-side SIGSEGV.
>>>
>>> (There is, by the way, some similar analysis by Tom de Vries in
>>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>>> Recursive tests may fail due to thread stack limit".)
>>>
>>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>>> do like their occasional "'printf' debugging", so we ought to make that
>>> work (... without pessimizing any "normal" code).
>>>
>>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>>> scope.
>>>
>>> There is a way to manually set a per-thread stack size, but it's not
>>> obvious which size to set: that sizes needs to work for the whole GPU
>>> kernel, and should be as low as possible (to maximize parallelism).
>>> I assume that even if GCC did an accurate call graph analysis of the GPU
>>> kernel's maximum stack usage, that still wouldn't help: that's before the
>>> PTX JIT does its own code transformations, including stack spilling.
>>>
>>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
>>> (-dlto) for device code".  This might help, assuming that it manages to
>>> simplify the libgfortran I/O code such that the PTX JIT then understands
>>> the call graph.  But: that's available only starting with recent
>>> CUDA 11.4, so not a general solution -- if it works at all, which I've
>>> not tested.
>>>
>>> Similarly, we could enable GCC's LTO for device code generation -- but
>>> that's a big project, out of scope at this time.  And again, we don't
>>> know if that at all helps this case.
>>>
>>> I see a few options:
>>>
>>> (a) Figure out what it is in the libgfortran I/O implementation that
>>> causes "Stack size [...] cannot be statically determined", and re-work
>>> that code to avoid that, or even disable certain things for nvptx, if
>>> feasible.
>
>> Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
>> bloat is from things that are unused for simpler I/O cases (so some
>> "inheritance" could help), and lots of the bloat is from using
>> string/length pairs using char * + size_t for what looks like could be
>> encoded a lot more efficiently.
>>
>> There's probably not much low-hanging fruit.
>
> (Similarly comments in Janne's email.)
>
>
> Well, as had to be expected, libgfortran I/O is really just one example,
> but the underlying problem may also be triggered in other ways (via other
> newlib/libc functions, for example).
>
> So, really a generic solution seems to be called for.
>
>>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
>>> I don't really want to do that however: it does introduce a bit of
>>> complexity in all the generated device code and run-time overhead that we
>>> generally would like to avoid.
>
> Directly using '-msoft-stack' isn't actually possible: it does implement
> "one stack per 32-threads warp", but for OpenACC we need "one stack per
> thread of a warp" (that is, each OpenACC 'vector' independently), and
> pre-allocating from device memory all those stacks (which may be a lot!)
> I foresee to really negatively impact overall performance?
>
>>> (c) I'm contemplating a tweak/compiler pass for transforming such large
>>> stack objects into heap allocation (during nvptx offloading compilation).
>>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
>>> code paths this is to affect.  (Might also add some compile-time
>>> diagnostic, of course.)  Could maybe even limit this to only be used
>>> during libgfortran compilation?  This is then conceptually a bit similar
>>> to (b), but localized to relevant parts only.  Has such a thing been done
>>> before in GCC, that I could build upon?
>>>
>>> Any other clever ideas?
>
>> Converting to heap allocation is difficult outside of the frontend and you
>> have to be very careful with memleaks.
>
> Heh, in fact it seems to be pretty simple!  (Famous last words?)  See
> "[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
> attached.  What do people think about such a thing?
>
> Still to be discussed are '-Wframe-malloc-threshold' (default-on vs.
> '-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?),
> default value for '-mframe-malloc-threshold=[...]' (potentially different
> for GCC/nvptx target libraries build vs. user-compiled code?), etc.
>
>
>> The library is written in C and
>> I see heap allocated temporaries there but in at least one
>> place a stack one is used:
>>
>> void
>> st_endfile (st_parameter_filepos *fpp)
>> {
>> ...
>>       if (u->current_record)
>>         {
>>           st_parameter_dt dtp;
>>           dtp.common = fpp->common;
>>           memset (&dtp.u.p, 0, sizeof (dtp.u.p));
>>           dtp.u.p.current_unit = u;
>>           next_record (&dtp, 1);
>>
>> that might be a mistake though - maybe it's enough to change that
>> to a heap allocation?  It might be also totally superfluous since
>> only 'u' should matter here ... (not sure if the above is the case
>> you are running into).
>
> (Have not yet looked into that; won't solve the general issue.)
>
>
> Grüße
>  Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-WIP-nvptx-mframe-malloc-threshold-Wframe-malloc-thre.patch --]
[-- Type: text/x-diff, Size: 16585 bytes --]

From 3f5524adacff23710cf1cab393a56bf23853cafa Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Dec 2022 21:25:19 +0100
Subject: [PATCH] [WIP] nvptx: '-mframe-malloc-threshold',
 '-Wframe-malloc-threshold'

---
 gcc/config/nvptx/nvptx.cc                     | 102 ++++++++++++++++--
 gcc/config/nvptx/nvptx.h                      |   3 +
 gcc/config/nvptx/nvptx.opt                    |  12 +++
 gcc/doc/invoke.texi                           |  16 ++-
 .../nvptx/frame-malloc-threshold-1.c          |  29 +++++
 .../nvptx/frame-malloc-threshold-2.c          |  13 +++
 .../nvptx/frame-malloc-threshold-3.c          |  14 +++
 .../nvptx/frame-malloc-threshold-4.c          |  16 +++
 .../nvptx/frame-malloc-threshold-5.c          |  15 +++
 .../nvptx/frame-malloc-threshold-6.c          |  15 +++
 .../nvptx/frame-malloc-threshold-7.c          |  15 +++
 11 files changed, 240 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
 create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index b93a253ab318..2efd70595991 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -178,6 +178,16 @@ static hash_map<tree_decl_hash, unsigned int> gang_private_shared_hmap;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function 'has_malloc_frame'.
+   Because of 'nvptx_name_replacement', we can't just:
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_FREE));
+       nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_MALLOC));
+   ..., but instead have to track them individually.
+*/
+static bool need_free_malloc_decl;
+static bool have_free_decl;
+static bool have_malloc_decl;
+
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 static bool have_softstack_decl;
@@ -976,6 +986,11 @@ write_fn_marker (std::stringstream &s, bool is_defn, bool globalize,
     s << " GLOBAL";
   s << " FUNCTION " << (is_defn ? "DEF: " : "DECL: ");
   s << name << "\n";
+
+  if (strcmp (name, "free") == 0)
+    have_free_decl = true;
+  else if (strcmp (name, "malloc") == 0)
+    have_malloc_decl = true;
 }
 
 /* Emit a linker marker for a variable decl or defn.  */
@@ -1231,22 +1246,66 @@ nvptx_maybe_record_fnsym (rtx sym)
     nvptx_record_needed_fndecl (decl);
 }
 
+//TODO
 /* Emit a local array to hold some part of a conventional stack frame
    and initialize REGNO to point to it.  If the size is zero, it'll
    never be valid to dereference, so we can simply initialize to
    zero.  */
 
 static void
-init_frame (FILE  *file, int regno, unsigned align, unsigned size)
+init_frame (FILE *file, int regno, int align, HOST_WIDE_INT size)
 {
-  if (size)
-    fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n",
-	     align, reg_names[regno], size);
   fprintf (file, "\t.reg.u%d %s;\n",
 	   POINTER_SIZE, reg_names[regno]);
-  fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
-		  :  "\tmov.u%d %s, 0;\n"),
-	   POINTER_SIZE, reg_names[regno], reg_names[regno]);
+
+  if (regno == FRAME_POINTER_REGNUM
+      && ((unsigned HOST_WIDE_INT) size
+	  >= (unsigned HOST_WIDE_INT) nvptx_frame_malloc_threshold))
+    {
+      warning_at (DECL_SOURCE_LOCATION (current_function_decl),
+		  OPT_Wframe_malloc_threshold,
+		  "using %<malloc%> for frame with size of %wu bytes", size);
+
+      /* <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) states that in addition to the "in-kernel
+	 'malloc()' function" there also exists an "in-kernel
+	 '__nv_aligned_device_malloc()' function", where "the address of the
+	 allocated memory will be a multiple of 'align'".  However that's not
+	 documented on
+	 <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0), so we shall not use that function.  */
+      /* <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls>
+	 (2022-12-21, v12.0) does not, but
+	 <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations>
+	 (2022-12-21, v12.0) does state that the pointer returned by
+	 "in-kernel 'malloc()' [...] is guaranteed to be aligned to a
+	 16-byte boundary".  */
+      if (align > 16)
+	sorry ("unfulfilled %d bytes alignment for frame", align);
+
+      /* We don't need to support 'realloc', so instead of newlib 'malloc'
+	 directly use the PTX 'malloc'.  */
+      fprintf (file,
+	       "\t{\n"
+	       "\t  .param .u64 %%ptr;\n"
+	       "\t  .param .u64 %%size;\n"
+	       "\t  st.param.u64 [%%size], " HOST_WIDE_INT_PRINT_DEC ";\n"
+	       "\t  call (%%ptr), malloc, (%%size);\n"
+	       "\t  ld.param.u64 %s, [%%ptr];\n"
+	       "\t}\n",
+	       size, reg_names[regno]);
+      cfun->machine->has_malloc_frame = true;
+      need_free_malloc_decl = true;
+    }
+  else
+    {
+      if (size)
+	fprintf (file, "\t.local .align %d .b8 %s_ar[" HOST_WIDE_INT_PRINT_DEC "];\n",
+		 align, reg_names[regno], size);
+      fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
+		      :  "\tmov.u%d %s, 0;\n"),
+	       POINTER_SIZE, reg_names[regno], reg_names[regno]);
+    }
 }
 
 /* Emit soft stack frame setup sequence.  */
@@ -1744,12 +1803,22 @@ nvptx_output_set_softstack (unsigned src_regno)
     }
   return "";
 }
+
 /* Output a return instruction.  Also copy the return value to its outgoing
    location.  */
 
 const char *
 nvptx_output_return (void)
 {
+  if (cfun->machine->has_malloc_frame)
+    fprintf (asm_out_file,
+	     "\t{\n"
+	     "\t  .param .u64 %%ptr;\n"
+	     "\t  st.param.u64 [%%ptr], %s;\n"
+	     "\t  call free, (%%ptr);\n"
+	     "\t}\n",
+	     reg_names[FRAME_POINTER_REGNUM]);
+
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
   if (mode != VOIDmode)
@@ -4470,8 +4539,8 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
       rtx_code_label *label = NULL;
 
       empty = false;
-      /* The frame size might not be DImode compatible, but the frame
-	 array's declaration will be.  So it's ok to round up here.  */
+      /* The frame size might not be DImode-compatible, but the actual frame
+	 allocated by 'init_frame' will be.  So it's ok to round up here.  */
       fs = (fs + GET_MODE_SIZE (DImode) - 1) / GET_MODE_SIZE (DImode);
       /* Detect single iteration loop. */
       if (fs == 1)
@@ -5989,6 +6058,21 @@ write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
 static void
 nvptx_file_end (void)
 {
+  if (need_free_malloc_decl)
+    {
+      if (!have_free_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "free");
+	  func_decls << ".extern .func free (.param .b64 %ptr);\n";
+	}
+      if (!have_malloc_decl)
+	{
+	  write_fn_marker (func_decls, false, true, "malloc");
+	  func_decls
+	    << ".extern .func (.param .b64 %ptr) malloc (.param .b64 %size);\n";
+	}
+    }
+
   hash_table<tree_hasher>::iterator iter;
   tree decl;
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index bc1021a80317..82d695551090 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -214,6 +214,8 @@ struct nvptx_args {
 
 #define TRAMPOLINE_SIZE 32
 #define TRAMPOLINE_ALIGNMENT 256
+
+#define NVPTX_FRAME_MALLOC_THRESHOLD_INIT 257
 \f
 /* We don't run reload, so this isn't actually used, but it still needs to be
    defined.  Showing an argp->fp elimination also stops
@@ -244,6 +246,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool has_malloc_frame;
   bool has_softstack; /* Current function has a soft stack frame.  */
   bool has_simtreg; /* Current function has an OpenMP SIMD region.  */
   int num_args;	/* Number of args of current call.  */
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 71d3b68510bd..6ccd3defc776 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -28,6 +28,18 @@ Target RejectNegative Mask(ABI64)
 Ignored, but preserved for backward compatibility.  Only 64-bit ABI is
 supported.
 
+mframe-malloc-threshold=
+Target Joined RejectNegative Host_Wide_Int ByteSize Var(nvptx_frame_malloc_threshold) Init(NVPTX_FRAME_MALLOC_THRESHOLD_INIT)
+-mframe-malloc-threshold=<byte-size>	When the frame size exceeds <byte-size>, frame allocation switches from '.local' memory to 'malloc'.
+
+mno-frame-malloc-threshold
+Target Alias(mframe-malloc-threshold=,18446744073709551615EiB,none)
+Always use '.local' memory for frame allocation.  Equivalent to -mframe-malloc-threshold=<SIZE_MAX> or larger.
+
+Wframe-malloc-threshold
+Target Warning
+Warn when the threshold is reached where frame allocation switches from '.local' memory to 'malloc'.
+
 mmainkernel
 Target RejectNegative
 Link in code for a __main kernel.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 471309dfacfe..e3b6ea0fe4b8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1179,7 +1179,9 @@ Objective-C and Objective-C++ Dialects}.
 -march=@var{arch}  -mbmx  -mno-bmx  -mcdx  -mno-cdx}
 
 @emph{Nvidia PTX Options}
-@gccoptlist{-m64  -mmainkernel  -moptimize}
+@gccoptlist{-m64 @gol
+-mframe-malloc-threshold=@var{byte-size} @gol
+-mmainkernel  -moptimize}
 
 @emph{OpenRISC Options}
 @gccoptlist{-mboard=@var{name}  -mnewlib  -mhard-mul  -mhard-div @gol
@@ -28367,6 +28369,18 @@ This option sets the values of the preprocessor macros
 for instance, for @samp{3.1} the macros have the values @samp{3} and
 @samp{1}, respectively.
 
+@item -mframe-malloc-threshold=@var{byte-size}
+@opindex mframe-malloc-threshold=
+@opindex mno-frame-malloc-threshold
+TODO
+
+This is not relevant if @code{-msoft-stack} is enabled.
+
+@option{-mframe-malloc-threshold=TODO} is enabled by default.
+This may be disabled either by specifying
+@var{byte-size} of @samp{SIZE_MAX} or more or by
+@option{-mno-frame-malloc-threshold}.
+
 @item -mmainkernel
 @opindex mmainkernel
 Link in code for a __main kernel.  This is for stand-alone instead of
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
new file mode 100644
index 000000000000..b16c17bfdf99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c
@@ -0,0 +1,29 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+/* PTX-provided 'free', 'malloc'; cf. 'nvptx_name_replacement'.  */
+void ptx_free (void *) __asm__ ("free");
+void *ptx_malloc (__SIZE_TYPE__) __asm__ ("malloc");
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[1234];
+
+  ptx_malloc (5);
+
+  ptx_free (ptx_malloc (1));
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 2 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 3 } }
+*/
+
+/* Of the implicit (via 'need_free_malloc_decl') and explicit declarations of
+   'free', 'malloc', only one is emitted each:
+   { dg-final { scan-assembler-times {(?n)\.extern .* free .*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)\.extern .* malloc .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
new file mode 100644
index 000000000000..2f6a919eb1f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
new file mode 100644
index 000000000000..7434132b2ad5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[256];
+}
+
+/* We don't exceed the default '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
new file mode 100644
index 000000000000..c4068ab7ad23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=32 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */
+{
+  char a[32];
+}
+
+/* We exceed the specified '-mframe-malloc-threshold=[...]'.
+   { dg-final { scan-assembler-not {%frame_ar} } }
+   { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } }
+   { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
new file mode 100644
index 000000000000..cc262427b03c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=1249 } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
new file mode 100644
index 000000000000..72017ca2f439
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mframe-malloc-threshold=2KiB } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We don't exceed the specified '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
new file mode 100644
index 000000000000..b2f85a55f050
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options {-save-temps -O0} } */
+/* { dg-additional-options -mno-frame-malloc-threshold } */
+/* { dg-additional-options -Wframe-malloc-threshold } */
+
+int f (void)
+{
+  char a[1234];
+}
+
+/* We'll never exceed the specified unlimited '-mframe-malloc-threshold=[...]'.
+/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } }
+   { dg-final { scan-assembler-not {free} } }
+   { dg-final { scan-assembler-not {malloc} } }
+*/
-- 
2.35.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)
  2023-01-11 12:06         ` [PING] " Thomas Schwinge
@ 2023-01-12  2:46           ` Jerry D
  0 siblings, 0 replies; 11+ messages in thread
From: Jerry D @ 2023-01-12  2:46 UTC (permalink / raw)
  To: Thomas Schwinge, Richard Biener, Tom de Vries, gcc-patches
  Cc: Janne Blomqvist, fortran, Alexander Monakov

On 1/11/23 4:06 AM, Thomas Schwinge wrote:
> Hi!
> 
> Ping -- the '-mframe-malloc-threshold' idea, at least.
> 
> Note that while this issue originally did pop up for Fortran I/O, it's
> likewise relevant for other functions that maintain big frames, for
> example in newlib:
> 
>      libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064];
>      libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064];
>      libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064];
>      libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560];
> 
> Therefore a generic solution (or, workaround if you'd like) does seem
> appropriate.
> 
---snip ---

AS a gfortranner I have to at least say anyone doing fortran I/O on a 
GPU is nuts.

With that said, a configurable option to address the broader issue makes 
sense. Perhaps the default threshold should be whatever it is now and if 
someone has a real situation where it is needed, they can adjust.

Regards,

Jerry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* nvptx, libgcc: Stub unwinding implementation
       [not found] <ae825c453f484ffd99c9be34af726089@mentor.com>
       [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
@ 2023-01-20 21:04 ` Thomas Schwinge
  2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
  1 sibling, 1 reply; 11+ messages in thread
From: Thomas Schwinge @ 2023-01-20 21:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs

[-- Attachment #1: Type: text/plain, Size: 796 bytes --]

Hi!

We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
configuration of libgfortran.  One prerequisite patch, based on WIP work
by Andrew Stubbs, is: "nvptx, libgcc: Stub unwinding implementation", see
attached.  This I've just pushed to devel/omp/gcc-12 branch in
commit 26d3146736218ccaaaafdaba4da1edf969bc190d, and would like to push
to master branch once other pending GCC patches have been accepted.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-nvptx-libgcc-Stub-unwinding-implementation.patch --]
[-- Type: text/x-diff, Size: 3312 bytes --]

From 26d3146736218ccaaaafdaba4da1edf969bc190d Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Sep 2022 18:58:34 +0200
Subject: [PATCH] nvptx, libgcc: Stub unwinding implementation

Adding stub '_Unwind_Backtrace', '_Unwind_GetIPInfo' functions is necessary
for linking libbacktrace, as a normal (non-'LIBGFOR_MINIMAL') configuration
of libgfortran wants to do, for example.

The file 'libgcc/config/nvptx/unwind-nvptx.c' is copied from
'libgcc/config/gcn/unwind-gcn.c'.

libgcc/ChangeLog:

	* config/nvptx/t-nvptx: Add unwind-nvptx.c.
	* config/nvptx/unwind-nvptx.c: New file.

Co-authored-by: Andrew Stubbs <ams@codesourcery.com>
---
 libgcc/ChangeLog.omp               |  6 +++++
 libgcc/config/nvptx/t-nvptx        |  3 ++-
 libgcc/config/nvptx/unwind-nvptx.c | 36 ++++++++++++++++++++++++++++++
 3 files changed, 44 insertions(+), 1 deletion(-)
 create mode 100644 libgcc/config/nvptx/unwind-nvptx.c

diff --git a/libgcc/ChangeLog.omp b/libgcc/ChangeLog.omp
index 2e7bf5cc029..c46f49bf5b7 100644
--- a/libgcc/ChangeLog.omp
+++ b/libgcc/ChangeLog.omp
@@ -1,3 +1,9 @@
+2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
+	    Andrew Stubbs  <ams@codesourcery.com>
+
+	* config/nvptx/t-nvptx: Add unwind-nvptx.c.
+	* config/nvptx/unwind-nvptx.c: New file.
+
 2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
 
 	* config/nvptx/crtstuff.c ["mgomp"]
diff --git a/libgcc/config/nvptx/t-nvptx b/libgcc/config/nvptx/t-nvptx
index 9a0454c3a4d..1845a38a35e 100644
--- a/libgcc/config/nvptx/t-nvptx
+++ b/libgcc/config/nvptx/t-nvptx
@@ -1,6 +1,7 @@
 LIB2ADD=$(srcdir)/config/nvptx/reduction.c \
 	$(srcdir)/config/nvptx/mgomp.c \
-	$(srcdir)/config/nvptx/atomic.c
+	$(srcdir)/config/nvptx/atomic.c \
+	$(srcdir)/config/nvptx/unwind-nvptx.c
 
 LIB2ADDEH=
 LIB2FUNCS_EXCLUDE=
diff --git a/libgcc/config/nvptx/unwind-nvptx.c b/libgcc/config/nvptx/unwind-nvptx.c
new file mode 100644
index 00000000000..c657b2af6f3
--- /dev/null
+++ b/libgcc/config/nvptx/unwind-nvptx.c
@@ -0,0 +1,36 @@
+/* Stub unwinding implementation.
+
+   Copyright (C) 2019-2023 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "unwind.h"
+
+_Unwind_Reason_Code
+_Unwind_Backtrace(_Unwind_Trace_Fn trace, void * trace_argument)
+{
+  return 0;
+}
+
+_Unwind_Ptr
+_Unwind_GetIPInfo (struct _Unwind_Context *c, int *ip_before_insn)
+{
+  return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* nvptx, libgfortran: Switch out of "minimal" mode
  2023-01-20 21:04 ` nvptx, libgcc: Stub unwinding implementation Thomas Schwinge
@ 2023-01-20 21:16   ` Thomas Schwinge
  2023-01-20 22:10     ` Thomas Koenig
  2023-01-24  9:37     ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge
  0 siblings, 2 replies; 11+ messages in thread
From: Thomas Schwinge @ 2023-01-20 21:16 UTC (permalink / raw)
  To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

Hi!

On 2023-01-20T22:04:02+0100, I wrote:
> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
> offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
> configuration of libgfortran.

This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode',
see attached, again based on WIP work by Andrew Stubbs.  This I've just
pushed to devel/omp/gcc-12 branch in
commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push
to master branch once other pending GCC patches have been accepted.


The OpenACC XFAILs: "[...] overflows the stack for nvptx offloading"
are unresolved at this point; see the discussion around
"Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?",
and my "nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
experimenting.  (The latter works to some extent, but also has other
issues that I shall detail at some later point in time.)


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: 0001-nvptx-libgfortran-Switch-out-of-minimal-mode.patch --]
[-- Type: text/x-diff, Size: 9644 bytes --]

From c7734c6fbb5513b4da6306de7bc85de9b8547988 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 21 Sep 2022 18:58:34 +0200
Subject: [PATCH] nvptx, libgfortran: Switch out of "minimal" mode

..., in order to enable (portions of) Fortran I/O, for example.

libgfortran/ChangeLog:

	* configure: Regenerate.
	* configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.

libgomp/ChangeLog:

	* testsuite/libgomp.fortran/target-print-1.f90: Adjust.
	* testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
	* testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
	* testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
	* testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
	* testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.

Co-authored-by: Andrew Stubbs <ams@codesourcery.com>
---
 libgfortran/ChangeLog.omp                       |  6 ++++++
 libgfortran/configure                           | 17 ++++++-----------
 libgfortran/configure.ac                        | 17 ++++++-----------
 libgomp/ChangeLog.omp                           |  7 +++++++
 .../libgomp.fortran/target-print-1-nvptx.f90    | 11 -----------
 .../libgomp.fortran/target-print-1.f90          |  3 ---
 .../libgomp.oacc-fortran/error_stop-2.f         |  4 +++-
 .../libgomp.oacc-fortran/print-1-nvptx.f90      | 11 -----------
 .../testsuite/libgomp.oacc-fortran/print-1.f90  |  5 ++---
 libgomp/testsuite/libgomp.oacc-fortran/stop-2.f |  4 +++-
 10 files changed, 33 insertions(+), 52 deletions(-)
 delete mode 100644 libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
 delete mode 100644 libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90

diff --git a/libgfortran/ChangeLog.omp b/libgfortran/ChangeLog.omp
index b08c264daf9..925575e65fa 100644
--- a/libgfortran/ChangeLog.omp
+++ b/libgfortran/ChangeLog.omp
@@ -1,3 +1,9 @@
+2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
+	    Andrew Stubbs  <ams@codesourcery.com>
+
+	* configure: Regenerate.
+	* configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.
+
 2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
 
 	PR target/85463
diff --git a/libgfortran/configure b/libgfortran/configure
index ae64dca3114..3e5c931d4ad 100755
--- a/libgfortran/configure
+++ b/libgfortran/configure
@@ -6230,17 +6230,12 @@ else
 fi
 
 
-# For GPU offloading, not everything in libfortran can be supported.
-# Currently, the only target that has this problem is nvptx.  The
-# following is a (partial) list of features that are unsupportable on
-# this particular target:
-# * Constructors
-# * alloca
-# * C library support for I/O, with printf as the one notable exception
-# * C library support for other features such as signal, environment
-#   variables, time functions
-
- if test "x${target_cpu}" = xnvptx; then
+# "Minimal" mode is for targets that cannot (yet) support all features of
+# libgfortran.  It avoids the need for working constructors, alloca, and C
+# library support for I/O, signals, environment variables, time functions, etc.
+# At present there are no targets that require this mode.
+
+ if false; then
   LIBGFOR_MINIMAL_TRUE=
   LIBGFOR_MINIMAL_FALSE='#'
 else
diff --git a/libgfortran/configure.ac b/libgfortran/configure.ac
index 97cc490cb5e..e5552949cc6 100644
--- a/libgfortran/configure.ac
+++ b/libgfortran/configure.ac
@@ -222,17 +222,12 @@ AM_CONDITIONAL(LIBGFOR_USE_SYMVER, [test "x$gfortran_use_symver" != xno])
 AM_CONDITIONAL(LIBGFOR_USE_SYMVER_GNU, [test "x$gfortran_use_symver" = xgnu])
 AM_CONDITIONAL(LIBGFOR_USE_SYMVER_SUN, [test "x$gfortran_use_symver" = xsun])
 
-# For GPU offloading, not everything in libfortran can be supported.
-# Currently, the only target that has this problem is nvptx.  The
-# following is a (partial) list of features that are unsupportable on
-# this particular target:
-# * Constructors
-# * alloca
-# * C library support for I/O, with printf as the one notable exception
-# * C library support for other features such as signal, environment
-#   variables, time functions
-
-AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx])
+# "Minimal" mode is for targets that cannot (yet) support all features of
+# libgfortran.  It avoids the need for working constructors, alloca, and C
+# library support for I/O, signals, environment variables, time functions, etc.
+# At present there are no targets that require this mode.
+
+AM_CONDITIONAL(LIBGFOR_MINIMAL, false)
 
 # Some compiler target support may have limited support for integer
 # or floating point numbers – or may want to reduce the libgfortran size
diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index 32aa9705296..30b1e558ea3 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,5 +1,12 @@
 2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* testsuite/libgomp.fortran/target-print-1.f90: Adjust.
+	* testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
+	* testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
+	* testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
+	* testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
+	* testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.
+
 	* plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New.
 	(nvptx_close_device, GOMP_OFFLOAD_load_image)
 	(GOMP_OFFLOAD_unload_image): Call it.
diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
deleted file mode 100644
index a89c9c33484..00000000000
--- a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
+++ /dev/null
@@ -1,11 +0,0 @@
-! Ensure that write on the offload device works, nvptx offloading variant.
-
-! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
-! configuration.
-! { dg-do link } ! ..., but still apply 'dg-do run' options.
-! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
-
-! Skip duplicated testing.
-! { dg-skip-if "separate file" { ! offload_target_nvptx } }
-
-include 'target-print-1.f90'
diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
index 327bb22cb6d..9ac70e5a85f 100644
--- a/libgomp/testsuite/libgomp.fortran/target-print-1.f90
+++ b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
@@ -3,9 +3,6 @@
 ! { dg-do run }
 ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
 
-! Separate file 'target-print-1-nvptx.f90' for nvptx offloading.
-! { dg-skip-if "separate file" { offload_target_nvptx } }
-
 program main
   implicit none
   integer :: var = 42
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
index 5951e8cbe64..bbb4b55ef2c 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
@@ -17,7 +17,9 @@
 
 ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
 
-! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" }
+! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
+! overflows the stack for nvptx offloading, thus XFAILed.
+! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
 !
 ! In gfortran's main program, libfortran's set_options is called - which sets
 ! compiler_options.backtrace = 1 by default.  For an offload libgfortran, this
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
deleted file mode 100644
index 866c8654355..00000000000
--- a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
+++ /dev/null
@@ -1,11 +0,0 @@
-! Ensure that write on the offload device works, nvptx offloading variant.
-
-! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
-! configuration.
-! { dg-do link } ! ..., but still apply 'dg-do run' options.
-! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
-
-! Skip duplicated testing.
-! { dg-skip-if "separate file" { ! offload_target_nvptx } }
-
-include 'print-1.f90'
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
index d2f89d915f8..d04503a0249 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
@@ -2,9 +2,8 @@
 
 ! { dg-do run }
 ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
-
-! Separate file 'print-1-nvptx.f90' for nvptx offloading.
-! { dg-skip-if "separate file" { offload_target_nvptx } }
+! The 'write' overflows the stack for nvptx offloading, thus XFAILed.
+! { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } }
 
 ! { dg-additional-options "-fopt-info-note-omp" }
 ! { dg-additional-options "-foffload=-fopt-info-note-omp" }
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
index fe7ee37813a..394de034b1f 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
@@ -17,7 +17,9 @@
 
 ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
 
-! { dg-output "STOP 35(\n|\r\n|\r)+" }
+! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
+! overflows the stack for nvptx offloading, thus XFAILed.
+! { dg-output "STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
 !
 ! PR85463.  The 'exit' implementation used with nvptx
 ! offloading is a little bit different.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvptx, libgfortran: Switch out of "minimal" mode
  2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
@ 2023-01-20 22:10     ` Thomas Koenig
  2023-01-24  9:37     ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Koenig @ 2023-01-20 22:10 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs

Hi Thomas,

> On 2023-01-20T22:04:02+0100, I wrote:
>> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
>> offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
>> configuration of libgfortran.
> 
> This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode',
> see attached, again based on WIP work by Andrew Stubbs.  This I've just
> pushed to devel/omp/gcc-12 branch in
> commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push
> to master branch once other pending GCC patches have been accepted.

Looks good to me.

Regards

	Thomas


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode)
  2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
  2023-01-20 22:10     ` Thomas Koenig
@ 2023-01-24  9:37     ` Thomas Schwinge
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Schwinge @ 2023-01-24  9:37 UTC (permalink / raw)
  To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 11871 bytes --]

Hi!

On 2023-01-20T22:16:00+0100, I wrote:
> On 2023-01-20T22:04:02+0100, I wrote:
>> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx
>> offloading, which means building a normal (non-'LIBGFOR_MINIMAL')
>> configuration of libgfortran.
>
> This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode',
> see attached, again based on WIP work by Andrew Stubbs.  This I've just
> pushed to devel/omp/gcc-12 branch in
> commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push
> to master branch once other pending GCC patches have been accepted.
>
>
> The OpenACC XFAILs: "[...] overflows the stack for nvptx offloading"
> are unresolved at this point; see the discussion around
> "Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?",
> and my "nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
> experimenting.  (The latter works to some extent, but also has other
> issues that I shall detail at some later point in time.)

I had a note from Tobias to "update the the last but one bullet point at
https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html".  Thus pushed to
devel/omp/gcc-12 branch commit 8c29332e98ca4669a059ebc0d90903b409ae049f
"Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode'",
see attached.  Please consider that one 'fixup'ed into the GCC master
branch submission.


Grüße
 Thomas


> From c7734c6fbb5513b4da6306de7bc85de9b8547988 Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge <thomas@codesourcery.com>
> Date: Wed, 21 Sep 2022 18:58:34 +0200
> Subject: [PATCH] nvptx, libgfortran: Switch out of "minimal" mode
>
> ..., in order to enable (portions of) Fortran I/O, for example.
>
> libgfortran/ChangeLog:
>
>       * configure: Regenerate.
>       * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.
>
> libgomp/ChangeLog:
>
>       * testsuite/libgomp.fortran/target-print-1.f90: Adjust.
>       * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
>       * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
>       * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
>       * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
>       * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.
>
> Co-authored-by: Andrew Stubbs <ams@codesourcery.com>
> ---
>  libgfortran/ChangeLog.omp                       |  6 ++++++
>  libgfortran/configure                           | 17 ++++++-----------
>  libgfortran/configure.ac                        | 17 ++++++-----------
>  libgomp/ChangeLog.omp                           |  7 +++++++
>  .../libgomp.fortran/target-print-1-nvptx.f90    | 11 -----------
>  .../libgomp.fortran/target-print-1.f90          |  3 ---
>  .../libgomp.oacc-fortran/error_stop-2.f         |  4 +++-
>  .../libgomp.oacc-fortran/print-1-nvptx.f90      | 11 -----------
>  .../testsuite/libgomp.oacc-fortran/print-1.f90  |  5 ++---
>  libgomp/testsuite/libgomp.oacc-fortran/stop-2.f |  4 +++-
>  10 files changed, 33 insertions(+), 52 deletions(-)
>  delete mode 100644 libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
>  delete mode 100644 libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
>
> diff --git a/libgfortran/ChangeLog.omp b/libgfortran/ChangeLog.omp
> index b08c264daf9..925575e65fa 100644
> --- a/libgfortran/ChangeLog.omp
> +++ b/libgfortran/ChangeLog.omp
> @@ -1,3 +1,9 @@
> +2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
> +         Andrew Stubbs  <ams@codesourcery.com>
> +
> +     * configure: Regenerate.
> +     * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx.
> +
>  2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
>
>       PR target/85463
> diff --git a/libgfortran/configure b/libgfortran/configure
> index ae64dca3114..3e5c931d4ad 100755
> --- a/libgfortran/configure
> +++ b/libgfortran/configure
> @@ -6230,17 +6230,12 @@ else
>  fi
>
>
> -# For GPU offloading, not everything in libfortran can be supported.
> -# Currently, the only target that has this problem is nvptx.  The
> -# following is a (partial) list of features that are unsupportable on
> -# this particular target:
> -# * Constructors
> -# * alloca
> -# * C library support for I/O, with printf as the one notable exception
> -# * C library support for other features such as signal, environment
> -#   variables, time functions
> -
> - if test "x${target_cpu}" = xnvptx; then
> +# "Minimal" mode is for targets that cannot (yet) support all features of
> +# libgfortran.  It avoids the need for working constructors, alloca, and C
> +# library support for I/O, signals, environment variables, time functions, etc.
> +# At present there are no targets that require this mode.
> +
> + if false; then
>    LIBGFOR_MINIMAL_TRUE=
>    LIBGFOR_MINIMAL_FALSE='#'
>  else
> diff --git a/libgfortran/configure.ac b/libgfortran/configure.ac
> index 97cc490cb5e..e5552949cc6 100644
> --- a/libgfortran/configure.ac
> +++ b/libgfortran/configure.ac
> @@ -222,17 +222,12 @@ AM_CONDITIONAL(LIBGFOR_USE_SYMVER, [test "x$gfortran_use_symver" != xno])
>  AM_CONDITIONAL(LIBGFOR_USE_SYMVER_GNU, [test "x$gfortran_use_symver" = xgnu])
>  AM_CONDITIONAL(LIBGFOR_USE_SYMVER_SUN, [test "x$gfortran_use_symver" = xsun])
>
> -# For GPU offloading, not everything in libfortran can be supported.
> -# Currently, the only target that has this problem is nvptx.  The
> -# following is a (partial) list of features that are unsupportable on
> -# this particular target:
> -# * Constructors
> -# * alloca
> -# * C library support for I/O, with printf as the one notable exception
> -# * C library support for other features such as signal, environment
> -#   variables, time functions
> -
> -AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx])
> +# "Minimal" mode is for targets that cannot (yet) support all features of
> +# libgfortran.  It avoids the need for working constructors, alloca, and C
> +# library support for I/O, signals, environment variables, time functions, etc.
> +# At present there are no targets that require this mode.
> +
> +AM_CONDITIONAL(LIBGFOR_MINIMAL, false)
>
>  # Some compiler target support may have limited support for integer
>  # or floating point numbers – or may want to reduce the libgfortran size
> diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
> index 32aa9705296..30b1e558ea3 100644
> --- a/libgomp/ChangeLog.omp
> +++ b/libgomp/ChangeLog.omp
> @@ -1,5 +1,12 @@
>  2023-01-20  Thomas Schwinge  <thomas@codesourcery.com>
>
> +     * testsuite/libgomp.fortran/target-print-1.f90: Adjust.
> +     * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove.
> +     * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust.
> +     * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove.
> +     * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust.
> +     * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise.
> +
>       * plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New.
>       (nvptx_close_device, GOMP_OFFLOAD_load_image)
>       (GOMP_OFFLOAD_unload_image): Call it.
> diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
> deleted file mode 100644
> index a89c9c33484..00000000000
> --- a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90
> +++ /dev/null
> @@ -1,11 +0,0 @@
> -! Ensure that write on the offload device works, nvptx offloading variant.
> -
> -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
> -! configuration.
> -! { dg-do link } ! ..., but still apply 'dg-do run' options.
> -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
> -
> -! Skip duplicated testing.
> -! { dg-skip-if "separate file" { ! offload_target_nvptx } }
> -
> -include 'target-print-1.f90'
> diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
> index 327bb22cb6d..9ac70e5a85f 100644
> --- a/libgomp/testsuite/libgomp.fortran/target-print-1.f90
> +++ b/libgomp/testsuite/libgomp.fortran/target-print-1.f90
> @@ -3,9 +3,6 @@
>  ! { dg-do run }
>  ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
>
> -! Separate file 'target-print-1-nvptx.f90' for nvptx offloading.
> -! { dg-skip-if "separate file" { offload_target_nvptx } }
> -
>  program main
>    implicit none
>    integer :: var = 42
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
> index 5951e8cbe64..bbb4b55ef2c 100644
> --- a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f
> @@ -17,7 +17,9 @@
>
>  ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
>
> -! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" }
> +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
> +! overflows the stack for nvptx offloading, thus XFAILed.
> +! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
>  !
>  ! In gfortran's main program, libfortran's set_options is called - which sets
>  ! compiler_options.backtrace = 1 by default.  For an offload libgfortran, this
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
> deleted file mode 100644
> index 866c8654355..00000000000
> --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90
> +++ /dev/null
> @@ -1,11 +0,0 @@
> -! Ensure that write on the offload device works, nvptx offloading variant.
> -
> -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran
> -! configuration.
> -! { dg-do link } ! ..., but still apply 'dg-do run' options.
> -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } }
> -
> -! Skip duplicated testing.
> -! { dg-skip-if "separate file" { ! offload_target_nvptx } }
> -
> -include 'print-1.f90'
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
> index d2f89d915f8..d04503a0249 100644
> --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90
> @@ -2,9 +2,8 @@
>
>  ! { dg-do run }
>  ! { dg-output "The answer is 42(\n|\r\n|\r)+" }
> -
> -! Separate file 'print-1-nvptx.f90' for nvptx offloading.
> -! { dg-skip-if "separate file" { offload_target_nvptx } }
> +! The 'write' overflows the stack for nvptx offloading, thus XFAILed.
> +! { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } }
>
>  ! { dg-additional-options "-fopt-info-note-omp" }
>  ! { dg-additional-options "-foffload=-fopt-info-note-omp" }
> diff --git a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
> index fe7ee37813a..394de034b1f 100644
> --- a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f
> @@ -17,7 +17,9 @@
>
>  ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" }
>
> -! { dg-output "STOP 35(\n|\r\n|\r)+" }
> +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...]
> +! overflows the stack for nvptx offloading, thus XFAILed.
> +! { dg-output "STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } }
>  !
>  ! PR85463.  The 'exit' implementation used with nvptx
>  ! offloading is a little bit different.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Update-libgomp-libgomp.texi-for-nvptx-libgfortran-Sw.patch --]
[-- Type: text/x-diff, Size: 1777 bytes --]

From 8c29332e98ca4669a059ebc0d90903b409ae049f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 24 Jan 2023 10:29:01 +0100
Subject: [PATCH] Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch
 out of "minimal" mode'

	libgomp/
	* libgomp.texi (nvptx): Update for
	'nvptx, libgfortran: Switch out of "minimal" mode'.
---
 libgomp/libgomp.texi | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 896d187f1ff..17f1509343f 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -4448,7 +4448,7 @@ The used sizes are
 
 The implementation remark:
 @itemize
-@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+@item I/O within OpenMP target regions and OpenACC compute regions is supported
       using the C library @code{printf} functions and the Fortran
       @code{print}/@code{write} statements.
 @end itemize
@@ -4496,9 +4496,11 @@ CUDA version and hardware.
 
 The implementation remark:
 @itemize
-@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
-      using the C library @code{printf} functions. Note that the Fortran
-      @code{print}/@code{write} statements are not supported, yet.
+@item I/O within OpenMP target regions and OpenACC compute regions is supported
+      using the C library @code{printf} functions.
+      Additionally, the Fortran @code{print}/@code{write} statements are
+      supported within OpenMP target regions, but not yet OpenACC compute
+      regions.
 @item Compilation OpenMP code that contains @code{requires reverse_offload}
       requires at least @code{-march=sm_35}, compiling for @code{-march=sm_30}
       is not supported.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-01-24  9:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ae825c453f484ffd99c9be34af726089@mentor.com>
     [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
2022-11-11 14:12   ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Thomas Schwinge
2022-11-11 14:35     ` Richard Biener
2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
2022-12-23 21:23         ` Jerry D
2023-01-11 12:06         ` [PING] " Thomas Schwinge
2023-01-12  2:46           ` Jerry D
2022-11-11 14:38     ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Janne Blomqvist
2023-01-20 21:04 ` nvptx, libgcc: Stub unwinding implementation Thomas Schwinge
2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
2023-01-20 22:10     ` Thomas Koenig
2023-01-24  9:37     ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).