Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
       [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
@ 2022-11-11 14:12   ` Thomas Schwinge
  2022-11-11 14:35     ` Richard Biener
  2022-11-11 14:38     ` Janne Blomqvist
  0 siblings, 2 replies; 3+ messages in thread
From: Thomas Schwinge @ 2022-11-11 14:12 UTC (permalink / raw)
  To: fortran, gcc; +Cc: Tom de Vries, Alexander Monakov

Hi!

For example, for Fortran code like:

    write (*,*) "Hello world"

..., 'gfortran' creates:

    struct __st_parameter_dt dt_parm.0;

    try
      {
        dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
        dt_parm.0.common.line = 29;
        dt_parm.0.common.flags = 128;
        dt_parm.0.common.unit = 6;
        _gfortran_st_write (&dt_parm.0);
        _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
        _gfortran_st_write_done (&dt_parm.0);
      }
    finally
      {
        dt_parm.0 = {CLOBBER(eol)};
      }

The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
really! -- there's a lot of state in Fortran I/O apparently).  That's a
problem for GPU execution -- here: OpenACC/nvptx -- where typically you
have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
"Use custom stacks instead of local memory for automatic storage".)

Now, the Nvidia Driver tries to accomodate for such largish stack usage,
and dynamically increases the per-thread stack as necessary (thereby
potentially reducing parallelism) -- if it manages to understand the call
graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
to disprove existance of recursion is the common problem, as I've read.
At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:

    warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined

That's still not an actual problem: if the GPU kernel's stack usage still
fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
I/O handling, there is another such 'dt_parm' put onto the stack, the
stack then overflows; device-side SIGSEGV.

(There is, by the way, some similar analysis by Tom de Vries in
<https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
Recursive tests may fail due to thread stack limit".)

Of course, you shouldn't really be doing I/O in GPU kernels, but people
do like their occasional "'printf' debugging", so we ought to make that
work (... without pessimizing any "normal" code).

I assume that generally reducing the size of 'dt_parm' etc. is out of
scope.

There is a way to manually set a per-thread stack size, but it's not
obvious which size to set: that sizes needs to work for the whole GPU
kernel, and should be as low as possible (to maximize parallelism).
I assume that even if GCC did an accurate call graph analysis of the GPU
kernel's maximum stack usage, that still wouldn't help: that's before the
PTX JIT does its own code transformations, including stack spilling.

There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
(-dlto) for device code".  This might help, assuming that it manages to
simplify the libgfortran I/O code such that the PTX JIT then understands
the call graph.  But: that's available only starting with recent
CUDA 11.4, so not a general solution -- if it works at all, which I've
not tested.

Similarly, we could enable GCC's LTO for device code generation -- but
that's a big project, out of scope at this time.  And again, we don't
know if that at all helps this case.

I see a few options:

(a) Figure out what it is in the libgfortran I/O implementation that
causes "Stack size [...] cannot be statically determined", and re-work
that code to avoid that, or even disable certain things for nvptx, if
feasible.

(b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
I don't really want to do that however: it does introduce a bit of
complexity in all the generated device code and run-time overhead that we
generally would like to avoid.

(c) I'm contemplating a tweak/compiler pass for transforming such large
stack objects into heap allocation (during nvptx offloading compilation).
'malloc'/'free' do exist; they're slow, but that's not a problem for the
code paths this is to affect.  (Might also add some compile-time
diagnostic, of course.)  Could maybe even limit this to only be used
during libgfortran compilation?  This is then conceptually a bit similar
to (b), but localized to relevant parts only.  Has such a thing been done
before in GCC, that I could build upon?

Any other clever ideas?

Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
  2022-11-11 14:12   ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Thomas Schwinge
@ 2022-11-11 14:35     ` Richard Biener
  2022-11-11 14:38     ` Janne Blomqvist
  1 sibling, 0 replies; 3+ messages in thread
From: Richard Biener @ 2022-11-11 14:35 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: fortran, gcc, Tom de Vries, Alexander Monakov

On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>
> Hi!
>
> For example, for Fortran code like:
>
>     write (*,*) "Hello world"
>
> ..., 'gfortran' creates:
>
>     struct __st_parameter_dt dt_parm.0;
>
>     try
>       {
>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>         dt_parm.0.common.line = 29;
>         dt_parm.0.common.flags = 128;
>         dt_parm.0.common.unit = 6;
>         _gfortran_st_write (&dt_parm.0);
>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>         _gfortran_st_write_done (&dt_parm.0);
>       }
>     finally
>       {
>         dt_parm.0 = {CLOBBER(eol)};
>       }
>
> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
> really! -- there's a lot of state in Fortran I/O apparently).  That's a
> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
> "Use custom stacks instead of local memory for automatic storage".)
>
> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
> and dynamically increases the per-thread stack as necessary (thereby
> potentially reducing parallelism) -- if it manages to understand the call
> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
> to disprove existance of recursion is the common problem, as I've read.
> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>
>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>
> That's still not an actual problem: if the GPU kernel's stack usage still
> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
> I/O handling, there is another such 'dt_parm' put onto the stack, the
> stack then overflows; device-side SIGSEGV.
>
> (There is, by the way, some similar analysis by Tom de Vries in
> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
> Recursive tests may fail due to thread stack limit".)
>
> Of course, you shouldn't really be doing I/O in GPU kernels, but people
> do like their occasional "'printf' debugging", so we ought to make that
> work (... without pessimizing any "normal" code).
>
> I assume that generally reducing the size of 'dt_parm' etc. is out of
> scope.
>
> There is a way to manually set a per-thread stack size, but it's not
> obvious which size to set: that sizes needs to work for the whole GPU
> kernel, and should be as low as possible (to maximize parallelism).
> I assume that even if GCC did an accurate call graph analysis of the GPU
> kernel's maximum stack usage, that still wouldn't help: that's before the
> PTX JIT does its own code transformations, including stack spilling.
>
> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
> (-dlto) for device code".  This might help, assuming that it manages to
> simplify the libgfortran I/O code such that the PTX JIT then understands
> the call graph.  But: that's available only starting with recent
> CUDA 11.4, so not a general solution -- if it works at all, which I've
> not tested.
>
> Similarly, we could enable GCC's LTO for device code generation -- but
> that's a big project, out of scope at this time.  And again, we don't
> know if that at all helps this case.
>
> I see a few options:
>
> (a) Figure out what it is in the libgfortran I/O implementation that
> causes "Stack size [...] cannot be statically determined", and re-work
> that code to avoid that, or even disable certain things for nvptx, if
> feasible.
>
> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
> I don't really want to do that however: it does introduce a bit of
> complexity in all the generated device code and run-time overhead that we
> generally would like to avoid.
>
> (c) I'm contemplating a tweak/compiler pass for transforming such large
> stack objects into heap allocation (during nvptx offloading compilation).
> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
> code paths this is to affect.  (Might also add some compile-time
> diagnostic, of course.)  Could maybe even limit this to only be used
> during libgfortran compilation?  This is then conceptually a bit similar
> to (b), but localized to relevant parts only.  Has such a thing been done
> before in GCC, that I could build upon?
>
> Any other clever ideas?

Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
bloat is from things that are unused for simpler I/O cases (so some
"inheritance" could help), and lots of the bloat is from using
string/length pairs using char * + size_t for what looks like could be
encoded a lot more efficiently.

There's probably not much low-hanging fruit.

Converting to heap allocation is difficult outside of the frontend and you
have to be very careful with memleaks.  The library is written in C and
I see heap allocated temporaries there but in at least one
place a stack one is used:

void
st_endfile (st_parameter_filepos *fpp)
{
...
      if (u->current_record)
        {
          st_parameter_dt dtp;
          dtp.common = fpp->common;
          memset (&dtp.u.p, 0, sizeof (dtp.u.p));
          dtp.u.p.current_unit = u;
          next_record (&dtp, 1);

that might be a mistake though - maybe it's enough to change that
to a heap allocation?  It might be also totally superfluous since
only 'u' should matter here ... (not sure if the above is the case
you are running into).

Richard.

>
>
> Grüße
>  Thomas
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
  2022-11-11 14:12   ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Thomas Schwinge
  2022-11-11 14:35     ` Richard Biener
@ 2022-11-11 14:38     ` Janne Blomqvist
  1 sibling, 0 replies; 3+ messages in thread
From: Janne Blomqvist @ 2022-11-11 14:38 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: fortran, gcc, Tom de Vries, Alexander Monakov

On Fri, Nov 11, 2022 at 4:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
> For example, for Fortran code like:
>
>     write (*,*) "Hello world"
>
> ..., 'gfortran' creates:

> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
> really! -- there's a lot of state in Fortran I/O apparently).

> Any other clever ideas?

There's a lot of potential options to set during Fortran I/O, but in
the vast majority of cases only a few are used. So a better library
interface would be to transfer only those options that are used, and
then let the full set of options live in heap memory managed by
libgfortran. Say some kind of simple byte-code format, with an
'opcode' saying which option it is, followed by the value.

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48419 for some
rough ideas in this direction, although I'm not personally working on
GFortran at this time so somebody else would have to pick it up.

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-11-11 14:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ae825c453f484ffd99c9be34af726089@mentor.com>
     [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
2022-11-11 14:12   ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Thomas Schwinge
2022-11-11 14:35     ` Richard Biener
2022-11-11 14:38     ` Janne Blomqvist

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).