Re: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <richard.guenther@gmail.com>
To: Thomas Schwinge <thomas@codesourcery.com>
Cc: fortran@gcc.gnu.org, gcc@gcc.gnu.org,
	Tom de Vries <tdevries@suse.de>,
	 Alexander Monakov <amonakov@ispras.ru>
Subject: Re: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
Date: Fri, 11 Nov 2022 15:35:44 +0100	[thread overview]
Message-ID: <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com> (raw)
In-Reply-To: <87zgcxoa05.fsf@euler.schwinge.homeip.net>

On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>
> Hi!
>
> For example, for Fortran code like:
>
>     write (*,*) "Hello world"
>
> ..., 'gfortran' creates:
>
>     struct __st_parameter_dt dt_parm.0;
>
>     try
>       {
>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>         dt_parm.0.common.line = 29;
>         dt_parm.0.common.flags = 128;
>         dt_parm.0.common.unit = 6;
>         _gfortran_st_write (&dt_parm.0);
>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>         _gfortran_st_write_done (&dt_parm.0);
>       }
>     finally
>       {
>         dt_parm.0 = {CLOBBER(eol)};
>       }
>
> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
> really! -- there's a lot of state in Fortran I/O apparently).  That's a
> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
> "Use custom stacks instead of local memory for automatic storage".)
>
> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
> and dynamically increases the per-thread stack as necessary (thereby
> potentially reducing parallelism) -- if it manages to understand the call
> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
> to disprove existance of recursion is the common problem, as I've read.
> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>
>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>
> That's still not an actual problem: if the GPU kernel's stack usage still
> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
> I/O handling, there is another such 'dt_parm' put onto the stack, the
> stack then overflows; device-side SIGSEGV.
>
> (There is, by the way, some similar analysis by Tom de Vries in
> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
> Recursive tests may fail due to thread stack limit".)
>
> Of course, you shouldn't really be doing I/O in GPU kernels, but people
> do like their occasional "'printf' debugging", so we ought to make that
> work (... without pessimizing any "normal" code).
>
> I assume that generally reducing the size of 'dt_parm' etc. is out of
> scope.
>
> There is a way to manually set a per-thread stack size, but it's not
> obvious which size to set: that sizes needs to work for the whole GPU
> kernel, and should be as low as possible (to maximize parallelism).
> I assume that even if GCC did an accurate call graph analysis of the GPU
> kernel's maximum stack usage, that still wouldn't help: that's before the
> PTX JIT does its own code transformations, including stack spilling.
>
> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
> (-dlto) for device code".  This might help, assuming that it manages to
> simplify the libgfortran I/O code such that the PTX JIT then understands
> the call graph.  But: that's available only starting with recent
> CUDA 11.4, so not a general solution -- if it works at all, which I've
> not tested.
>
> Similarly, we could enable GCC's LTO for device code generation -- but
> that's a big project, out of scope at this time.  And again, we don't
> know if that at all helps this case.
>
> I see a few options:
>
> (a) Figure out what it is in the libgfortran I/O implementation that
> causes "Stack size [...] cannot be statically determined", and re-work
> that code to avoid that, or even disable certain things for nvptx, if
> feasible.
>
> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
> I don't really want to do that however: it does introduce a bit of
> complexity in all the generated device code and run-time overhead that we
> generally would like to avoid.
>
> (c) I'm contemplating a tweak/compiler pass for transforming such large
> stack objects into heap allocation (during nvptx offloading compilation).
> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
> code paths this is to affect.  (Might also add some compile-time
> diagnostic, of course.)  Could maybe even limit this to only be used
> during libgfortran compilation?  This is then conceptually a bit similar
> to (b), but localized to relevant parts only.  Has such a thing been done
> before in GCC, that I could build upon?
>
> Any other clever ideas?

Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
bloat is from things that are unused for simpler I/O cases (so some
"inheritance" could help), and lots of the bloat is from using
string/length pairs using char * + size_t for what looks like could be
encoded a lot more efficiently.

There's probably not much low-hanging fruit.

Converting to heap allocation is difficult outside of the frontend and you
have to be very careful with memleaks.  The library is written in C and
I see heap allocated temporaries there but in at least one
place a stack one is used:

void
st_endfile (st_parameter_filepos *fpp)
{
...
      if (u->current_record)
        {
          st_parameter_dt dtp;
          dtp.common = fpp->common;
          memset (&dtp.u.p, 0, sizeof (dtp.u.p));
          dtp.u.p.current_unit = u;
          next_record (&dtp, 1);

that might be a mistake though - maybe it's enough to change that
to a heap allocation?  It might be also totally superfluous since
only 'u' should matter here ... (not sure if the above is the case
you are running into).

Richard.

>
>
> Grüße
>  Thomas
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

next prev parent reply	other threads:[~2022-11-11 14:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ae825c453f484ffd99c9be34af726089@mentor.com>
     [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
2022-11-11 14:12   ` Thomas Schwinge
2022-11-11 14:35     ` Richard Biener [this message]
2022-12-23 14:08       ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge
2022-12-23 21:23         ` Jerry D
2023-01-11 12:06         ` [PING] " Thomas Schwinge
2023-01-12  2:46           ` Jerry D
2022-11-11 14:38     ` Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? Janne Blomqvist
2023-01-20 21:04 ` nvptx, libgcc: Stub unwinding implementation Thomas Schwinge
2023-01-20 21:16   ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge
2023-01-20 22:10     ` Thomas Koenig
2023-01-24  9:37     ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com \
    --to=richard.guenther@gmail.com \
    --cc=amonakov@ispras.ru \
    --cc=fortran@gcc.gnu.org \
    --cc=gcc@gcc.gnu.org \
    --cc=tdevries@suse.de \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).