From: Thomas Schwinge <thomas@codesourcery.com>
To: <fortran@gcc.gnu.org>, <gcc@gcc.gnu.org>
Cc: Tom de Vries <tdevries@suse.de>, Alexander Monakov <amonakov@ispras.ru>
Subject: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?
Date: Fri, 11 Nov 2022 15:12:42 +0100 [thread overview]
Message-ID: <87zgcxoa05.fsf@euler.schwinge.homeip.net> (raw)
In-Reply-To: <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
Hi!
For example, for Fortran code like:
write (*,*) "Hello world"
..., 'gfortran' creates:
struct __st_parameter_dt dt_parm.0;
try
{
dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
dt_parm.0.common.line = 29;
dt_parm.0.common.flags = 128;
dt_parm.0.common.unit = 6;
_gfortran_st_write (&dt_parm.0);
_gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
_gfortran_st_write_done (&dt_parm.0);
}
finally
{
dt_parm.0 = {CLOBBER(eol)};
}
The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
really! -- there's a lot of state in Fortran I/O apparently). That's a
problem for GPU execution -- here: OpenACC/nvptx -- where typically you
have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
"Use custom stacks instead of local memory for automatic storage".)
Now, the Nvidia Driver tries to accomodate for such largish stack usage,
and dynamically increases the per-thread stack as necessary (thereby
potentially reducing parallelism) -- if it manages to understand the call
graph. In case of libgfortran I/O, it evidently doesn't. Not being able
to disprove existance of recursion is the common problem, as I've read.
At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
That's still not an actual problem: if the GPU kernel's stack usage still
fits into 1 KiB. Very often it does, but if, as happens in libgfortran
I/O handling, there is another such 'dt_parm' put onto the stack, the
stack then overflows; device-side SIGSEGV.
(There is, by the way, some similar analysis by Tom de Vries in
<https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
Recursive tests may fail due to thread stack limit".)
Of course, you shouldn't really be doing I/O in GPU kernels, but people
do like their occasional "'printf' debugging", so we ought to make that
work (... without pessimizing any "normal" code).
I assume that generally reducing the size of 'dt_parm' etc. is out of
scope.
There is a way to manually set a per-thread stack size, but it's not
obvious which size to set: that sizes needs to work for the whole GPU
kernel, and should be as low as possible (to maximize parallelism).
I assume that even if GCC did an accurate call graph analysis of the GPU
kernel's maximum stack usage, that still wouldn't help: that's before the
PTX JIT does its own code transformations, including stack spilling.
There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
(-dlto) for device code". This might help, assuming that it manages to
simplify the libgfortran I/O code such that the PTX JIT then understands
the call graph. But: that's available only starting with recent
CUDA 11.4, so not a general solution -- if it works at all, which I've
not tested.
Similarly, we could enable GCC's LTO for device code generation -- but
that's a big project, out of scope at this time. And again, we don't
know if that at all helps this case.
I see a few options:
(a) Figure out what it is in the libgfortran I/O implementation that
causes "Stack size [...] cannot be statically determined", and re-work
that code to avoid that, or even disable certain things for nvptx, if
feasible.
(b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
I don't really want to do that however: it does introduce a bit of
complexity in all the generated device code and run-time overhead that we
generally would like to avoid.
(c) I'm contemplating a tweak/compiler pass for transforming such large
stack objects into heap allocation (during nvptx offloading compilation).
'malloc'/'free' do exist; they're slow, but that's not a problem for the
code paths this is to affect. (Might also add some compile-time
diagnostic, of course.) Could maybe even limit this to only be used
during libgfortran compilation? This is then conceptually a bit similar
to (b), but localized to relevant parts only. Has such a thing been done
before in GCC, that I could build upon?
Any other clever ideas?
Grüße
Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
next parent reply other threads:[~2022-11-11 14:12 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <ae825c453f484ffd99c9be34af726089@mentor.com>
[not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
2022-11-11 14:12 ` Thomas Schwinge [this message]
2022-11-11 14:35 ` Richard Biener
2022-11-11 14:38 ` Janne Blomqvist
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zgcxoa05.fsf@euler.schwinge.homeip.net \
--to=thomas@codesourcery.com \
--cc=amonakov@ispras.ru \
--cc=fortran@gcc.gnu.org \
--cc=gcc@gcc.gnu.org \
--cc=tdevries@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).