Hi! Ping -- the '-mframe-malloc-threshold' idea, at least. Note that while this issue originally did pop up for Fortran I/O, it's likewise relevant for other functions that maintain big frames, for example in newlib: libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064]; libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064]; libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064]; libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560]; Therefore a generic solution (or, workaround if you'd like) does seem appropriate. Grüße Thomas On 2022-12-23T15:08:06+0100, I wrote: > Hi! > > On 2022-11-11T15:35:44+0100, Richard Biener via Fortran wrote: >> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge wrote: >>> For example, for Fortran code like: >>> >>> write (*,*) "Hello world" >>> >>> ..., 'gfortran' creates: >>> >>> struct __st_parameter_dt dt_parm.0; >>> >>> try >>> { >>> dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; >>> dt_parm.0.common.line = 29; >>> dt_parm.0.common.flags = 128; >>> dt_parm.0.common.unit = 6; >>> _gfortran_st_write (&dt_parm.0); >>> _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11); >>> _gfortran_st_write_done (&dt_parm.0); >>> } >>> finally >>> { >>> dt_parm.0 = {CLOBBER(eol)}; >>> } >>> >>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, >>> really! -- there's a lot of state in Fortran I/O apparently). That's a >>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you >>> have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; >>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' >>> "Use custom stacks instead of local memory for automatic storage".) >>> >>> Now, the Nvidia Driver tries to accomodate for such largish stack usage, >>> and dynamically increases the per-thread stack as necessary (thereby >>> potentially reducing parallelism) -- if it manages to understand the call >>> graph. In case of libgfortran I/O, it evidently doesn't. Not being able >>> to disprove existance of recursion is the common problem, as I've read. >>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: >>> >>> warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined >>> >>> That's still not an actual problem: if the GPU kernel's stack usage still >>> fits into 1 KiB. Very often it does, but if, as happens in libgfortran >>> I/O handling, there is another such 'dt_parm' put onto the stack, the >>> stack then overflows; device-side SIGSEGV. >>> >>> (There is, by the way, some similar analysis by Tom de Vries in >>> "[nvptx, openacc, openmp, testsuite] >>> Recursive tests may fail due to thread stack limit".) >>> >>> Of course, you shouldn't really be doing I/O in GPU kernels, but people >>> do like their occasional "'printf' debugging", so we ought to make that >>> work (... without pessimizing any "normal" code). >>> >>> I assume that generally reducing the size of 'dt_parm' etc. is out of >>> scope. >>> >>> There is a way to manually set a per-thread stack size, but it's not >>> obvious which size to set: that sizes needs to work for the whole GPU >>> kernel, and should be as low as possible (to maximize parallelism). >>> I assume that even if GCC did an accurate call graph analysis of the GPU >>> kernel's maximum stack usage, that still wouldn't help: that's before the >>> PTX JIT does its own code transformations, including stack spilling. >>> >>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization >>> (-dlto) for device code". This might help, assuming that it manages to >>> simplify the libgfortran I/O code such that the PTX JIT then understands >>> the call graph. But: that's available only starting with recent >>> CUDA 11.4, so not a general solution -- if it works at all, which I've >>> not tested. >>> >>> Similarly, we could enable GCC's LTO for device code generation -- but >>> that's a big project, out of scope at this time. And again, we don't >>> know if that at all helps this case. >>> >>> I see a few options: >>> >>> (a) Figure out what it is in the libgfortran I/O implementation that >>> causes "Stack size [...] cannot be statically determined", and re-work >>> that code to avoid that, or even disable certain things for nvptx, if >>> feasible. > >> Shrink st_parameter_dt (it's part of the ABI though, kind of). Lots of the >> bloat is from things that are unused for simpler I/O cases (so some >> "inheritance" could help), and lots of the bloat is from using >> string/length pairs using char * + size_t for what looks like could be >> encoded a lot more efficiently. >> >> There's probably not much low-hanging fruit. > > (Similarly comments in Janne's email.) > > > Well, as had to be expected, libgfortran I/O is really just one example, > but the underlying problem may also be triggered in other ways (via other > newlib/libc functions, for example). > > So, really a generic solution seems to be called for. > >>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'. >>> I don't really want to do that however: it does introduce a bit of >>> complexity in all the generated device code and run-time overhead that we >>> generally would like to avoid. > > Directly using '-msoft-stack' isn't actually possible: it does implement > "one stack per 32-threads warp", but for OpenACC we need "one stack per > thread of a warp" (that is, each OpenACC 'vector' independently), and > pre-allocating from device memory all those stacks (which may be a lot!) > I foresee to really negatively impact overall performance? > >>> (c) I'm contemplating a tweak/compiler pass for transforming such large >>> stack objects into heap allocation (during nvptx offloading compilation). >>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the >>> code paths this is to affect. (Might also add some compile-time >>> diagnostic, of course.) Could maybe even limit this to only be used >>> during libgfortran compilation? This is then conceptually a bit similar >>> to (b), but localized to relevant parts only. Has such a thing been done >>> before in GCC, that I could build upon? >>> >>> Any other clever ideas? > >> Converting to heap allocation is difficult outside of the frontend and you >> have to be very careful with memleaks. > > Heh, in fact it seems to be pretty simple! (Famous last words?) See > "[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'" > attached. What do people think about such a thing? > > Still to be discussed are '-Wframe-malloc-threshold' (default-on vs. > '-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?), > default value for '-mframe-malloc-threshold=[...]' (potentially different > for GCC/nvptx target libraries build vs. user-compiled code?), etc. > > >> The library is written in C and >> I see heap allocated temporaries there but in at least one >> place a stack one is used: >> >> void >> st_endfile (st_parameter_filepos *fpp) >> { >> ... >> if (u->current_record) >> { >> st_parameter_dt dtp; >> dtp.common = fpp->common; >> memset (&dtp.u.p, 0, sizeof (dtp.u.p)); >> dtp.u.p.current_unit = u; >> next_record (&dtp, 1); >> >> that might be a mistake though - maybe it's enough to change that >> to a heap allocation? It might be also totally superfluous since >> only 'u' should matter here ... (not sure if the above is the case >> you are running into). > > (Have not yet looked into that; won't solve the general issue.) > > > Grüße > Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955