From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153]) by sourceware.org (Postfix) with ESMTPS id A2AC63858D1E; Fri, 11 Nov 2022 14:12:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A2AC63858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.96,156,1665475200"; d="scan'208";a="89621606" Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167]) by esa1.mentor.iphmx.com with ESMTP; 11 Nov 2022 06:12:52 -0800 IronPort-SDR: OvSOaZaHXqbCB9/m/QOKsgZYJSauCvXWLoUi5B6sVdoT6WDQvSadh4hH0IRe3arWPVU4X77qvS HD76Ydf5e3+z02Ln+yUhm5CaARe4CVLQLhLI0i9Wi+UCa5fvcQmj90WQ3o/Sgw2lYlASoq88I8 btu/xWysm4NpHAOcqjQ0Z6ctvioAYsjzvJzTL8INhOr2pMlyFecUqW8Q1gp2IS3HDe7Npt3qTW CdHgMMUgj9nhfwVpBKYQrvP3lXzlvbzRXLX7jnu88I8s74WbJygkkMiiKY48u6iAR0Twi3xlnD s9M= From: Thomas Schwinge To: , CC: Tom de Vries , Alexander Monakov Subject: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? In-Reply-To: <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> References: <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1 (x86_64-pc-linux-gnu) Date: Fri, 11 Nov 2022 15:12:42 +0100 Message-ID: <87zgcxoa05.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,KAM_SHORT,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi! For example, for Fortran code like: write (*,*) "Hello world" ..., 'gfortran' creates: struct __st_parameter_dt dt_parm.0; try { dt_parm.0.common.filename =3D &"source-gcc/libgomp/testsuite/libgom= p.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; dt_parm.0.common.line =3D 29; dt_parm.0.common.flags =3D 128; dt_parm.0.common.unit =3D 6; _gfortran_st_write (&dt_parm.0); _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{l= b: 1 sz: 1}, 11); _gfortran_st_write_done (&dt_parm.0); } finally { dt_parm.0 =3D {CLOBBER(eol)}; } The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, really! -- there's a lot of state in Fortran I/O apparently). That's a problem for GPU execution -- here: OpenACC/nvptx -- where typically you have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' "Use custom stacks instead of local memory for automatic storage".) Now, the Nvidia Driver tries to accomodate for such largish stack usage, and dynamically increases the per-thread stack as necessary (thereby potentially reducing parallelism) -- if it manages to understand the call graph. In case of libgfortran I/O, it evidently doesn't. Not being able to disprove existance of recursion is the common problem, as I've read. At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be st= atically determined That's still not an actual problem: if the GPU kernel's stack usage still fits into 1 KiB. Very often it does, but if, as happens in libgfortran I/O handling, there is another such 'dt_parm' put onto the stack, the stack then overflows; device-side SIGSEGV. (There is, by the way, some similar analysis by Tom de Vries in "[nvptx, openacc, openmp, testsuite] Recursive tests may fail due to thread stack limit".) Of course, you shouldn't really be doing I/O in GPU kernels, but people do like their occasional "'printf' debugging", so we ought to make that work (... without pessimizing any "normal" code). I assume that generally reducing the size of 'dt_parm' etc. is out of scope. There is a way to manually set a per-thread stack size, but it's not obvious which size to set: that sizes needs to work for the whole GPU kernel, and should be as low as possible (to maximize parallelism). I assume that even if GCC did an accurate call graph analysis of the GPU kernel's maximum stack usage, that still wouldn't help: that's before the PTX JIT does its own code transformations, including stack spilling. There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization (-dlto) for device code". This might help, assuming that it manages to simplify the libgfortran I/O code such that the PTX JIT then understands the call graph. But: that's available only starting with recent CUDA 11.4, so not a general solution -- if it works at all, which I've not tested. Similarly, we could enable GCC's LTO for device code generation -- but that's a big project, out of scope at this time. And again, we don't know if that at all helps this case. I see a few options: (a) Figure out what it is in the libgfortran I/O implementation that causes "Stack size [...] cannot be statically determined", and re-work that code to avoid that, or even disable certain things for nvptx, if feasible. (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'. I don't really want to do that however: it does introduce a bit of complexity in all the generated device code and run-time overhead that we generally would like to avoid. (c) I'm contemplating a tweak/compiler pass for transforming such large stack objects into heap allocation (during nvptx offloading compilation). 'malloc'/'free' do exist; they're slow, but that's not a problem for the code paths this is to affect. (Might also add some compile-time diagnostic, of course.) Could maybe even limit this to only be used during libgfortran compilation? This is then conceptually a bit similar to (b), but localized to relevant parts only. Has such a thing been done before in GCC, that I could build upon? Any other clever ideas? Gr=C3=BC=C3=9Fe Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955