From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com [IPv6:2a00:1450:4864:20::633]) by sourceware.org (Postfix) with ESMTPS id 772303858C29; Fri, 11 Nov 2022 14:35:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 772303858C29 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-ej1-x633.google.com with SMTP id n12so12921075eja.11; Fri, 11 Nov 2022 06:35:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MhMeyTn++ZpP12S7E8O6xIatwrZtVDVCQ7OIfYwQHCQ=; b=KX9nHpnu+HVyhEU+fyyQIM6sDOP/MSLSOMqoSxlWYwHptm9B3VsPVAJrb4bWMNaDEx BC6B3/NBKeGpTG8R3Dk27Cz0bj4XCJsa/5pWJ9k/c20ssUwqBR5VMRqqLidPd7Pmy4nj 4jQ8zqYR2cW+VlPHXVFxVYGgExUjNdVNMbHLZwWjMBpxbqkqZM9MmF0rI0Kfi5NJ7O2f TypxCRggisx3PbuEjWqk32apryhIqFRD3hfjT1VMohx9GOi6DKywIjzSxmFCP5kNdHxR U/tmN+h8XgMIoo4epizevgX80hVmiE8YjEZ0TQMIxVn1WzuDC6A0y40uNS0lfUIlgkEE dc3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MhMeyTn++ZpP12S7E8O6xIatwrZtVDVCQ7OIfYwQHCQ=; b=c/DzNFNX8sIHQeuX9PkY5RuitK74PO+bO496t35IR2rL98N9gnAkke5UT010U3wow+ Lxfu9/TGKwxm04N2CmCN7YcaAmHoffO6tCyzAmHYOZ7y3a8WuJ+oJGlpMUpimIqzWKpf /QO4hvqJYAaIEDCKKveNYJaN0P7Dp9cSQkJDH2bApHk5X1eiFnkd98D/XkSEnzMm7n5J 46Ane6MN7GC2qakDDK4MufFlp7a/Ovk0Xeg8XwYlHV6qeYIOVpUh1cvVPOwWlIkMl9sH NKgxHDr8FjbcW4AAOtkesYuDIx/JIuNcRQrBQNgqcntsIpup/Bt+GiX3Zp5AdZFphNav sYGQ== X-Gm-Message-State: ANoB5plYVnnUn6zUbfO1IBDQkUOROnlL1j5p25m6RxBh9/kpCMrYs23Q pi6q5YZcRWntOjOa2Esh8zVkogWnYaQVbyFLf/k= X-Google-Smtp-Source: AA0mqf4uvsxzbBM5DP+qYuyKOIYheatcPXN0N0zY/TG7KDQt43ZGQwodMmeK/W+ns+vUZZaDPwEznv7OpORoE6DY3vs= X-Received: by 2002:a17:907:6a15:b0:78d:f308:1cd with SMTP id rf21-20020a1709076a1500b0078df30801cdmr1973292ejc.754.1668177356100; Fri, 11 Nov 2022 06:35:56 -0800 (PST) MIME-Version: 1.0 References: <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> <87zgcxoa05.fsf@euler.schwinge.homeip.net> In-Reply-To: <87zgcxoa05.fsf@euler.schwinge.homeip.net> From: Richard Biener Date: Fri, 11 Nov 2022 15:35:44 +0100 Message-ID: Subject: Re: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation? To: Thomas Schwinge Cc: fortran@gcc.gnu.org, gcc@gcc.gnu.org, Tom de Vries , Alexander Monakov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge w= rote: > > Hi! > > For example, for Fortran code like: > > write (*,*) "Hello world" > > ..., 'gfortran' creates: > > struct __st_parameter_dt dt_parm.0; > > try > { > dt_parm.0.common.filename =3D &"source-gcc/libgomp/testsuite/libg= omp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; > dt_parm.0.common.line =3D 29; > dt_parm.0.common.flags =3D 128; > dt_parm.0.common.unit =3D 6; > _gfortran_st_write (&dt_parm.0); > _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]= {lb: 1 sz: 1}, 11); > _gfortran_st_write_done (&dt_parm.0); > } > finally > { > dt_parm.0 =3D {CLOBBER(eol)}; > } > > The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, > really! -- there's a lot of state in Fortran I/O apparently). That's a > problem for GPU execution -- here: OpenACC/nvptx -- where typically you > have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; > GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' > "Use custom stacks instead of local memory for automatic storage".) > > Now, the Nvidia Driver tries to accomodate for such largish stack usage, > and dynamically increases the per-thread stack as necessary (thereby > potentially reducing parallelism) -- if it manages to understand the call > graph. In case of libgfortran I/O, it evidently doesn't. Not being able > to disprove existance of recursion is the common problem, as I've read. > At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: > > warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be = statically determined > > That's still not an actual problem: if the GPU kernel's stack usage still > fits into 1 KiB. Very often it does, but if, as happens in libgfortran > I/O handling, there is another such 'dt_parm' put onto the stack, the > stack then overflows; device-side SIGSEGV. > > (There is, by the way, some similar analysis by Tom de Vries in > "[nvptx, openacc, openmp, testsuite] > Recursive tests may fail due to thread stack limit".) > > Of course, you shouldn't really be doing I/O in GPU kernels, but people > do like their occasional "'printf' debugging", so we ought to make that > work (... without pessimizing any "normal" code). > > I assume that generally reducing the size of 'dt_parm' etc. is out of > scope. > > There is a way to manually set a per-thread stack size, but it's not > obvious which size to set: that sizes needs to work for the whole GPU > kernel, and should be as low as possible (to maximize parallelism). > I assume that even if GCC did an accurate call graph analysis of the GPU > kernel's maximum stack usage, that still wouldn't help: that's before the > PTX JIT does its own code transformations, including stack spilling. > > There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization > (-dlto) for device code". This might help, assuming that it manages to > simplify the libgfortran I/O code such that the PTX JIT then understands > the call graph. But: that's available only starting with recent > CUDA 11.4, so not a general solution -- if it works at all, which I've > not tested. > > Similarly, we could enable GCC's LTO for device code generation -- but > that's a big project, out of scope at this time. And again, we don't > know if that at all helps this case. > > I see a few options: > > (a) Figure out what it is in the libgfortran I/O implementation that > causes "Stack size [...] cannot be statically determined", and re-work > that code to avoid that, or even disable certain things for nvptx, if > feasible. > > (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'. > I don't really want to do that however: it does introduce a bit of > complexity in all the generated device code and run-time overhead that we > generally would like to avoid. > > (c) I'm contemplating a tweak/compiler pass for transforming such large > stack objects into heap allocation (during nvptx offloading compilation). > 'malloc'/'free' do exist; they're slow, but that's not a problem for the > code paths this is to affect. (Might also add some compile-time > diagnostic, of course.) Could maybe even limit this to only be used > during libgfortran compilation? This is then conceptually a bit similar > to (b), but localized to relevant parts only. Has such a thing been done > before in GCC, that I could build upon? > > Any other clever ideas? Shrink st_parameter_dt (it's part of the ABI though, kind of). Lots of the bloat is from things that are unused for simpler I/O cases (so some "inheritance" could help), and lots of the bloat is from using string/length pairs using char * + size_t for what looks like could be encoded a lot more efficiently. There's probably not much low-hanging fruit. Converting to heap allocation is difficult outside of the frontend and you have to be very careful with memleaks. The library is written in C and I see heap allocated temporaries there but in at least one place a stack one is used: void st_endfile (st_parameter_filepos *fpp) { ... if (u->current_record) { st_parameter_dt dtp; dtp.common =3D fpp->common; memset (&dtp.u.p, 0, sizeof (dtp.u.p)); dtp.u.p.current_unit =3D u; next_record (&dtp, 1); that might be a mistake though - maybe it's enough to change that to a heap allocation? It might be also totally superfluous since only 'u' should matter here ... (not sure if the above is the case you are running into). Richard. > > > Gr=C3=BC=C3=9Fe > Thomas > ----------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 2= 01, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch= =C3=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellsc= haft: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955