From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by sourceware.org (Postfix) with ESMTPS id CE0413858425; Fri, 23 Dec 2022 21:23:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE0413858425 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-pf1-x42b.google.com with SMTP id y21so1830050pfo.7; Fri, 23 Dec 2022 13:23:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to :content-language:user-agent:mime-version:date:message-id:from:to:cc :subject:date:message-id:reply-to; bh=mam6AoWr73tiCipWsp9PNmepZeUzo04c+xCjpNXVZko=; b=L/asSMOOIOecjyTsSn81zuQ/cCoq+hFeouYKOXCCcPfnYSBcWcNyQMn/5ApcWk+AMZ uPM4VWhOQ6I3vqU7WXref/sLaRi7m3e4Q43ICnANxjnWrDssx6I4BpNmqzIRcFTRK5qI YGn1pBTVHrjcQptwZkWcfHTyHxjKtv73M7nkkitlEJmxJQP3/KSKg2yNPmmMT8knIfvR 0+pi+Zn6zSqUgkay5ITghwja9YLBkhcTY1Kxsc+P8v8IeVst9/Qv1uLbsurKoLA9cOyv pCUCytsutcMfT6aypEsQNbaJqxXjF25Ag+d4CINdTFg/cNe1AwRPa3tDQmLGhZMz2NkM Dnzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to :content-language:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mam6AoWr73tiCipWsp9PNmepZeUzo04c+xCjpNXVZko=; b=2vTtVCgUinEJUMDQq5TUSRYUiqogwdjQPxwbdqInKaqt6TEqwp3VrQmAset0O3gy4u l9qc2ai0ba74Fm7SLh6UiSmbU5jmEJLoIUfGNa8hAaXRgdnsPfhCqyDXpmo18S2DeS5Z AHoIoQazGmyKL01S1i3tZ1/IB5FlwEXfMhf850AeAOp6w/Zcdjc8dToqCbjCQiraEJBc 3y9iDrRKpabgrSjHsrh1igEV73DmcyX1+QubPzQSIkUeZhz2zA4v6vlp2awn4TbHv+30 9sqUBdueC3ITVd1TlFEPhSBJmxg8R4s5cuT05dIik4Itx7fTa55233AvxdC+RTrK2fUj izEw== X-Gm-Message-State: AFqh2kpB6E3CVVyrxSKE8+HmWPAn42be5f2ZcYwfS8Cj/G2/+2qN5XkJ FUC79WdN5uKy+8MB5yjlHCw= X-Google-Smtp-Source: AMrXdXskBZ8YSg5GByP/2NhDM7/Gp+umHeRSjZv3ZdiNMET2IPA+w0LH65WDdzElYmjE88VLvDMToQ== X-Received: by 2002:a62:3894:0:b0:56b:a661:5a5a with SMTP id f142-20020a623894000000b0056ba6615a5amr2885404pfa.2.1671830621738; Fri, 23 Dec 2022 13:23:41 -0800 (PST) Received: from [192.168.1.20] ([50.37.188.226]) by smtp.gmail.com with ESMTPSA id x8-20020aa79a48000000b0057447bb0ddcsm2955762pfj.49.2022.12.23.13.23.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 23 Dec 2022 13:23:41 -0800 (PST) Message-ID: <758e50d1-5d28-cac3-ff4e-dc632cda1455@gmail.com> Date: Fri, 23 Dec 2022 13:23:40 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Content-Language: en-US To: Thomas Schwinge , Richard Biener , Tom de Vries , gcc-patches@gcc.gnu.org Cc: Janne Blomqvist , fortran@gcc.gnu.org, Alexander Monakov References: <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> <87zgcxoa05.fsf@euler.schwinge.homeip.net> <87ili2p60p.fsf@euler.schwinge.homeip.net> From: Jerry D Subject: Re: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) In-Reply-To: <87ili2p60p.fsf@euler.schwinge.homeip.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 12/23/22 6:08 AM, Thomas Schwinge wrote: > Hi! > > On 2022-11-11T15:35:44+0100, Richard Biener via Fortran wrote: >> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge wrote: >>> For example, for Fortran code like: >>> >>> write (*,*) "Hello world" >>> >>> ..., 'gfortran' creates: >>> >>> struct __st_parameter_dt dt_parm.0; >>> >>> try >>> { >>> dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; >>> dt_parm.0.common.line = 29; >>> dt_parm.0.common.flags = 128; >>> dt_parm.0.common.unit = 6; >>> _gfortran_st_write (&dt_parm.0); >>> _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11); >>> _gfortran_st_write_done (&dt_parm.0); >>> } >>> finally >>> { >>> dt_parm.0 = {CLOBBER(eol)}; >>> } >>> >>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, >>> really! -- there's a lot of state in Fortran I/O apparently). That's a >>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you >>> have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; >>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' >>> "Use custom stacks instead of local memory for automatic storage".) >>> >>> Now, the Nvidia Driver tries to accomodate for such largish stack usage, >>> and dynamically increases the per-thread stack as necessary (thereby >>> potentially reducing parallelism) -- if it manages to understand the call >>> graph. In case of libgfortran I/O, it evidently doesn't. Not being able >>> to disprove existance of recursion is the common problem, as I've read. >>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: >>> >>> warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined >>> >>> That's still not an actual problem: if the GPU kernel's stack usage still >>> fits into 1 KiB. Very often it does, but if, as happens in libgfortran >>> I/O handling, there is another such 'dt_parm' put onto the stack, the >>> stack then overflows; device-side SIGSEGV. >>> >>> (There is, by the way, some similar analysis by Tom de Vries in >>> "[nvptx, openacc, openmp, testsuite] >>> Recursive tests may fail due to thread stack limit".) >>> >>> Of course, you shouldn't really be doing I/O in GPU kernels, but people >>> do like their occasional "'printf' debugging", so we ought to make that >>> work (... without pessimizing any "normal" code). >>> >>> I assume that generally reducing the size of 'dt_parm' etc. is out of >>> scope. There are so many wiggles and turns and corner cases and the like of nightmares in I/O I would advise not trying to reduce the dt_parm. It could probably be done. For debugging GPU, would it not be better to have a way you signal back to a main thread to do a print from there, like some sort of call back in the users code under test. Putting this another way, recommend users debugging to use a different method than embedding print statements for debugging rather than do a tone of work to enable something that is not really a legitimate use case. FWIW, Jerry