From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Lksb=4V=gmail.com=jvdelisle2@sourceware.org>
Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b])
	by sourceware.org (Postfix) with ESMTPS id CE0413858425;
	Fri, 23 Dec 2022 21:23:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE0413858425
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-pf1-x42b.google.com with SMTP id y21so1830050pfo.7;
        Fri, 23 Dec 2022 13:23:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to
         :content-language:user-agent:mime-version:date:message-id:from:to:cc
         :subject:date:message-id:reply-to;
        bh=mam6AoWr73tiCipWsp9PNmepZeUzo04c+xCjpNXVZko=;
        b=L/asSMOOIOecjyTsSn81zuQ/cCoq+hFeouYKOXCCcPfnYSBcWcNyQMn/5ApcWk+AMZ
         uPM4VWhOQ6I3vqU7WXref/sLaRi7m3e4Q43ICnANxjnWrDssx6I4BpNmqzIRcFTRK5qI
         YGn1pBTVHrjcQptwZkWcfHTyHxjKtv73M7nkkitlEJmxJQP3/KSKg2yNPmmMT8knIfvR
         0+pi+Zn6zSqUgkay5ITghwja9YLBkhcTY1Kxsc+P8v8IeVst9/Qv1uLbsurKoLA9cOyv
         pCUCytsutcMfT6aypEsQNbaJqxXjF25Ag+d4CINdTFg/cNe1AwRPa3tDQmLGhZMz2NkM
         Dnzw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to
         :content-language:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=mam6AoWr73tiCipWsp9PNmepZeUzo04c+xCjpNXVZko=;
        b=2vTtVCgUinEJUMDQq5TUSRYUiqogwdjQPxwbdqInKaqt6TEqwp3VrQmAset0O3gy4u
         l9qc2ai0ba74Fm7SLh6UiSmbU5jmEJLoIUfGNa8hAaXRgdnsPfhCqyDXpmo18S2DeS5Z
         AHoIoQazGmyKL01S1i3tZ1/IB5FlwEXfMhf850AeAOp6w/Zcdjc8dToqCbjCQiraEJBc
         3y9iDrRKpabgrSjHsrh1igEV73DmcyX1+QubPzQSIkUeZhz2zA4v6vlp2awn4TbHv+30
         9sqUBdueC3ITVd1TlFEPhSBJmxg8R4s5cuT05dIik4Itx7fTa55233AvxdC+RTrK2fUj
         izEw==
X-Gm-Message-State: AFqh2kpB6E3CVVyrxSKE8+HmWPAn42be5f2ZcYwfS8Cj/G2/+2qN5XkJ
	FUC79WdN5uKy+8MB5yjlHCw=
X-Google-Smtp-Source: AMrXdXskBZ8YSg5GByP/2NhDM7/Gp+umHeRSjZv3ZdiNMET2IPA+w0LH65WDdzElYmjE88VLvDMToQ==
X-Received: by 2002:a62:3894:0:b0:56b:a661:5a5a with SMTP id f142-20020a623894000000b0056ba6615a5amr2885404pfa.2.1671830621738;
        Fri, 23 Dec 2022 13:23:41 -0800 (PST)
Received: from [192.168.1.20] ([50.37.188.226])
        by smtp.gmail.com with ESMTPSA id x8-20020aa79a48000000b0057447bb0ddcsm2955762pfj.49.2022.12.23.13.23.41
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 23 Dec 2022 13:23:41 -0800 (PST)
Message-ID: <758e50d1-5d28-cac3-ff4e-dc632cda1455@gmail.com>
Date: Fri, 23 Dec 2022 13:23:40 -0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.0
Content-Language: en-US
To: Thomas Schwinge <thomas@codesourcery.com>,
 Richard Biener <richard.guenther@gmail.com>, Tom de Vries
 <tdevries@suse.de>, gcc-patches@gcc.gnu.org
Cc: Janne Blomqvist <blomqvist.janne@gmail.com>, fortran@gcc.gnu.org,
 Alexander Monakov <amonakov@ispras.ru>
References: <ae825c453f484ffd99c9be34af726089@mentor.com>
 <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com>
 <87zgcxoa05.fsf@euler.schwinge.homeip.net>
 <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com>
 <87ili2p60p.fsf@euler.schwinge.homeip.net>
From: Jerry D <jvdelisle2@gmail.com>
Subject: Re: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'
 (was: Handling of large stack objects in GPU code generation -- maybe
 transform into heap allocation?)
In-Reply-To: <87ili2p60p.fsf@euler.schwinge.homeip.net>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <fortran.gcc.gnu.org>

On 12/23/22 6:08 AM, Thomas Schwinge wrote:
> Hi!
> 
> On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
>> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>>> For example, for Fortran code like:
>>>
>>>      write (*,*) "Hello world"
>>>
>>> ..., 'gfortran' creates:
>>>
>>>      struct __st_parameter_dt dt_parm.0;
>>>
>>>      try
>>>        {
>>>          dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>>          dt_parm.0.common.line = 29;
>>>          dt_parm.0.common.flags = 128;
>>>          dt_parm.0.common.unit = 6;
>>>          _gfortran_st_write (&dt_parm.0);
>>>          _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>>          _gfortran_st_write_done (&dt_parm.0);
>>>        }
>>>      finally
>>>        {
>>>          dt_parm.0 = {CLOBBER(eol)};
>>>        }
>>>
>>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>>> "Use custom stacks instead of local memory for automatic storage".)
>>>
>>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>>> and dynamically increases the per-thread stack as necessary (thereby
>>> potentially reducing parallelism) -- if it manages to understand the call
>>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>>> to disprove existance of recursion is the common problem, as I've read.
>>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>>
>>>      warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>>
>>> That's still not an actual problem: if the GPU kernel's stack usage still
>>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>>> stack then overflows; device-side SIGSEGV.
>>>
>>> (There is, by the way, some similar analysis by Tom de Vries in
>>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>>> Recursive tests may fail due to thread stack limit".)
>>>
>>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>>> do like their occasional "'printf' debugging", so we ought to make that
>>> work (... without pessimizing any "normal" code).
>>>
>>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>>> scope.

There are so many wiggles and turns and corner cases and the like of 
nightmares in I/O I would advise not trying to reduce the dt_parm.  It 
could probably be done.

For debugging GPU, would it not be better to have a way you signal back 
to a main thread to do a print from there, like some sort of call back 
in the users code under test.

Putting this another way, recommend users debugging to use a different 
method than embedding print statements for debugging rather than do a 
tone of work to enable something that is not really a legitimate use case.

FWIW,

Jerry