* nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) [not found] ` <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com> @ 2022-12-23 14:08 ` Thomas Schwinge 2022-12-23 21:23 ` Jerry D 2023-01-11 12:06 ` [PING] " Thomas Schwinge 0 siblings, 2 replies; 8+ messages in thread From: Thomas Schwinge @ 2022-12-23 14:08 UTC (permalink / raw) To: Richard Biener, Tom de Vries, gcc-patches Cc: Janne Blomqvist, fortran, Alexander Monakov [-- Attachment #1: Type: text/plain, Size: 7524 bytes --] Hi! On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote: > On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote: >> For example, for Fortran code like: >> >> write (*,*) "Hello world" >> >> ..., 'gfortran' creates: >> >> struct __st_parameter_dt dt_parm.0; >> >> try >> { >> dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; >> dt_parm.0.common.line = 29; >> dt_parm.0.common.flags = 128; >> dt_parm.0.common.unit = 6; >> _gfortran_st_write (&dt_parm.0); >> _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11); >> _gfortran_st_write_done (&dt_parm.0); >> } >> finally >> { >> dt_parm.0 = {CLOBBER(eol)}; >> } >> >> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, >> really! -- there's a lot of state in Fortran I/O apparently). That's a >> problem for GPU execution -- here: OpenACC/nvptx -- where typically you >> have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; >> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' >> "Use custom stacks instead of local memory for automatic storage".) >> >> Now, the Nvidia Driver tries to accomodate for such largish stack usage, >> and dynamically increases the per-thread stack as necessary (thereby >> potentially reducing parallelism) -- if it manages to understand the call >> graph. In case of libgfortran I/O, it evidently doesn't. Not being able >> to disprove existance of recursion is the common problem, as I've read. >> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: >> >> warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined >> >> That's still not an actual problem: if the GPU kernel's stack usage still >> fits into 1 KiB. Very often it does, but if, as happens in libgfortran >> I/O handling, there is another such 'dt_parm' put onto the stack, the >> stack then overflows; device-side SIGSEGV. >> >> (There is, by the way, some similar analysis by Tom de Vries in >> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite] >> Recursive tests may fail due to thread stack limit".) >> >> Of course, you shouldn't really be doing I/O in GPU kernels, but people >> do like their occasional "'printf' debugging", so we ought to make that >> work (... without pessimizing any "normal" code). >> >> I assume that generally reducing the size of 'dt_parm' etc. is out of >> scope. >> >> There is a way to manually set a per-thread stack size, but it's not >> obvious which size to set: that sizes needs to work for the whole GPU >> kernel, and should be as low as possible (to maximize parallelism). >> I assume that even if GCC did an accurate call graph analysis of the GPU >> kernel's maximum stack usage, that still wouldn't help: that's before the >> PTX JIT does its own code transformations, including stack spilling. >> >> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization >> (-dlto) for device code". This might help, assuming that it manages to >> simplify the libgfortran I/O code such that the PTX JIT then understands >> the call graph. But: that's available only starting with recent >> CUDA 11.4, so not a general solution -- if it works at all, which I've >> not tested. >> >> Similarly, we could enable GCC's LTO for device code generation -- but >> that's a big project, out of scope at this time. And again, we don't >> know if that at all helps this case. >> >> I see a few options: >> >> (a) Figure out what it is in the libgfortran I/O implementation that >> causes "Stack size [...] cannot be statically determined", and re-work >> that code to avoid that, or even disable certain things for nvptx, if >> feasible. > Shrink st_parameter_dt (it's part of the ABI though, kind of). Lots of the > bloat is from things that are unused for simpler I/O cases (so some > "inheritance" could help), and lots of the bloat is from using > string/length pairs using char * + size_t for what looks like could be > encoded a lot more efficiently. > > There's probably not much low-hanging fruit. (Similarly comments in Janne's email.) Well, as had to be expected, libgfortran I/O is really just one example, but the underlying problem may also be triggered in other ways (via other newlib/libc functions, for example). So, really a generic solution seems to be called for. >> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'. >> I don't really want to do that however: it does introduce a bit of >> complexity in all the generated device code and run-time overhead that we >> generally would like to avoid. Directly using '-msoft-stack' isn't actually possible: it does implement "one stack per 32-threads warp", but for OpenACC we need "one stack per thread of a warp" (that is, each OpenACC 'vector' independently), and pre-allocating from device memory all those stacks (which may be a lot!) I foresee to really negatively impact overall performance? >> (c) I'm contemplating a tweak/compiler pass for transforming such large >> stack objects into heap allocation (during nvptx offloading compilation). >> 'malloc'/'free' do exist; they're slow, but that's not a problem for the >> code paths this is to affect. (Might also add some compile-time >> diagnostic, of course.) Could maybe even limit this to only be used >> during libgfortran compilation? This is then conceptually a bit similar >> to (b), but localized to relevant parts only. Has such a thing been done >> before in GCC, that I could build upon? >> >> Any other clever ideas? > Converting to heap allocation is difficult outside of the frontend and you > have to be very careful with memleaks. Heh, in fact it seems to be pretty simple! (Famous last words?) See "[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'" attached. What do people think about such a thing? Still to be discussed are '-Wframe-malloc-threshold' (default-on vs. '-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?), default value for '-mframe-malloc-threshold=[...]' (potentially different for GCC/nvptx target libraries build vs. user-compiled code?), etc. > The library is written in C and > I see heap allocated temporaries there but in at least one > place a stack one is used: > > void > st_endfile (st_parameter_filepos *fpp) > { > ... > if (u->current_record) > { > st_parameter_dt dtp; > dtp.common = fpp->common; > memset (&dtp.u.p, 0, sizeof (dtp.u.p)); > dtp.u.p.current_unit = u; > next_record (&dtp, 1); > > that might be a mistake though - maybe it's enough to change that > to a heap allocation? It might be also totally superfluous since > only 'u' should matter here ... (not sure if the above is the case > you are running into). (Have not yet looked into that; won't solve the general issue.) Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-WIP-nvptx-mframe-malloc-threshold-Wframe-malloc-thre.patch --] [-- Type: text/x-diff, Size: 16585 bytes --] From 3f5524adacff23710cf1cab393a56bf23853cafa Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Wed, 21 Dec 2022 21:25:19 +0100 Subject: [PATCH] [WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' --- gcc/config/nvptx/nvptx.cc | 102 ++++++++++++++++-- gcc/config/nvptx/nvptx.h | 3 + gcc/config/nvptx/nvptx.opt | 12 +++ gcc/doc/invoke.texi | 16 ++- .../nvptx/frame-malloc-threshold-1.c | 29 +++++ .../nvptx/frame-malloc-threshold-2.c | 13 +++ .../nvptx/frame-malloc-threshold-3.c | 14 +++ .../nvptx/frame-malloc-threshold-4.c | 16 +++ .../nvptx/frame-malloc-threshold-5.c | 15 +++ .../nvptx/frame-malloc-threshold-6.c | 15 +++ .../nvptx/frame-malloc-threshold-7.c | 15 +++ 11 files changed, 240 insertions(+), 10 deletions(-) create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc index b93a253ab318..2efd70595991 100644 --- a/gcc/config/nvptx/nvptx.cc +++ b/gcc/config/nvptx/nvptx.cc @@ -178,6 +178,16 @@ static hash_map<tree_decl_hash, unsigned int> gang_private_shared_hmap; /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; +/* True if any function 'has_malloc_frame'. + Because of 'nvptx_name_replacement', we can't just: + nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_FREE)); + nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_MALLOC)); + ..., but instead have to track them individually. +*/ +static bool need_free_malloc_decl; +static bool have_free_decl; +static bool have_malloc_decl; + /* True if any function references __nvptx_stacks. */ static bool need_softstack_decl; static bool have_softstack_decl; @@ -976,6 +986,11 @@ write_fn_marker (std::stringstream &s, bool is_defn, bool globalize, s << " GLOBAL"; s << " FUNCTION " << (is_defn ? "DEF: " : "DECL: "); s << name << "\n"; + + if (strcmp (name, "free") == 0) + have_free_decl = true; + else if (strcmp (name, "malloc") == 0) + have_malloc_decl = true; } /* Emit a linker marker for a variable decl or defn. */ @@ -1231,22 +1246,66 @@ nvptx_maybe_record_fnsym (rtx sym) nvptx_record_needed_fndecl (decl); } +//TODO /* Emit a local array to hold some part of a conventional stack frame and initialize REGNO to point to it. If the size is zero, it'll never be valid to dereference, so we can simply initialize to zero. */ static void -init_frame (FILE *file, int regno, unsigned align, unsigned size) +init_frame (FILE *file, int regno, int align, HOST_WIDE_INT size) { - if (size) - fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n", - align, reg_names[regno], size); fprintf (file, "\t.reg.u%d %s;\n", POINTER_SIZE, reg_names[regno]); - fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n" - : "\tmov.u%d %s, 0;\n"), - POINTER_SIZE, reg_names[regno], reg_names[regno]); + + if (regno == FRAME_POINTER_REGNUM + && ((unsigned HOST_WIDE_INT) size + >= (unsigned HOST_WIDE_INT) nvptx_frame_malloc_threshold)) + { + warning_at (DECL_SOURCE_LOCATION (current_function_decl), + OPT_Wframe_malloc_threshold, + "using %<malloc%> for frame with size of %wu bytes", size); + + /* <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations> + (2022-12-21, v12.0) states that in addition to the "in-kernel + 'malloc()' function" there also exists an "in-kernel + '__nv_aligned_device_malloc()' function", where "the address of the + allocated memory will be a multiple of 'align'". However that's not + documented on + <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls> + (2022-12-21, v12.0), so we shall not use that function. */ + /* <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls> + (2022-12-21, v12.0) does not, but + <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations> + (2022-12-21, v12.0) does state that the pointer returned by + "in-kernel 'malloc()' [...] is guaranteed to be aligned to a + 16-byte boundary". */ + if (align > 16) + sorry ("unfulfilled %d bytes alignment for frame", align); + + /* We don't need to support 'realloc', so instead of newlib 'malloc' + directly use the PTX 'malloc'. */ + fprintf (file, + "\t{\n" + "\t .param .u64 %%ptr;\n" + "\t .param .u64 %%size;\n" + "\t st.param.u64 [%%size], " HOST_WIDE_INT_PRINT_DEC ";\n" + "\t call (%%ptr), malloc, (%%size);\n" + "\t ld.param.u64 %s, [%%ptr];\n" + "\t}\n", + size, reg_names[regno]); + cfun->machine->has_malloc_frame = true; + need_free_malloc_decl = true; + } + else + { + if (size) + fprintf (file, "\t.local .align %d .b8 %s_ar[" HOST_WIDE_INT_PRINT_DEC "];\n", + align, reg_names[regno], size); + fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n" + : "\tmov.u%d %s, 0;\n"), + POINTER_SIZE, reg_names[regno], reg_names[regno]); + } } /* Emit soft stack frame setup sequence. */ @@ -1744,12 +1803,22 @@ nvptx_output_set_softstack (unsigned src_regno) } return ""; } + /* Output a return instruction. Also copy the return value to its outgoing location. */ const char * nvptx_output_return (void) { + if (cfun->machine->has_malloc_frame) + fprintf (asm_out_file, + "\t{\n" + "\t .param .u64 %%ptr;\n" + "\t st.param.u64 [%%ptr], %s;\n" + "\t call free, (%%ptr);\n" + "\t}\n", + reg_names[FRAME_POINTER_REGNUM]); + machine_mode mode = (machine_mode)cfun->machine->return_mode; if (mode != VOIDmode) @@ -4470,8 +4539,8 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn, rtx_code_label *label = NULL; empty = false; - /* The frame size might not be DImode compatible, but the frame - array's declaration will be. So it's ok to round up here. */ + /* The frame size might not be DImode-compatible, but the actual frame + allocated by 'init_frame' will be. So it's ok to round up here. */ fs = (fs + GET_MODE_SIZE (DImode) - 1) / GET_MODE_SIZE (DImode); /* Detect single iteration loop. */ if (fs == 1) @@ -5989,6 +6058,21 @@ write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size) static void nvptx_file_end (void) { + if (need_free_malloc_decl) + { + if (!have_free_decl) + { + write_fn_marker (func_decls, false, true, "free"); + func_decls << ".extern .func free (.param .b64 %ptr);\n"; + } + if (!have_malloc_decl) + { + write_fn_marker (func_decls, false, true, "malloc"); + func_decls + << ".extern .func (.param .b64 %ptr) malloc (.param .b64 %size);\n"; + } + } + hash_table<tree_hasher>::iterator iter; tree decl; FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter) diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index bc1021a80317..82d695551090 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -214,6 +214,8 @@ struct nvptx_args { #define TRAMPOLINE_SIZE 32 #define TRAMPOLINE_ALIGNMENT 256 + +#define NVPTX_FRAME_MALLOC_THRESHOLD_INIT 257 \f /* We don't run reload, so this isn't actually used, but it still needs to be defined. Showing an argp->fp elimination also stops @@ -244,6 +246,7 @@ struct GTY(()) machine_function bool is_varadic; /* This call is varadic */ bool has_varadic; /* Current function has a varadic call. */ bool has_chain; /* Current function has outgoing static chain. */ + bool has_malloc_frame; bool has_softstack; /* Current function has a soft stack frame. */ bool has_simtreg; /* Current function has an OpenMP SIMD region. */ int num_args; /* Number of args of current call. */ diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt index 71d3b68510bd..6ccd3defc776 100644 --- a/gcc/config/nvptx/nvptx.opt +++ b/gcc/config/nvptx/nvptx.opt @@ -28,6 +28,18 @@ Target RejectNegative Mask(ABI64) Ignored, but preserved for backward compatibility. Only 64-bit ABI is supported. +mframe-malloc-threshold= +Target Joined RejectNegative Host_Wide_Int ByteSize Var(nvptx_frame_malloc_threshold) Init(NVPTX_FRAME_MALLOC_THRESHOLD_INIT) +-mframe-malloc-threshold=<byte-size> When the frame size exceeds <byte-size>, frame allocation switches from '.local' memory to 'malloc'. + +mno-frame-malloc-threshold +Target Alias(mframe-malloc-threshold=,18446744073709551615EiB,none) +Always use '.local' memory for frame allocation. Equivalent to -mframe-malloc-threshold=<SIZE_MAX> or larger. + +Wframe-malloc-threshold +Target Warning +Warn when the threshold is reached where frame allocation switches from '.local' memory to 'malloc'. + mmainkernel Target RejectNegative Link in code for a __main kernel. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 471309dfacfe..e3b6ea0fe4b8 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -1179,7 +1179,9 @@ Objective-C and Objective-C++ Dialects}. -march=@var{arch} -mbmx -mno-bmx -mcdx -mno-cdx} @emph{Nvidia PTX Options} -@gccoptlist{-m64 -mmainkernel -moptimize} +@gccoptlist{-m64 @gol +-mframe-malloc-threshold=@var{byte-size} @gol +-mmainkernel -moptimize} @emph{OpenRISC Options} @gccoptlist{-mboard=@var{name} -mnewlib -mhard-mul -mhard-div @gol @@ -28367,6 +28369,18 @@ This option sets the values of the preprocessor macros for instance, for @samp{3.1} the macros have the values @samp{3} and @samp{1}, respectively. +@item -mframe-malloc-threshold=@var{byte-size} +@opindex mframe-malloc-threshold= +@opindex mno-frame-malloc-threshold +TODO + +This is not relevant if @code{-msoft-stack} is enabled. + +@option{-mframe-malloc-threshold=TODO} is enabled by default. +This may be disabled either by specifying +@var{byte-size} of @samp{SIZE_MAX} or more or by +@option{-mno-frame-malloc-threshold}. + @item -mmainkernel @opindex mmainkernel Link in code for a __main kernel. This is for stand-alone instead of diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c new file mode 100644 index 000000000000..b16c17bfdf99 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c @@ -0,0 +1,29 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +/* PTX-provided 'free', 'malloc'; cf. 'nvptx_name_replacement'. */ +void ptx_free (void *) __asm__ ("free"); +void *ptx_malloc (__SIZE_TYPE__) __asm__ ("malloc"); + +int f (void) +/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */ +{ + char a[1234]; + + ptx_malloc (5); + + ptx_free (ptx_malloc (1)); +} + +/* We exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 2 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 3 } } +*/ + +/* Of the implicit (via 'need_free_malloc_decl') and explicit declarations of + 'free', 'malloc', only one is emitted each: + { dg-final { scan-assembler-times {(?n)\.extern .* free .*;} 1 } } + { dg-final { scan-assembler-times {(?n)\.extern .* malloc .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c new file mode 100644 index 000000000000..2f6a919eb1f1 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c @@ -0,0 +1,13 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ + +int f (void) +{ + char a[1234]; +} + +/* We exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c new file mode 100644 index 000000000000..7434132b2ad5 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c @@ -0,0 +1,14 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[256]; +} + +/* We don't exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c new file mode 100644 index 000000000000..c4068ab7ad23 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c @@ -0,0 +1,16 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=32 } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */ +{ + char a[32]; +} + +/* We exceed the specified '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c new file mode 100644 index 000000000000..cc262427b03c --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=1249 } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We don't exceed the specified '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c new file mode 100644 index 000000000000..72017ca2f439 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=2KiB } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We don't exceed the specified '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c new file mode 100644 index 000000000000..b2f85a55f050 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mno-frame-malloc-threshold } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We'll never exceed the specified unlimited '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ -- 2.35.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) 2022-12-23 14:08 ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge @ 2022-12-23 21:23 ` Jerry D 2023-01-11 12:06 ` [PING] " Thomas Schwinge 1 sibling, 0 replies; 8+ messages in thread From: Jerry D @ 2022-12-23 21:23 UTC (permalink / raw) To: Thomas Schwinge, Richard Biener, Tom de Vries, gcc-patches Cc: Janne Blomqvist, fortran, Alexander Monakov On 12/23/22 6:08 AM, Thomas Schwinge wrote: > Hi! > > On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote: >> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote: >>> For example, for Fortran code like: >>> >>> write (*,*) "Hello world" >>> >>> ..., 'gfortran' creates: >>> >>> struct __st_parameter_dt dt_parm.0; >>> >>> try >>> { >>> dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; >>> dt_parm.0.common.line = 29; >>> dt_parm.0.common.flags = 128; >>> dt_parm.0.common.unit = 6; >>> _gfortran_st_write (&dt_parm.0); >>> _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11); >>> _gfortran_st_write_done (&dt_parm.0); >>> } >>> finally >>> { >>> dt_parm.0 = {CLOBBER(eol)}; >>> } >>> >>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, >>> really! -- there's a lot of state in Fortran I/O apparently). That's a >>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you >>> have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; >>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' >>> "Use custom stacks instead of local memory for automatic storage".) >>> >>> Now, the Nvidia Driver tries to accomodate for such largish stack usage, >>> and dynamically increases the per-thread stack as necessary (thereby >>> potentially reducing parallelism) -- if it manages to understand the call >>> graph. In case of libgfortran I/O, it evidently doesn't. Not being able >>> to disprove existance of recursion is the common problem, as I've read. >>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: >>> >>> warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined >>> >>> That's still not an actual problem: if the GPU kernel's stack usage still >>> fits into 1 KiB. Very often it does, but if, as happens in libgfortran >>> I/O handling, there is another such 'dt_parm' put onto the stack, the >>> stack then overflows; device-side SIGSEGV. >>> >>> (There is, by the way, some similar analysis by Tom de Vries in >>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite] >>> Recursive tests may fail due to thread stack limit".) >>> >>> Of course, you shouldn't really be doing I/O in GPU kernels, but people >>> do like their occasional "'printf' debugging", so we ought to make that >>> work (... without pessimizing any "normal" code). >>> >>> I assume that generally reducing the size of 'dt_parm' etc. is out of >>> scope. There are so many wiggles and turns and corner cases and the like of nightmares in I/O I would advise not trying to reduce the dt_parm. It could probably be done. For debugging GPU, would it not be better to have a way you signal back to a main thread to do a print from there, like some sort of call back in the users code under test. Putting this another way, recommend users debugging to use a different method than embedding print statements for debugging rather than do a tone of work to enable something that is not really a legitimate use case. FWIW, Jerry ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) 2022-12-23 14:08 ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge 2022-12-23 21:23 ` Jerry D @ 2023-01-11 12:06 ` Thomas Schwinge 2023-01-12 2:46 ` Jerry D 1 sibling, 1 reply; 8+ messages in thread From: Thomas Schwinge @ 2023-01-11 12:06 UTC (permalink / raw) To: Richard Biener, Tom de Vries, gcc-patches Cc: Janne Blomqvist, fortran, Alexander Monakov [-- Attachment #1: Type: text/plain, Size: 8377 bytes --] Hi! Ping -- the '-mframe-malloc-threshold' idea, at least. Note that while this issue originally did pop up for Fortran I/O, it's likewise relevant for other functions that maintain big frames, for example in newlib: libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064]; libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064]; libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064]; libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560]; Therefore a generic solution (or, workaround if you'd like) does seem appropriate. Grüße Thomas On 2022-12-23T15:08:06+0100, I wrote: > Hi! > > On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote: >> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote: >>> For example, for Fortran code like: >>> >>> write (*,*) "Hello world" >>> >>> ..., 'gfortran' creates: >>> >>> struct __st_parameter_dt dt_parm.0; >>> >>> try >>> { >>> dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; >>> dt_parm.0.common.line = 29; >>> dt_parm.0.common.flags = 128; >>> dt_parm.0.common.unit = 6; >>> _gfortran_st_write (&dt_parm.0); >>> _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11); >>> _gfortran_st_write_done (&dt_parm.0); >>> } >>> finally >>> { >>> dt_parm.0 = {CLOBBER(eol)}; >>> } >>> >>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, >>> really! -- there's a lot of state in Fortran I/O apparently). That's a >>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you >>> have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; >>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' >>> "Use custom stacks instead of local memory for automatic storage".) >>> >>> Now, the Nvidia Driver tries to accomodate for such largish stack usage, >>> and dynamically increases the per-thread stack as necessary (thereby >>> potentially reducing parallelism) -- if it manages to understand the call >>> graph. In case of libgfortran I/O, it evidently doesn't. Not being able >>> to disprove existance of recursion is the common problem, as I've read. >>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: >>> >>> warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined >>> >>> That's still not an actual problem: if the GPU kernel's stack usage still >>> fits into 1 KiB. Very often it does, but if, as happens in libgfortran >>> I/O handling, there is another such 'dt_parm' put onto the stack, the >>> stack then overflows; device-side SIGSEGV. >>> >>> (There is, by the way, some similar analysis by Tom de Vries in >>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite] >>> Recursive tests may fail due to thread stack limit".) >>> >>> Of course, you shouldn't really be doing I/O in GPU kernels, but people >>> do like their occasional "'printf' debugging", so we ought to make that >>> work (... without pessimizing any "normal" code). >>> >>> I assume that generally reducing the size of 'dt_parm' etc. is out of >>> scope. >>> >>> There is a way to manually set a per-thread stack size, but it's not >>> obvious which size to set: that sizes needs to work for the whole GPU >>> kernel, and should be as low as possible (to maximize parallelism). >>> I assume that even if GCC did an accurate call graph analysis of the GPU >>> kernel's maximum stack usage, that still wouldn't help: that's before the >>> PTX JIT does its own code transformations, including stack spilling. >>> >>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization >>> (-dlto) for device code". This might help, assuming that it manages to >>> simplify the libgfortran I/O code such that the PTX JIT then understands >>> the call graph. But: that's available only starting with recent >>> CUDA 11.4, so not a general solution -- if it works at all, which I've >>> not tested. >>> >>> Similarly, we could enable GCC's LTO for device code generation -- but >>> that's a big project, out of scope at this time. And again, we don't >>> know if that at all helps this case. >>> >>> I see a few options: >>> >>> (a) Figure out what it is in the libgfortran I/O implementation that >>> causes "Stack size [...] cannot be statically determined", and re-work >>> that code to avoid that, or even disable certain things for nvptx, if >>> feasible. > >> Shrink st_parameter_dt (it's part of the ABI though, kind of). Lots of the >> bloat is from things that are unused for simpler I/O cases (so some >> "inheritance" could help), and lots of the bloat is from using >> string/length pairs using char * + size_t for what looks like could be >> encoded a lot more efficiently. >> >> There's probably not much low-hanging fruit. > > (Similarly comments in Janne's email.) > > > Well, as had to be expected, libgfortran I/O is really just one example, > but the underlying problem may also be triggered in other ways (via other > newlib/libc functions, for example). > > So, really a generic solution seems to be called for. > >>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'. >>> I don't really want to do that however: it does introduce a bit of >>> complexity in all the generated device code and run-time overhead that we >>> generally would like to avoid. > > Directly using '-msoft-stack' isn't actually possible: it does implement > "one stack per 32-threads warp", but for OpenACC we need "one stack per > thread of a warp" (that is, each OpenACC 'vector' independently), and > pre-allocating from device memory all those stacks (which may be a lot!) > I foresee to really negatively impact overall performance? > >>> (c) I'm contemplating a tweak/compiler pass for transforming such large >>> stack objects into heap allocation (during nvptx offloading compilation). >>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the >>> code paths this is to affect. (Might also add some compile-time >>> diagnostic, of course.) Could maybe even limit this to only be used >>> during libgfortran compilation? This is then conceptually a bit similar >>> to (b), but localized to relevant parts only. Has such a thing been done >>> before in GCC, that I could build upon? >>> >>> Any other clever ideas? > >> Converting to heap allocation is difficult outside of the frontend and you >> have to be very careful with memleaks. > > Heh, in fact it seems to be pretty simple! (Famous last words?) See > "[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'" > attached. What do people think about such a thing? > > Still to be discussed are '-Wframe-malloc-threshold' (default-on vs. > '-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?), > default value for '-mframe-malloc-threshold=[...]' (potentially different > for GCC/nvptx target libraries build vs. user-compiled code?), etc. > > >> The library is written in C and >> I see heap allocated temporaries there but in at least one >> place a stack one is used: >> >> void >> st_endfile (st_parameter_filepos *fpp) >> { >> ... >> if (u->current_record) >> { >> st_parameter_dt dtp; >> dtp.common = fpp->common; >> memset (&dtp.u.p, 0, sizeof (dtp.u.p)); >> dtp.u.p.current_unit = u; >> next_record (&dtp, 1); >> >> that might be a mistake though - maybe it's enough to change that >> to a heap allocation? It might be also totally superfluous since >> only 'u' should matter here ... (not sure if the above is the case >> you are running into). > > (Have not yet looked into that; won't solve the general issue.) > > > Grüße > Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-WIP-nvptx-mframe-malloc-threshold-Wframe-malloc-thre.patch --] [-- Type: text/x-diff, Size: 16585 bytes --] From 3f5524adacff23710cf1cab393a56bf23853cafa Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Wed, 21 Dec 2022 21:25:19 +0100 Subject: [PATCH] [WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' --- gcc/config/nvptx/nvptx.cc | 102 ++++++++++++++++-- gcc/config/nvptx/nvptx.h | 3 + gcc/config/nvptx/nvptx.opt | 12 +++ gcc/doc/invoke.texi | 16 ++- .../nvptx/frame-malloc-threshold-1.c | 29 +++++ .../nvptx/frame-malloc-threshold-2.c | 13 +++ .../nvptx/frame-malloc-threshold-3.c | 14 +++ .../nvptx/frame-malloc-threshold-4.c | 16 +++ .../nvptx/frame-malloc-threshold-5.c | 15 +++ .../nvptx/frame-malloc-threshold-6.c | 15 +++ .../nvptx/frame-malloc-threshold-7.c | 15 +++ 11 files changed, 240 insertions(+), 10 deletions(-) create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc index b93a253ab318..2efd70595991 100644 --- a/gcc/config/nvptx/nvptx.cc +++ b/gcc/config/nvptx/nvptx.cc @@ -178,6 +178,16 @@ static hash_map<tree_decl_hash, unsigned int> gang_private_shared_hmap; /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; +/* True if any function 'has_malloc_frame'. + Because of 'nvptx_name_replacement', we can't just: + nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_FREE)); + nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_MALLOC)); + ..., but instead have to track them individually. +*/ +static bool need_free_malloc_decl; +static bool have_free_decl; +static bool have_malloc_decl; + /* True if any function references __nvptx_stacks. */ static bool need_softstack_decl; static bool have_softstack_decl; @@ -976,6 +986,11 @@ write_fn_marker (std::stringstream &s, bool is_defn, bool globalize, s << " GLOBAL"; s << " FUNCTION " << (is_defn ? "DEF: " : "DECL: "); s << name << "\n"; + + if (strcmp (name, "free") == 0) + have_free_decl = true; + else if (strcmp (name, "malloc") == 0) + have_malloc_decl = true; } /* Emit a linker marker for a variable decl or defn. */ @@ -1231,22 +1246,66 @@ nvptx_maybe_record_fnsym (rtx sym) nvptx_record_needed_fndecl (decl); } +//TODO /* Emit a local array to hold some part of a conventional stack frame and initialize REGNO to point to it. If the size is zero, it'll never be valid to dereference, so we can simply initialize to zero. */ static void -init_frame (FILE *file, int regno, unsigned align, unsigned size) +init_frame (FILE *file, int regno, int align, HOST_WIDE_INT size) { - if (size) - fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n", - align, reg_names[regno], size); fprintf (file, "\t.reg.u%d %s;\n", POINTER_SIZE, reg_names[regno]); - fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n" - : "\tmov.u%d %s, 0;\n"), - POINTER_SIZE, reg_names[regno], reg_names[regno]); + + if (regno == FRAME_POINTER_REGNUM + && ((unsigned HOST_WIDE_INT) size + >= (unsigned HOST_WIDE_INT) nvptx_frame_malloc_threshold)) + { + warning_at (DECL_SOURCE_LOCATION (current_function_decl), + OPT_Wframe_malloc_threshold, + "using %<malloc%> for frame with size of %wu bytes", size); + + /* <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations> + (2022-12-21, v12.0) states that in addition to the "in-kernel + 'malloc()' function" there also exists an "in-kernel + '__nv_aligned_device_malloc()' function", where "the address of the + allocated memory will be a multiple of 'align'". However that's not + documented on + <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls> + (2022-12-21, v12.0), so we shall not use that function. */ + /* <https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/#system-calls> + (2022-12-21, v12.0) does not, but + <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#dynamic-global-memory-allocation-and-operations> + (2022-12-21, v12.0) does state that the pointer returned by + "in-kernel 'malloc()' [...] is guaranteed to be aligned to a + 16-byte boundary". */ + if (align > 16) + sorry ("unfulfilled %d bytes alignment for frame", align); + + /* We don't need to support 'realloc', so instead of newlib 'malloc' + directly use the PTX 'malloc'. */ + fprintf (file, + "\t{\n" + "\t .param .u64 %%ptr;\n" + "\t .param .u64 %%size;\n" + "\t st.param.u64 [%%size], " HOST_WIDE_INT_PRINT_DEC ";\n" + "\t call (%%ptr), malloc, (%%size);\n" + "\t ld.param.u64 %s, [%%ptr];\n" + "\t}\n", + size, reg_names[regno]); + cfun->machine->has_malloc_frame = true; + need_free_malloc_decl = true; + } + else + { + if (size) + fprintf (file, "\t.local .align %d .b8 %s_ar[" HOST_WIDE_INT_PRINT_DEC "];\n", + align, reg_names[regno], size); + fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n" + : "\tmov.u%d %s, 0;\n"), + POINTER_SIZE, reg_names[regno], reg_names[regno]); + } } /* Emit soft stack frame setup sequence. */ @@ -1744,12 +1803,22 @@ nvptx_output_set_softstack (unsigned src_regno) } return ""; } + /* Output a return instruction. Also copy the return value to its outgoing location. */ const char * nvptx_output_return (void) { + if (cfun->machine->has_malloc_frame) + fprintf (asm_out_file, + "\t{\n" + "\t .param .u64 %%ptr;\n" + "\t st.param.u64 [%%ptr], %s;\n" + "\t call free, (%%ptr);\n" + "\t}\n", + reg_names[FRAME_POINTER_REGNUM]); + machine_mode mode = (machine_mode)cfun->machine->return_mode; if (mode != VOIDmode) @@ -4470,8 +4539,8 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn, rtx_code_label *label = NULL; empty = false; - /* The frame size might not be DImode compatible, but the frame - array's declaration will be. So it's ok to round up here. */ + /* The frame size might not be DImode-compatible, but the actual frame + allocated by 'init_frame' will be. So it's ok to round up here. */ fs = (fs + GET_MODE_SIZE (DImode) - 1) / GET_MODE_SIZE (DImode); /* Detect single iteration loop. */ if (fs == 1) @@ -5989,6 +6058,21 @@ write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size) static void nvptx_file_end (void) { + if (need_free_malloc_decl) + { + if (!have_free_decl) + { + write_fn_marker (func_decls, false, true, "free"); + func_decls << ".extern .func free (.param .b64 %ptr);\n"; + } + if (!have_malloc_decl) + { + write_fn_marker (func_decls, false, true, "malloc"); + func_decls + << ".extern .func (.param .b64 %ptr) malloc (.param .b64 %size);\n"; + } + } + hash_table<tree_hasher>::iterator iter; tree decl; FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter) diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index bc1021a80317..82d695551090 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -214,6 +214,8 @@ struct nvptx_args { #define TRAMPOLINE_SIZE 32 #define TRAMPOLINE_ALIGNMENT 256 + +#define NVPTX_FRAME_MALLOC_THRESHOLD_INIT 257 \f /* We don't run reload, so this isn't actually used, but it still needs to be defined. Showing an argp->fp elimination also stops @@ -244,6 +246,7 @@ struct GTY(()) machine_function bool is_varadic; /* This call is varadic */ bool has_varadic; /* Current function has a varadic call. */ bool has_chain; /* Current function has outgoing static chain. */ + bool has_malloc_frame; bool has_softstack; /* Current function has a soft stack frame. */ bool has_simtreg; /* Current function has an OpenMP SIMD region. */ int num_args; /* Number of args of current call. */ diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt index 71d3b68510bd..6ccd3defc776 100644 --- a/gcc/config/nvptx/nvptx.opt +++ b/gcc/config/nvptx/nvptx.opt @@ -28,6 +28,18 @@ Target RejectNegative Mask(ABI64) Ignored, but preserved for backward compatibility. Only 64-bit ABI is supported. +mframe-malloc-threshold= +Target Joined RejectNegative Host_Wide_Int ByteSize Var(nvptx_frame_malloc_threshold) Init(NVPTX_FRAME_MALLOC_THRESHOLD_INIT) +-mframe-malloc-threshold=<byte-size> When the frame size exceeds <byte-size>, frame allocation switches from '.local' memory to 'malloc'. + +mno-frame-malloc-threshold +Target Alias(mframe-malloc-threshold=,18446744073709551615EiB,none) +Always use '.local' memory for frame allocation. Equivalent to -mframe-malloc-threshold=<SIZE_MAX> or larger. + +Wframe-malloc-threshold +Target Warning +Warn when the threshold is reached where frame allocation switches from '.local' memory to 'malloc'. + mmainkernel Target RejectNegative Link in code for a __main kernel. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 471309dfacfe..e3b6ea0fe4b8 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -1179,7 +1179,9 @@ Objective-C and Objective-C++ Dialects}. -march=@var{arch} -mbmx -mno-bmx -mcdx -mno-cdx} @emph{Nvidia PTX Options} -@gccoptlist{-m64 -mmainkernel -moptimize} +@gccoptlist{-m64 @gol +-mframe-malloc-threshold=@var{byte-size} @gol +-mmainkernel -moptimize} @emph{OpenRISC Options} @gccoptlist{-mboard=@var{name} -mnewlib -mhard-mul -mhard-div @gol @@ -28367,6 +28369,18 @@ This option sets the values of the preprocessor macros for instance, for @samp{3.1} the macros have the values @samp{3} and @samp{1}, respectively. +@item -mframe-malloc-threshold=@var{byte-size} +@opindex mframe-malloc-threshold= +@opindex mno-frame-malloc-threshold +TODO + +This is not relevant if @code{-msoft-stack} is enabled. + +@option{-mframe-malloc-threshold=TODO} is enabled by default. +This may be disabled either by specifying +@var{byte-size} of @samp{SIZE_MAX} or more or by +@option{-mno-frame-malloc-threshold}. + @item -mmainkernel @opindex mmainkernel Link in code for a __main kernel. This is for stand-alone instead of diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c new file mode 100644 index 000000000000..b16c17bfdf99 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c @@ -0,0 +1,29 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +/* PTX-provided 'free', 'malloc'; cf. 'nvptx_name_replacement'. */ +void ptx_free (void *) __asm__ ("free"); +void *ptx_malloc (__SIZE_TYPE__) __asm__ ("malloc"); + +int f (void) +/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */ +{ + char a[1234]; + + ptx_malloc (5); + + ptx_free (ptx_malloc (1)); +} + +/* We exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 2 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 3 } } +*/ + +/* Of the implicit (via 'need_free_malloc_decl') and explicit declarations of + 'free', 'malloc', only one is emitted each: + { dg-final { scan-assembler-times {(?n)\.extern .* free .*;} 1 } } + { dg-final { scan-assembler-times {(?n)\.extern .* malloc .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c new file mode 100644 index 000000000000..2f6a919eb1f1 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c @@ -0,0 +1,13 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ + +int f (void) +{ + char a[1234]; +} + +/* We exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c new file mode 100644 index 000000000000..7434132b2ad5 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c @@ -0,0 +1,14 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[256]; +} + +/* We don't exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c new file mode 100644 index 000000000000..c4068ab7ad23 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c @@ -0,0 +1,16 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=32 } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */ +{ + char a[32]; +} + +/* We exceed the specified '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c new file mode 100644 index 000000000000..cc262427b03c --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=1249 } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We don't exceed the specified '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c new file mode 100644 index 000000000000..72017ca2f439 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=2KiB } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We don't exceed the specified '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c new file mode 100644 index 000000000000..b2f85a55f050 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mno-frame-malloc-threshold } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We'll never exceed the specified unlimited '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ -- 2.35.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) 2023-01-11 12:06 ` [PING] " Thomas Schwinge @ 2023-01-12 2:46 ` Jerry D 0 siblings, 0 replies; 8+ messages in thread From: Jerry D @ 2023-01-12 2:46 UTC (permalink / raw) To: Thomas Schwinge, Richard Biener, Tom de Vries, gcc-patches Cc: Janne Blomqvist, fortran, Alexander Monakov On 1/11/23 4:06 AM, Thomas Schwinge wrote: > Hi! > > Ping -- the '-mframe-malloc-threshold' idea, at least. > > Note that while this issue originally did pop up for Fortran I/O, it's > likewise relevant for other functions that maintain big frames, for > example in newlib: > > libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064]; > libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064]; > libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064]; > libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560]; > > Therefore a generic solution (or, workaround if you'd like) does seem > appropriate. > ---snip --- AS a gfortranner I have to at least say anyone doing fortran I/O on a GPU is nuts. With that said, a configurable option to address the broader issue makes sense. Perhaps the default threshold should be whatever it is now and if someone has a real situation where it is needed, they can adjust. Regards, Jerry ^ permalink raw reply [flat|nested] 8+ messages in thread
* nvptx, libgcc: Stub unwinding implementation [not found] <ae825c453f484ffd99c9be34af726089@mentor.com> [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> @ 2023-01-20 21:04 ` Thomas Schwinge 2023-01-20 21:16 ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge 1 sibling, 1 reply; 8+ messages in thread From: Thomas Schwinge @ 2023-01-20 21:04 UTC (permalink / raw) To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs [-- Attachment #1: Type: text/plain, Size: 796 bytes --] Hi! We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx offloading, which means building a normal (non-'LIBGFOR_MINIMAL') configuration of libgfortran. One prerequisite patch, based on WIP work by Andrew Stubbs, is: "nvptx, libgcc: Stub unwinding implementation", see attached. This I've just pushed to devel/omp/gcc-12 branch in commit 26d3146736218ccaaaafdaba4da1edf969bc190d, and would like to push to master branch once other pending GCC patches have been accepted. Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-nvptx-libgcc-Stub-unwinding-implementation.patch --] [-- Type: text/x-diff, Size: 3312 bytes --] From 26d3146736218ccaaaafdaba4da1edf969bc190d Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Wed, 21 Sep 2022 18:58:34 +0200 Subject: [PATCH] nvptx, libgcc: Stub unwinding implementation Adding stub '_Unwind_Backtrace', '_Unwind_GetIPInfo' functions is necessary for linking libbacktrace, as a normal (non-'LIBGFOR_MINIMAL') configuration of libgfortran wants to do, for example. The file 'libgcc/config/nvptx/unwind-nvptx.c' is copied from 'libgcc/config/gcn/unwind-gcn.c'. libgcc/ChangeLog: * config/nvptx/t-nvptx: Add unwind-nvptx.c. * config/nvptx/unwind-nvptx.c: New file. Co-authored-by: Andrew Stubbs <ams@codesourcery.com> --- libgcc/ChangeLog.omp | 6 +++++ libgcc/config/nvptx/t-nvptx | 3 ++- libgcc/config/nvptx/unwind-nvptx.c | 36 ++++++++++++++++++++++++++++++ 3 files changed, 44 insertions(+), 1 deletion(-) create mode 100644 libgcc/config/nvptx/unwind-nvptx.c diff --git a/libgcc/ChangeLog.omp b/libgcc/ChangeLog.omp index 2e7bf5cc029..c46f49bf5b7 100644 --- a/libgcc/ChangeLog.omp +++ b/libgcc/ChangeLog.omp @@ -1,3 +1,9 @@ +2023-01-20 Thomas Schwinge <thomas@codesourcery.com> + Andrew Stubbs <ams@codesourcery.com> + + * config/nvptx/t-nvptx: Add unwind-nvptx.c. + * config/nvptx/unwind-nvptx.c: New file. + 2023-01-20 Thomas Schwinge <thomas@codesourcery.com> * config/nvptx/crtstuff.c ["mgomp"] diff --git a/libgcc/config/nvptx/t-nvptx b/libgcc/config/nvptx/t-nvptx index 9a0454c3a4d..1845a38a35e 100644 --- a/libgcc/config/nvptx/t-nvptx +++ b/libgcc/config/nvptx/t-nvptx @@ -1,6 +1,7 @@ LIB2ADD=$(srcdir)/config/nvptx/reduction.c \ $(srcdir)/config/nvptx/mgomp.c \ - $(srcdir)/config/nvptx/atomic.c + $(srcdir)/config/nvptx/atomic.c \ + $(srcdir)/config/nvptx/unwind-nvptx.c LIB2ADDEH= LIB2FUNCS_EXCLUDE= diff --git a/libgcc/config/nvptx/unwind-nvptx.c b/libgcc/config/nvptx/unwind-nvptx.c new file mode 100644 index 00000000000..c657b2af6f3 --- /dev/null +++ b/libgcc/config/nvptx/unwind-nvptx.c @@ -0,0 +1,36 @@ +/* Stub unwinding implementation. + + Copyright (C) 2019-2023 Free Software Foundation, Inc. + + This file is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by the + Free Software Foundation; either version 3, or (at your option) any + later version. + + This file is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + <http://www.gnu.org/licenses/>. */ + +#include "unwind.h" + +_Unwind_Reason_Code +_Unwind_Backtrace(_Unwind_Trace_Fn trace, void * trace_argument) +{ + return 0; +} + +_Unwind_Ptr +_Unwind_GetIPInfo (struct _Unwind_Context *c, int *ip_before_insn) +{ + return 0; +} -- 2.25.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* nvptx, libgfortran: Switch out of "minimal" mode 2023-01-20 21:04 ` nvptx, libgcc: Stub unwinding implementation Thomas Schwinge @ 2023-01-20 21:16 ` Thomas Schwinge 2023-01-20 22:10 ` Thomas Koenig 2023-01-24 9:37 ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge 0 siblings, 2 replies; 8+ messages in thread From: Thomas Schwinge @ 2023-01-20 21:16 UTC (permalink / raw) To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs [-- Attachment #1: Type: text/plain, Size: 1276 bytes --] Hi! On 2023-01-20T22:04:02+0100, I wrote: > We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx > offloading, which means building a normal (non-'LIBGFOR_MINIMAL') > configuration of libgfortran. This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode', see attached, again based on WIP work by Andrew Stubbs. This I've just pushed to devel/omp/gcc-12 branch in commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push to master branch once other pending GCC patches have been accepted. The OpenACC XFAILs: "[...] overflows the stack for nvptx offloading" are unresolved at this point; see the discussion around "Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?", and my "nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'" experimenting. (The latter works to some extent, but also has other issues that I shall detail at some later point in time.) Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Attachment #2: 0001-nvptx-libgfortran-Switch-out-of-minimal-mode.patch --] [-- Type: text/x-diff, Size: 9644 bytes --] From c7734c6fbb5513b4da6306de7bc85de9b8547988 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Wed, 21 Sep 2022 18:58:34 +0200 Subject: [PATCH] nvptx, libgfortran: Switch out of "minimal" mode ..., in order to enable (portions of) Fortran I/O, for example. libgfortran/ChangeLog: * configure: Regenerate. * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx. libgomp/ChangeLog: * testsuite/libgomp.fortran/target-print-1.f90: Adjust. * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove. * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust. * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove. * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust. * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise. Co-authored-by: Andrew Stubbs <ams@codesourcery.com> --- libgfortran/ChangeLog.omp | 6 ++++++ libgfortran/configure | 17 ++++++----------- libgfortran/configure.ac | 17 ++++++----------- libgomp/ChangeLog.omp | 7 +++++++ .../libgomp.fortran/target-print-1-nvptx.f90 | 11 ----------- .../libgomp.fortran/target-print-1.f90 | 3 --- .../libgomp.oacc-fortran/error_stop-2.f | 4 +++- .../libgomp.oacc-fortran/print-1-nvptx.f90 | 11 ----------- .../testsuite/libgomp.oacc-fortran/print-1.f90 | 5 ++--- libgomp/testsuite/libgomp.oacc-fortran/stop-2.f | 4 +++- 10 files changed, 33 insertions(+), 52 deletions(-) delete mode 100644 libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 delete mode 100644 libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 diff --git a/libgfortran/ChangeLog.omp b/libgfortran/ChangeLog.omp index b08c264daf9..925575e65fa 100644 --- a/libgfortran/ChangeLog.omp +++ b/libgfortran/ChangeLog.omp @@ -1,3 +1,9 @@ +2023-01-20 Thomas Schwinge <thomas@codesourcery.com> + Andrew Stubbs <ams@codesourcery.com> + + * configure: Regenerate. + * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx. + 2023-01-20 Thomas Schwinge <thomas@codesourcery.com> PR target/85463 diff --git a/libgfortran/configure b/libgfortran/configure index ae64dca3114..3e5c931d4ad 100755 --- a/libgfortran/configure +++ b/libgfortran/configure @@ -6230,17 +6230,12 @@ else fi -# For GPU offloading, not everything in libfortran can be supported. -# Currently, the only target that has this problem is nvptx. The -# following is a (partial) list of features that are unsupportable on -# this particular target: -# * Constructors -# * alloca -# * C library support for I/O, with printf as the one notable exception -# * C library support for other features such as signal, environment -# variables, time functions - - if test "x${target_cpu}" = xnvptx; then +# "Minimal" mode is for targets that cannot (yet) support all features of +# libgfortran. It avoids the need for working constructors, alloca, and C +# library support for I/O, signals, environment variables, time functions, etc. +# At present there are no targets that require this mode. + + if false; then LIBGFOR_MINIMAL_TRUE= LIBGFOR_MINIMAL_FALSE='#' else diff --git a/libgfortran/configure.ac b/libgfortran/configure.ac index 97cc490cb5e..e5552949cc6 100644 --- a/libgfortran/configure.ac +++ b/libgfortran/configure.ac @@ -222,17 +222,12 @@ AM_CONDITIONAL(LIBGFOR_USE_SYMVER, [test "x$gfortran_use_symver" != xno]) AM_CONDITIONAL(LIBGFOR_USE_SYMVER_GNU, [test "x$gfortran_use_symver" = xgnu]) AM_CONDITIONAL(LIBGFOR_USE_SYMVER_SUN, [test "x$gfortran_use_symver" = xsun]) -# For GPU offloading, not everything in libfortran can be supported. -# Currently, the only target that has this problem is nvptx. The -# following is a (partial) list of features that are unsupportable on -# this particular target: -# * Constructors -# * alloca -# * C library support for I/O, with printf as the one notable exception -# * C library support for other features such as signal, environment -# variables, time functions - -AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx]) +# "Minimal" mode is for targets that cannot (yet) support all features of +# libgfortran. It avoids the need for working constructors, alloca, and C +# library support for I/O, signals, environment variables, time functions, etc. +# At present there are no targets that require this mode. + +AM_CONDITIONAL(LIBGFOR_MINIMAL, false) # Some compiler target support may have limited support for integer # or floating point numbers – or may want to reduce the libgfortran size diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp index 32aa9705296..30b1e558ea3 100644 --- a/libgomp/ChangeLog.omp +++ b/libgomp/ChangeLog.omp @@ -1,5 +1,12 @@ 2023-01-20 Thomas Schwinge <thomas@codesourcery.com> + * testsuite/libgomp.fortran/target-print-1.f90: Adjust. + * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove. + * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust. + * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove. + * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust. + * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise. + * plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New. (nvptx_close_device, GOMP_OFFLOAD_load_image) (GOMP_OFFLOAD_unload_image): Call it. diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 deleted file mode 100644 index a89c9c33484..00000000000 --- a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 +++ /dev/null @@ -1,11 +0,0 @@ -! Ensure that write on the offload device works, nvptx offloading variant. - -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran -! configuration. -! { dg-do link } ! ..., but still apply 'dg-do run' options. -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } } - -! Skip duplicated testing. -! { dg-skip-if "separate file" { ! offload_target_nvptx } } - -include 'target-print-1.f90' diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1.f90 index 327bb22cb6d..9ac70e5a85f 100644 --- a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 +++ b/libgomp/testsuite/libgomp.fortran/target-print-1.f90 @@ -3,9 +3,6 @@ ! { dg-do run } ! { dg-output "The answer is 42(\n|\r\n|\r)+" } -! Separate file 'target-print-1-nvptx.f90' for nvptx offloading. -! { dg-skip-if "separate file" { offload_target_nvptx } } - program main implicit none integer :: var = 42 diff --git a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f index 5951e8cbe64..bbb4b55ef2c 100644 --- a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f +++ b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f @@ -17,7 +17,9 @@ ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" } -! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" } +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...] +! overflows the stack for nvptx offloading, thus XFAILed. +! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } } ! ! In gfortran's main program, libfortran's set_options is called - which sets ! compiler_options.backtrace = 1 by default. For an offload libgfortran, this diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 deleted file mode 100644 index 866c8654355..00000000000 --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 +++ /dev/null @@ -1,11 +0,0 @@ -! Ensure that write on the offload device works, nvptx offloading variant. - -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran -! configuration. -! { dg-do link } ! ..., but still apply 'dg-do run' options. -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } } - -! Skip duplicated testing. -! { dg-skip-if "separate file" { ! offload_target_nvptx } } - -include 'print-1.f90' diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 index d2f89d915f8..d04503a0249 100644 --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 +++ b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 @@ -2,9 +2,8 @@ ! { dg-do run } ! { dg-output "The answer is 42(\n|\r\n|\r)+" } - -! Separate file 'print-1-nvptx.f90' for nvptx offloading. -! { dg-skip-if "separate file" { offload_target_nvptx } } +! The 'write' overflows the stack for nvptx offloading, thus XFAILed. +! { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } } ! { dg-additional-options "-fopt-info-note-omp" } ! { dg-additional-options "-foffload=-fopt-info-note-omp" } diff --git a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f index fe7ee37813a..394de034b1f 100644 --- a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f +++ b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f @@ -17,7 +17,9 @@ ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" } -! { dg-output "STOP 35(\n|\r\n|\r)+" } +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...] +! overflows the stack for nvptx offloading, thus XFAILed. +! { dg-output "STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } } ! ! PR85463. The 'exit' implementation used with nvptx ! offloading is a little bit different. -- 2.25.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: nvptx, libgfortran: Switch out of "minimal" mode 2023-01-20 21:16 ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge @ 2023-01-20 22:10 ` Thomas Koenig 2023-01-24 9:37 ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge 1 sibling, 0 replies; 8+ messages in thread From: Thomas Koenig @ 2023-01-20 22:10 UTC (permalink / raw) To: Thomas Schwinge, gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs Hi Thomas, > On 2023-01-20T22:04:02+0100, I wrote: >> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx >> offloading, which means building a normal (non-'LIBGFOR_MINIMAL') >> configuration of libgfortran. > > This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode', > see attached, again based on WIP work by Andrew Stubbs. This I've just > pushed to devel/omp/gcc-12 branch in > commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push > to master branch once other pending GCC patches have been accepted. Looks good to me. Regards Thomas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) 2023-01-20 21:16 ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge 2023-01-20 22:10 ` Thomas Koenig @ 2023-01-24 9:37 ` Thomas Schwinge 1 sibling, 0 replies; 8+ messages in thread From: Thomas Schwinge @ 2023-01-24 9:37 UTC (permalink / raw) To: gcc-patches; +Cc: fortran, Tom de Vries, Andrew Stubbs, Tobias Burnus [-- Attachment #1: Type: text/plain, Size: 11871 bytes --] Hi! On 2023-01-20T22:16:00+0100, I wrote: > On 2023-01-20T22:04:02+0100, I wrote: >> We've been (t)asked to enable (portions of) GCC/Fortran I/O for nvptx >> offloading, which means building a normal (non-'LIBGFOR_MINIMAL') >> configuration of libgfortran. > > This is achieved by 'nvptx, libgfortran: Switch out of "minimal" mode', > see attached, again based on WIP work by Andrew Stubbs. This I've just > pushed to devel/omp/gcc-12 branch in > commit c7734c6fbb5513b4da6306de7bc85de9b8547988, and would like to push > to master branch once other pending GCC patches have been accepted. > > > The OpenACC XFAILs: "[...] overflows the stack for nvptx offloading" > are unresolved at this point; see the discussion around > "Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?", > and my "nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'" > experimenting. (The latter works to some extent, but also has other > issues that I shall detail at some later point in time.) I had a note from Tobias to "update the the last but one bullet point at https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html". Thus pushed to devel/omp/gcc-12 branch commit 8c29332e98ca4669a059ebc0d90903b409ae049f "Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode'", see attached. Please consider that one 'fixup'ed into the GCC master branch submission. Grüße Thomas > From c7734c6fbb5513b4da6306de7bc85de9b8547988 Mon Sep 17 00:00:00 2001 > From: Thomas Schwinge <thomas@codesourcery.com> > Date: Wed, 21 Sep 2022 18:58:34 +0200 > Subject: [PATCH] nvptx, libgfortran: Switch out of "minimal" mode > > ..., in order to enable (portions of) Fortran I/O, for example. > > libgfortran/ChangeLog: > > * configure: Regenerate. > * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx. > > libgomp/ChangeLog: > > * testsuite/libgomp.fortran/target-print-1.f90: Adjust. > * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove. > * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust. > * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove. > * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust. > * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise. > > Co-authored-by: Andrew Stubbs <ams@codesourcery.com> > --- > libgfortran/ChangeLog.omp | 6 ++++++ > libgfortran/configure | 17 ++++++----------- > libgfortran/configure.ac | 17 ++++++----------- > libgomp/ChangeLog.omp | 7 +++++++ > .../libgomp.fortran/target-print-1-nvptx.f90 | 11 ----------- > .../libgomp.fortran/target-print-1.f90 | 3 --- > .../libgomp.oacc-fortran/error_stop-2.f | 4 +++- > .../libgomp.oacc-fortran/print-1-nvptx.f90 | 11 ----------- > .../testsuite/libgomp.oacc-fortran/print-1.f90 | 5 ++--- > libgomp/testsuite/libgomp.oacc-fortran/stop-2.f | 4 +++- > 10 files changed, 33 insertions(+), 52 deletions(-) > delete mode 100644 libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 > delete mode 100644 libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 > > diff --git a/libgfortran/ChangeLog.omp b/libgfortran/ChangeLog.omp > index b08c264daf9..925575e65fa 100644 > --- a/libgfortran/ChangeLog.omp > +++ b/libgfortran/ChangeLog.omp > @@ -1,3 +1,9 @@ > +2023-01-20 Thomas Schwinge <thomas@codesourcery.com> > + Andrew Stubbs <ams@codesourcery.com> > + > + * configure: Regenerate. > + * configure.ac: No longer set LIBGFOR_MINIMAL for nvptx. > + > 2023-01-20 Thomas Schwinge <thomas@codesourcery.com> > > PR target/85463 > diff --git a/libgfortran/configure b/libgfortran/configure > index ae64dca3114..3e5c931d4ad 100755 > --- a/libgfortran/configure > +++ b/libgfortran/configure > @@ -6230,17 +6230,12 @@ else > fi > > > -# For GPU offloading, not everything in libfortran can be supported. > -# Currently, the only target that has this problem is nvptx. The > -# following is a (partial) list of features that are unsupportable on > -# this particular target: > -# * Constructors > -# * alloca > -# * C library support for I/O, with printf as the one notable exception > -# * C library support for other features such as signal, environment > -# variables, time functions > - > - if test "x${target_cpu}" = xnvptx; then > +# "Minimal" mode is for targets that cannot (yet) support all features of > +# libgfortran. It avoids the need for working constructors, alloca, and C > +# library support for I/O, signals, environment variables, time functions, etc. > +# At present there are no targets that require this mode. > + > + if false; then > LIBGFOR_MINIMAL_TRUE= > LIBGFOR_MINIMAL_FALSE='#' > else > diff --git a/libgfortran/configure.ac b/libgfortran/configure.ac > index 97cc490cb5e..e5552949cc6 100644 > --- a/libgfortran/configure.ac > +++ b/libgfortran/configure.ac > @@ -222,17 +222,12 @@ AM_CONDITIONAL(LIBGFOR_USE_SYMVER, [test "x$gfortran_use_symver" != xno]) > AM_CONDITIONAL(LIBGFOR_USE_SYMVER_GNU, [test "x$gfortran_use_symver" = xgnu]) > AM_CONDITIONAL(LIBGFOR_USE_SYMVER_SUN, [test "x$gfortran_use_symver" = xsun]) > > -# For GPU offloading, not everything in libfortran can be supported. > -# Currently, the only target that has this problem is nvptx. The > -# following is a (partial) list of features that are unsupportable on > -# this particular target: > -# * Constructors > -# * alloca > -# * C library support for I/O, with printf as the one notable exception > -# * C library support for other features such as signal, environment > -# variables, time functions > - > -AM_CONDITIONAL(LIBGFOR_MINIMAL, [test "x${target_cpu}" = xnvptx]) > +# "Minimal" mode is for targets that cannot (yet) support all features of > +# libgfortran. It avoids the need for working constructors, alloca, and C > +# library support for I/O, signals, environment variables, time functions, etc. > +# At present there are no targets that require this mode. > + > +AM_CONDITIONAL(LIBGFOR_MINIMAL, false) > > # Some compiler target support may have limited support for integer > # or floating point numbers – or may want to reduce the libgfortran size > diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp > index 32aa9705296..30b1e558ea3 100644 > --- a/libgomp/ChangeLog.omp > +++ b/libgomp/ChangeLog.omp > @@ -1,5 +1,12 @@ > 2023-01-20 Thomas Schwinge <thomas@codesourcery.com> > > + * testsuite/libgomp.fortran/target-print-1.f90: Adjust. > + * testsuite/libgomp.fortran/target-print-1-nvptx.f90: Remove. > + * testsuite/libgomp.oacc-fortran/print-1.f90: Adjust. > + * testsuite/libgomp.oacc-fortran/print-1-nvptx.f90: Remove. > + * testsuite/libgomp.oacc-fortran/error_stop-2.f: Adjust. > + * testsuite/libgomp.oacc-fortran/stop-2.f: Likewise. > + > * plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New. > (nvptx_close_device, GOMP_OFFLOAD_load_image) > (GOMP_OFFLOAD_unload_image): Call it. > diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 > deleted file mode 100644 > index a89c9c33484..00000000000 > --- a/libgomp/testsuite/libgomp.fortran/target-print-1-nvptx.f90 > +++ /dev/null > @@ -1,11 +0,0 @@ > -! Ensure that write on the offload device works, nvptx offloading variant. > - > -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran > -! configuration. > -! { dg-do link } ! ..., but still apply 'dg-do run' options. > -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } } > - > -! Skip duplicated testing. > -! { dg-skip-if "separate file" { ! offload_target_nvptx } } > - > -include 'target-print-1.f90' > diff --git a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 b/libgomp/testsuite/libgomp.fortran/target-print-1.f90 > index 327bb22cb6d..9ac70e5a85f 100644 > --- a/libgomp/testsuite/libgomp.fortran/target-print-1.f90 > +++ b/libgomp/testsuite/libgomp.fortran/target-print-1.f90 > @@ -3,9 +3,6 @@ > ! { dg-do run } > ! { dg-output "The answer is 42(\n|\r\n|\r)+" } > > -! Separate file 'target-print-1-nvptx.f90' for nvptx offloading. > -! { dg-skip-if "separate file" { offload_target_nvptx } } > - > program main > implicit none > integer :: var = 42 > diff --git a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f > index 5951e8cbe64..bbb4b55ef2c 100644 > --- a/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f > +++ b/libgomp/testsuite/libgomp.oacc-fortran/error_stop-2.f > @@ -17,7 +17,9 @@ > > ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" } > > -! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" } > +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...] > +! overflows the stack for nvptx offloading, thus XFAILed. > +! { dg-output "ERROR STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } } > ! > ! In gfortran's main program, libfortran's set_options is called - which sets > ! compiler_options.backtrace = 1 by default. For an offload libgfortran, this > diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 > deleted file mode 100644 > index 866c8654355..00000000000 > --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1-nvptx.f90 > +++ /dev/null > @@ -1,11 +0,0 @@ > -! Ensure that write on the offload device works, nvptx offloading variant. > - > -! This doesn't compile: for nvptx offloading we're using a minimal libgfortran > -! configuration. > -! { dg-do link } ! ..., but still apply 'dg-do run' options. > -! { dg-xfail-if "minimal libgfortran" { offload_target_nvptx } } > - > -! Skip duplicated testing. > -! { dg-skip-if "separate file" { ! offload_target_nvptx } } > - > -include 'print-1.f90' > diff --git a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 > index d2f89d915f8..d04503a0249 100644 > --- a/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 > +++ b/libgomp/testsuite/libgomp.oacc-fortran/print-1.f90 > @@ -2,9 +2,8 @@ > > ! { dg-do run } > ! { dg-output "The answer is 42(\n|\r\n|\r)+" } > - > -! Separate file 'print-1-nvptx.f90' for nvptx offloading. > -! { dg-skip-if "separate file" { offload_target_nvptx } } > +! The 'write' overflows the stack for nvptx offloading, thus XFAILed. > +! { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } } > > ! { dg-additional-options "-fopt-info-note-omp" } > ! { dg-additional-options "-foffload=-fopt-info-note-omp" } > diff --git a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f > index fe7ee37813a..394de034b1f 100644 > --- a/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f > +++ b/libgomp/testsuite/libgomp.oacc-fortran/stop-2.f > @@ -17,7 +17,9 @@ > > ! { dg-output "CheCKpOInT(\n|\r\n|\r)+" } > > -! { dg-output "STOP 35(\n|\r\n|\r)+" } > +! '_gfortran_error_stop_numeric' -> '_gfortrani_st_printf' -> [...] > +! overflows the stack for nvptx offloading, thus XFAILed. > +! { dg-output "STOP 35(\n|\r\n|\r)+" { xfail openacc_nvidia_accel_selected } } > ! > ! PR85463. The 'exit' implementation used with nvptx > ! offloading is a little bit different. ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Update-libgomp-libgomp.texi-for-nvptx-libgfortran-Sw.patch --] [-- Type: text/x-diff, Size: 1777 bytes --] From 8c29332e98ca4669a059ebc0d90903b409ae049f Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Tue, 24 Jan 2023 10:29:01 +0100 Subject: [PATCH] Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' libgomp/ * libgomp.texi (nvptx): Update for 'nvptx, libgfortran: Switch out of "minimal" mode'. --- libgomp/libgomp.texi | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 896d187f1ff..17f1509343f 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -4448,7 +4448,7 @@ The used sizes are The implementation remark: @itemize -@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported +@item I/O within OpenMP target regions and OpenACC compute regions is supported using the C library @code{printf} functions and the Fortran @code{print}/@code{write} statements. @end itemize @@ -4496,9 +4496,11 @@ CUDA version and hardware. The implementation remark: @itemize -@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported - using the C library @code{printf} functions. Note that the Fortran - @code{print}/@code{write} statements are not supported, yet. +@item I/O within OpenMP target regions and OpenACC compute regions is supported + using the C library @code{printf} functions. + Additionally, the Fortran @code{print}/@code{write} statements are + supported within OpenMP target regions, but not yet OpenACC compute + regions. @item Compilation OpenMP code that contains @code{requires reverse_offload} requires at least @code{-march=sm_35}, compiling for @code{-march=sm_30} is not supported. -- 2.25.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-01-24 9:37 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <ae825c453f484ffd99c9be34af726089@mentor.com> [not found] ` <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> [not found] ` <87zgcxoa05.fsf@euler.schwinge.homeip.net> [not found] ` <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com> 2022-12-23 14:08 ` nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) Thomas Schwinge 2022-12-23 21:23 ` Jerry D 2023-01-11 12:06 ` [PING] " Thomas Schwinge 2023-01-12 2:46 ` Jerry D 2023-01-20 21:04 ` nvptx, libgcc: Stub unwinding implementation Thomas Schwinge 2023-01-20 21:16 ` nvptx, libgfortran: Switch out of "minimal" mode Thomas Schwinge 2023-01-20 22:10 ` Thomas Koenig 2023-01-24 9:37 ` Update 'libgomp/libgomp.texi' for 'nvptx, libgfortran: Switch out of "minimal" mode' (was: nvptx, libgfortran: Switch out of "minimal" mode) Thomas Schwinge
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).