* [PATCH 0/5] openmp: Handle pinned and unified shared memory. @ 2022-03-08 11:30 Hafiz Abid Qadeer 2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer ` (6 more replies) 0 siblings, 7 replies; 18+ messages in thread From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw) To: gcc-patches, fortran; +Cc: jakub, ams, joseph This patch series add support for unified shared memory (USM) and pinned memory. The support in libgomp is for nvptx offloading only. A new command line option -foffload-memory allows user to choose either USM or pinned memory. The USM can also be enabled using requires construct. When USM us in use, calls to memory allocation function like malloc are changed to omp_alloc with appropriate allocator. No transformations are required for the pinned memory which is implemented using mlockall so is only available on Linux. Andrew Stubbs (4): openmp: Add -foffload-memory openmp: allow requires unified_shared_memory openmp, nvptx: ompx_unified_shared_mem_alloc openmp: -foffload-memory=pinned Hafiz Abid Qadeer (1): openmp: Use libgomp memory allocation functions with unified shared memory. gcc/c/c-parser.cc | 13 +- gcc/common.opt | 16 ++ gcc/coretypes.h | 7 + gcc/cp/parser.cc | 13 +- gcc/doc/invoke.texi | 16 +- gcc/fortran/openmp.cc | 10 +- gcc/omp-low.cc | 220 ++++++++++++++++++ gcc/passes.def | 1 + .../c-c++-common/gomp/alloc-pinned-1.c | 28 +++ gcc/testsuite/c-c++-common/gomp/usm-1.c | 4 + gcc/testsuite/c-c++-common/gomp/usm-2.c | 34 +++ gcc/testsuite/c-c++-common/gomp/usm-3.c | 32 +++ gcc/testsuite/g++.dg/gomp/usm-1.C | 32 +++ gcc/testsuite/g++.dg/gomp/usm-2.C | 30 +++ gcc/testsuite/g++.dg/gomp/usm-3.C | 38 +++ gcc/testsuite/gfortran.dg/gomp/usm-1.f90 | 6 + gcc/testsuite/gfortran.dg/gomp/usm-2.f90 | 16 ++ gcc/testsuite/gfortran.dg/gomp/usm-3.f90 | 13 ++ gcc/tree-pass.h | 1 + libgomp/allocator.c | 13 +- libgomp/config/linux/allocator.c | 70 ++++-- libgomp/config/nvptx/allocator.c | 6 + libgomp/libgomp-plugin.h | 3 + libgomp/libgomp.h | 6 + libgomp/libgomp.map | 5 + libgomp/omp.h.in | 4 + libgomp/omp_lib.f90.in | 8 + libgomp/plugin/plugin-nvptx.c | 45 +++- libgomp/target.c | 70 ++++++ libgomp/testsuite/libgomp.c++/usm-1.C | 54 +++++ libgomp/testsuite/libgomp.c/alloc-pinned-7.c | 66 ++++++ libgomp/testsuite/libgomp.c/usm-1.c | 24 ++ libgomp/testsuite/libgomp.c/usm-2.c | 32 +++ libgomp/testsuite/libgomp.c/usm-3.c | 35 +++ libgomp/testsuite/libgomp.c/usm-4.c | 36 +++ libgomp/testsuite/libgomp.c/usm-5.c | 28 +++ libgomp/testsuite/libgomp.c/usm-6.c | 70 ++++++ 37 files changed, 1075 insertions(+), 30 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c create mode 100644 libgomp/testsuite/libgomp.c/usm-1.c create mode 100644 libgomp/testsuite/libgomp.c/usm-2.c create mode 100644 libgomp/testsuite/libgomp.c/usm-3.c create mode 100644 libgomp/testsuite/libgomp.c/usm-4.c create mode 100644 libgomp/testsuite/libgomp.c/usm-5.c create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/5] openmp: Add -foffload-memory 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer @ 2022-03-08 11:30 ` Hafiz Abid Qadeer 2023-02-13 14:38 ` -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) Thomas Schwinge 2022-03-08 11:30 ` [PATCH 2/5] openmp: allow requires unified_shared_memory Hafiz Abid Qadeer ` (5 subsequent siblings) 6 siblings, 1 reply; 18+ messages in thread From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw) To: gcc-patches, fortran; +Cc: jakub, ams, joseph From: Andrew Stubbs <ams@codesourcery.com> Add a new option. It will be used in follow-up patches. gcc/ChangeLog: * common.opt: Add -foffload-memory and its enum values. * coretypes.h (enum offload_memory): New. * doc/invoke.texi: Document -foffload-memory. --- gcc/common.opt | 16 ++++++++++++++++ gcc/coretypes.h | 7 +++++++ gcc/doc/invoke.texi | 16 +++++++++++++++- 3 files changed, 38 insertions(+), 1 deletion(-) diff --git a/gcc/common.opt b/gcc/common.opt index 8b6513de47c..17426523e23 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2182,6 +2182,22 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32) EnumValue Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64) +foffload-memory= +Common Joined RejectNegative Enum(offload_memory) Var(flag_offload_memory) Init(OFFLOAD_MEMORY_NONE) +-foffload-memory=[none|unified|pinned] Use an offload memory optimization. + +Enum +Name(offload_memory) Type(enum offload_memory) UnknownError(Unknown offload memory option %qs) + +EnumValue +Enum(offload_memory) String(none) Value(OFFLOAD_MEMORY_NONE) + +EnumValue +Enum(offload_memory) String(unified) Value(OFFLOAD_MEMORY_UNIFIED) + +EnumValue +Enum(offload_memory) String(pinned) Value(OFFLOAD_MEMORY_PINNED) + fomit-frame-pointer Common Var(flag_omit_frame_pointer) Optimization When possible do not generate stack frames. diff --git a/gcc/coretypes.h b/gcc/coretypes.h index 08b9ac9094c..dd52d5bb113 100644 --- a/gcc/coretypes.h +++ b/gcc/coretypes.h @@ -206,6 +206,13 @@ enum offload_abi { OFFLOAD_ABI_ILP32 }; +/* Types of memory optimization for an offload device. */ +enum offload_memory { + OFFLOAD_MEMORY_NONE, + OFFLOAD_MEMORY_UNIFIED, + OFFLOAD_MEMORY_PINNED +}; + /* Types of profile update methods. */ enum profile_update { PROFILE_UPDATE_SINGLE, diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 248ed534aee..d16019fc8c3 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -202,7 +202,7 @@ in the following sections. -fno-builtin -fno-builtin-@var{function} -fcond-mismatch @gol -ffreestanding -fgimple -fgnu-tm -fgnu89-inline -fhosted @gol -flax-vector-conversions -fms-extensions @gol --foffload=@var{arg} -foffload-options=@var{arg} @gol +-foffload=@var{arg} -foffload-options=@var{arg} -foffload-memory=@var{arg} @gol -fopenacc -fopenacc-dim=@var{geom} @gol -fopenmp -fopenmp-simd @gol -fpermitted-flt-eval-methods=@var{standard} @gol @@ -2694,6 +2694,20 @@ Typical command lines are -foffload-options=amdgcn-amdhsa=-march=gfx906 -foffload-options=-lm @end smallexample +@item -foffload-memory=none +@itemx -foffload-memory=unified +@itemx -foffload-memory=pinned +@opindex foffload-memory +@cindex OpenMP offloading memory modes +Enable a memory optimization mode to use with OpenMP. The default behavior, +@option{-foffload-memory=none}, is to do nothing special (unless enabled via +a requires directive in the code). @option{-foffload-memory=unified} is +equivalent to @code{#pragma omp requires unified_shared_memory}. +@option{-foffload-memory=pinned} forces all host memory to be pinned (this +mode may require the user to increase the ulimit setting for locked memory). +All translation units must select the same setting to avoid undefined +behavior. + @item -fopenacc @opindex fopenacc @cindex OpenACC accelerator programming -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) 2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer @ 2023-02-13 14:38 ` Thomas Schwinge 2023-02-13 15:20 ` Andrew Stubbs 0 siblings, 1 reply; 18+ messages in thread From: Thomas Schwinge @ 2023-02-13 14:38 UTC (permalink / raw) To: Andrew Stubbs, Hafiz Abid Qadeer, Jakub Jelinek, Tobias Burnus Cc: gcc-patches Hi! On 2022-03-08T11:30:55+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote: > From: Andrew Stubbs <ams@codesourcery.com> > > Add a new option. It will be used in follow-up patches. > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > +@option{-foffload-memory=pinned} forces all host memory to be pinned (this > +mode may require the user to increase the ulimit setting for locked memory). So, this is currently implemented via 'mlockall', which, as discussed, (a) has issues ('ulimit -l'), and (b) doesn't actually achieve what it meant to achieve (because it doesn't register the page-locked memory with the GPU driver). So one idea was to re-purpose the unified shared memory 'gcc/omp-low.cc:pass_usm_transform' (compiler pass that "changes calls to malloc/free/calloc/realloc and operator new to memory allocation functions in libgomp with allocator=ompx_unified_shared_mem_alloc"), <https://inbox.sourceware.org/gcc-patches/20220308113059.688551-5-abidh@codesourcery.com>. (I have not yet looked into that in detail.) Here's now a different idea. As '-foffload-memory=pinned', per the name of the option, concerns itself with memory used in offloading but not host execution generally, why are we actually attempting to "[force] all host memory to be pinned" -- why not just the memory that's being used with offloading? That is, if '-foffload-memory=pinned' is set, register as page-locked with the GPU driver all memory that appears in OMP offloading data regions, such as OpenMP 'target' 'map' clauses etc. That way, this is directed at the offloading data transfers, as itended, but at the same time we don't "waste" page-locked memory for generic host memory allocations. What do you think -- you, who've spent a lot more time on this topic than I have, so it's likely possible that I fail to realize some "details"? Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) 2023-02-13 14:38 ` -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) Thomas Schwinge @ 2023-02-13 15:20 ` Andrew Stubbs 2023-04-03 14:56 ` [og12] '-foffload-memory=pinned' using offloading device interfaces (was: -foffload-memory=pinned) Thomas Schwinge 0 siblings, 1 reply; 18+ messages in thread From: Andrew Stubbs @ 2023-02-13 15:20 UTC (permalink / raw) To: Thomas Schwinge, Hafiz Abid Qadeer, Jakub Jelinek, Tobias Burnus Cc: gcc-patches On 13/02/2023 14:38, Thomas Schwinge wrote: > Hi! > > On 2022-03-08T11:30:55+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote: >> From: Andrew Stubbs <ams@codesourcery.com> >> >> Add a new option. It will be used in follow-up patches. > >> --- a/gcc/doc/invoke.texi >> +++ b/gcc/doc/invoke.texi > >> +@option{-foffload-memory=pinned} forces all host memory to be pinned (this >> +mode may require the user to increase the ulimit setting for locked memory). > > So, this is currently implemented via 'mlockall', which, as discussed, > (a) has issues ('ulimit -l'), and (b) doesn't actually achieve what it > meant to achieve (because it doesn't register the page-locked memory with > the GPU driver). > > So one idea was to re-purpose the unified shared memory > 'gcc/omp-low.cc:pass_usm_transform' (compiler pass that "changes calls to > malloc/free/calloc/realloc and operator new to memory allocation > functions in libgomp with allocator=ompx_unified_shared_mem_alloc"), > <https://inbox.sourceware.org/gcc-patches/20220308113059.688551-5-abidh@codesourcery.com>> (I have not yet looked into that in detail.) > > Here's now a different idea. As '-foffload-memory=pinned', per the name > of the option, concerns itself with memory used in offloading but not > host execution generally, why are we actually attempting to "[force] all > host memory to be pinned" -- why not just the memory that's being used > with offloading? That is, if '-foffload-memory=pinned' is set, register > as page-locked with the GPU driver all memory that appears in OMP > offloading data regions, such as OpenMP 'target' 'map' clauses etc. That > way, this is directed at the offloading data transfers, as itended, but > at the same time we don't "waste" page-locked memory for generic host > memory allocations. What do you think -- you, who've spent a lot more > time on this topic than I have, so it's likely possible that I fail to > realize some "details"? The main reason it is the way it is is because in general it's not possible to know what memory is going to be offloaded at the time it is allocated (and stack/static memory is never allocated that way). If there's a way to pin it after the fact then maybe that's not a terrible idea? The downside is that the memory might already have been paged out at that point, and we'd have to track what we'd previously pinned, or else re-pin it every time we launch a kernel. We'd also have no way to unpin previously pinned memory (not that that's relevant to the "lock all" case). My original plan was to use omp_alloc for both the standard OpenMP support and the -foffload-memory option (to get the benefit of pinning without modifying any source), but then I decided that the mlockall option was much less invasive. This is still the best way to implement target-independent pinning, when there's no driver registration option. Andrew ^ permalink raw reply [flat|nested] 18+ messages in thread
* [og12] '-foffload-memory=pinned' using offloading device interfaces (was: -foffload-memory=pinned) 2023-02-13 15:20 ` Andrew Stubbs @ 2023-04-03 14:56 ` Thomas Schwinge 0 siblings, 0 replies; 18+ messages in thread From: Thomas Schwinge @ 2023-04-03 14:56 UTC (permalink / raw) To: Andrew Stubbs, gcc-patches Cc: Hafiz Abid Qadeer, Jakub Jelinek, Tobias Burnus [-- Attachment #1: Type: text/plain, Size: 2727 bytes --] Hi! On 2023-02-13T15:20:07+0000, Andrew Stubbs <ams@codesourcery.com> wrote: > On 13/02/2023 14:38, Thomas Schwinge wrote: >> On 2022-03-08T11:30:55+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote: >>> From: Andrew Stubbs <ams@codesourcery.com> >>> >>> Add a new option. It will be used in follow-up patches. >> >>> --- a/gcc/doc/invoke.texi >>> +++ b/gcc/doc/invoke.texi >> >>> +@option{-foffload-memory=pinned} forces all host memory to be pinned (this >>> +mode may require the user to increase the ulimit setting for locked memory). >> >> So, this is currently implemented via 'mlockall', which, as discussed, >> (a) has issues ('ulimit -l'), and (b) doesn't actually achieve what it >> meant to achieve (because it doesn't register the page-locked memory with >> the GPU driver). >> [...] >> As '-foffload-memory=pinned', per the name >> of the option, concerns itself with memory used in offloading but not >> host execution generally, why are we actually attempting to "[force] all >> host memory to be pinned" -- why not just the memory that's being used >> with offloading? That is, if '-foffload-memory=pinned' is set, register >> as page-locked with the GPU driver all memory that appears in OMP >> offloading data regions, such as OpenMP 'target' 'map' clauses etc. That >> way, this is directed at the offloading data transfers, as itended, but >> at the same time we don't "waste" page-locked memory for generic host >> memory allocations. What do you think -- you, who've spent a lot more >> time on this topic than I have, so it's likely possible that I fail to >> realize some "details"? > > The main reason it is the way it is is because in general it's not > possible to know what memory is going to be offloaded at the time it is > allocated (and stack/static memory is never allocated that way). > > If there's a way to pin it after the fact then maybe that's not a > terrible idea? [...] I've now pushed to devel/omp/gcc-12 branch my take on this in commit 43095690ea519205bf56fc148b346edaa43e0f0f "'-foffload-memory=pinned' using offloading device interfaces", and for changes related to og12 commit 15d0f61a7fecdc8fd12857c40879ea3730f6d99f "Merge non-contiguous array support patches": commit 694bbd399c1323975b4a6735646e46c6914de63d "'-foffload-memory=pinned' using offloading device interfaces for non-contiguous array support", see attached. Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-foffload-memory-pinned-using-offloading-device-inter.patch --] [-- Type: text/x-diff, Size: 77424 bytes --] From 43095690ea519205bf56fc148b346edaa43e0f0f Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Thu, 30 Mar 2023 10:08:12 +0200 Subject: [PATCH 1/2] '-foffload-memory=pinned' using offloading device interfaces Implemented for nvptx offloading via 'cuMemHostAlloc', 'cuMemHostRegister'. gcc/ * doc/invoke.texi (-foffload-memory=pinned): Document. include/ * cuda/cuda.h (CUresult): Add 'CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED'. (CUdevice_attribute): Add 'CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED'. (CU_MEMHOSTREGISTER_READ_ONLY): Add. (cuMemHostGetFlags, cuMemHostRegister, cuMemHostUnregister): Add. libgomp/ * libgomp-plugin.h (GOMP_OFFLOAD_page_locked_host_free): Add 'struct goacc_asyncqueue *' formal parameter. (GOMP_OFFLOAD_page_locked_host_register) (GOMP_OFFLOAD_page_locked_host_unregister) (GOMP_OFFLOAD_page_locked_host_p): Add. * libgomp.h (always_pinned_mode) (gomp_page_locked_host_register_dev) (gomp_page_locked_host_unregister_dev): Add. (struct splay_tree_key_s): Add 'page_locked_host_p'. (struct gomp_device_descr): Add 'GOMP_OFFLOAD_page_locked_host_register', 'GOMP_OFFLOAD_page_locked_host_unregister', 'GOMP_OFFLOAD_page_locked_host_p'. * libgomp.texi (-foffload-memory=pinned): Document. * plugin/cuda-lib.def (cuMemHostGetFlags, cuMemHostRegister_v2) (cuMemHostRegister, cuMemHostUnregister): Add. * plugin/plugin-nvptx.c (struct ptx_device): Add 'read_only_host_register_supported'. (nvptx_open_device): Initialize it. (free_host_blocks, free_host_blocks_lock) (nvptx_run_deferred_page_locked_host_free) (nvptx_page_locked_host_free_callback, nvptx_page_locked_host_p) (GOMP_OFFLOAD_page_locked_host_register) (nvptx_page_locked_host_unregister_callback) (GOMP_OFFLOAD_page_locked_host_unregister) (GOMP_OFFLOAD_page_locked_host_p) (nvptx_run_deferred_page_locked_host_unregister) (nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback): Add. (GOMP_OFFLOAD_fini_device, GOMP_OFFLOAD_page_locked_host_alloc) (GOMP_OFFLOAD_run): Call 'nvptx_run_deferred_page_locked_host_free'. (struct goacc_asyncqueue): Add 'page_locked_host_unregister_blocks_lock', 'page_locked_host_unregister_blocks'. (nvptx_goacc_asyncqueue_construct) (nvptx_goacc_asyncqueue_destruct): Handle those. (GOMP_OFFLOAD_page_locked_host_free): Handle 'struct goacc_asyncqueue *' formal parameter. (GOMP_OFFLOAD_openacc_async_test) (nvptx_goacc_asyncqueue_synchronize): Call 'nvptx_run_deferred_page_locked_host_unregister'. (GOMP_OFFLOAD_openacc_async_serialize): Call 'nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback'. * config/linux/allocator.c (linux_memspace_alloc) (linux_memspace_calloc, linux_memspace_free) (linux_memspace_realloc): Remove 'always_pinned_mode' handling. (GOMP_enable_pinned_mode): Move... * target.c: ... here. (always_pinned_mode, verify_always_pinned_mode) (gomp_verify_always_pinned_mode, gomp_page_locked_host_alloc_dev) (gomp_page_locked_host_free_dev) (gomp_page_locked_host_aligned_alloc_dev) (gomp_page_locked_host_aligned_free_dev) (gomp_page_locked_host_register_dev) (gomp_page_locked_host_unregister_dev): Add. (gomp_copy_host2dev, gomp_map_vars_internal) (gomp_remove_var_internal, gomp_unmap_vars_internal) (get_gomp_offload_icvs, gomp_load_image_to_device) (gomp_target_rev, omp_target_memcpy_copy) (omp_target_memcpy_rect_worker): Handle 'always_pinned_mode'. (gomp_copy_host2dev, gomp_copy_dev2host): Handle 'verify_always_pinned_mode'. (GOMP_target_ext): Add 'assert'. (gomp_page_locked_host_alloc): Use 'gomp_page_locked_host_alloc_dev'. (gomp_page_locked_host_free): Use 'gomp_page_locked_host_free_dev'. (omp_target_associate_ptr): Adjust. (gomp_load_plugin_for_device): Handle 'page_locked_host_register', 'page_locked_host_unregister', 'page_locked_host_p'. * oacc-mem.c (memcpy_tofrom_device): Handle 'always_pinned_mode'. * libgomp_g.h (GOMP_enable_pinned_mode): Adjust. * testsuite/libgomp.c/alloc-pinned-7.c: Remove. --- gcc/ChangeLog.omp | 4 + gcc/doc/invoke.texi | 19 +- include/ChangeLog.omp | 9 + include/cuda/cuda.h | 11 +- libgomp/ChangeLog.omp | 75 ++ libgomp/config/linux/allocator.c | 26 - libgomp/libgomp-plugin.h | 7 +- libgomp/libgomp.h | 15 + libgomp/libgomp.texi | 35 + libgomp/libgomp_g.h | 2 +- libgomp/oacc-mem.c | 16 + libgomp/plugin/cuda-lib.def | 4 + libgomp/plugin/plugin-nvptx.c | 435 ++++++++++- libgomp/target.c | 771 +++++++++++++++++-- libgomp/testsuite/libgomp.c/alloc-pinned-7.c | 63 -- 15 files changed, 1339 insertions(+), 153 deletions(-) delete mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c diff --git a/gcc/ChangeLog.omp b/gcc/ChangeLog.omp index 5e76158db06..d8aa0ab51bf 100644 --- a/gcc/ChangeLog.omp +++ b/gcc/ChangeLog.omp @@ -1,3 +1,7 @@ +2023-04-03 Thomas Schwinge <thomas@codesourcery.com> + + * doc/invoke.texi (-foffload-memory=pinned): Document. + 2023-03-31 Frederik Harwath <frederik@codesourcery.com> * omp-transform-loops.cc (walk_omp_for_loops): Handle diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 1fe047042ae..070b63030f8 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -2711,13 +2711,28 @@ Typical command lines are @itemx -foffload-memory=unified @itemx -foffload-memory=pinned @opindex foffload-memory +@cindex Offloading memory modes @cindex OpenMP offloading memory modes + Enable a memory optimization mode to use with OpenMP. The default behavior, @option{-foffload-memory=none}, is to do nothing special (unless enabled via a requires directive in the code). @option{-foffload-memory=unified} is equivalent to @code{#pragma omp requires unified_shared_memory}. -@option{-foffload-memory=pinned} forces all host memory to be pinned (this -mode may require the user to increase the ulimit setting for locked memory). + +@c The following paragraph is duplicated in +@c '../../libgomp/libgomp.texi', '-foffload-memory=pinned'. +If supported by the active offloading device, +@option{-foffload-memory=pinned} enables automatic use of page-locked +host memory for memory objects participating in host <-> device memory +transfers, for both OpenACC and OpenMP offloading. +Such memory is allocated or registered using the respective offloading +device interfaces, which potentially helps optimization of host <-> +device data transfers. +This option is experimental. +Beware that use of a lot of pinned memory may degrade overall system +performance, as it does reduce the amount of host memory available for +paging. + All translation units must select the same setting to avoid undefined behavior. diff --git a/include/ChangeLog.omp b/include/ChangeLog.omp index 244d67e6608..655377a6d0d 100644 --- a/include/ChangeLog.omp +++ b/include/ChangeLog.omp @@ -1,3 +1,12 @@ +2023-04-03 Thomas Schwinge <thomas@codesourcery.com> + + * cuda/cuda.h (CUresult): Add + 'CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED'. + (CUdevice_attribute): Add + 'CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED'. + (CU_MEMHOSTREGISTER_READ_ONLY): Add. + (cuMemHostGetFlags, cuMemHostRegister, cuMemHostUnregister): Add. + 2023-02-20 Thomas Schwinge <thomas@codesourcery.com> * cuda/cuda.h (cuMemHostRegister, cuMemHostUnregister): Remove. diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h index 062d394b95f..f8f464607db 100644 --- a/include/cuda/cuda.h +++ b/include/cuda/cuda.h @@ -57,6 +57,7 @@ typedef enum { CUDA_ERROR_INVALID_CONTEXT = 201, CUDA_ERROR_NOT_FOUND = 500, CUDA_ERROR_NOT_READY = 600, + CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED = 712, CUDA_ERROR_LAUNCH_FAILED = 719, CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720, CUDA_ERROR_NOT_PERMITTED = 800, @@ -80,7 +81,8 @@ typedef enum { CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76, - CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82 + CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82, + CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED = 113 } CUdevice_attribute; typedef enum { @@ -124,8 +126,11 @@ enum { #define CU_LAUNCH_PARAM_END ((void *) 0) #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1) #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2) + #define CU_MEMHOSTALLOC_DEVICEMAP 0x02U +#define CU_MEMHOSTREGISTER_READ_ONLY 0x08 + enum { CU_STREAM_DEFAULT = 0, CU_STREAM_NON_BLOCKING = 1 @@ -183,6 +188,10 @@ CUresult cuMemAlloc (CUdeviceptr *, size_t); CUresult cuMemAllocHost (void **, size_t); CUresult cuMemAllocManaged(CUdeviceptr *, size_t, unsigned int); CUresult cuMemHostAlloc (void **, size_t, unsigned int); +CUresult cuMemHostGetFlags (unsigned int *, void *); +#define cuMemHostRegister cuMemHostRegister_v2 +CUresult cuMemHostRegister(void *, size_t, unsigned int); +CUresult cuMemHostUnregister(void *); CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t); #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream); diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp index 7afb5f43c04..1b02c057562 100644 --- a/libgomp/ChangeLog.omp +++ b/libgomp/ChangeLog.omp @@ -1,5 +1,80 @@ 2023-04-03 Thomas Schwinge <thomas@codesourcery.com> + * libgomp-plugin.h (GOMP_OFFLOAD_page_locked_host_free): Add + 'struct goacc_asyncqueue *' formal parameter. + (GOMP_OFFLOAD_page_locked_host_register) + (GOMP_OFFLOAD_page_locked_host_unregister) + (GOMP_OFFLOAD_page_locked_host_p): Add. + * libgomp.h (always_pinned_mode) + (gomp_page_locked_host_register_dev) + (gomp_page_locked_host_unregister_dev): Add. + (struct splay_tree_key_s): Add 'page_locked_host_p'. + (struct gomp_device_descr): Add + 'GOMP_OFFLOAD_page_locked_host_register', + 'GOMP_OFFLOAD_page_locked_host_unregister', + 'GOMP_OFFLOAD_page_locked_host_p'. + * libgomp.texi (-foffload-memory=pinned): Document. + * plugin/cuda-lib.def (cuMemHostGetFlags, cuMemHostRegister_v2) + (cuMemHostRegister, cuMemHostUnregister): Add. + * plugin/plugin-nvptx.c (struct ptx_device): Add + 'read_only_host_register_supported'. + (nvptx_open_device): Initialize it. + (free_host_blocks, free_host_blocks_lock) + (nvptx_run_deferred_page_locked_host_free) + (nvptx_page_locked_host_free_callback, nvptx_page_locked_host_p) + (GOMP_OFFLOAD_page_locked_host_register) + (nvptx_page_locked_host_unregister_callback) + (GOMP_OFFLOAD_page_locked_host_unregister) + (GOMP_OFFLOAD_page_locked_host_p) + (nvptx_run_deferred_page_locked_host_unregister) + (nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback): + Add. + (GOMP_OFFLOAD_fini_device, GOMP_OFFLOAD_page_locked_host_alloc) + (GOMP_OFFLOAD_run): Call + 'nvptx_run_deferred_page_locked_host_free'. + (struct goacc_asyncqueue): Add + 'page_locked_host_unregister_blocks_lock', + 'page_locked_host_unregister_blocks'. + (nvptx_goacc_asyncqueue_construct) + (nvptx_goacc_asyncqueue_destruct): Handle those. + (GOMP_OFFLOAD_page_locked_host_free): Handle + 'struct goacc_asyncqueue *' formal parameter. + (GOMP_OFFLOAD_openacc_async_test) + (nvptx_goacc_asyncqueue_synchronize): Call + 'nvptx_run_deferred_page_locked_host_unregister'. + (GOMP_OFFLOAD_openacc_async_serialize): Call + 'nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback'. + * config/linux/allocator.c (linux_memspace_alloc) + (linux_memspace_calloc, linux_memspace_free) + (linux_memspace_realloc): Remove 'always_pinned_mode' handling. + (GOMP_enable_pinned_mode): Move... + * target.c: ... here. + (always_pinned_mode, verify_always_pinned_mode) + (gomp_verify_always_pinned_mode, gomp_page_locked_host_alloc_dev) + (gomp_page_locked_host_free_dev) + (gomp_page_locked_host_aligned_alloc_dev) + (gomp_page_locked_host_aligned_free_dev) + (gomp_page_locked_host_register_dev) + (gomp_page_locked_host_unregister_dev): Add. + (gomp_copy_host2dev, gomp_map_vars_internal) + (gomp_remove_var_internal, gomp_unmap_vars_internal) + (get_gomp_offload_icvs, gomp_load_image_to_device) + (gomp_target_rev, omp_target_memcpy_copy) + (omp_target_memcpy_rect_worker): Handle 'always_pinned_mode'. + (gomp_copy_host2dev, gomp_copy_dev2host): Handle + 'verify_always_pinned_mode'. + (GOMP_target_ext): Add 'assert'. + (gomp_page_locked_host_alloc): Use + 'gomp_page_locked_host_alloc_dev'. + (gomp_page_locked_host_free): Use + 'gomp_page_locked_host_free_dev'. + (omp_target_associate_ptr): Adjust. + (gomp_load_plugin_for_device): Handle 'page_locked_host_register', + 'page_locked_host_unregister', 'page_locked_host_p'. + * oacc-mem.c (memcpy_tofrom_device): Handle 'always_pinned_mode'. + * libgomp_g.h (GOMP_enable_pinned_mode): Adjust. + * testsuite/libgomp.c/alloc-pinned-7.c: Remove. + PR other/76739 * target.c (gomp_map_vars_internal): Pass pre-allocated 'ptrblock' to 'goacc_noncontig_array_create_ptrblock'. diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c index 3e1bd5a1285..62649f64221 100644 --- a/libgomp/config/linux/allocator.c +++ b/libgomp/config/linux/allocator.c @@ -45,20 +45,6 @@ #include <assert.h> #include "libgomp.h" -static bool always_pinned_mode = false; - -/* This function is called by the compiler when -foffload-memory=pinned - is used. */ - -void -GOMP_enable_pinned_mode () -{ - if (mlockall (MCL_CURRENT | MCL_FUTURE) != 0) - gomp_error ("failed to pin all memory (ulimit too low?)"); - else - always_pinned_mode = true; -} - static int using_device_for_page_locked = /* uninitialized */ -1; @@ -70,9 +56,6 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin, __FUNCTION__, (unsigned long long) memspace, (unsigned long long) size, pin, init0); - /* Explicit pinning may not be required. */ - pin = pin && !always_pinned_mode; - void *addr; if (memspace == ompx_unified_shared_mem_space) @@ -137,9 +120,6 @@ linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin) gomp_debug (0, "%s: memspace=%llu, size=%llu, pin=%d\n", __FUNCTION__, (unsigned long long) memspace, (unsigned long long) size, pin); - /* Explicit pinning may not be required. */ - pin = pin && !always_pinned_mode; - if (memspace == ompx_unified_shared_mem_space) { void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV); @@ -159,9 +139,6 @@ linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size, gomp_debug (0, "%s: memspace=%llu, addr=%p, size=%llu, pin=%d\n", __FUNCTION__, (unsigned long long) memspace, addr, (unsigned long long) size, pin); - /* Explicit pinning may not be required. */ - pin = pin && !always_pinned_mode; - if (memspace == ompx_unified_shared_mem_space) gomp_usm_free (addr, GOMP_DEVICE_ICV); else if (pin) @@ -188,9 +165,6 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, gomp_debug (0, "%s: memspace=%llu, addr=%p, oldsize=%llu, size=%llu, oldpin=%d, pin=%d\n", __FUNCTION__, (unsigned long long) memspace, addr, (unsigned long long) oldsize, (unsigned long long) size, oldpin, pin); - /* Explicit pinning may not be required. */ - pin = pin && !always_pinned_mode; - if (memspace == ompx_unified_shared_mem_space) goto manual_realloc; else if (oldpin && pin) diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index ca557a79380..7456b7d1026 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -141,7 +141,12 @@ extern void *GOMP_OFFLOAD_usm_alloc (int, size_t); extern bool GOMP_OFFLOAD_usm_free (int, void *); extern bool GOMP_OFFLOAD_is_usm_ptr (void *); extern bool GOMP_OFFLOAD_page_locked_host_alloc (void **, size_t); -extern bool GOMP_OFFLOAD_page_locked_host_free (void *); +extern bool GOMP_OFFLOAD_page_locked_host_free (void *, + struct goacc_asyncqueue *); +extern int GOMP_OFFLOAD_page_locked_host_register (int, void *, size_t, int); +extern bool GOMP_OFFLOAD_page_locked_host_unregister (void *, size_t, + struct goacc_asyncqueue *); +extern int GOMP_OFFLOAD_page_locked_host_p (int, const void *, size_t); extern bool GOMP_OFFLOAD_dev2host (int, void *, const void *, size_t); extern bool GOMP_OFFLOAD_host2dev (int, void *, const void *, size_t); extern bool GOMP_OFFLOAD_dev2dev (int, void *, const void *, size_t); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 3b2b4aa9534..b7ac9d3da5b 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -1123,6 +1123,8 @@ extern int gomp_pause_host (void); /* target.c */ +extern bool always_pinned_mode; + extern void gomp_init_targets_once (void); extern int gomp_get_num_devices (void); extern bool gomp_target_task_fn (void *); @@ -1130,6 +1132,11 @@ extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t, int, struct goacc_asyncqueue *); extern void * gomp_usm_alloc (size_t size, int device_num); extern void gomp_usm_free (void *device_ptr, int device_num); +extern int gomp_page_locked_host_register_dev (struct gomp_device_descr *, + void *, size_t, int); +extern bool gomp_page_locked_host_unregister_dev (struct gomp_device_descr *, + void *, size_t, + struct goacc_asyncqueue *); extern bool gomp_page_locked_host_alloc (void **, size_t); extern void gomp_page_locked_host_free (void *); @@ -1232,6 +1239,9 @@ struct splay_tree_key_s { uintptr_t *structelem_refcount_ptr; }; struct splay_tree_aux *aux; + /* Whether we have registered page-locked host memory for + '[host_start, host_end)'. */ + bool page_locked_host_p; }; /* The comparison function. */ @@ -1393,6 +1403,11 @@ struct gomp_device_descr __typeof (GOMP_OFFLOAD_is_usm_ptr) *is_usm_ptr_func; __typeof (GOMP_OFFLOAD_page_locked_host_alloc) *page_locked_host_alloc_func; __typeof (GOMP_OFFLOAD_page_locked_host_free) *page_locked_host_free_func; + __typeof (GOMP_OFFLOAD_page_locked_host_register) + *page_locked_host_register_func; + __typeof (GOMP_OFFLOAD_page_locked_host_unregister) + *page_locked_host_unregister_func; + __typeof (GOMP_OFFLOAD_page_locked_host_p) *page_locked_host_p_func; __typeof (GOMP_OFFLOAD_dev2host) *dev2host_func; __typeof (GOMP_OFFLOAD_host2dev) *host2dev_func; __typeof (GOMP_OFFLOAD_dev2dev) *dev2dev_func; diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 6355ce2a37b..df52fd3039c 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -4402,10 +4402,41 @@ creating memory allocators requesting The following sections present notes on the offload-target specifics @menu +* @option{-foffload-memory=pinned}:: * AMD Radeon:: * nvptx:: @end menu +@node @option{-foffload-memory=pinned} +@section @option{-foffload-memory=pinned} + +@c The following paragraph is duplicated from +@c '../gcc/doc/invoke.texi', '-foffload-memory=pinned'. +If supported by the active offloading device, +@option{-foffload-memory=pinned} enables automatic use of page-locked +host memory for memory objects participating in host <-> device memory +transfers, for both OpenACC and OpenMP offloading. +Such memory is allocated or registered using the respective offloading +device interfaces, which potentially helps optimization of host <-> +device data transfers. +This option is experimental. +Beware that use of a lot of pinned memory may degrade overall system +performance, as it does reduce the amount of host memory available for +paging. + +An OpenACC @emph{async} @code{enter data}-like operation may register +a memory object as pinned. After the corresponding @emph{async} +@code{exit data}-like operation, this registration does last until +next synchronization point (such as @code{acc_async_synchronize}). +During this time, the user code must not "touch" the host-side memory +allocation -- but that does correspond to the @emph{async} semantics +anyway. + +We don't consider @code{-foffload-memory=pinned} for one-time internal +data transfers, such as setup during device initialization. + + + @node AMD Radeon @section AMD Radeon (GCN) @@ -4459,6 +4490,8 @@ The implementation remark: @item OpenMP @emph{pinned} memory (@code{omp_atk_pinned}, @code{ompx_pinned_mem_alloc}, for example) is allocated via @code{mmap}, @code{mlock}. +@item @option{-foffload-memory=pinned} is not supported, + @pxref{@option{-foffload-memory=pinned}}. @end itemize @@ -4526,6 +4559,8 @@ The implementation remark: is allocated via @code{cuMemHostAlloc} (CUDA Driver API). This potentially helps optimization of host <-> device data transfers. +@item @option{-foffload-memory=pinned} is supported, + @pxref{@option{-foffload-memory=pinned}}. @end itemize diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h index fe66a53d94a..2a515ce7348 100644 --- a/libgomp/libgomp_g.h +++ b/libgomp/libgomp_g.h @@ -365,6 +365,7 @@ extern bool GOMP_teams4 (unsigned int, unsigned int, unsigned int, bool); extern bool GOMP_evaluate_target_device (int, const char *, const char *, const char *); +extern void GOMP_enable_pinned_mode (void); /* teams.c */ @@ -375,7 +376,6 @@ extern void GOMP_teams_reg (void (*) (void *), void *, unsigned, unsigned, extern void *GOMP_alloc (size_t, size_t, uintptr_t); extern void GOMP_free (void *, uintptr_t); -extern void GOMP_enable_pinned_mode (void); /* error.c */ diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c index bd82beefcdb..75ec8958501 100644 --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -199,11 +199,27 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async, } goacc_aq aq = get_goacc_asyncqueue (async); + + int h_page_locked_host_p = 0; + + if (always_pinned_mode + && s != 0) + { + h_page_locked_host_p = gomp_page_locked_host_register_dev + (thr->dev, h, s, from ? GOMP_MAP_FROM : GOMP_MAP_TO); + if (h_page_locked_host_p < 0) + exit (EXIT_FAILURE); + } + if (from) gomp_copy_dev2host (thr->dev, aq, h, d, s); else gomp_copy_host2dev (thr->dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL); + if (h_page_locked_host_p + && !gomp_page_locked_host_unregister_dev (thr->dev, h, s, aq)) + exit (EXIT_FAILURE); + if (profiling_p) { thr->prof_info = NULL; diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def index 9b786c9f2f6..062a141053f 100644 --- a/libgomp/plugin/cuda-lib.def +++ b/libgomp/plugin/cuda-lib.def @@ -31,6 +31,10 @@ CUDA_ONE_CALL (cuMemAlloc) CUDA_ONE_CALL (cuMemAllocHost) CUDA_ONE_CALL (cuMemAllocManaged) CUDA_ONE_CALL (cuMemHostAlloc) +CUDA_ONE_CALL (cuMemHostGetFlags) +CUDA_ONE_CALL_MAYBE_NULL (cuMemHostRegister_v2) +CUDA_ONE_CALL (cuMemHostRegister) +CUDA_ONE_CALL (cuMemHostUnregister) CUDA_ONE_CALL (cuMemcpy) CUDA_ONE_CALL (cuMemcpyDtoDAsync) CUDA_ONE_CALL (cuMemcpyDtoH) diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 23f89b6fb34..e57a2b30e66 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -78,11 +78,14 @@ extern CUresult cuGetErrorString (CUresult, const char **); CUresult cuLinkAddData (CUlinkState, CUjitInputType, void *, size_t, const char *, unsigned, CUjit_option *, void **); CUresult cuLinkCreate (unsigned, CUjit_option *, void **, CUlinkState *); +#undef cuMemHostRegister +CUresult cuMemHostRegister (void *, size_t, unsigned int); #else typedef size_t (*CUoccupancyB2DSize)(int); CUresult cuLinkAddData_v2 (CUlinkState, CUjitInputType, void *, size_t, const char *, unsigned, CUjit_option *, void **); CUresult cuLinkCreate_v2 (unsigned, CUjit_option *, void **, CUlinkState *); +CUresult cuMemHostRegister_v2 (void *, size_t, unsigned int); CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction, CUoccupancyB2DSize, size_t, int); #endif @@ -218,6 +221,8 @@ static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER; struct goacc_asyncqueue { CUstream cuda_stream; + pthread_mutex_t page_locked_host_unregister_blocks_lock; + struct ptx_free_block *page_locked_host_unregister_blocks; }; struct nvptx_callback @@ -314,6 +319,7 @@ struct ptx_device int warp_size; int max_threads_per_block; int max_threads_per_multiprocessor; + bool read_only_host_register_supported; int default_dims[GOMP_DIM_MAX]; int compute_major, compute_minor; @@ -340,6 +346,33 @@ struct ptx_device static struct ptx_device **ptx_devices; +static struct ptx_free_block *free_host_blocks = NULL; +static pthread_mutex_t free_host_blocks_lock = PTHREAD_MUTEX_INITIALIZER; + +static bool +nvptx_run_deferred_page_locked_host_free (void) +{ + GOMP_PLUGIN_debug (0, "%s\n", + __FUNCTION__); + + pthread_mutex_lock (&free_host_blocks_lock); + struct ptx_free_block *b = free_host_blocks; + free_host_blocks = NULL; + pthread_mutex_unlock (&free_host_blocks_lock); + + while (b) + { + GOMP_PLUGIN_debug (0, " b=%p: cuMemFreeHost(b->ptr=%p)\n", + b, b->ptr); + + struct ptx_free_block *b_next = b->next; + CUDA_CALL (cuMemFreeHost, b->ptr); + free (b); + b = b_next; + } + return true; +} + /* OpenMP kernels reserve a small amount of ".shared" space for use by omp_alloc. The size is configured using GOMP_NVPTX_LOWLAT_POOL, but the default is set here. */ @@ -542,6 +575,19 @@ nvptx_open_device (int n) CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev); assert (r == CUDA_SUCCESS && pi); + /* This is a CUDA 11.1 feature. */ + r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi, + CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED, + dev); + if (r == CUDA_ERROR_INVALID_VALUE) + pi = false; + else if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("cuDeviceGetAttribute error: %s", cuda_error (r)); + return NULL; + } + ptx_dev->read_only_host_register_supported = pi; + for (int i = 0; i != GOMP_DIM_MAX; i++) ptx_dev->default_dims[i] = 0; @@ -1278,6 +1324,11 @@ GOMP_OFFLOAD_init_device (int n) bool GOMP_OFFLOAD_fini_device (int n) { + /* This isn't related to this specific 'ptx_devices[n]', but is a convenient + place to clean up. */ + if (!nvptx_run_deferred_page_locked_host_free ()) + return false; + pthread_mutex_lock (&ptx_dev_lock); if (ptx_devices[n] != NULL) @@ -1711,6 +1762,12 @@ GOMP_OFFLOAD_page_locked_host_alloc (void **ptr, size_t size) GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, size=%llu\n", __FUNCTION__, ptr, (unsigned long long) size); + /* TODO: Maybe running the deferred 'cuMemFreeHost's here is not the best + idea, given that we don't know what context we're called from? (See + 'GOMP_OFFLOAD_run' reverse offload handling.) But, where to do it? */ + if (!nvptx_run_deferred_page_locked_host_free ()) + return false; + CUresult r; unsigned int flags = 0; @@ -1729,16 +1786,243 @@ GOMP_OFFLOAD_page_locked_host_alloc (void **ptr, size_t size) return true; } +static void +nvptx_page_locked_host_free_callback (CUstream stream, CUresult r, void *ptr) +{ + GOMP_PLUGIN_debug (0, "%s: stream=%p, r=%u, ptr=%p\n", + __FUNCTION__, stream, (unsigned) r, ptr); + + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (r)); + + /* We can't now call 'cuMemFreeHost': we're in a CUDA stream context, + where we "must not make any CUDA API calls". + And, in particular in an OpenMP 'target' reverse offload context, + this may even dead-lock?! */ + /* See 'nvptx_free'. */ + struct ptx_free_block *n + = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block)); + GOMP_PLUGIN_debug (0, " defer; n=%p\n", n); + n->ptr = ptr; + pthread_mutex_lock (&free_host_blocks_lock); + n->next = free_host_blocks; + free_host_blocks = n; + pthread_mutex_unlock (&free_host_blocks_lock); +} + +bool +GOMP_OFFLOAD_page_locked_host_free (void *ptr, struct goacc_asyncqueue *aq) +{ + GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, aq=%p\n", + __FUNCTION__, ptr, aq); + + if (aq) + { + GOMP_PLUGIN_debug (0, " aq <-" + " nvptx_page_locked_host_free_callback(ptr)\n"); + CUDA_CALL (cuStreamAddCallback, aq->cuda_stream, + nvptx_page_locked_host_free_callback, ptr, 0); + } + else + CUDA_CALL (cuMemFreeHost, ptr); + return true; +} + +static int +nvptx_page_locked_host_p (const void *ptr, size_t size) +{ + GOMP_PLUGIN_debug (0, "%s: ptr=%p, size=%llu\n", + __FUNCTION__, ptr, (unsigned long long) size); + + int ret; + + CUresult r; + + /* Apparently, there exists no CUDA call to query 'PTR + [0, SIZE)'. Instead + of invoking 'cuMemHostGetFlags' SIZE times, we deem it sufficient to only + query the base PTR. */ + unsigned int flags; + void *ptr_noconst = (void *) ptr; + r = CUDA_CALL_NOCHECK (cuMemHostGetFlags, &flags, ptr_noconst); + (void) flags; + if (r == CUDA_SUCCESS) + ret = 1; + else if (r == CUDA_ERROR_INVALID_VALUE) + ret = 0; + else + { + GOMP_PLUGIN_error ("cuMemHostGetFlags error: %s", cuda_error (r)); + ret = -1; + } + GOMP_PLUGIN_debug (0, " -> %d (with r = %u)\n", + ret, (unsigned) r); + return ret; +} + +int +GOMP_OFFLOAD_page_locked_host_register (int ord, + void *ptr, size_t size, int kind) +{ + bool try_read_only; + /* Magic number: if the actualy mapping kind is unknown... */ + if (kind == -1) + /* ..., allow for trying read-only registration here. */ + try_read_only = true; + else + try_read_only = !GOMP_MAP_COPY_FROM_P (kind); + GOMP_PLUGIN_debug (0, "nvptx %s: ord=%d, ptr=%p, size=%llu," + " kind=%d (try_read_only=%d)\n", + __FUNCTION__, ord, ptr, (unsigned long long) size, + kind, try_read_only); + assert (size != 0); + + if (!nvptx_attach_host_thread_to_device (ord)) + return -1; + struct ptx_device *ptx_dev = ptx_devices[ord]; + + int ret = -1; + + CUresult r; + + unsigned int flags = 0; + /* Given 'CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING', we don't need + 'flags |= CU_MEMHOSTREGISTER_PORTABLE;' here. */ + cuMemHostRegister: + if (CUDA_CALL_EXISTS (cuMemHostRegister_v2)) + r = CUDA_CALL_NOCHECK (cuMemHostRegister_v2, ptr, size, flags); + else + r = CUDA_CALL_NOCHECK (cuMemHostRegister, ptr, size, flags); + if (r == CUDA_SUCCESS) + ret = 1; + else if (r == CUDA_ERROR_INVALID_VALUE) + { + /* For example, for 'cuMemHostAlloc' (via the user code, for example) + followed by 'cuMemHostRegister' (via 'always_pinned_mode', for + example), we don't get 'CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED' but + 'CUDA_ERROR_INVALID_VALUE'. */ + if (nvptx_page_locked_host_p (ptr, size)) + /* Accept the case that the region already is page-locked. */ + ret = 0; + /* Depending on certain circumstances (see 'cuMemHostRegister' + documentation), for example, 'const' data that is placed in section + '.rodata' may need 'flags |= CU_MEMHOSTREGISTER_READ_ONLY;', to avoid + 'CUDA_ERROR_INVALID_VALUE'. If running into that, we now apply/re-try + lazily instead of actively setting it above, to avoid the following + problem. Supposedly/observably (but, not documented), if part of a + memory page has been registered without 'CU_MEMHOSTREGISTER_READ_ONLY' + and we then try to register another part with + 'CU_MEMHOSTREGISTER_READ_ONLY', we'll get 'CUDA_ERROR_INVALID_VALUE'. + In that case, we can solve the issue by re-trying with + 'CU_MEMHOSTREGISTER_READ_ONLY' masked out. However, if part of a + memory page has been registered with 'CU_MEMHOSTREGISTER_READ_ONLY' + and we then try to register another part without + 'CU_MEMHOSTREGISTER_READ_ONLY', that latter part apparently inherits + the former's 'CU_MEMHOSTREGISTER_READ_ONLY' (and any device to host + copy then fails). We can't easily resolve that situation + retroactively, that is, we can't easily re-register the first + 'CU_MEMHOSTREGISTER_READ_ONLY' part without that flag. */ + else if (!(flags & CU_MEMHOSTREGISTER_READ_ONLY) + && try_read_only + && ptx_dev->read_only_host_register_supported) + { + GOMP_PLUGIN_debug (0, " flags |= CU_MEMHOSTREGISTER_READ_ONLY;\n"); + flags |= CU_MEMHOSTREGISTER_READ_ONLY; + goto cuMemHostRegister; + } + /* We ought to use 'CU_MEMHOSTREGISTER_READ_ONLY', but it's not + available. */ + else if (try_read_only + && !ptx_dev->read_only_host_register_supported) + { + assert (!(flags & CU_MEMHOSTREGISTER_READ_ONLY)); + GOMP_PLUGIN_debug (0, " punt;" + " CU_MEMHOSTREGISTER_READ_ONLY not available\n"); + /* Accept this (legacy) case; we can't (easily) register page-locked + this region of host memory. */ + ret = 0; + } + } + else if (r == CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED) + { + /* 'cuMemHostRegister' (via the user code, for example) followed by + another (potentially partially overlapping) 'cuMemHostRegister' + (via 'always_pinned_mode', for example). */ + /* Accept this case in good faith; do not verify further. */ + ret = 0; + } + if (ret == -1) + GOMP_PLUGIN_error ("cuMemHostRegister error: %s", cuda_error (r)); + GOMP_PLUGIN_debug (0, " -> %d (with r = %u)\n", + ret, (unsigned) r); + return ret; +} + +static void +nvptx_page_locked_host_unregister_callback (CUstream stream, CUresult r, + void *b_) +{ + void **b = b_; + struct goacc_asyncqueue *aq = b[0]; + void *ptr = b[1]; + GOMP_PLUGIN_debug (0, "%s: stream=%p, r=%u, b_=%p (aq=%p, ptr=%p)\n", + __FUNCTION__, stream, (unsigned) r, b_, aq, ptr); + + free (b_); + + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (r)); + + /* We can't now call 'cuMemHostUnregister': we're in a CUDA stream context, + where we "must not make any CUDA API calls". */ + /* See 'nvptx_free'. */ + struct ptx_free_block *n + = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block)); + GOMP_PLUGIN_debug (0, " defer; n=%p\n", n); + n->ptr = ptr; + pthread_mutex_lock (&aq->page_locked_host_unregister_blocks_lock); + n->next = aq->page_locked_host_unregister_blocks; + aq->page_locked_host_unregister_blocks = n; + pthread_mutex_unlock (&aq->page_locked_host_unregister_blocks_lock); +} + bool -GOMP_OFFLOAD_page_locked_host_free (void *ptr) +GOMP_OFFLOAD_page_locked_host_unregister (void *ptr, size_t size, + struct goacc_asyncqueue *aq) { - GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p\n", - __FUNCTION__, ptr); + GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, size=%llu, aq=%p\n", + __FUNCTION__, ptr, (unsigned long long) size, aq); + assert (size != 0); - CUDA_CALL (cuMemFreeHost, ptr); + if (aq) + { + /* We don't unregister right away, as in-flight operations may still + benefit from the registration. */ + void **b = GOMP_PLUGIN_malloc (2 * sizeof (*b)); + b[0] = aq; + b[1] = ptr; + GOMP_PLUGIN_debug (0, " aq <-" + " nvptx_page_locked_host_unregister_callback(b=%p)\n", + b); + CUDA_CALL (cuStreamAddCallback, aq->cuda_stream, + nvptx_page_locked_host_unregister_callback, b, 0); + } + else + CUDA_CALL (cuMemHostUnregister, ptr); return true; } +int +GOMP_OFFLOAD_page_locked_host_p (int ord, const void *ptr, size_t size) +{ + GOMP_PLUGIN_debug (0, "nvptx %s: ord=%d, ptr=%p, size=%llu\n", + __FUNCTION__, ord, ptr, (unsigned long long) size); + + if (!nvptx_attach_host_thread_to_device (ord)) + return -1; + + return nvptx_page_locked_host_p (ptr, size); +} + void GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), @@ -1841,12 +2125,19 @@ GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream) static struct goacc_asyncqueue * nvptx_goacc_asyncqueue_construct (unsigned int flags) { + GOMP_PLUGIN_debug (0, "%s: flags=%u\n", + __FUNCTION__, flags); + CUstream stream = NULL; CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, flags); struct goacc_asyncqueue *aq = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue)); aq->cuda_stream = stream; + pthread_mutex_init (&aq->page_locked_host_unregister_blocks_lock, NULL); + aq->page_locked_host_unregister_blocks = NULL; + GOMP_PLUGIN_debug (0, " -> aq=%p (with cuda_stream=%p)\n", + aq, aq->cuda_stream); return aq; } @@ -1859,9 +2150,24 @@ GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused))) static bool nvptx_goacc_asyncqueue_destruct (struct goacc_asyncqueue *aq) { + GOMP_PLUGIN_debug (0, "nvptx %s: aq=%p\n", + __FUNCTION__, aq); + CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream); + + bool ret = true; + pthread_mutex_lock (&aq->page_locked_host_unregister_blocks_lock); + if (aq->page_locked_host_unregister_blocks != NULL) + { + GOMP_PLUGIN_error ("aq->page_locked_host_unregister_blocks not empty"); + ret = false; + } + pthread_mutex_unlock (&aq->page_locked_host_unregister_blocks_lock); + pthread_mutex_destroy (&aq->page_locked_host_unregister_blocks_lock); + free (aq); - return true; + + return ret; } bool @@ -1870,12 +2176,50 @@ GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq) return nvptx_goacc_asyncqueue_destruct (aq); } +static bool +nvptx_run_deferred_page_locked_host_unregister (struct goacc_asyncqueue *aq) +{ + GOMP_PLUGIN_debug (0, "%s: aq=%p\n", + __FUNCTION__, aq); + + bool ret = true; + pthread_mutex_lock (&aq->page_locked_host_unregister_blocks_lock); + for (struct ptx_free_block *b = aq->page_locked_host_unregister_blocks; b;) + { + GOMP_PLUGIN_debug (0, " b=%p: cuMemHostUnregister(b->ptr=%p)\n", + b, b->ptr); + + struct ptx_free_block *b_next = b->next; + CUresult r = CUDA_CALL_NOCHECK (cuMemHostUnregister, b->ptr); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("cuMemHostUnregister error: %s", cuda_error (r)); + ret = false; + } + free (b); + b = b_next; + } + aq->page_locked_host_unregister_blocks = NULL; + pthread_mutex_unlock (&aq->page_locked_host_unregister_blocks_lock); + return ret; +} + int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq) { + GOMP_PLUGIN_debug (0, "nvptx %s: aq=%p\n", + __FUNCTION__, aq); + CUresult r = CUDA_CALL_NOCHECK (cuStreamQuery, aq->cuda_stream); if (r == CUDA_SUCCESS) - return 1; + { + /* As a user may expect that they don't need to 'wait' if + 'acc_async_test' returns 'true', clean up here, too. */ + if (!nvptx_run_deferred_page_locked_host_unregister (aq)) + return -1; + + return 1; + } if (r == CUDA_ERROR_NOT_READY) return 0; @@ -1886,7 +2230,17 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq) static bool nvptx_goacc_asyncqueue_synchronize (struct goacc_asyncqueue *aq) { + GOMP_PLUGIN_debug (0, "%s: aq=%p\n", + __FUNCTION__, aq); + CUDA_CALL_ERET (false, cuStreamSynchronize, aq->cuda_stream); + + /* This is called from a user code (non-stream) context, and upon returning, + we must've given up on any page-locked memory registrations, so unregister + any pending ones now. */ + if (!nvptx_run_deferred_page_locked_host_unregister (aq)) + return false; + return true; } @@ -1896,14 +2250,70 @@ GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq) return nvptx_goacc_asyncqueue_synchronize (aq); } +static void +nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback +(CUstream stream, CUresult r, void *b_) +{ + void **b = b_; + struct goacc_asyncqueue *aq1 = b[0]; + struct goacc_asyncqueue *aq2 = b[1]; + GOMP_PLUGIN_debug (0, "%s: stream=%p, r=%u, b_=%p (aq1=%p, aq2=%p)\n", + __FUNCTION__, stream, (unsigned) r, b_, aq1, aq2); + + free (b_); + + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (r)); + + pthread_mutex_lock (&aq1->page_locked_host_unregister_blocks_lock); + if (aq1->page_locked_host_unregister_blocks) + { + pthread_mutex_lock (&aq2->page_locked_host_unregister_blocks_lock); + GOMP_PLUGIN_debug (0, " page_locked_host_unregister_blocks:" + " aq1 -> aq2\n"); + if (aq2->page_locked_host_unregister_blocks == NULL) + aq2->page_locked_host_unregister_blocks + = aq1->page_locked_host_unregister_blocks; + else + { + struct ptx_free_block *b = aq2->page_locked_host_unregister_blocks; + while (b->next != NULL) + b = b->next; + b->next = aq1->page_locked_host_unregister_blocks; + } + pthread_mutex_unlock (&aq2->page_locked_host_unregister_blocks_lock); + aq1->page_locked_host_unregister_blocks = NULL; + } + pthread_mutex_unlock (&aq1->page_locked_host_unregister_blocks_lock); +} + bool GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1, struct goacc_asyncqueue *aq2) { + GOMP_PLUGIN_debug (0, "nvptx %s: aq1=%p, aq2=%p\n", + __FUNCTION__, aq1, aq2); + + if (aq1 != aq2) + { + void **b = GOMP_PLUGIN_malloc (2 * sizeof (*b)); + b[0] = aq1; + b[1] = aq2; + /* Enqueue on 'aq1': move 'page_locked_host_unregister_blocks' of 'aq1' + to 'aq2'. */ + GOMP_PLUGIN_debug (0, " aq1 <-" + " nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback" + "(b=%p)\n", b); + CUDA_CALL (cuStreamAddCallback, aq1->cuda_stream, + nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback, + b, 0); + } + CUevent e; CUDA_CALL_ERET (false, cuEventCreate, &e, CU_EVENT_DISABLE_TIMING); CUDA_CALL_ERET (false, cuEventRecord, e, aq1->cuda_stream); CUDA_CALL_ERET (false, cuStreamWaitEvent, aq2->cuda_stream, e, 0); + return true; } @@ -2238,6 +2648,19 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq)) exit (EXIT_FAILURE); __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE); + + /* Clean up here; otherwise we may run into the situation that + a following reverse offload does + 'GOMP_OFFLOAD_page_locked_host_alloc', and that then runs the + deferred 'cuMemFreeHost's -- which may dead-lock?! + TODO: This may need more considerations for the case that + different host threads do reverse offload? We could move + 'free_host_blocks' into 'aq' (which is separate per reverse + offload) instead of global, like + 'page_locked_host_unregister_blocks', but that doesn't seem the + right thing for OpenACC 'async' generally? */ + if (!nvptx_run_deferred_page_locked_host_free ()) + exit (EXIT_FAILURE); } usleep (1); } diff --git a/libgomp/target.c b/libgomp/target.c index b88b1ebaa13..ed2fc09cf44 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -108,6 +108,74 @@ static int num_devices_openmp; /* OpenMP requires mask. */ static int omp_requires_mask; + +static void *gomp_page_locked_host_alloc_dev (struct gomp_device_descr *, + size_t, bool); +static bool gomp_page_locked_host_free_dev (struct gomp_device_descr *, + void *, + struct goacc_asyncqueue *); +static void *gomp_page_locked_host_aligned_alloc_dev (struct gomp_device_descr *, + size_t, size_t); +static bool gomp_page_locked_host_aligned_free_dev (struct gomp_device_descr *, + void *, + struct goacc_asyncqueue *); + +/* Use (that is, allocate or register) page-locked host memory for memory + objects participating in host <-> device memory transfers. + + When this is enabled, there is no fallback to non-page-locked host + memory. */ + +attribute_hidden +bool always_pinned_mode = false; + +/* This function is called by the compiler when -foffload-memory=pinned + is used. */ + +void +GOMP_enable_pinned_mode () +{ + always_pinned_mode = true; +} + +/* Verify that page-locked host memory is used for memory objects participating + in host <-> device memory transfers. */ + +static const bool verify_always_pinned_mode = false; + +static bool +gomp_verify_always_pinned_mode (struct gomp_device_descr *device, + const void *ptr, size_t size) +{ + gomp_debug (0, "%s: device=%p (%s), ptr=%p, size=%llu\n", + __FUNCTION__, + device, device->name, ptr, (unsigned long long) size); + + if (size == 0) + /* Skip zero-size requests; for those we've got no actual region of + page-locked host memory. */ + ; + else if (device->page_locked_host_register_func) + { + int page_locked_host_p + = device->page_locked_host_p_func (device->target_id, ptr, size); + if (page_locked_host_p < 0) + { + gomp_error ("Failed to test page-locked host memory" + " via %s libgomp plugin", + device->name); + return false; + } + if (!page_locked_host_p) + { + gomp_error ("Failed page-locked host memory test"); + return false; + } + } + return true; +} + + /* Similar to gomp_realloc, but release register_lock before gomp_fatal. */ static void * @@ -402,6 +470,9 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep, if (__builtin_expect (aq != NULL, 0)) assert (ephemeral); + /* We're just filling the CBUF; 'always_pinned_mode' isn't + relevant. */ + memcpy ((char *) cbuf->buf + (doff - cbuf->chunks[0].start), h, sz); return; @@ -422,18 +493,92 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep, stack local in a function that is no longer executing). As we've not been able to use CBUF, make a copy of the data into a temporary buffer. */ - h_buf = gomp_malloc (sz); + if (always_pinned_mode) + { + h_buf = gomp_page_locked_host_alloc_dev (devicep, sz, false); + if (!h_buf) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + h_buf = gomp_malloc (sz); memcpy (h_buf, h, sz); } + + /* No 'gomp_verify_always_pinned_mode' for 'ephemeral'; have just + allocated. */ + if (!ephemeral + && verify_always_pinned_mode + && always_pinned_mode) + if (!gomp_verify_always_pinned_mode (devicep, h_buf, sz)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func, "dev", d, "host", h_buf, h, sz, aq); + if (ephemeral) - /* Free once the transfer has completed. */ - devicep->openacc.async.queue_callback_func (aq, free, h_buf); + { + if (always_pinned_mode) + { + if (!gomp_page_locked_host_free_dev (devicep, h_buf, aq)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + /* Free once the transfer has completed. */ + devicep->openacc.async.queue_callback_func (aq, free, h_buf); + } } else - gomp_device_copy (devicep, devicep->host2dev_func, - "dev", d, "host", h, sz); + { + if (ephemeral + && always_pinned_mode) + { + /* TODO: Page-locking on the spot probably doesn't make a lot of + sense (performance-wise). Should we instead use a "page-locked + host memory bounce buffer" (per host thread, or per device, + or...)? */ + void *ptr = (void *) h; + int page_locked_host_p + = gomp_page_locked_host_register_dev (devicep, + ptr, sz, GOMP_MAP_TO); + if (page_locked_host_p < 0) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + /* Ephemeral data isn't already page-locked host memory. */ + assert (page_locked_host_p); + } + else if (verify_always_pinned_mode + && always_pinned_mode) + if (!gomp_verify_always_pinned_mode (devicep, h, sz)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + + gomp_device_copy (devicep, devicep->host2dev_func, + "dev", d, "host", h, sz); + + if (ephemeral + && always_pinned_mode) + { + void *ptr = (void *) h; + if (!gomp_page_locked_host_unregister_dev (devicep, ptr, sz, aq)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + } } attribute_hidden void @@ -441,6 +586,14 @@ gomp_copy_dev2host (struct gomp_device_descr *devicep, struct goacc_asyncqueue *aq, void *h, const void *d, size_t sz) { + if (verify_always_pinned_mode + && always_pinned_mode) + if (!gomp_verify_always_pinned_mode (devicep, h, sz)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + if (__builtin_expect (aq != NULL, 0)) goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func, "host", h, "dev", d, NULL, sz, aq); @@ -1367,8 +1520,19 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, cbuf.chunk_cnt--; if (cbuf.chunk_cnt > 0) { - cbuf.buf - = malloc (cbuf.chunks[cbuf.chunk_cnt - 1].end - cbuf.chunks[0].start); + size_t sz + = cbuf.chunks[cbuf.chunk_cnt - 1].end - cbuf.chunks[0].start; + if (always_pinned_mode) + { + cbuf.buf = gomp_page_locked_host_alloc_dev (devicep, sz, false); + if (!cbuf.buf) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + cbuf.buf = malloc (sz); if (cbuf.buf) { cbuf.tgt = tgt; @@ -1671,6 +1835,23 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, k->tgt = tgt; k->refcount = 0; k->dynamic_refcount = 0; + k->page_locked_host_p = false; + if (always_pinned_mode) + { + void *ptr = (void *) k->host_start; + size_t size = k->host_end - k->host_start; + int page_locked_host_p = 0; + if (size != 0) + page_locked_host_p = gomp_page_locked_host_register_dev + (devicep, ptr, size, kind & typemask); + if (page_locked_host_p < 0) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + if (page_locked_host_p) + k->page_locked_host_p = true; + } if (field_tgt_clear != FIELD_TGT_EMPTY) { k->tgt_offset = k->host_start - field_tgt_base @@ -1976,11 +2157,22 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, - cbuf.chunks[0].start), cbuf.chunks[c].end - cbuf.chunks[c].start, false, NULL); - if (aq) - /* Free once the transfer has completed. */ - devicep->openacc.async.queue_callback_func (aq, free, cbuf.buf); + if (always_pinned_mode) + { + if (!gomp_page_locked_host_free_dev (devicep, cbuf.buf, aq)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } else - free (cbuf.buf); + { + if (aq) + /* Free once the transfer has completed. */ + devicep->openacc.async.queue_callback_func (aq, free, cbuf.buf); + else + free (cbuf.buf); + } cbuf.buf = NULL; cbufp = NULL; } @@ -2112,6 +2304,23 @@ gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k, /* Starting from the _FIRST key, and continue for all following sibling keys. */ gomp_remove_splay_tree_key (&devicep->mem_map, k); + + if (always_pinned_mode) + { + if (k->page_locked_host_p) + { + void *ptr = (void *) k->host_start; + size_t size = k->host_end - k->host_start; + if (!gomp_page_locked_host_unregister_dev (devicep, + ptr, size, aq)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + k->page_locked_host_p = false; + } + } + if (REFCOUNT_STRUCTELEM_LAST_P (k->refcount)) break; else @@ -2119,7 +2328,25 @@ gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k, } } else - gomp_remove_splay_tree_key (&devicep->mem_map, k); + { + gomp_remove_splay_tree_key (&devicep->mem_map, k); + + if (always_pinned_mode) + { + if (k->page_locked_host_p) + { + void *ptr = (void *) k->host_start; + size_t size = k->host_end - k->host_start; + if (!gomp_page_locked_host_unregister_dev (devicep, + ptr, size, aq)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + k->page_locked_host_p = false; + } + } + } if (aq) devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void, @@ -2211,6 +2438,8 @@ gomp_unmap_vars_internal (struct target_mem_desc *tgt, bool do_copyfrom, + tgt->list[i].offset), tgt->list[i].length); /* Queue all removals together for processing below. + We may unregister page-locked host memory only after all device to + host memory transfers have completed. See also 'gomp_exit_data'. */ if (do_remove) remove_vars[nrmvars++] = k; @@ -2392,8 +2621,17 @@ get_gomp_offload_icvs (int dev_num) if (offload_icvs != NULL) return &offload_icvs->icvs; - struct gomp_offload_icv_list *new - = (struct gomp_offload_icv_list *) gomp_malloc (sizeof (struct gomp_offload_icv_list)); + struct gomp_offload_icv_list *new; + size_t size = sizeof (struct gomp_offload_icv_list); + if (always_pinned_mode) + { + struct gomp_device_descr *device = &devices[dev_num]; + new = gomp_page_locked_host_alloc_dev (device, size, false); + if (!new) + exit (EXIT_FAILURE); + } + else + new = gomp_malloc (size); new->device_num = dev_num; new->icvs.device_num = dev_num; @@ -2447,6 +2685,8 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, const void *host_table, const void *target_data, bool is_register_lock) { + gomp_debug (0, "%s: devicep=%p (%s)\n", + __FUNCTION__, devicep, devicep->name); void **host_func_table = ((void ***) host_table)[0]; void **host_funcs_end = ((void ***) host_table)[1]; void **host_var_table = ((void ***) host_table)[2]; @@ -2511,6 +2751,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, k->refcount = REFCOUNT_INFINITY; k->dynamic_refcount = 0; k->aux = NULL; + k->page_locked_host_p = false; array->left = NULL; array->right = NULL; splay_tree_insert (&devicep->mem_map, array); @@ -2556,6 +2797,34 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, k->refcount = is_link_var ? REFCOUNT_LINK : REFCOUNT_INFINITY; k->dynamic_refcount = 0; k->aux = NULL; + k->page_locked_host_p = false; + if (always_pinned_mode) + { + void *ptr = (void *) k->host_start; + size_t size = k->host_end - k->host_start; + gomp_debug (0, " var %d: ptr=%p, size=%llu, is_link_var=%d\n", + i, ptr, (unsigned long long) size, is_link_var); + if (!is_link_var) + { + /* '#pragma omp declare target' variables typically are + read/write, but in particular artificial ones, like Fortran + array constructors, may be placed in section '.rodata'. + We don't have the actual mapping kind available here, so we + use a magic number. */ + const int kind = -1; + int page_locked_host_p = gomp_page_locked_host_register_dev + (devicep, ptr, size, kind); + if (page_locked_host_p < 0) + { + gomp_mutex_unlock (&devicep->lock); + if (is_register_lock) + gomp_mutex_unlock (®ister_lock); + exit (EXIT_FAILURE); + } + if (page_locked_host_p) + k->page_locked_host_p = true; + } + } array->left = NULL; array->right = NULL; splay_tree_insert (&devicep->mem_map, array); @@ -2577,6 +2846,13 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, devicep->target_id. */ int dev_num = (int) (devicep - &devices[0]); struct gomp_offload_icvs *icvs = get_gomp_offload_icvs (dev_num); + if (!icvs) + { + gomp_mutex_unlock (&devicep->lock); + if (is_register_lock) + gomp_mutex_unlock (®ister_lock); + gomp_fatal ("'get_gomp_offload_icvs' failed"); + } size_t var_size = var->end - var->start; if (var_size != sizeof (struct gomp_offload_icvs)) { @@ -2599,6 +2875,8 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, k->refcount = REFCOUNT_INFINITY; k->dynamic_refcount = 0; k->aux = NULL; + /* 'always_pinned_mode' handled via 'get_gomp_offload_icvs'. */ + k->page_locked_host_p = always_pinned_mode; array->left = NULL; array->right = NULL; splay_tree_insert (&devicep->mem_map, array); @@ -3261,6 +3539,12 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum, flags = clear_unsupported_flags (devicep, flags); + /* For 'nowait' we supposedly have to unregister/free page-locked host memory + via 'GOMP_PLUGIN_target_task_completion'. There is no current + configuration exercising this (and thus, infeasible to test). */ + assert (!(flags & GOMP_TARGET_FLAG_NOWAIT) + || !(devicep && devicep->page_locked_host_register_func)); + if (flags & GOMP_TARGET_FLAG_NOWAIT) { struct gomp_thread *thr = gomp_thread (); @@ -3572,18 +3856,37 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, } else { - devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t)); - sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t)); - kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short)); + size_t devaddrs_size = mapnum * sizeof (uint64_t); + size_t sizes_size = mapnum * sizeof (uint64_t); + size_t kinds_size = mapnum * sizeof (unsigned short); + if (always_pinned_mode) + { + if (!(devaddrs = gomp_page_locked_host_alloc_dev (devicep, + devaddrs_size, + false)) + || !(sizes = gomp_page_locked_host_alloc_dev (devicep, + sizes_size, + false)) + || !(kinds = gomp_page_locked_host_alloc_dev (devicep, + kinds_size, + false))) + exit (EXIT_FAILURE); + } + else + { + devaddrs = gomp_malloc (devaddrs_size); + sizes = gomp_malloc (sizes_size); + kinds = gomp_malloc (kinds_size); + } gomp_copy_dev2host (devicep, aq, devaddrs, (const void *) (uintptr_t) devaddrs_ptr, - mapnum * sizeof (uint64_t)); + devaddrs_size); gomp_copy_dev2host (devicep, aq, sizes, (const void *) (uintptr_t) sizes_ptr, - mapnum * sizeof (uint64_t)); + sizes_size); gomp_copy_dev2host (devicep, aq, kinds, (const void *) (uintptr_t) kinds_ptr, - mapnum * sizeof (unsigned short)); + kinds_size); if (aq && !devicep->openacc.async.synchronize_func (aq)) exit (EXIT_FAILURE); } @@ -3598,7 +3901,23 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, if (tgt_align) { - char *tgt = gomp_alloca (tgt_size + tgt_align - 1); + size_t tgt_alloc_size = tgt_size + tgt_align - 1; + char *tgt = gomp_alloca (tgt_alloc_size); + if (always_pinned_mode) + { + /* TODO: See 'gomp_copy_host2dev' re "page-locking on the spot". + On the other hand, performance isn't really a concern, here. */ + int page_locked_host_p = 0; + if (tgt_alloc_size != 0) + { + page_locked_host_p = gomp_page_locked_host_register_dev + (devicep, tgt, tgt_alloc_size, GOMP_MAP_TOFROM); + if (page_locked_host_p < 0) + exit (EXIT_FAILURE); + /* 'gomp_alloca' isn't already page-locked host memory. */ + assert (page_locked_host_p); + } + } uintptr_t al = (uintptr_t) tgt & (tgt_align - 1); if (al) tgt += tgt_align - al; @@ -3632,6 +3951,14 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, ++i; } } + if (always_pinned_mode) + { + if (tgt_alloc_size != 0 + && !gomp_page_locked_host_unregister_dev (devicep, + tgt, tgt_alloc_size, + NULL)) + exit (EXIT_FAILURE); + } } if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM) && mapnum > 0) @@ -3718,9 +4045,20 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, { cdata[i].aligned = true; size_t align = (size_t) 1 << (kinds[i] >> 8); - devaddrs[i] - = (uint64_t) (uintptr_t) gomp_aligned_alloc (align, - sizes[i]); + void *ptr; + if (always_pinned_mode) + { + ptr = gomp_page_locked_host_aligned_alloc_dev + (devicep, align, sizes[i]); + if (!ptr) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + ptr = gomp_aligned_alloc (align, sizes[i]); + devaddrs[i] = (uint64_t) (uintptr_t) ptr; } else if (n2 != NULL) devaddrs[i] = (n2->host_start + cdata[i].devaddr @@ -3770,7 +4108,23 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, } } if (!cdata[i].present) - devaddrs[i] = (uintptr_t) gomp_malloc (sizeof (void*)); + { + void *ptr; + size_t size = sizeof (void *); + if (always_pinned_mode) + { + ptr = gomp_page_locked_host_alloc_dev (devicep, + size, false); + if (!ptr) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + ptr = gomp_malloc (size); + devaddrs[i] = (uintptr_t) ptr; + } /* Assume that when present, the pointer is already correct. */ if (!n2) *(uint64_t *) (uintptr_t) (devaddrs[i] + sizes[i]) @@ -3803,9 +4157,20 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, { cdata[i].aligned = true; size_t align = (size_t) 1 << (kinds[i] >> 8); - devaddrs[i] - = (uint64_t) (uintptr_t) gomp_aligned_alloc (align, - sizes[i]); + void *ptr; + if (always_pinned_mode) + { + ptr = gomp_page_locked_host_aligned_alloc_dev + (devicep, align, sizes[i]); + if (!ptr) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + ptr = gomp_aligned_alloc (align, sizes[i]); + devaddrs[i] = (uint64_t) (uintptr_t) ptr; gomp_copy_dev2host (devicep, aq, (void *) (uintptr_t) devaddrs[i], (void *) (uintptr_t) cdata[i].devaddr, @@ -3881,7 +4246,20 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, + sizes[i + sizes[i]]); size_t align = (size_t) 1 << (kinds[i] >> 8); cdata[i].aligned = true; - devaddrs[i] = (uintptr_t) gomp_aligned_alloc (align, sz); + void *ptr; + if (always_pinned_mode) + { + ptr = gomp_page_locked_host_aligned_alloc_dev + (devicep, align, sz); + if (!ptr) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + ptr = gomp_aligned_alloc (align, sz); + devaddrs[i] = (uintptr_t) ptr; devaddrs[i] -= devaddrs[i+1] - cdata[i].devaddr; } else @@ -3945,9 +4323,29 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, struct_cpy = sizes[i]; } else if (!cdata[i].present && cdata[i].aligned) - gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]); + { + void *ptr = (void *) (uintptr_t) devaddrs[i]; + if (always_pinned_mode) + { + if (!gomp_page_locked_host_aligned_free_dev (devicep, + ptr, + aq)) + exit (EXIT_FAILURE); + } + else + gomp_aligned_free (ptr); + } else if (!cdata[i].present) - free ((void *) (uintptr_t) devaddrs[i]); + { + void *ptr = (void *) (uintptr_t) devaddrs[i]; + if (always_pinned_mode) + { + if (!gomp_page_locked_host_free_dev (devicep, ptr, aq)) + exit (EXIT_FAILURE); + } + else + free (ptr); + } } if (clean_struct) for (uint64_t i = 0; i < mapnum; i++) @@ -3956,12 +4354,30 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, == GOMP_MAP_STRUCT)) { devaddrs[i] += cdata[i+1].devaddr - cdata[i].devaddr; - gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]); + void *ptr = (void *) (uintptr_t) devaddrs[i]; + if (always_pinned_mode) + { + if (!gomp_page_locked_host_aligned_free_dev (devicep, + ptr, aq)) + exit (EXIT_FAILURE); + } + else + gomp_aligned_free (ptr); } - free (devaddrs); - free (sizes); - free (kinds); + if (always_pinned_mode) + { + if (!gomp_page_locked_host_free_dev (devicep, devaddrs, aq) + || !gomp_page_locked_host_free_dev (devicep, sizes, aq) + || !gomp_page_locked_host_free_dev (devicep, kinds, aq)) + exit (EXIT_FAILURE); + } + else + { + free (devaddrs); + free (sizes); + free (kinds); + } } } @@ -4585,6 +5001,160 @@ gomp_usm_free (void *device_ptr, int device_num) } +/* Allocate page-locked host memory via DEVICE. */ + +static void * +gomp_page_locked_host_alloc_dev (struct gomp_device_descr *device, + size_t size, bool allow_null) +{ + gomp_debug (0, "%s: device=%p (%s), size=%llu\n", + __FUNCTION__, device, device->name, (unsigned long long) size); + + void *ret; + if (!device->page_locked_host_alloc_func (&ret, size)) + { + const char *fmt + = "Failed to allocate page-locked host memory via %s libgomp plugin"; + if (allow_null) + gomp_fatal (fmt, device->name); + else + gomp_error (fmt, device->name); + ret = NULL; + } + else if (ret == NULL && !allow_null) + gomp_error ("Out of memory allocating %lu bytes" + " page-locked host memory" + " via %s libgomp plugin", + (unsigned long) size, device->name); + else + gomp_debug (0, " -> ret=[%p, %p)\n", + ret, ret + size); + return ret; +} + +/* Free page-locked host memory via DEVICE. */ + +static bool +gomp_page_locked_host_free_dev (struct gomp_device_descr *device, + void *ptr, + struct goacc_asyncqueue *aq) +{ + gomp_debug (0, "%s: device=%p (%s), ptr=%p, aq=%p\n", + __FUNCTION__, device, device->name, ptr, aq); + + if (!device->page_locked_host_free_func (ptr, aq)) + { + gomp_error ("Failed to free page-locked host memory" + " via %s libgomp plugin", + device->name); + return false; + } + return true; +} + +/* Allocate aligned page-locked host memory via DEVICE. + + That is, 'gomp_aligned_alloc' (see 'alloc.c') for page-locked host + memory. */ + +static void * +gomp_page_locked_host_aligned_alloc_dev (struct gomp_device_descr *device, + size_t al, size_t size) +{ + gomp_debug (0, "%s: device=%p (%s), al=%llu, size=%llu\n", + __FUNCTION__, device, device->name, + (unsigned long long) al, (unsigned long long) size); + + void *ret; + if (al < sizeof (void *)) + al = sizeof (void *); + ret = NULL; + if ((al & (al - 1)) == 0 && size) + { + void *p = gomp_page_locked_host_alloc_dev (device, size + al, true); + if (p) + { + void *ap = (void *) (((uintptr_t) p + al) & -al); + ((void **) ap)[-1] = p; + ret = ap; + } + } + if (ret == NULL) + gomp_error ("Out of memory allocating %lu bytes", (unsigned long) size); + else + gomp_debug (0, " -> ret=[%p, %p)\n", + ret, ret + size); + return ret; +} + +/* Free aligned page-locked host memory via DEVICE. + + That is, 'gomp_aligned_free' (see 'alloc.c') for page-locked host + memory. */ + +static bool +gomp_page_locked_host_aligned_free_dev (struct gomp_device_descr *device, + void *ptr, + struct goacc_asyncqueue *aq) +{ + gomp_debug (0, "%s: device=%p (%s), ptr=%p, aq=%p\n", + __FUNCTION__, device, device->name, ptr, aq); + + if (ptr) + { + ptr = ((void **) ptr)[-1]; + gomp_debug (0, " ptr=%p\n", + ptr); + + if (!gomp_page_locked_host_free_dev (device, ptr, aq)) + return false; + } + return true; +} + +/* Register page-locked host memory via DEVICE. */ + +attribute_hidden int +gomp_page_locked_host_register_dev (struct gomp_device_descr *device, + void *ptr, size_t size, int kind) +{ + gomp_debug (0, "%s: device=%p (%s), ptr=%p, size=%llu, kind=%d\n", + __FUNCTION__, device, device->name, + ptr, (unsigned long long) size, kind); + assert (size != 0); + + int ret = device->page_locked_host_register_func (device->target_id, + ptr, size, kind); + if (ret < 0) + gomp_error ("Failed to register page-locked host memory" + " via %s libgomp plugin", + device->name); + return ret; +} + +/* Unregister page-locked host memory via DEVICE. */ + +attribute_hidden bool +gomp_page_locked_host_unregister_dev (struct gomp_device_descr *device, + void *ptr, size_t size, + struct goacc_asyncqueue *aq) +{ + gomp_debug (0, "%s: device=%p (%s), ptr=%p, size=%llu, aq=%p\n", + __FUNCTION__, device, device->name, + ptr, (unsigned long long) size, aq); + assert (size != 0); + + if (!device->page_locked_host_unregister_func (ptr, size, aq)) + { + gomp_error ("Failed to unregister page-locked host memory" + " via %s libgomp plugin", + device->name); + return false; + } + return true; +} + + /* Device (really: libgomp plugin) to use for paged-locked memory. We assume there is either none or exactly one such device for the lifetime of the process. */ @@ -4681,10 +5251,7 @@ gomp_page_locked_host_alloc (void **ptr, size_t size) } gomp_mutex_unlock (&device->lock); - if (!device->page_locked_host_alloc_func (ptr, size)) - gomp_fatal ("Failed to allocate page-locked host memory" - " via %s libgomp plugin", - device->name); + *ptr = gomp_page_locked_host_alloc_dev (device, size, true); } return device != NULL; } @@ -4713,10 +5280,8 @@ gomp_page_locked_host_free (void *ptr) } gomp_mutex_unlock (&device->lock); - if (!device->page_locked_host_free_func (ptr)) - gomp_fatal ("Failed to free page-locked host memory" - " via %s libgomp plugin", - device->name); + if (!gomp_page_locked_host_free_dev (device, ptr, NULL)) + exit (EXIT_FAILURE); } @@ -4792,30 +5357,84 @@ omp_target_memcpy_copy (void *dst, const void *src, size_t length, bool ret; if (src_devicep == NULL && dst_devicep == NULL) { + /* No 'gomp_verify_always_pinned_mode' here. */ memcpy ((char *) dst + dst_offset, (char *) src + src_offset, length); return 0; } if (src_devicep == NULL) { gomp_mutex_lock (&dst_devicep->lock); + + void *src_ptr = (void *) src + src_offset; + int src_ptr_page_locked_host_p = 0; + + if (always_pinned_mode) + { + if (length != 0) + src_ptr_page_locked_host_p = gomp_page_locked_host_register_dev + (dst_devicep, src_ptr, length, GOMP_MAP_TO); + if (src_ptr_page_locked_host_p < 0) + { + gomp_mutex_unlock (&dst_devicep->lock); + return ENOMEM; + } + } + + /* No 'gomp_verify_always_pinned_mode' here; have just registered. */ ret = dst_devicep->host2dev_func (dst_devicep->target_id, (char *) dst + dst_offset, - (char *) src + src_offset, length); + src_ptr, length); + + if (src_ptr_page_locked_host_p + && !gomp_page_locked_host_unregister_dev (dst_devicep, + src_ptr, length, NULL)) + { + gomp_mutex_unlock (&dst_devicep->lock); + return ENOMEM; + } + gomp_mutex_unlock (&dst_devicep->lock); return (ret ? 0 : EINVAL); } if (dst_devicep == NULL) { gomp_mutex_lock (&src_devicep->lock); + + void *dst_ptr = (void *) dst + dst_offset; + int dst_ptr_page_locked_host_p = 0; + + if (always_pinned_mode) + { + if (length != 0) + dst_ptr_page_locked_host_p = gomp_page_locked_host_register_dev + (src_devicep, dst_ptr, length, GOMP_MAP_FROM); + if (dst_ptr_page_locked_host_p < 0) + { + gomp_mutex_unlock (&src_devicep->lock); + return ENOMEM; + } + } + + /* No 'gomp_verify_always_pinned_mode' here; have just registered. */ ret = src_devicep->dev2host_func (src_devicep->target_id, - (char *) dst + dst_offset, + dst_ptr, (char *) src + src_offset, length); + + if (dst_ptr_page_locked_host_p + && !gomp_page_locked_host_unregister_dev (src_devicep, + dst_ptr, length, NULL)) + { + gomp_mutex_unlock (&src_devicep->lock); + return ENOMEM; + } + gomp_mutex_unlock (&src_devicep->lock); return (ret ? 0 : EINVAL); } if (src_devicep == dst_devicep) { gomp_mutex_lock (&src_devicep->lock); + /* No 'gomp_verify_always_pinned_mode' here. */ ret = src_devicep->dev2dev_func (src_devicep->target_id, (char *) dst + dst_offset, (char *) src + src_offset, length); @@ -4927,21 +5546,63 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size, return EINVAL; if (dst_devicep == NULL && src_devicep == NULL) { + /* No 'gomp_verify_always_pinned_mode' here. */ memcpy ((char *) dst + dst_off, (const char *) src + src_off, length); ret = 1; } else if (src_devicep == NULL) - ret = dst_devicep->host2dev_func (dst_devicep->target_id, - (char *) dst + dst_off, - (const char *) src + src_off, - length); + { + void *src_ptr = (void *) src + src_off; + int src_ptr_page_locked_host_p = 0; + + if (always_pinned_mode) + { + if (length != 0) + src_ptr_page_locked_host_p = gomp_page_locked_host_register_dev + (dst_devicep, src_ptr, length, GOMP_MAP_TO); + if (src_ptr_page_locked_host_p < 0) + return ENOMEM; + } + + /* No 'gomp_verify_always_pinned_mode' here; have just registered. */ + ret = dst_devicep->host2dev_func (dst_devicep->target_id, + (char *) dst + dst_off, + src_ptr, + length); + + if (src_ptr_page_locked_host_p + && !gomp_page_locked_host_unregister_dev (dst_devicep, + src_ptr, length, NULL)) + return ENOMEM; + } else if (dst_devicep == NULL) - ret = src_devicep->dev2host_func (src_devicep->target_id, - (char *) dst + dst_off, - (const char *) src + src_off, - length); + { + void *dst_ptr = (void *) dst + dst_off; + int dst_ptr_page_locked_host_p = 0; + + if (always_pinned_mode) + { + if (length != 0) + dst_ptr_page_locked_host_p = gomp_page_locked_host_register_dev + (src_devicep, dst_ptr, length, GOMP_MAP_FROM); + if (dst_ptr_page_locked_host_p < 0) + return ENOMEM; + } + + /* No 'gomp_verify_always_pinned_mode' here; have just registered. */ + ret = src_devicep->dev2host_func (src_devicep->target_id, + dst_ptr, + (const char *) src + src_off, + length); + + if (dst_ptr_page_locked_host_p + && !gomp_page_locked_host_unregister_dev (src_devicep, + dst_ptr, length, NULL)) + return ENOMEM; + } else if (src_devicep == dst_devicep) + /* No 'gomp_verify_always_pinned_mode' here. */ ret = src_devicep->dev2dev_func (src_devicep->target_id, (char *) dst + dst_off, (const char *) src + src_off, @@ -5184,6 +5845,7 @@ omp_target_associate_ptr (const void *host_ptr, const void *device_ptr, k->refcount = REFCOUNT_INFINITY; k->dynamic_refcount = 0; k->aux = NULL; + k->page_locked_host_p = false; array->left = NULL; array->right = NULL; splay_tree_insert (&devicep->mem_map, array); @@ -5406,6 +6068,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, DLSYM_OPT (is_usm_ptr, is_usm_ptr); DLSYM_OPT (page_locked_host_alloc, page_locked_host_alloc); DLSYM_OPT (page_locked_host_free, page_locked_host_free); + DLSYM_OPT (page_locked_host_register, page_locked_host_register); + DLSYM_OPT (page_locked_host_unregister, page_locked_host_unregister); + DLSYM_OPT (page_locked_host_p, page_locked_host_p); DLSYM (dev2host); DLSYM (host2dev); DLSYM (evaluate_device); diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c deleted file mode 100644 index 8dc19055038..00000000000 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c +++ /dev/null @@ -1,63 +0,0 @@ -/* { dg-do run } */ -/* { dg-additional-options "-foffload-memory=pinned" } */ - -/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */ - -/* Test that pinned memory works. */ - -#include <stdio.h> -#include <stdlib.h> - -#ifdef __linux__ -#include <sys/types.h> -#include <unistd.h> - -#include <sys/mman.h> - -int -get_pinned_mem () -{ - int pid = getpid (); - char buf[100]; - sprintf (buf, "/proc/%d/status", pid); - - FILE *proc = fopen (buf, "r"); - if (!proc) - abort (); - while (fgets (buf, 100, proc)) - { - int val; - if (sscanf (buf, "VmLck: %d", &val)) - { - fclose (proc); - return val; - } - } - abort (); -} -#else -int -get_pinned_mem () -{ - return 0; -} - -#define mlockall(...) 0 -#endif - -#include <omp.h> - -int -main () -{ - // Sanity check - if (get_pinned_mem () == 0) - { - /* -foffload-memory=pinned has failed, but maybe that's because - isufficient pinned memory was available. */ - if (mlockall (MCL_CURRENT | MCL_FUTURE) == 0) - abort (); - } - - return 0; -} -- 2.25.1 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #3: 0002-foffload-memory-pinned-using-offloading-device-inter.patch --] [-- Type: text/x-diff, Size: 3428 bytes --] From 694bbd399c1323975b4a6735646e46c6914de63d Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Thu, 30 Mar 2023 10:08:12 +0200 Subject: [PATCH 2/2] '-foffload-memory=pinned' using offloading device interfaces for non-contiguous array support Changes related to og12 commit 15d0f61a7fecdc8fd12857c40879ea3730f6d99f "Merge non-contiguous array support patches". libgomp/ * target.c (gomp_map_vars_internal) <non-contiguous array support>: Handle 'always_pinned_mode'. --- libgomp/ChangeLog.omp | 3 +++ libgomp/target.c | 55 +++++++++++++++++++++++++++++++++++++++---- 2 files changed, 53 insertions(+), 5 deletions(-) diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp index 1b02c057562..09cf9c6f3c1 100644 --- a/libgomp/ChangeLog.omp +++ b/libgomp/ChangeLog.omp @@ -1,5 +1,8 @@ 2023-04-03 Thomas Schwinge <thomas@codesourcery.com> + * target.c (gomp_map_vars_internal) + <non-contiguous array support>: Handle 'always_pinned_mode'. + * libgomp-plugin.h (GOMP_OFFLOAD_page_locked_host_free): Add 'struct goacc_asyncqueue *' formal parameter. (GOMP_OFFLOAD_page_locked_host_register) diff --git a/libgomp/target.c b/libgomp/target.c index ed2fc09cf44..38eb5d1aa5b 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -2087,6 +2087,23 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, k->dynamic_refcount = 0; k->aux = NULL; k->tgt_offset = tgt_size; + k->page_locked_host_p = false; + if (always_pinned_mode) + { + void *ptr = (void *) k->host_start; + size_t size = k->host_end - k->host_start; + int page_locked_host_p = 0; + if (size != 0) + page_locked_host_p = gomp_page_locked_host_register_dev + (devicep, ptr, size, kind & typemask); + if (page_locked_host_p < 0) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + if (page_locked_host_p) + k->page_locked_host_p = true; + } tgt_size += nca->data_row_size; @@ -2120,16 +2137,44 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, accelerator side ptrblock and copy it in. */ if (nca->ptrblock_size) { - void *ptrblock = gomp_malloc (nca->ptrblock_size); + void *ptrblock; + if (always_pinned_mode) + { + ptrblock + = gomp_page_locked_host_alloc_dev (devicep, + nca->ptrblock_size, + false); + if (!ptrblock) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } + else + ptrblock = gomp_malloc (nca->ptrblock_size); goacc_noncontig_array_create_ptrblock (nca, ptrblock, target_ptrblock); gomp_copy_host2dev (devicep, aq, target_ptrblock, ptrblock, nca->ptrblock_size, false, cbufp); - if (aq) - /* Free once the transfer has completed. */ - devicep->openacc.async.queue_callback_func (aq, free, ptrblock); + if (always_pinned_mode) + { + if (!gomp_page_locked_host_free_dev (devicep, + ptrblock, + aq)) + { + gomp_mutex_unlock (&devicep->lock); + exit (EXIT_FAILURE); + } + } else - free (ptrblock); + { + if (aq) + /* Free once the transfer has completed. */ + devicep->openacc.async.queue_callback_func + (aq, free, ptrblock); + else + free (ptrblock); + } } } } -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/5] openmp: allow requires unified_shared_memory 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer 2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer @ 2022-03-08 11:30 ` Hafiz Abid Qadeer 2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer ` (4 subsequent siblings) 6 siblings, 0 replies; 18+ messages in thread From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw) To: gcc-patches, fortran; +Cc: jakub, ams, joseph From: Andrew Stubbs <ams@codesourcery.com> This is the front-end portion of the Unified Shared Memory implementation. It removes the "sorry, unimplemented message" in C, C++, and Fortran, and sets flag_offload_memory, but is otherwise inactive, for now. It also checks that -foffload-memory isn't set to an incompatible mode. gcc/c/ChangeLog: * c-parser.cc (c_parser_omp_requires): Allow "requires unified_share_memory". gcc/cp/ChangeLog: * parser.cc (cp_parser_omp_requires): Allow "requires unified_share_memory". gcc/fortran/ChangeLog: * openmp.cc (gfc_match_omp_requires): Allow "requires unified_share_memory". gcc/testsuite/ChangeLog: * c-c++-common/gomp/usm-1.c: New test. * gfortran.dg/gomp/usm-1.f90: New test. --- gcc/c/c-parser.cc | 13 ++++++++++++- gcc/cp/parser.cc | 13 ++++++++++++- gcc/fortran/openmp.cc | 10 +++++++++- gcc/testsuite/c-c++-common/gomp/usm-1.c | 4 ++++ gcc/testsuite/gfortran.dg/gomp/usm-1.f90 | 6 ++++++ 5 files changed, 43 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90 diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index 84deac04c44..dc834158d1c 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -22542,7 +22542,16 @@ c_parser_omp_requires (c_parser *parser) if (!strcmp (p, "unified_address")) this_req = OMP_REQUIRES_UNIFIED_ADDRESS; else if (!strcmp (p, "unified_shared_memory")) + { this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "unified_shared_memory is incompatible with the " + "selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "dynamic_allocators")) this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS; else if (!strcmp (p, "reverse_offload")) @@ -22609,7 +22618,9 @@ c_parser_omp_requires (c_parser *parser) c_parser_skip_to_pragma_eol (parser, false); return; } - if (p && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS) + if (p + && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS + && this_req != OMP_REQUIRES_UNIFIED_SHARED_MEMORY) sorry_at (cloc, "%qs clause on %<requires%> directive not " "supported yet", p); if (p) diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index 03d99aba13e..ba263152aaf 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -46464,7 +46464,16 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok) if (!strcmp (p, "unified_address")) this_req = OMP_REQUIRES_UNIFIED_ADDRESS; else if (!strcmp (p, "unified_shared_memory")) + { this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "unified_shared_memory is incompatible with the " + "selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "dynamic_allocators")) this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS; else if (!strcmp (p, "reverse_offload")) @@ -46537,7 +46546,9 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok) cp_parser_skip_to_pragma_eol (parser, pragma_tok); return false; } - if (p && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS) + if (p + && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS + && this_req != OMP_REQUIRES_UNIFIED_SHARED_MEMORY) sorry_at (cloc, "%qs clause on %<requires%> directive not " "supported yet", p); if (p) diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc index 16cd03a3d67..1f434857719 100644 --- a/gcc/fortran/openmp.cc +++ b/gcc/fortran/openmp.cc @@ -29,6 +29,7 @@ along with GCC; see the file COPYING3. If not see #include "diagnostic.h" #include "gomp-constants.h" #include "target-memory.h" /* For gfc_encode_character. */ +#include "options.h" /* Match an end of OpenMP directive. End of OpenMP directive is optional whitespace, followed by '\n' or comment '!'. */ @@ -5373,6 +5374,12 @@ gfc_match_omp_requires (void) requires_clause = OMP_REQ_UNIFIED_SHARED_MEMORY; if (requires_clauses & OMP_REQ_UNIFIED_SHARED_MEMORY) goto duplicate_clause; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + gfc_error_now ("unified_shared_memory at %C is incompatible with " + "the selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; } else if (gfc_match (clauses[3]) == MATCH_YES) { @@ -5412,7 +5419,8 @@ gfc_match_omp_requires (void) goto error; if (requires_clause & ~(OMP_REQ_ATOMIC_MEM_ORDER_MASK - | OMP_REQ_DYNAMIC_ALLOCATORS)) + | OMP_REQ_DYNAMIC_ALLOCATORS + | OMP_REQ_UNIFIED_SHARED_MEMORY)) gfc_error_now ("Sorry, %qs clause at %L on REQUIRES directive is not " "yet supported", clause, &old_loc); if (!gfc_omp_requires_add_clause (requires_clause, clause, &old_loc, NULL)) diff --git a/gcc/testsuite/c-c++-common/gomp/usm-1.c b/gcc/testsuite/c-c++-common/gomp/usm-1.c new file mode 100644 index 00000000000..619c21a83f4 --- /dev/null +++ b/gcc/testsuite/c-c++-common/gomp/usm-1.c @@ -0,0 +1,4 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-foffload-memory=pinned" } */ + +#pragma omp requires unified_shared_memory /* { dg-error "unified_shared_memory is incompatible with the selected -foffload-memory option" } */ diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-1.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-1.f90 new file mode 100644 index 00000000000..340f6bb50a5 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/gomp/usm-1.f90 @@ -0,0 +1,6 @@ +! { dg-do compile } +! { dg-additional-options "-foffload-memory=pinned" } + +!$omp requires unified_shared_memory ! { dg-error "unified_shared_memory at .* is incompatible with the selected -foffload-memory option" } + +end -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer 2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer 2022-03-08 11:30 ` [PATCH 2/5] openmp: allow requires unified_shared_memory Hafiz Abid Qadeer @ 2022-03-08 11:30 ` Hafiz Abid Qadeer 2023-02-10 14:21 ` Thomas Schwinge 2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer ` (3 subsequent siblings) 6 siblings, 1 reply; 18+ messages in thread From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw) To: gcc-patches, fortran; +Cc: jakub, ams, joseph From: Andrew Stubbs <ams@codesourcery.com> This adds support for using Cuda Managed Memory with omp_alloc. It will be used as the underpinnings for "requires unified_shared_memory" in a later patch. There are two new predefined allocators, ompx_unified_shared_mem_alloc and ompx_host_mem_alloc, plus corresponding memory spaces, which can be used to allocate memory in the "managed" space and explicitly on the host (it is intended that "malloc" will be intercepted by the compiler). The nvptx plugin is modified to make the necessary Cuda calls, and libgomp is modified to switch to shared-memory mode for USM allocated mappings. libgomp/ChangeLog: * allocator.c (omp_max_predefined_alloc): Update. (omp_aligned_alloc): Don't fallback ompx_host_mem_alloc. (omp_aligned_calloc): Likewise. (omp_realloc): Likewise. * config/linux/allocator.c (linux_memspace_alloc): Handle USM. (linux_memspace_calloc): Handle USM. (linux_memspace_free): Handle USM. (linux_memspace_realloc): Handle USM. * config/nvptx/allocator.c (nvptx_memspace_alloc): Reject ompx_host_mem_alloc. (nvptx_memspace_calloc): Likewise. (nvptx_memspace_realloc): Likewise. * libgomp-plugin.h (GOMP_OFFLOAD_usm_alloc): New prototype. (GOMP_OFFLOAD_usm_free): New prototype. (GOMP_OFFLOAD_is_usm_ptr): New prototype. * libgomp.h (gomp_usm_alloc): New prototype. (gomp_usm_free): New prototype. (gomp_is_usm_ptr): New prototype. (struct gomp_device_descr): Add USM functions. * omp.h.in (omp_memspace_handle_t): Add ompx_unified_shared_mem_space and ompx_host_mem_space. (omp_allocator_handle_t): Add ompx_unified_shared_mem_alloc and ompx_host_mem_alloc. * omp_lib.f90.in: Likewise. * plugin/plugin-nvptx.c (nvptx_alloc): Add "usm" parameter. Call cuMemAllocManaged as appropriate. (GOMP_OFFLOAD_alloc): Move internals to ... (GOMP_OFFLOAD_alloc_1): ... this, and add usm parameter. (GOMP_OFFLOAD_usm_alloc): New function. (GOMP_OFFLOAD_usm_free): New function. (GOMP_OFFLOAD_is_usm_ptr): New function. * target.c (gomp_map_vars_internal): Add USM support. (gomp_usm_alloc): New function. (gomp_usm_free): New function. (gomp_load_plugin_for_device): New function. * testsuite/libgomp.c/usm-1.c: New test. * testsuite/libgomp.c/usm-2.c: New test. * testsuite/libgomp.c/usm-3.c: New test. * testsuite/libgomp.c/usm-4.c: New test. * testsuite/libgomp.c/usm-5.c: New test. --- libgomp/allocator.c | 13 ++++-- libgomp/config/linux/allocator.c | 48 ++++++++++++-------- libgomp/config/nvptx/allocator.c | 6 +++ libgomp/libgomp-plugin.h | 3 ++ libgomp/libgomp.h | 6 +++ libgomp/omp.h.in | 4 ++ libgomp/omp_lib.f90.in | 8 ++++ libgomp/plugin/plugin-nvptx.c | 45 ++++++++++++++++--- libgomp/target.c | 70 +++++++++++++++++++++++++++++ libgomp/testsuite/libgomp.c/usm-1.c | 24 ++++++++++ libgomp/testsuite/libgomp.c/usm-2.c | 32 +++++++++++++ libgomp/testsuite/libgomp.c/usm-3.c | 35 +++++++++++++++ libgomp/testsuite/libgomp.c/usm-4.c | 36 +++++++++++++++ libgomp/testsuite/libgomp.c/usm-5.c | 28 ++++++++++++ 14 files changed, 330 insertions(+), 28 deletions(-) create mode 100644 libgomp/testsuite/libgomp.c/usm-1.c create mode 100644 libgomp/testsuite/libgomp.c/usm-2.c create mode 100644 libgomp/testsuite/libgomp.c/usm-3.c create mode 100644 libgomp/testsuite/libgomp.c/usm-4.c create mode 100644 libgomp/testsuite/libgomp.c/usm-5.c diff --git a/libgomp/allocator.c b/libgomp/allocator.c index 000ccc2dd9c..18045dbe0c4 100644 --- a/libgomp/allocator.c +++ b/libgomp/allocator.c @@ -32,7 +32,7 @@ #include <stdlib.h> #include <string.h> -#define omp_max_predefined_alloc ompx_pinned_mem_alloc +#define omp_max_predefined_alloc ompx_host_mem_alloc /* These macros may be overridden in config/<target>/allocator.c. */ #ifndef MEMSPACE_ALLOC @@ -68,6 +68,8 @@ static const omp_memspace_handle_t predefined_alloc_mapping[] = { omp_low_lat_mem_space, /* omp_pteam_mem_alloc. */ omp_low_lat_mem_space, /* omp_thread_mem_alloc. */ omp_default_mem_space, /* ompx_pinned_mem_alloc. */ + ompx_unified_shared_mem_space, /* ompx_unified_shared_mem_alloc. */ + ompx_host_mem_space, /* ompx_host_mem_alloc. */ }; struct omp_allocator_data @@ -367,7 +369,8 @@ fail: int fallback = (allocator_data ? allocator_data->fallback : (allocator == omp_default_mem_alloc - || allocator == ompx_pinned_mem_alloc) + || allocator == ompx_pinned_mem_alloc + || allocator == ompx_host_mem_alloc) ? omp_atv_null_fb : omp_atv_default_mem_fb); switch (fallback) @@ -597,7 +600,8 @@ fail: int fallback = (allocator_data ? allocator_data->fallback : (allocator == omp_default_mem_alloc - || allocator == ompx_pinned_mem_alloc) + || allocator == ompx_pinned_mem_alloc + || allocator == ompx_host_mem_alloc) ? omp_atv_null_fb : omp_atv_default_mem_fb); switch (fallback) @@ -855,7 +859,8 @@ fail: int fallback = (allocator_data ? allocator_data->fallback : (allocator == omp_default_mem_alloc - || allocator == ompx_pinned_mem_alloc) + || allocator == ompx_pinned_mem_alloc + || allocator == ompx_host_mem_alloc) ? omp_atv_null_fb : omp_atv_default_mem_fb); switch (fallback) diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c index 5f3ae491f07..face524259c 100644 --- a/libgomp/config/linux/allocator.c +++ b/libgomp/config/linux/allocator.c @@ -42,9 +42,11 @@ static void * linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin) { - (void)memspace; - - if (pin) + if (memspace == ompx_unified_shared_mem_space) + { + return gomp_usm_alloc (size, GOMP_DEVICE_ICV); + } + else if (pin) { void *addr = mmap (NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); @@ -67,7 +69,14 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin) static void * linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin) { - if (pin) + if (memspace == ompx_unified_shared_mem_space) + { + void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV); + memset (ret, 0, size); + return ret; + } + else if (memspace == ompx_unified_shared_mem_space + || pin) return linux_memspace_alloc (memspace, size, pin); else return calloc (1, size); @@ -77,9 +86,9 @@ static void linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size, int pin) { - (void)memspace; - - if (pin) + if (memspace == ompx_unified_shared_mem_space) + gomp_usm_free (addr, GOMP_DEVICE_ICV); + else if (pin) munmap (addr, size); else free (addr); @@ -89,7 +98,9 @@ static void * linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, size_t oldsize, size_t size, int oldpin, int pin) { - if (oldpin && pin) + if (memspace == ompx_unified_shared_mem_space) + goto manual_realloc; + else if (oldpin && pin) { void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE); if (newaddr == MAP_FAILED) @@ -98,18 +109,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, return newaddr; } else if (oldpin || pin) - { - void *newaddr = linux_memspace_alloc (memspace, size, pin); - if (newaddr) - { - memcpy (newaddr, addr, oldsize < size ? oldsize : size); - linux_memspace_free (memspace, addr, oldsize, oldpin); - } - - return newaddr; - } + goto manual_realloc; else return realloc (addr, size); + +manual_realloc: + void *newaddr = linux_memspace_alloc (memspace, size, pin); + if (newaddr) + { + memcpy (newaddr, addr, oldsize < size ? oldsize : size); + linux_memspace_free (memspace, addr, oldsize, oldpin); + } + + return newaddr; } #define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \ diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c index 0102680b717..c1a73511623 100644 --- a/libgomp/config/nvptx/allocator.c +++ b/libgomp/config/nvptx/allocator.c @@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size) __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE); return result; } + else if (memspace == ompx_host_mem_space) + return NULL; else return malloc (size); } @@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size) return result; } + else if (memspace == ompx_host_mem_space) + return NULL; else return calloc (1, size); } @@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr, } return result; } + else if (memspace == ompx_host_mem_space) + return NULL; else return realloc (addr, size); } diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index 07ab700b80c..104f375bc1b 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -134,6 +134,9 @@ extern int GOMP_OFFLOAD_load_image (int, unsigned, const void *, extern bool GOMP_OFFLOAD_unload_image (int, unsigned, const void *); extern void *GOMP_OFFLOAD_alloc (int, size_t); extern bool GOMP_OFFLOAD_free (int, void *); +extern void *GOMP_OFFLOAD_usm_alloc (int, size_t); +extern bool GOMP_OFFLOAD_usm_free (int, void *); +extern bool GOMP_OFFLOAD_is_usm_ptr (void *); extern bool GOMP_OFFLOAD_dev2host (int, void *, const void *, size_t); extern bool GOMP_OFFLOAD_host2dev (int, void *, const void *, size_t); extern bool GOMP_OFFLOAD_dev2dev (int, void *, const void *, size_t); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index b9e03919993..1cbde607794 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -1013,6 +1013,9 @@ extern int gomp_pause_host (void); extern void gomp_init_targets_once (void); extern int gomp_get_num_devices (void); extern bool gomp_target_task_fn (void *); +extern void * gomp_usm_alloc (size_t size, int device_num); +extern void gomp_usm_free (void *device_ptr, int device_num); +extern bool gomp_is_usm_ptr (void *ptr); /* Splay tree definitions. */ typedef struct splay_tree_node_s *splay_tree_node; @@ -1238,6 +1241,9 @@ struct gomp_device_descr __typeof (GOMP_OFFLOAD_unload_image) *unload_image_func; __typeof (GOMP_OFFLOAD_alloc) *alloc_func; __typeof (GOMP_OFFLOAD_free) *free_func; + __typeof (GOMP_OFFLOAD_usm_alloc) *usm_alloc_func; + __typeof (GOMP_OFFLOAD_usm_free) *usm_free_func; + __typeof (GOMP_OFFLOAD_is_usm_ptr) *is_usm_ptr_func; __typeof (GOMP_OFFLOAD_dev2host) *dev2host_func; __typeof (GOMP_OFFLOAD_host2dev) *host2dev_func; __typeof (GOMP_OFFLOAD_dev2dev) *dev2dev_func; diff --git a/libgomp/omp.h.in b/libgomp/omp.h.in index 1d002d36aae..4ec4475306b 100644 --- a/libgomp/omp.h.in +++ b/libgomp/omp.h.in @@ -120,6 +120,8 @@ typedef enum omp_memspace_handle_t __GOMP_UINTPTR_T_ENUM omp_const_mem_space = 2, omp_high_bw_mem_space = 3, omp_low_lat_mem_space = 4, + ompx_unified_shared_mem_space = 5, + ompx_host_mem_space = 6, __omp_memspace_handle_t_max__ = __UINTPTR_MAX__ } omp_memspace_handle_t; @@ -135,6 +137,8 @@ typedef enum omp_allocator_handle_t __GOMP_UINTPTR_T_ENUM omp_pteam_mem_alloc = 7, omp_thread_mem_alloc = 8, ompx_pinned_mem_alloc = 9, + ompx_unified_shared_mem_alloc = 10, + ompx_host_mem_alloc = 11, __omp_allocator_handle_t_max__ = __UINTPTR_MAX__ } omp_allocator_handle_t; diff --git a/libgomp/omp_lib.f90.in b/libgomp/omp_lib.f90.in index a095dad8962..e1c32aa78d2 100644 --- a/libgomp/omp_lib.f90.in +++ b/libgomp/omp_lib.f90.in @@ -160,6 +160,10 @@ parameter :: omp_thread_mem_alloc = 8 integer (kind=omp_allocator_handle_kind), & parameter :: ompx_pinned_mem_alloc = 9 + integer (kind=omp_allocator_handle_kind), & + parameter :: ompx_unified_shared_mem_alloc = 10 + integer (kind=omp_allocator_handle_kind), & + parameter :: ompx_host_mem_alloc = 11 integer (omp_memspace_handle_kind), & parameter :: omp_default_mem_space = 0 integer (omp_memspace_handle_kind), & @@ -170,6 +174,10 @@ parameter :: omp_high_bw_mem_space = 3 integer (omp_memspace_handle_kind), & parameter :: omp_low_lat_mem_space = 4 + integer (omp_memspace_handle_kind), & + parameter :: omp_unified_shared_mem_space = 5 + integer (omp_memspace_handle_kind), & + parameter :: omp_host_mem_space = 6 type omp_alloctrait integer (kind=omp_alloctrait_key_kind) key diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 1b9a5e95c07..b664d652a45 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -1042,11 +1042,13 @@ nvptx_stacks_free (struct ptx_device *ptx_dev, bool force) } static void * -nvptx_alloc (size_t s, bool suppress_errors) +nvptx_alloc (size_t s, bool suppress_errors, bool usm) { CUdeviceptr d; - CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &d, s); + CUresult r = (usm ? CUDA_CALL_NOCHECK (cuMemAllocManaged, &d, s, + CU_MEM_ATTACH_GLOBAL) + : CUDA_CALL_NOCHECK (cuMemAlloc, &d, s)); if (suppress_errors && r == CUDA_ERROR_OUT_OF_MEMORY) return NULL; else if (r != CUDA_SUCCESS) @@ -1423,8 +1425,8 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data) return ret; } -void * -GOMP_OFFLOAD_alloc (int ord, size_t size) +static void * +GOMP_OFFLOAD_alloc_1 (int ord, size_t size, bool usm) { if (!nvptx_attach_host_thread_to_device (ord)) return NULL; @@ -1447,7 +1449,7 @@ GOMP_OFFLOAD_alloc (int ord, size_t size) blocks = tmp; } - void *d = nvptx_alloc (size, true); + void *d = nvptx_alloc (size, true, usm); if (d) return d; else @@ -1455,10 +1457,22 @@ GOMP_OFFLOAD_alloc (int ord, size_t size) /* Memory allocation failed. Try freeing the stacks block, and retrying. */ nvptx_stacks_free (ptx_dev, true); - return nvptx_alloc (size, false); + return nvptx_alloc (size, false, usm); } } +void * +GOMP_OFFLOAD_alloc (int ord, size_t size) +{ + return GOMP_OFFLOAD_alloc_1 (ord, size, false); +} + +void * +GOMP_OFFLOAD_usm_alloc (int ord, size_t size) +{ + return GOMP_OFFLOAD_alloc_1 (ord, size, true); +} + bool GOMP_OFFLOAD_free (int ord, void *ptr) { @@ -1466,6 +1480,25 @@ GOMP_OFFLOAD_free (int ord, void *ptr) && nvptx_free (ptr, ptx_devices[ord])); } +bool +GOMP_OFFLOAD_usm_free (int ord, void *ptr) +{ + return GOMP_OFFLOAD_free (ord, ptr); +} + +bool +GOMP_OFFLOAD_is_usm_ptr (void *ptr) +{ + bool managed = false; + /* This returns 3 outcomes ... + CUDA_ERROR_INVALID_VALUE - Not a Cuda allocated pointer. + CUDA_SUCCESS, managed:false - Cuda allocated, but not USM. + CUDA_SUCCESS, managed:true - USM. */ + CUDA_CALL_NOCHECK (cuPointerGetAttribute, &managed, + CU_POINTER_ATTRIBUTE_IS_MANAGED, (CUdeviceptr)ptr); + return managed; +} + void GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, void **hostaddrs, void **devaddrs, diff --git a/libgomp/target.c b/libgomp/target.c index 9017458885e..f98e8da2526 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -1030,6 +1030,15 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, tgt->list[i].offset = 0; continue; } + else if (devicep->is_usm_ptr_func + && devicep->is_usm_ptr_func (hostaddrs[i])) + { + /* The memory is visible from both host and target + so nothing needs to be moved. */ + tgt->list[i].key = NULL; + tgt->list[i].offset = OFFSET_INLINED; + continue; + } else if ((kind & typemask) == GOMP_MAP_STRUCT) { size_t first = i + 1; @@ -1488,6 +1497,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, continue; } default: + if (tgt->list[i].offset == OFFSET_INLINED + && !array) + continue; break; } splay_tree_key k = &array->key; @@ -3323,6 +3335,61 @@ omp_target_free (void *device_ptr, int device_num) gomp_mutex_unlock (&devicep->lock); } +void * +gomp_usm_alloc (size_t size, int device_num) +{ + if (device_num == gomp_get_num_devices ()) + return malloc (size); + + struct gomp_device_descr *devicep = resolve_device (device_num); + if (devicep == NULL) + return NULL; + + if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400) + || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM) + return malloc (size); + + void *ret = NULL; + gomp_mutex_lock (&devicep->lock); + if (devicep->usm_alloc_func) + ret = devicep->usm_alloc_func (devicep->target_id, size); + gomp_mutex_unlock (&devicep->lock); + return ret; +} + +void +gomp_usm_free (void *device_ptr, int device_num) +{ + if (device_ptr == NULL) + return; + + if (device_num == gomp_get_num_devices ()) + { + free (device_ptr); + return; + } + + struct gomp_device_descr *devicep = resolve_device (device_num); + if (devicep == NULL) + return; + + if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400) + || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM) + { + free (device_ptr); + return; + } + + gomp_mutex_lock (&devicep->lock); + if (devicep->usm_free_func + && !devicep->usm_free_func (devicep->target_id, device_ptr)) + { + gomp_mutex_unlock (&devicep->lock); + gomp_fatal ("error in freeing device memory block at %p", device_ptr); + } + gomp_mutex_unlock (&devicep->lock); +} + int omp_target_is_present (const void *ptr, int device_num) { @@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, DLSYM (unload_image); DLSYM (alloc); DLSYM (free); + DLSYM_OPT (usm_alloc, usm_alloc); + DLSYM_OPT (usm_free, usm_free); + DLSYM_OPT (is_usm_ptr, is_usm_ptr); DLSYM (dev2host); DLSYM (host2dev); device->capabilities = device->get_caps_func (); diff --git a/libgomp/testsuite/libgomp.c/usm-1.c b/libgomp/testsuite/libgomp.c/usm-1.c new file mode 100644 index 00000000000..1b35f19c45b --- /dev/null +++ b/libgomp/testsuite/libgomp.c/usm-1.c @@ -0,0 +1,24 @@ +/* { dg-do run } */ + +#include <omp.h> +#include <stdint.h> + +int +main () +{ + int *a = (int *) omp_alloc(sizeof(int), ompx_unified_shared_mem_alloc); + if (!a) + __builtin_abort (); + + *a = 42; + uintptr_t a_p = (uintptr_t)a; + + #pragma omp target is_device_ptr(a) + { + if (*a != 42 || a_p != (uintptr_t)a) + __builtin_abort (); + } + + omp_free(a, ompx_unified_shared_mem_alloc); + return 0; +} diff --git a/libgomp/testsuite/libgomp.c/usm-2.c b/libgomp/testsuite/libgomp.c/usm-2.c new file mode 100644 index 00000000000..689cee7e456 --- /dev/null +++ b/libgomp/testsuite/libgomp.c/usm-2.c @@ -0,0 +1,32 @@ +/* { dg-do run } */ + +#include <omp.h> +#include <stdint.h> + +int +main () +{ + int *a = (int *) omp_alloc(sizeof(int)*2, ompx_unified_shared_mem_alloc); + if (!a) + __builtin_abort (); + + a[0] = 42; + a[1] = 43; + + uintptr_t a_p = (uintptr_t)a; + + #pragma omp target map(a[0]) + { + if (a[0] != 42 || a_p != (uintptr_t)a) + __builtin_abort (); + } + + #pragma omp target map(a[1]) + { + if (a[1] != 43 || a_p != (uintptr_t)a) + __builtin_abort (); + } + + omp_free(a, ompx_unified_shared_mem_alloc); + return 0; +} diff --git a/libgomp/testsuite/libgomp.c/usm-3.c b/libgomp/testsuite/libgomp.c/usm-3.c new file mode 100644 index 00000000000..2ca66afe93f --- /dev/null +++ b/libgomp/testsuite/libgomp.c/usm-3.c @@ -0,0 +1,35 @@ +/* { dg-do run } */ + +#include <omp.h> +#include <stdint.h> + +int +main () +{ + int *a = (int *) omp_alloc(sizeof(int)*2, ompx_unified_shared_mem_alloc); + if (!a) + __builtin_abort (); + + a[0] = 42; + a[1] = 43; + + uintptr_t a_p = (uintptr_t)a; + +#pragma omp target data map(a[0:2]) + { +#pragma omp target + { + if (a[0] != 42 || a_p != (uintptr_t)a) + __builtin_abort (); + } + +#pragma omp target + { + if (a[1] != 43 || a_p != (uintptr_t)a) + __builtin_abort (); + } + } + + omp_free(a, ompx_unified_shared_mem_alloc); + return 0; +} diff --git a/libgomp/testsuite/libgomp.c/usm-4.c b/libgomp/testsuite/libgomp.c/usm-4.c new file mode 100644 index 00000000000..753908c8440 --- /dev/null +++ b/libgomp/testsuite/libgomp.c/usm-4.c @@ -0,0 +1,36 @@ +/* { dg-do run } */ + +#include <omp.h> +#include <stdint.h> + +int +main () +{ + int *a = (int *) omp_alloc(sizeof(int)*2, ompx_unified_shared_mem_alloc); + if (!a) + __builtin_abort (); + + a[0] = 42; + a[1] = 43; + + uintptr_t a_p = (uintptr_t)a; + +#pragma omp target enter data map(to:a[0:2]) + +#pragma omp target + { + if (a[0] != 42 || a_p != (uintptr_t)a) + __builtin_abort (); + } + +#pragma omp target + { + if (a[1] != 43 || a_p != (uintptr_t)a) + __builtin_abort (); + } + +#pragma omp target exit data map(delete:a[0:2]) + + omp_free(a, ompx_unified_shared_mem_alloc); + return 0; +} diff --git a/libgomp/testsuite/libgomp.c/usm-5.c b/libgomp/testsuite/libgomp.c/usm-5.c new file mode 100644 index 00000000000..4d8b3cf71b1 --- /dev/null +++ b/libgomp/testsuite/libgomp.c/usm-5.c @@ -0,0 +1,28 @@ +/* { dg-do run } */ +/* { dg-require-effective-target offload_device } */ + +#include <omp.h> +#include <stdint.h> + +#pragma omp requires unified_shared_memory + +int +main () +{ + int *a = (int *) omp_alloc(sizeof(int), ompx_host_mem_alloc); + if (!a) + __builtin_abort (); + + a[0] = 42; + + uintptr_t a_p = (uintptr_t)a; + +#pragma omp target map(a[0:1]) + { + if (a[0] != 42 || a_p == (uintptr_t)a) + __builtin_abort (); + } + + omp_free(a, ompx_host_mem_alloc); + return 0; +} -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc 2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer @ 2023-02-10 14:21 ` Thomas Schwinge 2023-02-10 15:31 ` Andrew Stubbs 0 siblings, 1 reply; 18+ messages in thread From: Thomas Schwinge @ 2023-02-10 14:21 UTC (permalink / raw) To: ams; +Cc: Hafiz Abid Qadeer, gcc-patches Hi Andrew! On 2022-03-08T11:30:57+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote: > From: Andrew Stubbs <ams@codesourcery.com> > > This adds support for using Cuda Managed Memory with omp_alloc. It will be > used as the underpinnings for "requires unified_shared_memory" in a later > patch. > > There are two new predefined allocators, ompx_unified_shared_mem_alloc and > ompx_host_mem_alloc, plus corresponding memory spaces, [...] > --- a/libgomp/config/linux/allocator.c > +++ b/libgomp/config/linux/allocator.c > @@ -42,9 +42,11 @@ > static void * > linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin) > { > - (void)memspace; > - > - if (pin) > + if (memspace == ompx_unified_shared_mem_space) > + { > + return gomp_usm_alloc (size, GOMP_DEVICE_ICV); > + } > + else if (pin) > { > void *addr = mmap (NULL, size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); This I understand conceptually, but then: > @@ -67,7 +69,14 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin) > static void * > linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin) > { > - if (pin) > + if (memspace == ompx_unified_shared_mem_space) > + { > + void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV); > + memset (ret, 0, size); > + return ret; > + } > + else if (memspace == ompx_unified_shared_mem_space > + || pin) > return linux_memspace_alloc (memspace, size, pin); > else > return calloc (1, size); ..., here, we've got a duplicated (and thus always-false) expression 'memspace == ompx_unified_shared_mem_space' (..., which '-Wduplicated-cond' fails to report; <https://gcc.gnu.org/PR108753> "'-Wduplicated-cond' doesn't diagnose duplicated subexpressions"...). Is the correct fix the following (conceptually like 'linux_memspace_alloc' cited above), or is there something that I fail to understand? static void * linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin) { if (memspace == ompx_unified_shared_mem_space) { void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV); memset (ret, 0, size); return ret; } - else if (memspace == ompx_unified_shared_mem_space - || pin) + else if (pin) return linux_memspace_alloc (memspace, size, pin); else return calloc (1, size); The following ones then again are conceptually like 'linux_memspace_alloc' cited above: > @@ -77,9 +86,9 @@ static void > linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size, > int pin) > { > - (void)memspace; > - > - if (pin) > + if (memspace == ompx_unified_shared_mem_space) > + gomp_usm_free (addr, GOMP_DEVICE_ICV); > + else if (pin) > munmap (addr, size); > else > free (addr); > @@ -89,7 +98,9 @@ static void * > linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, > size_t oldsize, size_t size, int oldpin, int pin) > { > - if (oldpin && pin) > + if (memspace == ompx_unified_shared_mem_space) > + goto manual_realloc; > + else if (oldpin && pin) > { > void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE); > if (newaddr == MAP_FAILED) > @@ -98,18 +109,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, > [...] ..., and similar those here: > --- a/libgomp/config/nvptx/allocator.c > +++ b/libgomp/config/nvptx/allocator.c > @@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size) > __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE); > return result; > } > + else if (memspace == ompx_host_mem_space) > + return NULL; > else > return malloc (size); > } > @@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size) > > return result; > } > + else if (memspace == ompx_host_mem_space) > + return NULL; > else > return calloc (1, size); > } > @@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr, > } > return result; > } > + else if (memspace == ompx_host_mem_space) > + return NULL; > else > return realloc (addr, size); > } (I'd have added an explicit no-op (or, 'abort'?) to 'nvptx_memspace_free', but that's maybe just me...) ;-\ > --- a/libgomp/libgomp.h > +++ b/libgomp/libgomp.h > +extern void * gomp_usm_alloc (size_t size, int device_num); > +extern void gomp_usm_free (void *device_ptr, int device_num); > +extern bool gomp_is_usm_ptr (void *ptr); 'gomp_is_usm_ptr' isn't defined/used anywhere; I'll remove it. > --- a/libgomp/target.c > +++ b/libgomp/target.c > @@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, > DLSYM (unload_image); > DLSYM (alloc); > DLSYM (free); > + DLSYM_OPT (usm_alloc, usm_alloc); > + DLSYM_OPT (usm_free, usm_free); > + DLSYM_OPT (is_usm_ptr, is_usm_ptr); > DLSYM (dev2host); > DLSYM (host2dev); As a sanity check, shouldn't we check that either none or all three of those are defined, like in the 'if (cuda && cuda != 4) { [error] }' check a bit further down? Note that these remarks likewise apply to the current upstream submission: <https://inbox.sourceware.org/gcc-patches/ef374d055251b2bc65b97d7e54a0a72d811b869d.1657188329.git.ams@codesourcery.com> "openmp, nvptx: ompx_unified_shared_mem_alloc". Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc 2023-02-10 14:21 ` Thomas Schwinge @ 2023-02-10 15:31 ` Andrew Stubbs 2023-02-16 21:24 ` [og12] Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space' (was: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc) Thomas Schwinge 0 siblings, 1 reply; 18+ messages in thread From: Andrew Stubbs @ 2023-02-10 15:31 UTC (permalink / raw) To: Thomas Schwinge; +Cc: Hafiz Abid Qadeer, gcc-patches On 10/02/2023 14:21, Thomas Schwinge wrote: > Is the correct fix the following (conceptually like > 'linux_memspace_alloc' cited above), or is there something that I fail to > understand? > > static void * > linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin) > { > if (memspace == ompx_unified_shared_mem_space) > { > void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV); > memset (ret, 0, size); > return ret; > } > - else if (memspace == ompx_unified_shared_mem_space > - || pin) > + else if (pin) > return linux_memspace_alloc (memspace, size, pin); > else > return calloc (1, size); Yes, I think that is what was intended (and what actually happens). You can have your memory both unified and pinned (well, maybe it's possible, but there's no one Cuda API for that), so the USM takes precedence. > The following ones then again are conceptually like > 'linux_memspace_alloc' cited above: > >> @@ -77,9 +86,9 @@ static void >> linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size, >> int pin) >> { >> - (void)memspace; >> - >> - if (pin) >> + if (memspace == ompx_unified_shared_mem_space) >> + gomp_usm_free (addr, GOMP_DEVICE_ICV); >> + else if (pin) >> munmap (addr, size); >> else >> free (addr); >> @@ -89,7 +98,9 @@ static void * >> linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, >> size_t oldsize, size_t size, int oldpin, int pin) >> { >> - if (oldpin && pin) >> + if (memspace == ompx_unified_shared_mem_space) >> + goto manual_realloc; >> + else if (oldpin && pin) >> { >> void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE); >> if (newaddr == MAP_FAILED) >> @@ -98,18 +109,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, >> [...] Yes. > ..., and similar those here: > >> --- a/libgomp/config/nvptx/allocator.c >> +++ b/libgomp/config/nvptx/allocator.c >> @@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size) >> __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE); >> return result; >> } >> + else if (memspace == ompx_host_mem_space) >> + return NULL; >> else >> return malloc (size); >> } >> @@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size) >> >> return result; >> } >> + else if (memspace == ompx_host_mem_space) >> + return NULL; >> else >> return calloc (1, size); >> } >> @@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr, >> } >> return result; >> } >> + else if (memspace == ompx_host_mem_space) >> + return NULL; >> else >> return realloc (addr, size); >> } > > (I'd have added an explicit no-op (or, 'abort'?) to > 'nvptx_memspace_free', but that's maybe just me...) ;-\ Why? The host memspace is just the regular heap, which can be a thing on any device. It's an extension though so we can define it either way. >> --- a/libgomp/libgomp.h >> +++ b/libgomp/libgomp.h > >> +extern void * gomp_usm_alloc (size_t size, int device_num); >> +extern void gomp_usm_free (void *device_ptr, int device_num); >> +extern bool gomp_is_usm_ptr (void *ptr); > > 'gomp_is_usm_ptr' isn't defined/used anywhere; I'll remove it. I think I started that and then decided against. Thanks. >> --- a/libgomp/target.c >> +++ b/libgomp/target.c > >> @@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, >> DLSYM (unload_image); >> DLSYM (alloc); >> DLSYM (free); >> + DLSYM_OPT (usm_alloc, usm_alloc); >> + DLSYM_OPT (usm_free, usm_free); >> + DLSYM_OPT (is_usm_ptr, is_usm_ptr); >> DLSYM (dev2host); >> DLSYM (host2dev); > > As a sanity check, shouldn't we check that either none or all three of > those are defined, like in the 'if (cuda && cuda != 4) { [error] }' check > a bit further down? This is only going to happen when somebody writes a new plugin, and then they'll discover very quickly that there are issues. I've wasted more time writing this sentence than it's worth already. :) > Note that these remarks likewise apply to the current upstream > submission: > <https://inbox.sourceware.org/gcc-patches/ef374d055251b2bc65b97d7e54a0a72d811b869d.1657188329.git.ams@codesourcery.com>> "openmp, nvptx: ompx_unified_shared_mem_alloc". I have new patches to heap on top of this set already on OG12, and more planned, plus these ones you're working on; the whole patchset is going to have to get a rebase, squash, and tidy "soonish". Andrew ^ permalink raw reply [flat|nested] 18+ messages in thread
* [og12] Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space' (was: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc) 2023-02-10 15:31 ` Andrew Stubbs @ 2023-02-16 21:24 ` Thomas Schwinge 0 siblings, 0 replies; 18+ messages in thread From: Thomas Schwinge @ 2023-02-16 21:24 UTC (permalink / raw) To: Andrew Stubbs, gcc-patches; +Cc: Hafiz Abid Qadeer [-- Attachment #1: Type: text/plain, Size: 3355 bytes --] Hi! On 2023-02-10T15:31:47+0000, Andrew Stubbs <ams@codesourcery.com> wrote: > On 10/02/2023 14:21, Thomas Schwinge wrote: >> Is the correct fix the following [...] > > Yes, [...] >>> --- a/libgomp/config/nvptx/allocator.c >>> +++ b/libgomp/config/nvptx/allocator.c >>> @@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size) >>> __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE); >>> return result; >>> } >>> + else if (memspace == ompx_host_mem_space) >>> + return NULL; >>> else >>> return malloc (size); >>> } >>> @@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size) >>> >>> return result; >>> } >>> + else if (memspace == ompx_host_mem_space) >>> + return NULL; >>> else >>> return calloc (1, size); >>> } >>> @@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr, >>> } >>> return result; >>> } >>> + else if (memspace == ompx_host_mem_space) >>> + return NULL; >>> else >>> return realloc (addr, size); >>> } >> >> (I'd have added an explicit no-op (or, 'abort'?) to >> 'nvptx_memspace_free', but that's maybe just me...) ;-\ > > Why? The host memspace is just the regular heap, which can be a thing on > any device. It's an extension though so we can define it either way. My point was: for nvptx libgomp, all 'ompx_host_mem_space' allocator functions (cited above) 'return NULL', and it's a cheap check to verify that in 'nvptx_memspace_free'. >>> --- a/libgomp/libgomp.h >>> +++ b/libgomp/libgomp.h >> >>> +extern void * gomp_usm_alloc (size_t size, int device_num); >>> +extern void gomp_usm_free (void *device_ptr, int device_num); >>> +extern bool gomp_is_usm_ptr (void *ptr); >> >> 'gomp_is_usm_ptr' isn't defined/used anywhere; I'll remove it. > > I think I started that and then decided against. Thanks. These three combined, I've pushed to devel/omp/gcc-12 branch commit 23f52e49368d7b26a1b1a72d6bb903d31666e961 "Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space'", see attached. >>> --- a/libgomp/target.c >>> +++ b/libgomp/target.c >> >>> @@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, >>> DLSYM (unload_image); >>> DLSYM (alloc); >>> DLSYM (free); >>> + DLSYM_OPT (usm_alloc, usm_alloc); >>> + DLSYM_OPT (usm_free, usm_free); >>> + DLSYM_OPT (is_usm_ptr, is_usm_ptr); >>> DLSYM (dev2host); >>> DLSYM (host2dev); >> >> As a sanity check, shouldn't we check that either none or all three of >> those are defined, like in the 'if (cuda && cuda != 4) { [error] }' check >> a bit further down? > > This is only going to happen when somebody writes a new plugin, and then > they'll discover very quickly that there are issues. I've wasted more > time writing this sentence than it's worth already. :) Eh. ;-) OK, outvoted. Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Miscellaneous-clean-up-re-OpenMP-ompx_unified_shared.patch --] [-- Type: text/x-diff, Size: 3153 bytes --] From 23f52e49368d7b26a1b1a72d6bb903d31666e961 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Tue, 14 Feb 2023 17:10:57 +0100 Subject: [PATCH] Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space' Clean-up for og12 commit 84914e197d91a67b3d27db0e4c69a433462983a5 "openmp, nvptx: ompx_unified_shared_mem_alloc". No functional change. libgomp/ * config/linux/allocator.c (linux_memspace_calloc): Elide (innocuous) duplicate 'if' condition. * config/nvptx/allocator.c (nvptx_memspace_free): Explicitly handle 'memspace == ompx_host_mem_space'. * libgomp.h (gomp_is_usm_ptr): Remove. --- libgomp/ChangeLog.omp | 6 ++++++ libgomp/config/linux/allocator.c | 3 +-- libgomp/config/nvptx/allocator.c | 4 ++++ libgomp/libgomp.h | 1 - 4 files changed, 11 insertions(+), 3 deletions(-) diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp index b667c72b8ca..1c4b1833c0b 100644 --- a/libgomp/ChangeLog.omp +++ b/libgomp/ChangeLog.omp @@ -1,5 +1,11 @@ 2023-02-16 Thomas Schwinge <thomas@codesourcery.com> + * config/linux/allocator.c (linux_memspace_calloc): Elide + (innocuous) duplicate 'if' condition. + * config/nvptx/allocator.c (nvptx_memspace_free): Explicitly + handle 'memspace == ompx_host_mem_space'. + * libgomp.h (gomp_is_usm_ptr): Remove. + * basic-allocator.c (BASIC_ALLOC_YIELD): instead of '#deine', '#define' it. diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c index 07af3a2821a..8a9171c36df 100644 --- a/libgomp/config/linux/allocator.c +++ b/libgomp/config/linux/allocator.c @@ -95,8 +95,7 @@ linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin) memset (ret, 0, size); return ret; } - else if (memspace == ompx_unified_shared_mem_space - || pin) + else if (pin) return linux_memspace_alloc (memspace, size, pin); else return calloc (1, size); diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c index 7c2a7463bf7..cbf86b8a2ec 100644 --- a/libgomp/config/nvptx/allocator.c +++ b/libgomp/config/nvptx/allocator.c @@ -42,6 +42,7 @@ chunks. */ #include "libgomp.h" +#include <assert.h> #include <stdlib.h> #define BASIC_ALLOC_PREFIX __nvptx_lowlat @@ -93,6 +94,9 @@ nvptx_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size) __nvptx_lowlat_free (shared_pool, addr, size); } + else if (memspace == ompx_host_mem_space) + /* Just verify what all allocator functions return. */ + assert (addr == NULL); else free (addr); } diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index d1e45cc584e..c001b468252 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -1133,7 +1133,6 @@ extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t, void *); extern void * gomp_usm_alloc (size_t size, int device_num); extern void gomp_usm_free (void *device_ptr, int device_num); -extern bool gomp_is_usm_ptr (void *ptr); /* Splay tree definitions. */ typedef struct splay_tree_node_s *splay_tree_node; -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory. 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer ` (2 preceding siblings ...) 2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer @ 2022-03-08 11:30 ` Hafiz Abid Qadeer 2022-04-02 12:04 ` Andrew Stubbs 2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer ` (2 subsequent siblings) 6 siblings, 1 reply; 18+ messages in thread From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw) To: gcc-patches, fortran; +Cc: jakub, ams, joseph This patches changes calls to malloc/free/calloc/realloc and operator new to memory allocation functions in libgomp with allocator=ompx_unified_shared_mem_alloc. This helps existing code to benefit from the unified shared memory. The libgomp does the correct thing with all the mapping constructs and there is no memory copies if the pointer is pointing to unified shared memory. We only replace replacable new operator and not the class member or placement new. gcc/ChangeLog: * omp-low.cc (usm_transform): New function. (make_pass_usm_transform): Likewise. (class pass_usm_transform): New. * passes.def: Add pass_usm_transform. * tree-pass.h (make_pass_usm_transform): New declaration. gcc/testsuite/ChangeLog: * c-c++-common/gomp/usm-2.c: New test. * c-c++-common/gomp/usm-3.c: New test. * g++.dg/gomp/usm-1.C: New test. * g++.dg/gomp/usm-2.C: New test. * g++.dg/gomp/usm-3.C: New test. * gfortran.dg/gomp/usm-2.f90: New test. * gfortran.dg/gomp/usm-3.f90: New test. libgomp/ChangeLog: * testsuite/libgomp.c/usm-6.c: New test. * testsuite/libgomp.c++/usm-1.C: Likewise. --- gcc/omp-low.cc | 152 +++++++++++++++++++++++ gcc/passes.def | 1 + gcc/testsuite/c-c++-common/gomp/usm-2.c | 34 +++++ gcc/testsuite/c-c++-common/gomp/usm-3.c | 32 +++++ gcc/testsuite/g++.dg/gomp/usm-1.C | 32 +++++ gcc/testsuite/g++.dg/gomp/usm-2.C | 30 +++++ gcc/testsuite/g++.dg/gomp/usm-3.C | 38 ++++++ gcc/testsuite/gfortran.dg/gomp/usm-2.f90 | 16 +++ gcc/testsuite/gfortran.dg/gomp/usm-3.f90 | 13 ++ gcc/tree-pass.h | 1 + libgomp/testsuite/libgomp.c++/usm-1.C | 54 ++++++++ libgomp/testsuite/libgomp.c/usm-6.c | 70 +++++++++++ 12 files changed, 473 insertions(+) create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index 5ce3a50709a..ec08d59f676 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -14849,6 +14849,158 @@ make_pass_diagnose_omp_blocks (gcc::context *ctxt) { return new pass_diagnose_omp_blocks (ctxt); } + +/* Provide transformation required for using unified shared memory + by replacing calls to standard memory allocation functions with + function provided by the libgomp. */ + +static tree +usm_transform (gimple_stmt_iterator *gsi_p, bool *, + struct walk_stmt_info *wi) +{ + gimple *stmt = gsi_stmt (*gsi_p); + /* ompx_unified_shared_mem_alloc is 10. */ + const unsigned int unified_shared_mem_alloc = 10; + + switch (gimple_code (stmt)) + { + case GIMPLE_CALL: + { + gcall *gs = as_a <gcall *> (stmt); + tree fndecl = gimple_call_fndecl (gs); + if (fndecl) + { + tree allocator = build_int_cst (pointer_sized_int_node, + unified_shared_mem_alloc); + const char *name = IDENTIFIER_POINTER (DECL_NAME (fndecl)); + if ((strcmp (name, "malloc") == 0) + || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL) + && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_MALLOC) + || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl)) + { + tree omp_alloc_type + = build_function_type_list (ptr_type_node, size_type_node, + pointer_sized_int_node, + NULL_TREE); + tree repl = build_fn_decl ("omp_alloc", omp_alloc_type); + tree size = gimple_call_arg (gs, 0); + gimple *g = gimple_build_call (repl, 2, size, allocator); + gimple_call_set_lhs (g, gimple_call_lhs (gs)); + gimple_set_location (g, gimple_location (stmt)); + gsi_replace (gsi_p, g, true); + } + else if ((strcmp (name, "calloc") == 0) + || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL) + && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_CALLOC)) + { + tree omp_calloc_type + = build_function_type_list (ptr_type_node, size_type_node, + size_type_node, + pointer_sized_int_node, + NULL_TREE); + tree repl = build_fn_decl ("omp_calloc", omp_calloc_type); + tree num = gimple_call_arg (gs, 0); + tree size = gimple_call_arg (gs, 1); + gimple *g = gimple_build_call (repl, 3, num, size, allocator); + gimple_call_set_lhs (g, gimple_call_lhs (gs)); + gimple_set_location (g, gimple_location (stmt)); + gsi_replace (gsi_p, g, true); + } + else if ((strcmp (name, "realloc") == 0) + || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL) + && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_REALLOC)) + { + tree omp_realloc_type + = build_function_type_list (ptr_type_node, ptr_type_node, + size_type_node, + pointer_sized_int_node, + pointer_sized_int_node, + NULL_TREE); + tree repl = build_fn_decl ("omp_realloc", omp_realloc_type); + tree ptr = gimple_call_arg (gs, 0); + tree size = gimple_call_arg (gs, 1); + gimple *g = gimple_build_call (repl, 4, ptr, size, allocator, + allocator); + gimple_call_set_lhs (g, gimple_call_lhs (gs)); + gimple_set_location (g, gimple_location (stmt)); + gsi_replace (gsi_p, g, true); + } + else if ((strcmp (name, "free") == 0) + || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL) + && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_FREE) + || (DECL_IS_OPERATOR_DELETE_P (fndecl) + && DECL_IS_REPLACEABLE_OPERATOR (fndecl))) + { + tree omp_free_type + = build_function_type_list (void_type_node, ptr_type_node, + pointer_sized_int_node, + NULL_TREE); + tree repl = build_fn_decl ("omp_free", omp_free_type); + tree ptr = gimple_call_arg (gs, 0); + gimple *g = gimple_build_call (repl, 2, ptr, allocator); + gimple_set_location (g, gimple_location (stmt)); + gsi_replace (gsi_p, g, true); + } + } + } + break; + + default: + break; + } + + return NULL_TREE; +} + +namespace { + +const pass_data pass_data_usm_transform = +{ + GIMPLE_PASS, /* type */ + "usm_transform", /* name */ + OPTGROUP_OMP, /* optinfo_flags */ + TV_NONE, /* tv_id */ + PROP_gimple_any, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + 0, /* todo_flags_finish */ +}; + +class pass_usm_transform : public gimple_opt_pass +{ +public: + pass_usm_transform (gcc::context *ctxt) + : gimple_opt_pass (pass_data_usm_transform, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) + { + return (flag_openmp || flag_openmp_simd) + && (flag_offload_memory == OFFLOAD_MEMORY_UNIFIED + || omp_requires_mask & OMP_REQUIRES_UNIFIED_SHARED_MEMORY); + } + virtual unsigned int execute (function *) + { + struct walk_stmt_info wi; + gimple_seq body = gimple_body (current_function_decl); + + memset (&wi, 0, sizeof (wi)); + walk_gimple_seq (body, usm_transform, NULL, &wi); + + return 0; + } + +}; // class pass_usm_transform + +} // anon namespace + +gimple_opt_pass * +make_pass_usm_transform (gcc::context *ctxt) +{ + return new pass_usm_transform (ctxt); +} \f #include "gt-omp-low.h" diff --git a/gcc/passes.def b/gcc/passes.def index f7718181038..98c7736bb8b 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -36,6 +36,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_diagnose_tm_blocks); NEXT_PASS (pass_omp_oacc_kernels_decompose); NEXT_PASS (pass_lower_omp); + NEXT_PASS (pass_usm_transform); NEXT_PASS (pass_lower_cf); NEXT_PASS (pass_lower_tm); NEXT_PASS (pass_refactor_eh); diff --git a/gcc/testsuite/c-c++-common/gomp/usm-2.c b/gcc/testsuite/c-c++-common/gomp/usm-2.c new file mode 100644 index 00000000000..2f3f986012c --- /dev/null +++ b/gcc/testsuite/c-c++-common/gomp/usm-2.c @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-fdump-tree-usm_transform" } */ + +#pragma omp requires unified_shared_memory + +#ifdef __cplusplus +extern "C" { +#endif + +void *malloc (__SIZE_TYPE__); +void *calloc(__SIZE_TYPE__, __SIZE_TYPE__); +void *realloc(void *, __SIZE_TYPE__); +void free (void *); + +#ifdef __cplusplus +} +#endif + +void +foo () +{ + void *p1 = malloc(20); + void *p2 = realloc(p1, 30); + void *p3 = calloc(4, 15); + free (p2); + free (p3); +} + +/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_free" 2 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " free" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " malloc" "usm_transform" } } */ diff --git a/gcc/testsuite/c-c++-common/gomp/usm-3.c b/gcc/testsuite/c-c++-common/gomp/usm-3.c new file mode 100644 index 00000000000..c8230e7ff7c --- /dev/null +++ b/gcc/testsuite/c-c++-common/gomp/usm-3.c @@ -0,0 +1,32 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-foffload-memory=unified -fdump-tree-usm_transform" } */ + +#ifdef __cplusplus +extern "C" { +#endif + +void *malloc (__SIZE_TYPE__); +void *calloc(__SIZE_TYPE__, __SIZE_TYPE__); +void *realloc(void *, __SIZE_TYPE__); +void free (void *); + +#ifdef __cplusplus +} +#endif + +void +foo () +{ + void *p1 = malloc(20); + void *p2 = realloc(p1, 30); + void *p3 = calloc(4, 15); + free (p2); + free (p3); +} + +/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_free" 2 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " free" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " malloc" "usm_transform" } } */ diff --git a/gcc/testsuite/g++.dg/gomp/usm-1.C b/gcc/testsuite/g++.dg/gomp/usm-1.C new file mode 100644 index 00000000000..bd70a81b5bb --- /dev/null +++ b/gcc/testsuite/g++.dg/gomp/usm-1.C @@ -0,0 +1,32 @@ +// { dg-do compile } +// { dg-options "-fopenmp -fdump-tree-usm_transform" } + +#pragma omp requires unified_shared_memory + +struct t1 +{ + int a; + int b; +}; + +typedef unsigned char uint8_t; + +void +foo (__SIZE_TYPE__ x, __SIZE_TYPE__ y) +{ + uint8_t *p1 = new uint8_t; + uint8_t *p2 = new uint8_t[20]; + t1 *p3 = new t1; + t1 *p4 = new t1[y]; + delete p1; + delete p3; + delete [] p2; + delete [] p4; +} + +/* { dg-final { scan-tree-dump-times "omp_alloc \\(1, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_alloc" 4 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not "operator new" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not "operator delete" "usm_transform" } } */ diff --git a/gcc/testsuite/g++.dg/gomp/usm-2.C b/gcc/testsuite/g++.dg/gomp/usm-2.C new file mode 100644 index 00000000000..f6ab155c6de --- /dev/null +++ b/gcc/testsuite/g++.dg/gomp/usm-2.C @@ -0,0 +1,30 @@ +// { dg-do compile } +// { dg-options "-fopenmp -foffload-memory=unified -fdump-tree-usm_transform" } + +struct t1 +{ + int a; + int b; +}; + +typedef unsigned char uint8_t; + +void +foo (__SIZE_TYPE__ x, __SIZE_TYPE__ y) +{ + uint8_t *p1 = new uint8_t; + uint8_t *p2 = new uint8_t[20]; + t1 *p3 = new t1; + t1 *p4 = new t1[y]; + delete p1; + delete p3; + delete [] p2; + delete [] p4; +} + +/* { dg-final { scan-tree-dump-times "omp_alloc \\(1, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_alloc" 4 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not "operator new" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not "operator delete" "usm_transform" } } */ diff --git a/gcc/testsuite/g++.dg/gomp/usm-3.C b/gcc/testsuite/g++.dg/gomp/usm-3.C new file mode 100644 index 00000000000..50ac9302c8b --- /dev/null +++ b/gcc/testsuite/g++.dg/gomp/usm-3.C @@ -0,0 +1,38 @@ +// { dg-do compile } +// { dg-options "-fopenmp -fdump-tree-usm_transform" } + +#pragma omp requires unified_shared_memory + +#include <new> + + +struct X { + static void* operator new(std::size_t count) + { + static char buf[10]; + return &buf[0]; + } + static void* operator new[](std::size_t count) + { + static char buf[10]; + return &buf[0]; + } + static void operator delete(void*) + { + } + static void operator delete[](void*) + { + } +}; +void foo() { + X* p1 = new X; + delete p1; + X* p2 = new X[10]; + delete[] p2; + unsigned char buf[24] ; + int *p3 = new (buf) int(3); + p3[0] = 1; +} + +/* { dg-final { scan-tree-dump-not "omp_alloc" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not "omp_free" "usm_transform" } } */ diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-2.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-2.f90 new file mode 100644 index 00000000000..dc775260cb7 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/gomp/usm-2.f90 @@ -0,0 +1,16 @@ +! { dg-do compile } +! { dg-additional-options "-fdump-tree-usm_transform" } + +!$omp requires unified_shared_memory +end + +subroutine foo() + implicit none + integer, allocatable :: var1 + + allocate(var1) + +end subroutine + +! { dg-final { scan-tree-dump-times "omp_alloc" 1 "usm_transform" } } +! { dg-final { scan-tree-dump-times "omp_free" 1 "usm_transform" } } \ No newline at end of file diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-3.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-3.f90 new file mode 100644 index 00000000000..7983444ebff --- /dev/null +++ b/gcc/testsuite/gfortran.dg/gomp/usm-3.f90 @@ -0,0 +1,13 @@ +! { dg-do compile } +! { dg-additional-options "-foffload-memory=unified -fdump-tree-usm_transform" } + +subroutine foo() + implicit none + integer, allocatable :: var1 + + allocate(var1) + +end subroutine + +! { dg-final { scan-tree-dump-times "omp_alloc" 1 "usm_transform" } } +! { dg-final { scan-tree-dump-times "omp_free" 1 "usm_transform" } } \ No newline at end of file diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 606d1d60b85..494a9662afa 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -424,6 +424,7 @@ extern gimple_opt_pass *make_pass_lower_vector_ssa (gcc::context *ctxt); extern gimple_opt_pass *make_pass_omp_oacc_kernels_decompose (gcc::context *ctxt); extern gimple_opt_pass *make_pass_lower_omp (gcc::context *ctxt); extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt); +extern gimple_opt_pass *make_pass_usm_transform (gcc::context *ctxt); extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt); extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt); extern gimple_opt_pass *make_pass_omp_target_link (gcc::context *ctxt); diff --git a/libgomp/testsuite/libgomp.c++/usm-1.C b/libgomp/testsuite/libgomp.c++/usm-1.C new file mode 100644 index 00000000000..fea25e5f10b --- /dev/null +++ b/libgomp/testsuite/libgomp.c++/usm-1.C @@ -0,0 +1,54 @@ +/* { dg-do run } */ +/* { dg-skip-if "Only valid for nvptx" { ! offload_target_nvptx } } */ +#include <stdint.h> + +#pragma omp requires unified_shared_memory + +int g1 = 0; + +struct s1 +{ + s1() { a = g1++;} + ~s1() { g1--;} + int a; +}; + +int +main () +{ + s1 *p1 = new s1; + s1 *p2 = new s1[10]; + + if (!p1 || !p2 || p1->a != 0) + __builtin_abort (); + + for (int i = 0; i < 10; i++) + if (p2[i].a != i+1) + __builtin_abort (); + + uintptr_t pp1 = (uintptr_t)p1; + uintptr_t pp2 = (uintptr_t)p2; + +#pragma omp target firstprivate(pp1, pp2) + { + s1 *t1 = (s1*)pp1; + s1 *t2 = (s1*)pp2; + if (t1->a != 0) + __builtin_abort (); + + for (int i = 0; i < 10; i++) + if (t2[i].a != i+1) + __builtin_abort (); + + t1->a = 42; + } + + if (p1->a != 42) + __builtin_abort (); + + delete [] p2; + delete p1; + if (g1 != 0) + __builtin_abort (); + return 0; +} diff --git a/libgomp/testsuite/libgomp.c/usm-6.c b/libgomp/testsuite/libgomp.c/usm-6.c new file mode 100644 index 00000000000..d98da68a1ed --- /dev/null +++ b/libgomp/testsuite/libgomp.c/usm-6.c @@ -0,0 +1,70 @@ +/* { dg-do run } */ +/* { dg-skip-if "Only valid for nvptx" { ! offload_target_nvptx } } */ + +#include <stdint.h> +#include <stdlib.h> + +#pragma omp requires unified_shared_memory + +int +main () +{ + int *a = (int *) malloc(sizeof(int)*2); + int *b = (int *) calloc(sizeof(int), 3); + int *c = (int *) realloc(NULL, sizeof(int) * 4); + if (!a || !b || !c) + __builtin_abort (); + + a[0] = 42; + a[1] = 43; + b[0] = 52; + b[1] = 53; + b[2] = 54; + c[0] = 62; + c[1] = 63; + c[2] = 64; + c[3] = 65; + + uintptr_t a_p = (uintptr_t)a; + uintptr_t b_p = (uintptr_t)b; + uintptr_t c_p = (uintptr_t)c; + +#pragma omp target enter data map(to:a[0:2]) + +#pragma omp target is_device_ptr(c) + { + if (a[0] != 42 || a_p != (uintptr_t)a) + __builtin_abort (); + if (b[0] != 52 || b[2] != 54 || b_p != (uintptr_t)b) + __builtin_abort (); + if (c[0] != 62 || c[3] != 65 || c_p != (uintptr_t)c) + __builtin_abort (); + a[0] = 72; + b[0] = 82; + c[0] = 92; + } + +#pragma omp target + { + if (a[1] != 43 || a_p != (uintptr_t)a) + __builtin_abort (); + if (b[1] != 53 || b_p != (uintptr_t)b) + __builtin_abort (); + if (c[1] != 63 || c[2] != 64 || c_p != (uintptr_t)c) + __builtin_abort (); + a[1] = 73; + b[1] = 83; + c[1] = 93; + } + +#pragma omp target exit data map(delete:a[0:2]) + + if (a[0] != 72 || a[1] != 73 + || b[0] != 82 || b[1] != 83 + || c[0] != 92 || c[1] != 93) + __builtin_abort (); + free(a); + free(b); + free(c); + return 0; +} -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory. 2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer @ 2022-04-02 12:04 ` Andrew Stubbs 2022-04-02 12:42 ` Andrew Stubbs 0 siblings, 1 reply; 18+ messages in thread From: Andrew Stubbs @ 2022-04-02 12:04 UTC (permalink / raw) To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph [-- Attachment #1: Type: text/plain, Size: 491 bytes --] On 08/03/2022 11:30, Hafiz Abid Qadeer wrote: > This patches changes calls to malloc/free/calloc/realloc and operator new to > memory allocation functions in libgomp with > allocator=ompx_unified_shared_mem_alloc. This additional patch adds transformation for omp_target_alloc. The OpenMP 5.0 document says that addresses allocated this way needs to work without is_device_ptr. The easiest way to make that work is to make them USM addresses. I will commit this to OG11 shortly. Andrew [-- Attachment #2: 220401-usm-omp_target_alloc.patch --] [-- Type: text/plain, Size: 6595 bytes --] openmp: Do USM transform for omp_target_alloc OpenMP 5.0 says that omp_target_alloc should return USM addresses. gcc/ChangeLog: * omp-low.c (usm_transform): Transform omp_target_alloc and omp_target_free. libgomp/ChangeLog: * testsuite/libgomp.c/usm-6.c: Add omp_target_alloc. gcc/testsuite/ChangeLog: * c-c++-common/gomp/usm-2.c: Add omp_target_alloc. * c-c++-common/gomp/usm-3.c: Add omp_target_alloc. diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 4e8ab9e4ca0..9235eafd1d7 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -15880,7 +15880,8 @@ usm_transform (gimple_stmt_iterator *gsi_p, bool *, if ((strcmp (name, "malloc") == 0) || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL) && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_MALLOC) - || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl)) + || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl) + || strcmp (name, "omp_target_alloc") == 0) { tree omp_alloc_type = build_function_type_list (ptr_type_node, size_type_node, @@ -15952,7 +15953,8 @@ usm_transform (gimple_stmt_iterator *gsi_p, bool *, || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL) && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_FREE) || (DECL_IS_OPERATOR_DELETE_P (fndecl) - && DECL_IS_REPLACEABLE_OPERATOR (fndecl))) + && DECL_IS_REPLACEABLE_OPERATOR (fndecl)) + || strcmp (name, "omp_target_free") == 0) { tree omp_free_type = build_function_type_list (void_type_node, ptr_type_node, diff --git a/gcc/testsuite/c-c++-common/gomp/usm-2.c b/gcc/testsuite/c-c++-common/gomp/usm-2.c index 64dbb6be131..8c20ef94e69 100644 --- a/gcc/testsuite/c-c++-common/gomp/usm-2.c +++ b/gcc/testsuite/c-c++-common/gomp/usm-2.c @@ -12,6 +12,8 @@ void *aligned_alloc (__SIZE_TYPE__, __SIZE_TYPE__); void *calloc(__SIZE_TYPE__, __SIZE_TYPE__); void *realloc(void *, __SIZE_TYPE__); void free (void *); +void *omp_target_alloc (__SIZE_TYPE__, int); +void omp_target_free (void *, int); #ifdef __cplusplus } @@ -24,16 +26,21 @@ foo () void *p2 = realloc(p1, 30); void *p3 = calloc(4, 15); void *p4 = aligned_alloc(16, 40); + void *p5 = omp_target_alloc(50, 1); free (p2); free (p3); free (p4); + omp_target_free (p5, 1); } /* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform" } } */ /* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform" } } */ /* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform" } } */ /* { dg-final { scan-tree-dump-times "omp_aligned_alloc \\(16, 40, 10\\)" 1 "usm_transform" } } */ -/* { dg-final { scan-tree-dump-times "omp_free" 3 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_alloc \\(50, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform" } } */ /* { dg-final { scan-tree-dump-not " free" "usm_transform" } } */ /* { dg-final { scan-tree-dump-not " aligned_alloc" "usm_transform" } } */ /* { dg-final { scan-tree-dump-not " malloc" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " omp_target_alloc" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " omp_target_free" "usm_transform" } } */ diff --git a/gcc/testsuite/c-c++-common/gomp/usm-3.c b/gcc/testsuite/c-c++-common/gomp/usm-3.c index 934582ea5fd..2b0cbb45e27 100644 --- a/gcc/testsuite/c-c++-common/gomp/usm-3.c +++ b/gcc/testsuite/c-c++-common/gomp/usm-3.c @@ -10,6 +10,8 @@ void *aligned_alloc (__SIZE_TYPE__, __SIZE_TYPE__); void *calloc(__SIZE_TYPE__, __SIZE_TYPE__); void *realloc(void *, __SIZE_TYPE__); void free (void *); +void *omp_target_alloc (__SIZE_TYPE__, int); +void omp_target_free (void *, int); #ifdef __cplusplus } @@ -22,16 +24,21 @@ foo () void *p2 = realloc(p1, 30); void *p3 = calloc(4, 15); void *p4 = aligned_alloc(16, 40); + void *p5 = omp_target_alloc(50, 1); free (p2); free (p3); free (p4); + omp_target_free (p5, 1); } /* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform" } } */ /* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform" } } */ /* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform" } } */ /* { dg-final { scan-tree-dump-times "omp_aligned_alloc \\(16, 40, 10\\)" 1 "usm_transform" } } */ -/* { dg-final { scan-tree-dump-times "omp_free" 3 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_alloc \\(50, 10\\)" 1 "usm_transform" } } */ +/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform" } } */ /* { dg-final { scan-tree-dump-not " free" "usm_transform" } } */ /* { dg-final { scan-tree-dump-not " aligned_alloc" "usm_transform" } } */ /* { dg-final { scan-tree-dump-not " malloc" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " omp_target_alloc" "usm_transform" } } */ +/* { dg-final { scan-tree-dump-not " omp_target_free" "usm_transform" } } */ diff --git a/libgomp/testsuite/libgomp.c/usm-6.c b/libgomp/testsuite/libgomp.c/usm-6.c index d2c828fdc9d..c207140092a 100644 --- a/libgomp/testsuite/libgomp.c/usm-6.c +++ b/libgomp/testsuite/libgomp.c/usm-6.c @@ -4,6 +4,8 @@ #include <stdint.h> #include <stdlib.h> +#include <omp.h> + /* On old systems, the declaraition may not be present in stdlib.h which will generate a warning. This function is going to be replaced with omp_aligned_alloc so the purpose of this declaration is to avoid that @@ -19,7 +21,8 @@ main () int *b = (int *) calloc(sizeof(int), 3); int *c = (int *) realloc(NULL, sizeof(int) * 4); int *d = (int *) aligned_alloc(32, sizeof(int)); - if (!a || !b || !c || !d) + int *e = (int *) omp_target_alloc(sizeof(int), 1); + if (!a || !b || !c || !d || !e) __builtin_abort (); a[0] = 42; @@ -36,6 +39,7 @@ main () uintptr_t b_p = (uintptr_t)b; uintptr_t c_p = (uintptr_t)c; uintptr_t d_p = (uintptr_t)d; + uintptr_t e_p = (uintptr_t)e; if (d_p & 31 != 0) __builtin_abort (); @@ -52,9 +56,12 @@ main () __builtin_abort (); if (d_p != (uintptr_t)d) __builtin_abort (); + if (e_p != (uintptr_t)e) + __builtin_abort (); a[0] = 72; b[0] = 82; c[0] = 92; + e[0] = 102; } #pragma omp target @@ -74,10 +81,12 @@ main () if (a[0] != 72 || a[1] != 73 || b[0] != 82 || b[1] != 83 - || c[0] != 92 || c[1] != 93) + || c[0] != 92 || c[1] != 93 + || e[0] != 102) __builtin_abort (); free(a); free(b); free(c); + omp_target_free(e, 1); return 0; } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory. 2022-04-02 12:04 ` Andrew Stubbs @ 2022-04-02 12:42 ` Andrew Stubbs 0 siblings, 0 replies; 18+ messages in thread From: Andrew Stubbs @ 2022-04-02 12:42 UTC (permalink / raw) To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph On 02/04/2022 13:04, Andrew Stubbs wrote: > This additional patch adds transformation for omp_target_alloc. The > OpenMP 5.0 document says that addresses allocated this way needs to work > without is_device_ptr. The easiest way to make that work is to make them > USM addresses. Actually, reading on, it says "Every device address allocated through OpenMP device memory routines is a valid host pointer", so USM is the correct answer. Andrew ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 5/5] openmp: -foffload-memory=pinned 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer ` (3 preceding siblings ...) 2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer @ 2022-03-08 11:30 ` Hafiz Abid Qadeer 2022-03-30 22:40 ` Andrew Stubbs 2023-02-09 11:16 ` [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned) Thomas Schwinge 2022-04-13 13:14 ` [PATCH 0/5] openmp: Handle pinned and unified shared memory Andrew Stubbs 2022-04-20 13:25 ` [PATCH] openmp: Handle unified address memory Andrew Stubbs 6 siblings, 2 replies; 18+ messages in thread From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw) To: gcc-patches, fortran; +Cc: jakub, ams, joseph From: Andrew Stubbs <ams@codesourcery.com> Implement the -foffload-memory=pinned option such that libgomp is instructed to enable fully-pinned memory at start-up. The option is intended to provide a performance boost to certain offload programs without modifying the code. This feature only works on Linux, at present, and simply calls mlockall to enable always-on memory pinning. It requires that the ulimit feature is set high enough to accommodate all the program's memory usage. In this mode the ompx_pinned_memory_alloc feature is disabled as it is not needed and may conflict. gcc/ChangeLog: * omp-low.cc (omp_enable_pinned_mode): New function. (execute_lower_omp): Call omp_enable_pinned_mode. libgomp/ChangeLog: * config/linux/allocator.c (always_pinned_mode): New variable. (GOMP_enable_pinned_mode): New function. (linux_memspace_alloc): Disable pinning when always_pinned_mode set. (linux_memspace_calloc): Likewise. (linux_memspace_free): Likewise. (linux_memspace_realloc): Likewise. * libgomp.map (GOMP_5.1.1): New version space with GOMP_enable_pinned_mode. * testsuite/libgomp.c/alloc-pinned-7.c: New test. gcc/testsuite/ChangeLog: * c-c++-common/gomp/alloc-pinned-1.c: New test. --- gcc/omp-low.cc | 68 +++++++++++++++++++ .../c-c++-common/gomp/alloc-pinned-1.c | 28 ++++++++ libgomp/config/linux/allocator.c | 26 +++++++ libgomp/libgomp.map | 5 ++ libgomp/testsuite/libgomp.c/alloc-pinned-7.c | 66 ++++++++++++++++++ 5 files changed, 193 insertions(+) create mode 100644 gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index ec08d59f676..ce21b3bd6f8 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -14441,6 +14441,70 @@ lower_omp (gimple_seq *body, omp_context *ctx) input_location = saved_location; } +/* Emit a constructor function to enable -foffload-memory=pinned + at runtime. Libgomp handles the OS mode setting, but we need to trigger + it by calling GOMP_enable_pinned mode before the program proper runs. */ + +static void +omp_enable_pinned_mode () +{ + static bool visited = false; + if (visited) + return; + visited = true; + + /* Create a new function like this: + + static void __attribute__((constructor)) + __set_pinned_mode () + { + GOMP_enable_pinned_mode (); + } + */ + + tree name = get_identifier ("__set_pinned_mode"); + tree voidfntype = build_function_type_list (void_type_node, NULL_TREE); + tree decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, name, voidfntype); + + TREE_STATIC (decl) = 1; + TREE_USED (decl) = 1; + DECL_ARTIFICIAL (decl) = 1; + DECL_IGNORED_P (decl) = 0; + TREE_PUBLIC (decl) = 0; + DECL_UNINLINABLE (decl) = 1; + DECL_EXTERNAL (decl) = 0; + DECL_CONTEXT (decl) = NULL_TREE; + DECL_INITIAL (decl) = make_node (BLOCK); + BLOCK_SUPERCONTEXT (DECL_INITIAL (decl)) = decl; + DECL_STATIC_CONSTRUCTOR (decl) = 1; + DECL_ATTRIBUTES (decl) = tree_cons (get_identifier ("constructor"), + NULL_TREE, NULL_TREE); + + tree t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE, + void_type_node); + DECL_ARTIFICIAL (t) = 1; + DECL_IGNORED_P (t) = 1; + DECL_CONTEXT (t) = decl; + DECL_RESULT (decl) = t; + + push_struct_function (decl); + init_tree_ssa (cfun); + + tree callname = get_identifier ("GOMP_enable_pinned_mode"); + tree calldecl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, callname, + voidfntype); + gcall *call = gimple_build_call (calldecl, 0); + + gimple_seq seq = NULL; + gimple_seq_add_stmt (&seq, call); + gimple_set_body (decl, gimple_build_bind (NULL_TREE, seq, NULL)); + + cfun->function_end_locus = UNKNOWN_LOCATION; + cfun->curr_properties |= PROP_gimple_any; + pop_cfun (); + cgraph_node::add_new_function (decl, true); +} + /* Main entry point. */ static unsigned int @@ -14497,6 +14561,10 @@ execute_lower_omp (void) for (auto task_stmt : task_cpyfns) finalize_task_copyfn (task_stmt); task_cpyfns.release (); + + if (flag_offload_memory == OFFLOAD_MEMORY_PINNED) + omp_enable_pinned_mode (); + return 0; } diff --git a/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c b/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c new file mode 100644 index 00000000000..e0e08019bff --- /dev/null +++ b/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c @@ -0,0 +1,28 @@ +/* { dg-do run } */ +/* { dg-additional-options "-foffload-memory=pinned" } */ +/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */ + +#if __cplusplus +#define EXTERNC extern "C" +#else +#define EXTERNC +#endif + +/* Intercept the libgomp initialization call to check it happens. */ + +int good = 0; + +EXTERNC void +GOMP_enable_pinned_mode () +{ + good = 1; +} + +int +main () +{ + if (!good) + __builtin_exit (1); + + return 0; +} diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c index face524259c..4bd5bd6b930 100644 --- a/libgomp/config/linux/allocator.c +++ b/libgomp/config/linux/allocator.c @@ -39,9 +39,26 @@ #include <string.h> #include "libgomp.h" +static bool always_pinned_mode = false; + +/* This function is called by the compiler when -foffload-memory=pinned + is used. */ + +void +GOMP_enable_pinned_mode () +{ + if (mlockall (MCL_CURRENT | MCL_FUTURE) != 0) + gomp_error ("failed to pin all memory (ulimit too low?)"); + else + always_pinned_mode = true; +} + static void * linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin) { + /* Explicit pinning may not be required. */ + pin = pin && !always_pinned_mode; + if (memspace == ompx_unified_shared_mem_space) { return gomp_usm_alloc (size, GOMP_DEVICE_ICV); @@ -69,6 +86,9 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin) static void * linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin) { + /* Explicit pinning may not be required. */ + pin = pin && !always_pinned_mode; + if (memspace == ompx_unified_shared_mem_space) { void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV); @@ -86,6 +106,9 @@ static void linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size, int pin) { + /* Explicit pinning may not be required. */ + pin = pin && !always_pinned_mode; + if (memspace == ompx_unified_shared_mem_space) gomp_usm_free (addr, GOMP_DEVICE_ICV); else if (pin) @@ -98,6 +121,9 @@ static void * linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr, size_t oldsize, size_t size, int oldpin, int pin) { + /* Explicit pinning may not be required. */ + pin = pin && !always_pinned_mode; + if (memspace == ompx_unified_shared_mem_space) goto manual_realloc; else if (oldpin && pin) diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index 2ac58094169..40402dc9893 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -402,6 +402,11 @@ GOMP_5.1 { GOMP_teams4; } GOMP_5.0.1; +GOMP_5.1.1 { + global: + GOMP_enable_pinned_mode; +} GOMP_5.1; + OACC_2.0 { global: acc_get_num_devices; diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c new file mode 100644 index 00000000000..6fd19b46a5c --- /dev/null +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c @@ -0,0 +1,66 @@ +/* { dg-do run } */ +/* { dg-additional-options "-foffload-memory=pinned" } */ + +/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */ + +/* Test that pinned memory works. */ + +#ifdef __linux__ +#include <sys/types.h> +#include <unistd.h> +#include <stdio.h> +#include <stdlib.h> + +#include <sys/mman.h> + +int +get_pinned_mem () +{ + int pid = getpid (); + char buf[100]; + sprintf (buf, "/proc/%d/status", pid); + + FILE *proc = fopen (buf, "r"); + if (!proc) + abort (); + while (fgets (buf, 100, proc)) + { + int val; + if (sscanf (buf, "VmLck: %d", &val)) + { + printf ("lock %d\n", val); + fclose (proc); + return val; + } + } + abort (); +} +#else +int +get_pinned_mem () +{ + return 0; +} + +#define mlockall(...) 0 +#endif + +#include <omp.h> + +/* Allocate more than a page each time, but stay within the ulimit. */ +#define SIZE 10*1024 + +int +main () +{ + // Sanity check + if (get_pinned_mem () == 0) + { + /* -foffload-memory=pinned has failed, but maybe that's because + isufficient pinned memory was available. */ + if (mlockall (MCL_CURRENT | MCL_FUTURE) == 0) + abort (); + } + + return 0; +} -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 5/5] openmp: -foffload-memory=pinned 2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer @ 2022-03-30 22:40 ` Andrew Stubbs 2023-02-09 11:16 ` [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned) Thomas Schwinge 1 sibling, 0 replies; 18+ messages in thread From: Andrew Stubbs @ 2022-03-30 22:40 UTC (permalink / raw) To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph [-- Attachment #1: Type: text/plain, Size: 324 bytes --] On 08/03/2022 11:30, Hafiz Abid Qadeer wrote: > gcc/ChangeLog: > > * omp-low.cc (omp_enable_pinned_mode): New function. > (execute_lower_omp): Call omp_enable_pinned_mode. This worked for x86_64, but I needed to make the attached adjustment to work on powerpc without a linker error. I've committed it to OG11. Andrew [-- Attachment #2: 220330-gomp_enable_pinned_mode.patch --] [-- Type: text/plain, Size: 1360 bytes --] openmp: BUILT_IN_GOMP_ENABLE_PINNED_MODE Rework the GOMP_enable_pinned_mode call so that it works on powerpc where the old way gave a local call. gcc/ChangeLog: * omp-builtins.def (BUILT_IN_GOMP_ENABLE_PINNED_MODE): New. * omp-low.c (omp_enable_pinned_mode): Use BUILT_IN_GOMP_ENABLE_PINNED_MODE. diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def index c591d79fa07..e442b0b5c94 100644 --- a/gcc/omp-builtins.def +++ b/gcc/omp-builtins.def @@ -468,3 +468,6 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ERROR, "GOMP_error", DEF_GOMP_BUILTIN (BUILT_IN_GOMP_EVALUATE_TARGET_DEVICE, "GOMP_evaluate_target_device", BT_FN_BOOL_INT_CONST_PTR_CONST_PTR_CONST_PTR, ATTR_NOTHROW_LEAF_LIST) +DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ENABLE_PINNED_MODE, + "GOMP_enable_pinned_mode", + BT_FN_VOID, ATTR_NOTHROW_LIST) diff --git a/gcc/omp-low.c b/gcc/omp-low.c index f7ecfb52c73..4e8ab9e4ca0 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -15432,9 +15432,7 @@ omp_enable_pinned_mode () push_struct_function (decl); init_tree_ssa (cfun); - tree callname = get_identifier ("GOMP_enable_pinned_mode"); - tree calldecl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, callname, - voidfntype); + tree calldecl = builtin_decl_explicit (BUILT_IN_GOMP_ENABLE_PINNED_MODE); gcall *call = gimple_build_call (calldecl, 0); gimple_seq seq = NULL; ^ permalink raw reply [flat|nested] 18+ messages in thread
* [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned) 2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer 2022-03-30 22:40 ` Andrew Stubbs @ 2023-02-09 11:16 ` Thomas Schwinge 1 sibling, 0 replies; 18+ messages in thread From: Thomas Schwinge @ 2023-02-09 11:16 UTC (permalink / raw) To: gcc-patches, ams; +Cc: Hafiz Abid Qadeer, fortran, jakub, joseph [-- Attachment #1: Type: text/plain, Size: 967 bytes --] Hi! On 2022-03-08T11:30:59+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote: > From: Andrew Stubbs <ams@codesourcery.com> > > [...] > --- /dev/null > +++ b/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c > @@ -0,0 +1,28 @@ > +/* { dg-do run } */ Pushed to devel/omp/gcc-12 branch commit 9c0ffa3776a135a69697253a0bd75ebf9b9d0150 "'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c'", see attached. Note that this likewise applies to the current upstream submission: <inbox.sourceware.org/gcc-patches/8011a994bb38db60f37127880b0fc682564f6e8d.1657188329.git.ams@codesourcery.com> "openmp: -foffload-memory=pinned". Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-c-c-common-gomp-alloc-pinned-1.c-libgomp.c-c-common-.patch --] [-- Type: text/x-diff, Size: 2046 bytes --] From 9c0ffa3776a135a69697253a0bd75ebf9b9d0150 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Mon, 30 Jan 2023 17:46:29 +0100 Subject: [PATCH] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' Otherwise, for build-tree testing: xgcc: fatal error: cannot read spec file 'libgomp.spec': No such file or directory ..., and thus corresponding FAILs, UNRESOLVEDs. Fix-up for og12 commit 842df187487f5b16ae29bbe7e9acd79661a9df48 "openmp: -foffload-memory=pinned". gcc/testsuite/ * c-c++-common/gomp/alloc-pinned-1.c: Cut. libgomp/ * testsuite/libgomp.c-c++-common/alloc-pinned-1.c: Paste. --- gcc/testsuite/ChangeLog.omp | 2 ++ libgomp/ChangeLog.omp | 4 ++++ .../testsuite/libgomp.c-c++-common}/alloc-pinned-1.c | 0 3 files changed, 6 insertions(+) rename {gcc/testsuite/c-c++-common/gomp => libgomp/testsuite/libgomp.c-c++-common}/alloc-pinned-1.c (100%) diff --git a/gcc/testsuite/ChangeLog.omp b/gcc/testsuite/ChangeLog.omp index 42769c7dae5..9f9d5a10ac3 100644 --- a/gcc/testsuite/ChangeLog.omp +++ b/gcc/testsuite/ChangeLog.omp @@ -1,5 +1,7 @@ 2023-02-09 Thomas Schwinge <thomas@codesourcery.com> + * c-c++-common/gomp/alloc-pinned-1.c: Cut. + * gfortran.dg/gomp/allocate-4.f90: Fix 'omp_allocator_handle_kind' example. diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp index d319d43ceb0..39165173884 100644 --- a/libgomp/ChangeLog.omp +++ b/libgomp/ChangeLog.omp @@ -1,3 +1,7 @@ +2023-02-09 Thomas Schwinge <thomas@codesourcery.com> + + * testsuite/libgomp.c-c++-common/alloc-pinned-1.c: Paste. + 2023-02-08 Tobias Burnus <tobias@codesourcery.com> Backported from master: diff --git a/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c b/libgomp/testsuite/libgomp.c-c++-common/alloc-pinned-1.c similarity index 100% rename from gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c rename to libgomp/testsuite/libgomp.c-c++-common/alloc-pinned-1.c -- 2.25.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] openmp: Handle pinned and unified shared memory. 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer ` (4 preceding siblings ...) 2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer @ 2022-04-13 13:14 ` Andrew Stubbs 2022-04-20 13:25 ` [PATCH] openmp: Handle unified address memory Andrew Stubbs 6 siblings, 0 replies; 18+ messages in thread From: Andrew Stubbs @ 2022-04-13 13:14 UTC (permalink / raw) To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph [-- Attachment #1: Type: text/plain, Size: 338 bytes --] This patch adjusts the testcases, previously proposed, to allow for testing on machines with varying page sizes and default amounts of lockable memory. There turns out to be more variation than I had thought. This should go on mainline at the same time as the previous patches in this thread. I'll commit it to OG11 shortly. Andrew [-- Attachment #2: 220413-pinned-tests.patch --] [-- Type: text/plain, Size: 8831 bytes --] libgomp: autodetect page sizes in pinned memory tests There's not one number that works everywhere. This also fixes the failure mode on non-Linux hosts. libgomp/ChangeLog: * testsuite/libgomp.c/alloc-pinned-1.c: Autodetect page size. * testsuite/libgomp.c/alloc-pinned-2.c: Likewise. * testsuite/libgomp.c/alloc-pinned-3.c: Likewise. * testsuite/libgomp.c/alloc-pinned-4.c: Likewise. * testsuite/libgomp.c/alloc-pinned-5.c: Likewise. * testsuite/libgomp.c/alloc-pinned-6.c: Likewise. * testsuite/libgomp.c/alloc-pinned-7.c: Clean up. diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c index 0a6360cda29..79792b16d83 100644 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c @@ -4,13 +4,23 @@ /* Test that pinned memory works. */ +#include <stdio.h> +#include <stdlib.h> + #ifdef __linux__ #include <sys/types.h> #include <unistd.h> -#include <stdio.h> -#include <stdlib.h> #include <sys/mman.h> +#include <sys/resource.h> + +#define PAGE_SIZE sysconf(_SC_PAGESIZE) +#define CHECK_SIZE(SIZE) { \ + struct rlimit limit; \ + if (getrlimit (RLIMIT_MEMLOCK, &limit) \ + || limit.rlim_cur <= SIZE) \ + fprintf (stderr, "unsufficient lockable memory; please increase ulimit\n"); \ + } int get_pinned_mem () @@ -34,6 +44,9 @@ get_pinned_mem () abort (); } #else +#define PAGE_SIZE 1 /* unknown */ +#define CHECK_SIZE(SIZE) fprintf (stderr, "OS unsupported\n"); + int get_pinned_mem () { @@ -43,12 +56,13 @@ get_pinned_mem () #include <omp.h> -/* Allocate more than a page each time, but stay within the ulimit. */ -#define SIZE 10*1024 - int main () { + /* Allocate at least a page each time, but stay within the ulimit. */ + const int SIZE = PAGE_SIZE; + CHECK_SIZE (SIZE*3); + const omp_alloctrait_t traits[] = { { omp_atk_pinned, 1 } }; diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c index 8fdb4ff5cfd..228c656b715 100644 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c @@ -4,13 +4,23 @@ /* Test that pinned memory works (pool_size code path). */ +#include <stdio.h> +#include <stdlib.h> + #ifdef __linux__ #include <sys/types.h> #include <unistd.h> -#include <stdio.h> -#include <stdlib.h> #include <sys/mman.h> +#include <sys/resource.h> + +#define PAGE_SIZE sysconf(_SC_PAGESIZE) +#define CHECK_SIZE(SIZE) { \ + struct rlimit limit; \ + if (getrlimit (RLIMIT_MEMLOCK, &limit) \ + || limit.rlim_cur <= SIZE) \ + fprintf (stderr, "unsufficient lockable memory; please increase ulimit\n"); \ + } int get_pinned_mem () @@ -34,6 +44,9 @@ get_pinned_mem () abort (); } #else +#define PAGE_SIZE 1 /* unknown */ +#define CHECK_SIZE(SIZE) fprintf (stderr, "OS unsupported\n"); + int get_pinned_mem () { @@ -43,12 +56,13 @@ get_pinned_mem () #include <omp.h> -/* Allocate more than a page each time, but stay within the ulimit. */ -#define SIZE 10*1024 - int main () { + /* Allocate at least a page each time, but stay within the ulimit. */ + const int SIZE = PAGE_SIZE; + CHECK_SIZE (SIZE*3); + const omp_alloctrait_t traits[] = { { omp_atk_pinned, 1 }, { omp_atk_pool_size, SIZE*8 } diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-3.c b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c index 943dfea5c9b..90539ffe3e0 100644 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-3.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c @@ -2,15 +2,18 @@ /* Test that pinned memory fails correctly. */ +#include <stdio.h> +#include <stdlib.h> + #ifdef __linux__ #include <sys/types.h> #include <unistd.h> -#include <stdio.h> -#include <stdlib.h> #include <sys/mman.h> #include <sys/resource.h> +#define PAGE_SIZE sysconf(_SC_PAGESIZE) + int get_pinned_mem () { @@ -45,6 +48,8 @@ set_pin_limit (int size) } #else int +#define PAGE_SIZE 10000*1024 /* unknown */ + get_pinned_mem () { return 0; @@ -58,12 +63,12 @@ set_pin_limit () #include <omp.h> -/* This should be large enough to cover multiple pages. */ -#define SIZE 10000*1024 - int main () { + /* This needs to be large enough to cover multiple pages. */ + const int SIZE = PAGE_SIZE*4; + /* Pinned memory, no fallback. */ const omp_alloctrait_t traits1[] = { { omp_atk_pinned, 1 }, diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-4.c b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c index d9cb8dfe1fd..534e49eefc4 100644 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-4.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c @@ -2,15 +2,18 @@ /* Test that pinned memory fails correctly, pool_size code path. */ +#include <stdio.h> +#include <stdlib.h> + #ifdef __linux__ #include <sys/types.h> #include <unistd.h> -#include <stdio.h> -#include <stdlib.h> #include <sys/mman.h> #include <sys/resource.h> +#define PAGE_SIZE sysconf(_SC_PAGESIZE) + int get_pinned_mem () { @@ -45,6 +48,8 @@ set_pin_limit (int size) } #else int +#define PAGE_SIZE 10000*1024 /* unknown */ + get_pinned_mem () { return 0; @@ -58,12 +63,12 @@ set_pin_limit () #include <omp.h> -/* This should be large enough to cover multiple pages. */ -#define SIZE 10000*1024 - int main () { + /* This needs to be large enough to cover multiple pages. */ + const int SIZE = PAGE_SIZE*4; + /* Pinned memory, no fallback. */ const omp_alloctrait_t traits1[] = { { omp_atk_pinned, 1 }, diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-5.c b/libgomp/testsuite/libgomp.c/alloc-pinned-5.c index 8355ca83790..315c7161a39 100644 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-5.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-5.c @@ -4,13 +4,23 @@ /* Test that ompx_pinned_mem_alloc works. */ +#include <stdio.h> +#include <stdlib.h> + #ifdef __linux__ #include <sys/types.h> #include <unistd.h> -#include <stdio.h> -#include <stdlib.h> #include <sys/mman.h> +#include <sys/resource.h> + +#define PAGE_SIZE sysconf(_SC_PAGESIZE) +#define CHECK_SIZE(SIZE) { \ + struct rlimit limit; \ + if (getrlimit (RLIMIT_MEMLOCK, &limit) \ + || limit.rlim_cur <= SIZE) \ + fprintf (stderr, "unsufficient lockable memory; please increase ulimit\n"); \ + } int get_pinned_mem () @@ -34,6 +44,9 @@ get_pinned_mem () abort (); } #else +#define PAGE_SIZE 1 /* unknown */ +#define CHECK_SIZE(SIZE) fprintf (stderr, "OS unsupported\n"); + int get_pinned_mem () { @@ -43,12 +56,13 @@ get_pinned_mem () #include <omp.h> -/* Allocate more than a page each time, but stay within the ulimit. */ -#define SIZE 10*1024 - int main () { + /* Allocate at least a page each time, but stay within the ulimit. */ + const int SIZE = PAGE_SIZE; + CHECK_SIZE (SIZE*3); + // Sanity check if (get_pinned_mem () != 0) abort (); diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-6.c b/libgomp/testsuite/libgomp.c/alloc-pinned-6.c index 80fd37ab875..bbe20c04875 100644 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-6.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-6.c @@ -2,15 +2,18 @@ /* Test that ompx_pinned_mem_alloc fails correctly. */ +#include <stdio.h> +#include <stdlib.h> + #ifdef __linux__ #include <sys/types.h> #include <unistd.h> -#include <stdio.h> -#include <stdlib.h> #include <sys/mman.h> #include <sys/resource.h> +#define PAGE_SIZE sysconf(_SC_PAGESIZE) + int get_pinned_mem () { @@ -44,6 +47,8 @@ set_pin_limit (int size) abort (); } #else +#define PAGE_SIZE 10000*1024 /* unknown */ + int get_pinned_mem () { @@ -58,12 +63,12 @@ set_pin_limit () #include <omp.h> -/* This should be large enough to cover multiple pages. */ -#define SIZE 10000*1024 - int main () { + /* Allocate at least a page each time, but stay within the ulimit. */ + const int SIZE = PAGE_SIZE*4; + /* Ensure that the limit is smaller than the allocation. */ set_pin_limit (SIZE/2); diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c index 6fd19b46a5c..8dc19055038 100644 --- a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c @@ -5,11 +5,12 @@ /* Test that pinned memory works. */ +#include <stdio.h> +#include <stdlib.h> + #ifdef __linux__ #include <sys/types.h> #include <unistd.h> -#include <stdio.h> -#include <stdlib.h> #include <sys/mman.h> @@ -28,7 +29,6 @@ get_pinned_mem () int val; if (sscanf (buf, "VmLck: %d", &val)) { - printf ("lock %d\n", val); fclose (proc); return val; } @@ -47,9 +47,6 @@ get_pinned_mem () #include <omp.h> -/* Allocate more than a page each time, but stay within the ulimit. */ -#define SIZE 10*1024 - int main () { ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] openmp: Handle unified address memory. 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer ` (5 preceding siblings ...) 2022-04-13 13:14 ` [PATCH 0/5] openmp: Handle pinned and unified shared memory Andrew Stubbs @ 2022-04-20 13:25 ` Andrew Stubbs 6 siblings, 0 replies; 18+ messages in thread From: Andrew Stubbs @ 2022-04-20 13:25 UTC (permalink / raw) To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph [-- Attachment #1: Type: text/plain, Size: 393 bytes --] This patch adds enough support for "requires unified_address" to make the sollve_vv testcases pass. It implements unified_address as a synonym of unified_shared_memory, which is both valid and the only way I know of to unify addresses with Cuda (could be wrong). This patch should be applied on to of the previous patch set for USM. OK for stage 1? I'll apply it to OG11 shortly. Andrew [-- Attachment #2: 220420-unified_address.patch --] [-- Type: text/plain, Size: 6037 bytes --] openmp: unified_address support This makes "requires unified_address" work by making it eqivalent to "requires unified_shared_memory". This is more than is strictly necessary, but should be standard compliant. gcc/c/ChangeLog: * c-parser.c (c_parser_omp_requires): Check requires unified_address for conflict with -foffload-memory=shared. gcc/cp/ChangeLog: * parser.c (cp_parser_omp_requires): Check requires unified_address for conflict with -foffload-memory=shared. gcc/fortran/ChangeLog: * openmp.c (gfc_match_omp_requires): Check requires unified_address for conflict with -foffload-memory=shared. gcc/ChangeLog: * omp-low.c: Do USM transformations for "unified_address". gcc/testsuite/ChangeLog: * c-c++-common/gomp/usm-4.c: New test. * gfortran.dg/gomp/usm-4.f90: New test. diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c index 12408770193..9a3d0cb8cea 100644 --- a/gcc/c/c-parser.c +++ b/gcc/c/c-parser.c @@ -22531,18 +22531,27 @@ c_parser_omp_requires (c_parser *parser) enum omp_requires this_req = (enum omp_requires) 0; if (!strcmp (p, "unified_address")) - this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + { + this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "unified_address is incompatible with the " + "selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "unified_shared_memory")) - { - this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; - - if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED - && flag_offload_memory != OFFLOAD_MEMORY_NONE) - error_at (cloc, - "unified_shared_memory is incompatible with the " - "selected -foffload-memory option"); - flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; - } + { + this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "unified_shared_memory is incompatible with the " + "selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "dynamic_allocators")) this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS; else if (!strcmp (p, "reverse_offload")) diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c index fd9f62f4543..3a9ea272f10 100644 --- a/gcc/cp/parser.c +++ b/gcc/cp/parser.c @@ -46406,18 +46406,27 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok) enum omp_requires this_req = (enum omp_requires) 0; if (!strcmp (p, "unified_address")) - this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + { + this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "unified_address is incompatible with the " + "selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "unified_shared_memory")) - { - this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; - - if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED - && flag_offload_memory != OFFLOAD_MEMORY_NONE) - error_at (cloc, - "unified_shared_memory is incompatible with the " - "selected -foffload-memory option"); - flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; - } + { + this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "unified_shared_memory is incompatible with the " + "selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "dynamic_allocators")) this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS; else if (!strcmp (p, "reverse_offload")) diff --git a/gcc/fortran/openmp.c b/gcc/fortran/openmp.c index ac4126bd7ea..ece04c03a68 100644 --- a/gcc/fortran/openmp.c +++ b/gcc/fortran/openmp.c @@ -5546,6 +5546,12 @@ gfc_match_omp_requires (void) requires_clause = OMP_REQ_UNIFIED_ADDRESS; if (requires_clauses & OMP_REQ_UNIFIED_ADDRESS) goto duplicate_clause; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + gfc_error_now ("unified_address at %C is incompatible with " + "the selected -foffload-memory option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; } else if (gfc_match (clauses[2]) == MATCH_YES) { diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 4653370aa41..ce30f53dbb5 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -16008,7 +16008,8 @@ public: { return (flag_openmp || flag_openmp_simd) && (flag_offload_memory == OFFLOAD_MEMORY_UNIFIED - || omp_requires_mask & OMP_REQUIRES_UNIFIED_SHARED_MEMORY); + || omp_requires_mask & OMP_REQUIRES_UNIFIED_SHARED_MEMORY + || omp_requires_mask & OMP_REQUIRES_UNIFIED_ADDRESS); } virtual unsigned int execute (function *) { diff --git a/gcc/testsuite/c-c++-common/gomp/usm-4.c b/gcc/testsuite/c-c++-common/gomp/usm-4.c new file mode 100644 index 00000000000..b19664e9b66 --- /dev/null +++ b/gcc/testsuite/c-c++-common/gomp/usm-4.c @@ -0,0 +1,4 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-foffload-memory=pinned" } */ + +#pragma omp requires unified_address /* { dg-error "unified_address is incompatible with the selected -foffload-memory option" } */ diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-4.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-4.f90 new file mode 100644 index 00000000000..725b07f2f88 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/gomp/usm-4.f90 @@ -0,0 +1,6 @@ +! { dg-do compile } +! { dg-additional-options "-foffload-memory=pinned" } + +!$omp requires unified_address ! { dg-error "unified_address at .* is incompatible with the selected -foffload-memory option" } + +end ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2023-04-03 14:57 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer 2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer 2023-02-13 14:38 ` -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) Thomas Schwinge 2023-02-13 15:20 ` Andrew Stubbs 2023-04-03 14:56 ` [og12] '-foffload-memory=pinned' using offloading device interfaces (was: -foffload-memory=pinned) Thomas Schwinge 2022-03-08 11:30 ` [PATCH 2/5] openmp: allow requires unified_shared_memory Hafiz Abid Qadeer 2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer 2023-02-10 14:21 ` Thomas Schwinge 2023-02-10 15:31 ` Andrew Stubbs 2023-02-16 21:24 ` [og12] Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space' (was: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc) Thomas Schwinge 2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer 2022-04-02 12:04 ` Andrew Stubbs 2022-04-02 12:42 ` Andrew Stubbs 2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer 2022-03-30 22:40 ` Andrew Stubbs 2023-02-09 11:16 ` [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned) Thomas Schwinge 2022-04-13 13:14 ` [PATCH 0/5] openmp: Handle pinned and unified shared memory Andrew Stubbs 2022-04-20 13:25 ` [PATCH] openmp: Handle unified address memory Andrew Stubbs
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).