public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/5] openmp: Handle pinned and unified shared memory.
@ 2022-03-08 11:30 Hafiz Abid Qadeer
  2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw)
  To: gcc-patches, fortran; +Cc: jakub, ams, joseph

This patch series add support for unified shared memory (USM) and pinned
memory. The support in libgomp is for nvptx offloading only.  A new
command line option -foffload-memory allows user to choose either USM
or pinned memory. The USM can also be enabled using requires construct.

When USM us in use, calls to memory allocation function like malloc are
changed to omp_alloc with appropriate allocator.  No transformations are
required for the pinned memory which is implemented using mlockall so is
only available on Linux.

Andrew Stubbs (4):
  openmp: Add -foffload-memory
  openmp: allow requires unified_shared_memory
  openmp, nvptx: ompx_unified_shared_mem_alloc
  openmp: -foffload-memory=pinned

Hafiz Abid Qadeer (1):
  openmp: Use libgomp memory allocation functions with unified shared
    memory.

 gcc/c/c-parser.cc                             |  13 +-
 gcc/common.opt                                |  16 ++
 gcc/coretypes.h                               |   7 +
 gcc/cp/parser.cc                              |  13 +-
 gcc/doc/invoke.texi                           |  16 +-
 gcc/fortran/openmp.cc                         |  10 +-
 gcc/omp-low.cc                                | 220 ++++++++++++++++++
 gcc/passes.def                                |   1 +
 .../c-c++-common/gomp/alloc-pinned-1.c        |  28 +++
 gcc/testsuite/c-c++-common/gomp/usm-1.c       |   4 +
 gcc/testsuite/c-c++-common/gomp/usm-2.c       |  34 +++
 gcc/testsuite/c-c++-common/gomp/usm-3.c       |  32 +++
 gcc/testsuite/g++.dg/gomp/usm-1.C             |  32 +++
 gcc/testsuite/g++.dg/gomp/usm-2.C             |  30 +++
 gcc/testsuite/g++.dg/gomp/usm-3.C             |  38 +++
 gcc/testsuite/gfortran.dg/gomp/usm-1.f90      |   6 +
 gcc/testsuite/gfortran.dg/gomp/usm-2.f90      |  16 ++
 gcc/testsuite/gfortran.dg/gomp/usm-3.f90      |  13 ++
 gcc/tree-pass.h                               |   1 +
 libgomp/allocator.c                           |  13 +-
 libgomp/config/linux/allocator.c              |  70 ++++--
 libgomp/config/nvptx/allocator.c              |   6 +
 libgomp/libgomp-plugin.h                      |   3 +
 libgomp/libgomp.h                             |   6 +
 libgomp/libgomp.map                           |   5 +
 libgomp/omp.h.in                              |   4 +
 libgomp/omp_lib.f90.in                        |   8 +
 libgomp/plugin/plugin-nvptx.c                 |  45 +++-
 libgomp/target.c                              |  70 ++++++
 libgomp/testsuite/libgomp.c++/usm-1.C         |  54 +++++
 libgomp/testsuite/libgomp.c/alloc-pinned-7.c  |  66 ++++++
 libgomp/testsuite/libgomp.c/usm-1.c           |  24 ++
 libgomp/testsuite/libgomp.c/usm-2.c           |  32 +++
 libgomp/testsuite/libgomp.c/usm-3.c           |  35 +++
 libgomp/testsuite/libgomp.c/usm-4.c           |  36 +++
 libgomp/testsuite/libgomp.c/usm-5.c           |  28 +++
 libgomp/testsuite/libgomp.c/usm-6.c           |  70 ++++++
 37 files changed, 1075 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90
 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] openmp: Add -foffload-memory
  2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
@ 2022-03-08 11:30 ` Hafiz Abid Qadeer
  2023-02-13 14:38   ` -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) Thomas Schwinge
  2022-03-08 11:30 ` [PATCH 2/5] openmp: allow requires unified_shared_memory Hafiz Abid Qadeer
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw)
  To: gcc-patches, fortran; +Cc: jakub, ams, joseph

From: Andrew Stubbs <ams@codesourcery.com>

Add a new option.  It will be used in follow-up patches.

gcc/ChangeLog:

	* common.opt: Add -foffload-memory and its enum values.
	* coretypes.h (enum offload_memory): New.
	* doc/invoke.texi: Document -foffload-memory.
---
 gcc/common.opt      | 16 ++++++++++++++++
 gcc/coretypes.h     |  7 +++++++
 gcc/doc/invoke.texi | 16 +++++++++++++++-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 8b6513de47c..17426523e23 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2182,6 +2182,22 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
 EnumValue
 Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
 
+foffload-memory=
+Common Joined RejectNegative Enum(offload_memory) Var(flag_offload_memory) Init(OFFLOAD_MEMORY_NONE)
+-foffload-memory=[none|unified|pinned]	Use an offload memory optimization.
+
+Enum
+Name(offload_memory) Type(enum offload_memory) UnknownError(Unknown offload memory option %qs)
+
+EnumValue
+Enum(offload_memory) String(none) Value(OFFLOAD_MEMORY_NONE)
+
+EnumValue
+Enum(offload_memory) String(unified) Value(OFFLOAD_MEMORY_UNIFIED)
+
+EnumValue
+Enum(offload_memory) String(pinned) Value(OFFLOAD_MEMORY_PINNED)
+
 fomit-frame-pointer
 Common Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git a/gcc/coretypes.h b/gcc/coretypes.h
index 08b9ac9094c..dd52d5bb113 100644
--- a/gcc/coretypes.h
+++ b/gcc/coretypes.h
@@ -206,6 +206,13 @@ enum offload_abi {
   OFFLOAD_ABI_ILP32
 };
 
+/* Types of memory optimization for an offload device.  */
+enum offload_memory {
+  OFFLOAD_MEMORY_NONE,
+  OFFLOAD_MEMORY_UNIFIED,
+  OFFLOAD_MEMORY_PINNED
+};
+
 /* Types of profile update methods.  */
 enum profile_update {
   PROFILE_UPDATE_SINGLE,
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 248ed534aee..d16019fc8c3 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -202,7 +202,7 @@ in the following sections.
 -fno-builtin  -fno-builtin-@var{function}  -fcond-mismatch @gol
 -ffreestanding  -fgimple  -fgnu-tm  -fgnu89-inline  -fhosted @gol
 -flax-vector-conversions  -fms-extensions @gol
--foffload=@var{arg}  -foffload-options=@var{arg} @gol
+-foffload=@var{arg}  -foffload-options=@var{arg} -foffload-memory=@var{arg} @gol
 -fopenacc  -fopenacc-dim=@var{geom} @gol
 -fopenmp  -fopenmp-simd @gol
 -fpermitted-flt-eval-methods=@var{standard} @gol
@@ -2694,6 +2694,20 @@ Typical command lines are
 -foffload-options=amdgcn-amdhsa=-march=gfx906 -foffload-options=-lm
 @end smallexample
 
+@item -foffload-memory=none
+@itemx -foffload-memory=unified
+@itemx -foffload-memory=pinned
+@opindex foffload-memory
+@cindex OpenMP offloading memory modes
+Enable a memory optimization mode to use with OpenMP.  The default behavior,
+@option{-foffload-memory=none}, is to do nothing special (unless enabled via
+a requires directive in the code).  @option{-foffload-memory=unified} is
+equivalent to @code{#pragma omp requires unified_shared_memory}.
+@option{-foffload-memory=pinned} forces all host memory to be pinned (this
+mode may require the user to increase the ulimit setting for locked memory).
+All translation units must select the same setting to avoid undefined
+behavior.
+
 @item -fopenacc
 @opindex fopenacc
 @cindex OpenACC accelerator programming
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/5] openmp: allow requires unified_shared_memory
  2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
  2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer
@ 2022-03-08 11:30 ` Hafiz Abid Qadeer
  2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw)
  To: gcc-patches, fortran; +Cc: jakub, ams, joseph

From: Andrew Stubbs <ams@codesourcery.com>

This is the front-end portion of the Unified Shared Memory implementation.
It removes the "sorry, unimplemented message" in C, C++, and Fortran, and sets
flag_offload_memory, but is otherwise inactive, for now.

It also checks that -foffload-memory isn't set to an incompatible mode.

gcc/c/ChangeLog:

	* c-parser.cc (c_parser_omp_requires): Allow "requires
	  unified_share_memory".

gcc/cp/ChangeLog:

	* parser.cc (cp_parser_omp_requires): Allow "requires
	unified_share_memory".

gcc/fortran/ChangeLog:

	* openmp.cc (gfc_match_omp_requires): Allow "requires
	unified_share_memory".

gcc/testsuite/ChangeLog:

	* c-c++-common/gomp/usm-1.c: New test.
	* gfortran.dg/gomp/usm-1.f90: New test.
---
 gcc/c/c-parser.cc                        | 13 ++++++++++++-
 gcc/cp/parser.cc                         | 13 ++++++++++++-
 gcc/fortran/openmp.cc                    | 10 +++++++++-
 gcc/testsuite/c-c++-common/gomp/usm-1.c  |  4 ++++
 gcc/testsuite/gfortran.dg/gomp/usm-1.f90 |  6 ++++++
 5 files changed, 43 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 84deac04c44..dc834158d1c 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -22542,7 +22542,16 @@ c_parser_omp_requires (c_parser *parser)
 	  if (!strcmp (p, "unified_address"))
 	    this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
 	  else if (!strcmp (p, "unified_shared_memory"))
+	  {
 	    this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+	    if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		&& flag_offload_memory != OFFLOAD_MEMORY_NONE)
+	      error_at (cloc,
+			"unified_shared_memory is incompatible with the "
+			"selected -foffload-memory option");
+	    flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	  }
 	  else if (!strcmp (p, "dynamic_allocators"))
 	    this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
 	  else if (!strcmp (p, "reverse_offload"))
@@ -22609,7 +22618,9 @@ c_parser_omp_requires (c_parser *parser)
 	      c_parser_skip_to_pragma_eol (parser, false);
 	      return;
 	    }
-	  if (p && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS)
+	  if (p
+	      && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS
+	      && this_req != OMP_REQUIRES_UNIFIED_SHARED_MEMORY)
 	    sorry_at (cloc, "%qs clause on %<requires%> directive not "
 			    "supported yet", p);
 	  if (p)
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 03d99aba13e..ba263152aaf 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -46464,7 +46464,16 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok)
 	  if (!strcmp (p, "unified_address"))
 	    this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
 	  else if (!strcmp (p, "unified_shared_memory"))
+	  {
 	    this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+	    if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		&& flag_offload_memory != OFFLOAD_MEMORY_NONE)
+	      error_at (cloc,
+			"unified_shared_memory is incompatible with the "
+			"selected -foffload-memory option");
+	    flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	  }
 	  else if (!strcmp (p, "dynamic_allocators"))
 	    this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
 	  else if (!strcmp (p, "reverse_offload"))
@@ -46537,7 +46546,9 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok)
 	      cp_parser_skip_to_pragma_eol (parser, pragma_tok);
 	      return false;
 	    }
-	  if (p && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS)
+	  if (p
+	      && this_req != OMP_REQUIRES_DYNAMIC_ALLOCATORS
+	      && this_req != OMP_REQUIRES_UNIFIED_SHARED_MEMORY)
 	    sorry_at (cloc, "%qs clause on %<requires%> directive not "
 			    "supported yet", p);
 	  if (p)
diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index 16cd03a3d67..1f434857719 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -29,6 +29,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "diagnostic.h"
 #include "gomp-constants.h"
 #include "target-memory.h"  /* For gfc_encode_character.  */
+#include "options.h"
 
 /* Match an end of OpenMP directive.  End of OpenMP directive is optional
    whitespace, followed by '\n' or comment '!'.  */
@@ -5373,6 +5374,12 @@ gfc_match_omp_requires (void)
 	  requires_clause = OMP_REQ_UNIFIED_SHARED_MEMORY;
 	  if (requires_clauses & OMP_REQ_UNIFIED_SHARED_MEMORY)
 	    goto duplicate_clause;
+
+	  if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+	      && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+	    gfc_error_now ("unified_shared_memory at %C is incompatible with "
+			   "the selected -foffload-memory option");
+	  flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
 	}
       else if (gfc_match (clauses[3]) == MATCH_YES)
 	{
@@ -5412,7 +5419,8 @@ gfc_match_omp_requires (void)
 	goto error;
 
       if (requires_clause & ~(OMP_REQ_ATOMIC_MEM_ORDER_MASK
-			      | OMP_REQ_DYNAMIC_ALLOCATORS))
+			      | OMP_REQ_DYNAMIC_ALLOCATORS
+			      | OMP_REQ_UNIFIED_SHARED_MEMORY))
 	gfc_error_now ("Sorry, %qs clause at %L on REQUIRES directive is not "
 		       "yet supported", clause, &old_loc);
       if (!gfc_omp_requires_add_clause (requires_clause, clause, &old_loc, NULL))
diff --git a/gcc/testsuite/c-c++-common/gomp/usm-1.c b/gcc/testsuite/c-c++-common/gomp/usm-1.c
new file mode 100644
index 00000000000..619c21a83f4
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/gomp/usm-1.c
@@ -0,0 +1,4 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-foffload-memory=pinned" } */
+
+#pragma omp requires unified_shared_memory  /* { dg-error "unified_shared_memory is incompatible with the selected -foffload-memory option" } */
diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-1.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-1.f90
new file mode 100644
index 00000000000..340f6bb50a5
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/gomp/usm-1.f90
@@ -0,0 +1,6 @@
+! { dg-do compile }
+! { dg-additional-options "-foffload-memory=pinned" }
+
+!$omp requires unified_shared_memory  ! { dg-error "unified_shared_memory at .* is incompatible with the selected -foffload-memory option" }
+
+end
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc
  2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
  2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer
  2022-03-08 11:30 ` [PATCH 2/5] openmp: allow requires unified_shared_memory Hafiz Abid Qadeer
@ 2022-03-08 11:30 ` Hafiz Abid Qadeer
  2023-02-10 14:21   ` Thomas Schwinge
  2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw)
  To: gcc-patches, fortran; +Cc: jakub, ams, joseph

From: Andrew Stubbs <ams@codesourcery.com>

This adds support for using Cuda Managed Memory with omp_alloc.  It will be
used as the underpinnings for "requires unified_shared_memory" in a later
patch.

There are two new predefined allocators, ompx_unified_shared_mem_alloc and
ompx_host_mem_alloc, plus corresponding memory spaces, which can be used to
allocate memory in the "managed" space and explicitly on the host (it is
intended that "malloc" will be intercepted by the compiler).

The nvptx plugin is modified to make the necessary Cuda calls, and libgomp
is modified to switch to shared-memory mode for USM allocated mappings.

libgomp/ChangeLog:

	* allocator.c (omp_max_predefined_alloc): Update.
	(omp_aligned_alloc): Don't fallback ompx_host_mem_alloc.
	(omp_aligned_calloc): Likewise.
	(omp_realloc): Likewise.
	* config/linux/allocator.c (linux_memspace_alloc): Handle USM.
	(linux_memspace_calloc): Handle USM.
	(linux_memspace_free): Handle USM.
	(linux_memspace_realloc): Handle USM.
	* config/nvptx/allocator.c (nvptx_memspace_alloc): Reject
	ompx_host_mem_alloc.
	(nvptx_memspace_calloc): Likewise.
	(nvptx_memspace_realloc): Likewise.
	* libgomp-plugin.h (GOMP_OFFLOAD_usm_alloc): New prototype.
	(GOMP_OFFLOAD_usm_free): New prototype.
	(GOMP_OFFLOAD_is_usm_ptr): New prototype.
	* libgomp.h (gomp_usm_alloc): New prototype.
	(gomp_usm_free): New prototype.
	(gomp_is_usm_ptr): New prototype.
	(struct gomp_device_descr): Add USM functions.
	* omp.h.in (omp_memspace_handle_t): Add ompx_unified_shared_mem_space
	and ompx_host_mem_space.
	(omp_allocator_handle_t): Add ompx_unified_shared_mem_alloc and
	ompx_host_mem_alloc.
	* omp_lib.f90.in: Likewise.
	* plugin/plugin-nvptx.c (nvptx_alloc): Add "usm" parameter.
	Call cuMemAllocManaged as appropriate.
	(GOMP_OFFLOAD_alloc): Move internals to ...
	(GOMP_OFFLOAD_alloc_1): ... this, and add usm parameter.
	(GOMP_OFFLOAD_usm_alloc): New function.
	(GOMP_OFFLOAD_usm_free): New function.
	(GOMP_OFFLOAD_is_usm_ptr): New function.
	* target.c (gomp_map_vars_internal): Add USM support.
	(gomp_usm_alloc): New function.
	(gomp_usm_free): New function.
	(gomp_load_plugin_for_device): New function.
	* testsuite/libgomp.c/usm-1.c: New test.
	* testsuite/libgomp.c/usm-2.c: New test.
	* testsuite/libgomp.c/usm-3.c: New test.
	* testsuite/libgomp.c/usm-4.c: New test.
	* testsuite/libgomp.c/usm-5.c: New test.
---
 libgomp/allocator.c                 | 13 ++++--
 libgomp/config/linux/allocator.c    | 48 ++++++++++++--------
 libgomp/config/nvptx/allocator.c    |  6 +++
 libgomp/libgomp-plugin.h            |  3 ++
 libgomp/libgomp.h                   |  6 +++
 libgomp/omp.h.in                    |  4 ++
 libgomp/omp_lib.f90.in              |  8 ++++
 libgomp/plugin/plugin-nvptx.c       | 45 ++++++++++++++++---
 libgomp/target.c                    | 70 +++++++++++++++++++++++++++++
 libgomp/testsuite/libgomp.c/usm-1.c | 24 ++++++++++
 libgomp/testsuite/libgomp.c/usm-2.c | 32 +++++++++++++
 libgomp/testsuite/libgomp.c/usm-3.c | 35 +++++++++++++++
 libgomp/testsuite/libgomp.c/usm-4.c | 36 +++++++++++++++
 libgomp/testsuite/libgomp.c/usm-5.c | 28 ++++++++++++
 14 files changed, 330 insertions(+), 28 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/usm-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-5.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index 000ccc2dd9c..18045dbe0c4 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -32,7 +32,7 @@
 #include <stdlib.h>
 #include <string.h>
 
-#define omp_max_predefined_alloc ompx_pinned_mem_alloc
+#define omp_max_predefined_alloc ompx_host_mem_alloc
 
 /* These macros may be overridden in config/<target>/allocator.c.  */
 #ifndef MEMSPACE_ALLOC
@@ -68,6 +68,8 @@ static const omp_memspace_handle_t predefined_alloc_mapping[] = {
   omp_low_lat_mem_space,   /* omp_pteam_mem_alloc. */
   omp_low_lat_mem_space,   /* omp_thread_mem_alloc. */
   omp_default_mem_space,   /* ompx_pinned_mem_alloc. */
+  ompx_unified_shared_mem_space,  /* ompx_unified_shared_mem_alloc. */
+  ompx_host_mem_space,     /* ompx_host_mem_alloc.  */
 };
 
 struct omp_allocator_data
@@ -367,7 +369,8 @@ fail:
   int fallback = (allocator_data
 		  ? allocator_data->fallback
 		  : (allocator == omp_default_mem_alloc
-		     || allocator == ompx_pinned_mem_alloc)
+		     || allocator == ompx_pinned_mem_alloc
+		     || allocator == ompx_host_mem_alloc)
 		  ? omp_atv_null_fb
 		  : omp_atv_default_mem_fb);
   switch (fallback)
@@ -597,7 +600,8 @@ fail:
   int fallback = (allocator_data
 		  ? allocator_data->fallback
 		  : (allocator == omp_default_mem_alloc
-		     || allocator == ompx_pinned_mem_alloc)
+		     || allocator == ompx_pinned_mem_alloc
+		     || allocator == ompx_host_mem_alloc)
 		  ? omp_atv_null_fb
 		  : omp_atv_default_mem_fb);
   switch (fallback)
@@ -855,7 +859,8 @@ fail:
   int fallback = (allocator_data
 		  ? allocator_data->fallback
 		  : (allocator == omp_default_mem_alloc
-		     || allocator == ompx_pinned_mem_alloc)
+		     || allocator == ompx_pinned_mem_alloc
+		     || allocator == ompx_host_mem_alloc)
 		  ? omp_atv_null_fb
 		  : omp_atv_default_mem_fb);
   switch (fallback)
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index 5f3ae491f07..face524259c 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -42,9 +42,11 @@
 static void *
 linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
 {
-  (void)memspace;
-
-  if (pin)
+  if (memspace == ompx_unified_shared_mem_space)
+    {
+      return gomp_usm_alloc (size, GOMP_DEVICE_ICV);
+    }
+  else if (pin)
     {
       void *addr = mmap (NULL, size, PROT_READ | PROT_WRITE,
 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
@@ -67,7 +69,14 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
 static void *
 linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
 {
-  if (pin)
+  if (memspace == ompx_unified_shared_mem_space)
+    {
+      void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV);
+      memset (ret, 0, size);
+      return ret;
+    }
+  else if (memspace == ompx_unified_shared_mem_space
+      || pin)
     return linux_memspace_alloc (memspace, size, pin);
   else
     return calloc (1, size);
@@ -77,9 +86,9 @@ static void
 linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
 		     int pin)
 {
-  (void)memspace;
-
-  if (pin)
+  if (memspace == ompx_unified_shared_mem_space)
+    gomp_usm_free (addr, GOMP_DEVICE_ICV);
+  else if (pin)
     munmap (addr, size);
   else
     free (addr);
@@ -89,7 +98,9 @@ static void *
 linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
 			size_t oldsize, size_t size, int oldpin, int pin)
 {
-  if (oldpin && pin)
+  if (memspace == ompx_unified_shared_mem_space)
+    goto manual_realloc;
+  else if (oldpin && pin)
     {
       void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE);
       if (newaddr == MAP_FAILED)
@@ -98,18 +109,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
       return newaddr;
     }
   else if (oldpin || pin)
-    {
-      void *newaddr = linux_memspace_alloc (memspace, size, pin);
-      if (newaddr)
-	{
-	  memcpy (newaddr, addr, oldsize < size ? oldsize : size);
-	  linux_memspace_free (memspace, addr, oldsize, oldpin);
-	}
-
-      return newaddr;
-    }
+    goto manual_realloc;
   else
     return realloc (addr, size);
+
+manual_realloc:
+  void *newaddr = linux_memspace_alloc (memspace, size, pin);
+  if (newaddr)
+    {
+      memcpy (newaddr, addr, oldsize < size ? oldsize : size);
+      linux_memspace_free (memspace, addr, oldsize, oldpin);
+    }
+
+  return newaddr;
 }
 
 #define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index 0102680b717..c1a73511623 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
       __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE);
       return result;
     }
+  else if (memspace == ompx_host_mem_space)
+    return NULL;
   else
     return malloc (size);
 }
@@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
 
       return result;
     }
+  else if (memspace == ompx_host_mem_space)
+    return NULL;
   else
     return calloc (1, size);
 }
@@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
 	}
       return result;
     }
+  else if (memspace == ompx_host_mem_space)
+    return NULL;
   else
     return realloc (addr, size);
 }
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 07ab700b80c..104f375bc1b 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -134,6 +134,9 @@ extern int GOMP_OFFLOAD_load_image (int, unsigned, const void *,
 extern bool GOMP_OFFLOAD_unload_image (int, unsigned, const void *);
 extern void *GOMP_OFFLOAD_alloc (int, size_t);
 extern bool GOMP_OFFLOAD_free (int, void *);
+extern void *GOMP_OFFLOAD_usm_alloc (int, size_t);
+extern bool GOMP_OFFLOAD_usm_free (int, void *);
+extern bool GOMP_OFFLOAD_is_usm_ptr (void *);
 extern bool GOMP_OFFLOAD_dev2host (int, void *, const void *, size_t);
 extern bool GOMP_OFFLOAD_host2dev (int, void *, const void *, size_t);
 extern bool GOMP_OFFLOAD_dev2dev (int, void *, const void *, size_t);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index b9e03919993..1cbde607794 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1013,6 +1013,9 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void * gomp_usm_alloc (size_t size, int device_num);
+extern void gomp_usm_free (void *device_ptr, int device_num);
+extern bool gomp_is_usm_ptr (void *ptr);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
@@ -1238,6 +1241,9 @@ struct gomp_device_descr
   __typeof (GOMP_OFFLOAD_unload_image) *unload_image_func;
   __typeof (GOMP_OFFLOAD_alloc) *alloc_func;
   __typeof (GOMP_OFFLOAD_free) *free_func;
+  __typeof (GOMP_OFFLOAD_usm_alloc) *usm_alloc_func;
+  __typeof (GOMP_OFFLOAD_usm_free) *usm_free_func;
+  __typeof (GOMP_OFFLOAD_is_usm_ptr) *is_usm_ptr_func;
   __typeof (GOMP_OFFLOAD_dev2host) *dev2host_func;
   __typeof (GOMP_OFFLOAD_host2dev) *host2dev_func;
   __typeof (GOMP_OFFLOAD_dev2dev) *dev2dev_func;
diff --git a/libgomp/omp.h.in b/libgomp/omp.h.in
index 1d002d36aae..4ec4475306b 100644
--- a/libgomp/omp.h.in
+++ b/libgomp/omp.h.in
@@ -120,6 +120,8 @@ typedef enum omp_memspace_handle_t __GOMP_UINTPTR_T_ENUM
   omp_const_mem_space = 2,
   omp_high_bw_mem_space = 3,
   omp_low_lat_mem_space = 4,
+  ompx_unified_shared_mem_space = 5,
+  ompx_host_mem_space = 6,
   __omp_memspace_handle_t_max__ = __UINTPTR_MAX__
 } omp_memspace_handle_t;
 
@@ -135,6 +137,8 @@ typedef enum omp_allocator_handle_t __GOMP_UINTPTR_T_ENUM
   omp_pteam_mem_alloc = 7,
   omp_thread_mem_alloc = 8,
   ompx_pinned_mem_alloc = 9,
+  ompx_unified_shared_mem_alloc = 10,
+  ompx_host_mem_alloc = 11,
   __omp_allocator_handle_t_max__ = __UINTPTR_MAX__
 } omp_allocator_handle_t;
 
diff --git a/libgomp/omp_lib.f90.in b/libgomp/omp_lib.f90.in
index a095dad8962..e1c32aa78d2 100644
--- a/libgomp/omp_lib.f90.in
+++ b/libgomp/omp_lib.f90.in
@@ -160,6 +160,10 @@
                  parameter :: omp_thread_mem_alloc = 8
         integer (kind=omp_allocator_handle_kind), &
                  parameter :: ompx_pinned_mem_alloc = 9
+        integer (kind=omp_allocator_handle_kind), &
+                 parameter :: ompx_unified_shared_mem_alloc = 10
+        integer (kind=omp_allocator_handle_kind), &
+                 parameter :: ompx_host_mem_alloc = 11
         integer (omp_memspace_handle_kind), &
                  parameter :: omp_default_mem_space = 0
         integer (omp_memspace_handle_kind), &
@@ -170,6 +174,10 @@
                  parameter :: omp_high_bw_mem_space = 3
         integer (omp_memspace_handle_kind), &
                  parameter :: omp_low_lat_mem_space = 4
+        integer (omp_memspace_handle_kind), &
+                 parameter :: omp_unified_shared_mem_space = 5
+        integer (omp_memspace_handle_kind), &
+                 parameter :: omp_host_mem_space = 6
 
         type omp_alloctrait
           integer (kind=omp_alloctrait_key_kind) key
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 1b9a5e95c07..b664d652a45 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1042,11 +1042,13 @@ nvptx_stacks_free (struct ptx_device *ptx_dev, bool force)
 }
 
 static void *
-nvptx_alloc (size_t s, bool suppress_errors)
+nvptx_alloc (size_t s, bool suppress_errors, bool usm)
 {
   CUdeviceptr d;
 
-  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &d, s);
+  CUresult r = (usm ? CUDA_CALL_NOCHECK (cuMemAllocManaged, &d, s,
+					 CU_MEM_ATTACH_GLOBAL)
+		: CUDA_CALL_NOCHECK (cuMemAlloc, &d, s));
   if (suppress_errors && r == CUDA_ERROR_OUT_OF_MEMORY)
     return NULL;
   else if (r != CUDA_SUCCESS)
@@ -1423,8 +1425,8 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
   return ret;
 }
 
-void *
-GOMP_OFFLOAD_alloc (int ord, size_t size)
+static void *
+GOMP_OFFLOAD_alloc_1 (int ord, size_t size, bool usm)
 {
   if (!nvptx_attach_host_thread_to_device (ord))
     return NULL;
@@ -1447,7 +1449,7 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
       blocks = tmp;
     }
 
-  void *d = nvptx_alloc (size, true);
+  void *d = nvptx_alloc (size, true, usm);
   if (d)
     return d;
   else
@@ -1455,10 +1457,22 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
       /* Memory allocation failed.  Try freeing the stacks block, and
 	 retrying.  */
       nvptx_stacks_free (ptx_dev, true);
-      return nvptx_alloc (size, false);
+      return nvptx_alloc (size, false, usm);
     }
 }
 
+void *
+GOMP_OFFLOAD_alloc (int ord, size_t size)
+{
+  return GOMP_OFFLOAD_alloc_1 (ord, size, false);
+}
+
+void *
+GOMP_OFFLOAD_usm_alloc (int ord, size_t size)
+{
+  return GOMP_OFFLOAD_alloc_1 (ord, size, true);
+}
+
 bool
 GOMP_OFFLOAD_free (int ord, void *ptr)
 {
@@ -1466,6 +1480,25 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
 	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
+bool
+GOMP_OFFLOAD_usm_free (int ord, void *ptr)
+{
+  return GOMP_OFFLOAD_free (ord, ptr);
+}
+
+bool
+GOMP_OFFLOAD_is_usm_ptr (void *ptr)
+{
+  bool managed = false;
+  /* This returns 3 outcomes ...
+     CUDA_ERROR_INVALID_VALUE    - Not a Cuda allocated pointer.
+     CUDA_SUCCESS, managed:false - Cuda allocated, but not USM.
+     CUDA_SUCCESS, managed:true  - USM.  */
+  CUDA_CALL_NOCHECK (cuPointerGetAttribute, &managed,
+		     CU_POINTER_ATTRIBUTE_IS_MANAGED, (CUdeviceptr)ptr);
+  return managed;
+}
+
 void
 GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 			   void **hostaddrs, void **devaddrs,
diff --git a/libgomp/target.c b/libgomp/target.c
index 9017458885e..f98e8da2526 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1030,6 +1030,15 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	    tgt->list[i].offset = 0;
 	  continue;
 	}
+      else if (devicep->is_usm_ptr_func
+	       && devicep->is_usm_ptr_func (hostaddrs[i]))
+	{
+	  /* The memory is visible from both host and target
+	     so nothing needs to be moved.  */
+	  tgt->list[i].key = NULL;
+	  tgt->list[i].offset = OFFSET_INLINED;
+	  continue;
+	}
       else if ((kind & typemask) == GOMP_MAP_STRUCT)
 	{
 	  size_t first = i + 1;
@@ -1488,6 +1497,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		  continue;
 		}
 	      default:
+		if (tgt->list[i].offset == OFFSET_INLINED
+		    && !array)
+		  continue;
 		break;
 	      }
 	    splay_tree_key k = &array->key;
@@ -3323,6 +3335,61 @@ omp_target_free (void *device_ptr, int device_num)
   gomp_mutex_unlock (&devicep->lock);
 }
 
+void *
+gomp_usm_alloc (size_t size, int device_num)
+{
+  if (device_num == gomp_get_num_devices ())
+    return malloc (size);
+
+  struct gomp_device_descr *devicep = resolve_device (device_num);
+  if (devicep == NULL)
+    return NULL;
+
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+    return malloc (size);
+
+  void *ret = NULL;
+  gomp_mutex_lock (&devicep->lock);
+  if (devicep->usm_alloc_func)
+    ret = devicep->usm_alloc_func (devicep->target_id, size);
+  gomp_mutex_unlock (&devicep->lock);
+  return ret;
+}
+
+void
+gomp_usm_free (void *device_ptr, int device_num)
+{
+  if (device_ptr == NULL)
+    return;
+
+  if (device_num == gomp_get_num_devices ())
+    {
+      free (device_ptr);
+      return;
+    }
+
+  struct gomp_device_descr *devicep = resolve_device (device_num);
+  if (devicep == NULL)
+    return;
+
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+    {
+      free (device_ptr);
+      return;
+    }
+
+  gomp_mutex_lock (&devicep->lock);
+  if (devicep->usm_free_func
+      && !devicep->usm_free_func (devicep->target_id, device_ptr))
+    {
+      gomp_mutex_unlock (&devicep->lock);
+      gomp_fatal ("error in freeing device memory block at %p", device_ptr);
+    }
+  gomp_mutex_unlock (&devicep->lock);
+}
+
 int
 omp_target_is_present (const void *ptr, int device_num)
 {
@@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM (unload_image);
   DLSYM (alloc);
   DLSYM (free);
+  DLSYM_OPT (usm_alloc, usm_alloc);
+  DLSYM_OPT (usm_free, usm_free);
+  DLSYM_OPT (is_usm_ptr, is_usm_ptr);
   DLSYM (dev2host);
   DLSYM (host2dev);
   device->capabilities = device->get_caps_func ();
diff --git a/libgomp/testsuite/libgomp.c/usm-1.c b/libgomp/testsuite/libgomp.c/usm-1.c
new file mode 100644
index 00000000000..1b35f19c45b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/usm-1.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <stdint.h>
+
+int
+main ()
+{
+  int *a = (int *) omp_alloc(sizeof(int), ompx_unified_shared_mem_alloc);
+  if (!a)
+    __builtin_abort ();
+
+  *a = 42;
+  uintptr_t a_p = (uintptr_t)a;
+
+  #pragma omp target is_device_ptr(a)
+    {
+      if (*a != 42 || a_p != (uintptr_t)a)
+	__builtin_abort ();
+    }
+
+  omp_free(a, ompx_unified_shared_mem_alloc);
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/usm-2.c b/libgomp/testsuite/libgomp.c/usm-2.c
new file mode 100644
index 00000000000..689cee7e456
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/usm-2.c
@@ -0,0 +1,32 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <stdint.h>
+
+int
+main ()
+{
+  int *a = (int *) omp_alloc(sizeof(int)*2, ompx_unified_shared_mem_alloc);
+  if (!a)
+    __builtin_abort ();
+
+  a[0] = 42;
+  a[1] = 43;
+
+  uintptr_t a_p = (uintptr_t)a;
+
+  #pragma omp target map(a[0])
+    {
+      if (a[0] != 42 || a_p != (uintptr_t)a)
+	__builtin_abort ();
+    }
+
+  #pragma omp target map(a[1])
+    {
+      if (a[1] != 43 || a_p != (uintptr_t)a)
+	__builtin_abort ();
+    }
+
+  omp_free(a, ompx_unified_shared_mem_alloc);
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/usm-3.c b/libgomp/testsuite/libgomp.c/usm-3.c
new file mode 100644
index 00000000000..2ca66afe93f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/usm-3.c
@@ -0,0 +1,35 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <stdint.h>
+
+int
+main ()
+{
+  int *a = (int *) omp_alloc(sizeof(int)*2, ompx_unified_shared_mem_alloc);
+  if (!a)
+    __builtin_abort ();
+
+  a[0] = 42;
+  a[1] = 43;
+
+  uintptr_t a_p = (uintptr_t)a;
+
+#pragma omp target data map(a[0:2])
+    {
+#pragma omp target
+	{
+	  if (a[0] != 42 || a_p != (uintptr_t)a)
+	    __builtin_abort ();
+	}
+
+#pragma omp target
+	{
+	  if (a[1] != 43 || a_p != (uintptr_t)a)
+	    __builtin_abort ();
+	}
+    }
+
+  omp_free(a, ompx_unified_shared_mem_alloc);
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/usm-4.c b/libgomp/testsuite/libgomp.c/usm-4.c
new file mode 100644
index 00000000000..753908c8440
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/usm-4.c
@@ -0,0 +1,36 @@
+/* { dg-do run } */
+
+#include <omp.h>
+#include <stdint.h>
+
+int
+main ()
+{
+  int *a = (int *) omp_alloc(sizeof(int)*2, ompx_unified_shared_mem_alloc);
+  if (!a)
+    __builtin_abort ();
+
+  a[0] = 42;
+  a[1] = 43;
+
+  uintptr_t a_p = (uintptr_t)a;
+
+#pragma omp target enter data map(to:a[0:2])
+
+#pragma omp target
+    {
+      if (a[0] != 42 || a_p != (uintptr_t)a)
+	__builtin_abort ();
+    }
+
+#pragma omp target
+    {
+      if (a[1] != 43 || a_p != (uintptr_t)a)
+	__builtin_abort ();
+    }
+
+#pragma omp target exit data map(delete:a[0:2])
+
+  omp_free(a, ompx_unified_shared_mem_alloc);
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/usm-5.c b/libgomp/testsuite/libgomp.c/usm-5.c
new file mode 100644
index 00000000000..4d8b3cf71b1
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/usm-5.c
@@ -0,0 +1,28 @@
+/* { dg-do run } */
+/* { dg-require-effective-target offload_device } */
+
+#include <omp.h>
+#include <stdint.h>
+
+#pragma omp requires unified_shared_memory
+
+int
+main ()
+{
+  int *a = (int *) omp_alloc(sizeof(int), ompx_host_mem_alloc);
+  if (!a)
+    __builtin_abort ();
+
+  a[0] = 42;
+
+  uintptr_t a_p = (uintptr_t)a;
+
+#pragma omp target map(a[0:1])
+    {
+      if (a[0] != 42 || a_p == (uintptr_t)a)
+	__builtin_abort ();
+    }
+
+  omp_free(a, ompx_host_mem_alloc);
+  return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory.
  2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
                   ` (2 preceding siblings ...)
  2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer
@ 2022-03-08 11:30 ` Hafiz Abid Qadeer
  2022-04-02 12:04   ` Andrew Stubbs
  2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw)
  To: gcc-patches, fortran; +Cc: jakub, ams, joseph

This patches changes calls to malloc/free/calloc/realloc and operator new to
memory allocation functions in libgomp with
allocator=ompx_unified_shared_mem_alloc.  This helps existing code to benefit
from the unified shared memory.  The libgomp does the correct thing with all
the mapping constructs and there is no memory copies if the pointer is pointing
to unified shared memory.

We only replace replacable new operator and not the class member or placement new.

gcc/ChangeLog:

	* omp-low.cc (usm_transform): New function.
	(make_pass_usm_transform): Likewise.
	(class pass_usm_transform): New.
	* passes.def: Add pass_usm_transform.
	* tree-pass.h (make_pass_usm_transform): New declaration.

gcc/testsuite/ChangeLog:

	* c-c++-common/gomp/usm-2.c: New test.
	* c-c++-common/gomp/usm-3.c: New test.
	* g++.dg/gomp/usm-1.C: New test.
	* g++.dg/gomp/usm-2.C: New test.
	* g++.dg/gomp/usm-3.C: New test.
	* gfortran.dg/gomp/usm-2.f90: New test.
	* gfortran.dg/gomp/usm-3.f90: New test.

libgomp/ChangeLog:

	* testsuite/libgomp.c/usm-6.c: New test.
	* testsuite/libgomp.c++/usm-1.C: Likewise.
---
 gcc/omp-low.cc                           | 152 +++++++++++++++++++++++
 gcc/passes.def                           |   1 +
 gcc/testsuite/c-c++-common/gomp/usm-2.c  |  34 +++++
 gcc/testsuite/c-c++-common/gomp/usm-3.c  |  32 +++++
 gcc/testsuite/g++.dg/gomp/usm-1.C        |  32 +++++
 gcc/testsuite/g++.dg/gomp/usm-2.C        |  30 +++++
 gcc/testsuite/g++.dg/gomp/usm-3.C        |  38 ++++++
 gcc/testsuite/gfortran.dg/gomp/usm-2.f90 |  16 +++
 gcc/testsuite/gfortran.dg/gomp/usm-3.f90 |  13 ++
 gcc/tree-pass.h                          |   1 +
 libgomp/testsuite/libgomp.c++/usm-1.C    |  54 ++++++++
 libgomp/testsuite/libgomp.c/usm-6.c      |  70 +++++++++++
 12 files changed, 473 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90
 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C
 create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index 5ce3a50709a..ec08d59f676 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -14849,6 +14849,158 @@ make_pass_diagnose_omp_blocks (gcc::context *ctxt)
 {
   return new pass_diagnose_omp_blocks (ctxt);
 }
+
+/* Provide transformation required for using unified shared memory
+   by replacing calls to standard memory allocation functions with
+   function provided by the libgomp.  */
+
+static tree
+usm_transform (gimple_stmt_iterator *gsi_p, bool *,
+	       struct walk_stmt_info *wi)
+{
+  gimple *stmt = gsi_stmt (*gsi_p);
+  /* ompx_unified_shared_mem_alloc is 10.  */
+  const unsigned int unified_shared_mem_alloc = 10;
+
+  switch (gimple_code (stmt))
+    {
+    case GIMPLE_CALL:
+      {
+	gcall *gs = as_a <gcall *> (stmt);
+	tree fndecl = gimple_call_fndecl (gs);
+	if (fndecl)
+	  {
+	    tree allocator = build_int_cst (pointer_sized_int_node,
+					    unified_shared_mem_alloc);
+	    const char *name = IDENTIFIER_POINTER (DECL_NAME (fndecl));
+	    if ((strcmp (name, "malloc") == 0)
+		 || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
+		     && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_MALLOC)
+		 || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl))
+	      {
+		  tree omp_alloc_type
+		    = build_function_type_list (ptr_type_node, size_type_node,
+						pointer_sized_int_node,
+						NULL_TREE);
+		tree repl = build_fn_decl ("omp_alloc", omp_alloc_type);
+		tree size = gimple_call_arg (gs, 0);
+		gimple *g = gimple_build_call (repl, 2, size, allocator);
+		gimple_call_set_lhs (g, gimple_call_lhs (gs));
+		gimple_set_location (g, gimple_location (stmt));
+		gsi_replace (gsi_p, g, true);
+	      }
+	    else if ((strcmp (name, "calloc") == 0)
+		      || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
+			  && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_CALLOC))
+	      {
+		tree omp_calloc_type
+		  = build_function_type_list (ptr_type_node, size_type_node,
+					      size_type_node,
+					      pointer_sized_int_node,
+					      NULL_TREE);
+		tree repl = build_fn_decl ("omp_calloc", omp_calloc_type);
+		tree num = gimple_call_arg (gs, 0);
+		tree size = gimple_call_arg (gs, 1);
+		gimple *g = gimple_build_call (repl, 3, num, size, allocator);
+		gimple_call_set_lhs (g, gimple_call_lhs (gs));
+		gimple_set_location (g, gimple_location (stmt));
+		gsi_replace (gsi_p, g, true);
+	      }
+	    else if ((strcmp (name, "realloc") == 0)
+		      || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
+			  && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_REALLOC))
+	      {
+		tree omp_realloc_type
+		  = build_function_type_list (ptr_type_node, ptr_type_node,
+					      size_type_node,
+					      pointer_sized_int_node,
+					      pointer_sized_int_node,
+					      NULL_TREE);
+		tree repl = build_fn_decl ("omp_realloc", omp_realloc_type);
+		tree ptr = gimple_call_arg (gs, 0);
+		tree size = gimple_call_arg (gs, 1);
+		gimple *g = gimple_build_call (repl, 4, ptr, size, allocator,
+					       allocator);
+		gimple_call_set_lhs (g, gimple_call_lhs (gs));
+		gimple_set_location (g, gimple_location (stmt));
+		gsi_replace (gsi_p, g, true);
+	      }
+	    else  if ((strcmp (name, "free") == 0)
+		       || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
+			   && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_FREE)
+		       || (DECL_IS_OPERATOR_DELETE_P (fndecl)
+			   && DECL_IS_REPLACEABLE_OPERATOR (fndecl)))
+	      {
+		tree omp_free_type
+		  = build_function_type_list (void_type_node, ptr_type_node,
+					      pointer_sized_int_node,
+					      NULL_TREE);
+		tree repl = build_fn_decl ("omp_free", omp_free_type);
+		tree ptr = gimple_call_arg (gs, 0);
+		gimple *g = gimple_build_call (repl, 2, ptr, allocator);
+		gimple_set_location (g, gimple_location (stmt));
+		gsi_replace (gsi_p, g, true);
+	      }
+	  }
+      }
+      break;
+
+    default:
+      break;
+    }
+
+  return NULL_TREE;
+}
+
+namespace {
+
+const pass_data pass_data_usm_transform =
+{
+  GIMPLE_PASS, /* type */
+  "usm_transform", /* name */
+  OPTGROUP_OMP, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_gimple_any, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_usm_transform : public gimple_opt_pass
+{
+public:
+  pass_usm_transform (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_usm_transform, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return (flag_openmp || flag_openmp_simd)
+	    && (flag_offload_memory == OFFLOAD_MEMORY_UNIFIED
+		|| omp_requires_mask & OMP_REQUIRES_UNIFIED_SHARED_MEMORY);
+  }
+  virtual unsigned int execute (function *)
+  {
+    struct walk_stmt_info wi;
+    gimple_seq body = gimple_body (current_function_decl);
+
+    memset (&wi, 0, sizeof (wi));
+    walk_gimple_seq (body, usm_transform, NULL, &wi);
+
+    return 0;
+  }
+
+}; // class pass_usm_transform
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_usm_transform (gcc::context *ctxt)
+{
+  return new pass_usm_transform (ctxt);
+}
 \f
 
 #include "gt-omp-low.h"
diff --git a/gcc/passes.def b/gcc/passes.def
index f7718181038..98c7736bb8b 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -36,6 +36,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_diagnose_tm_blocks);
   NEXT_PASS (pass_omp_oacc_kernels_decompose);
   NEXT_PASS (pass_lower_omp);
+  NEXT_PASS (pass_usm_transform);
   NEXT_PASS (pass_lower_cf);
   NEXT_PASS (pass_lower_tm);
   NEXT_PASS (pass_refactor_eh);
diff --git a/gcc/testsuite/c-c++-common/gomp/usm-2.c b/gcc/testsuite/c-c++-common/gomp/usm-2.c
new file mode 100644
index 00000000000..2f3f986012c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/gomp/usm-2.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-fdump-tree-usm_transform" } */
+
+#pragma omp requires unified_shared_memory
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+void *malloc (__SIZE_TYPE__);
+void *calloc(__SIZE_TYPE__, __SIZE_TYPE__);
+void *realloc(void *, __SIZE_TYPE__);
+void free (void *);
+
+#ifdef __cplusplus
+}
+#endif
+
+void
+foo ()
+{
+  void *p1 = malloc(20);
+  void *p2 = realloc(p1, 30);
+  void *p3 = calloc(4, 15);
+  free (p2);
+  free (p3);
+}
+
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_free" 2 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " free"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " malloc"  "usm_transform"  } } */
diff --git a/gcc/testsuite/c-c++-common/gomp/usm-3.c b/gcc/testsuite/c-c++-common/gomp/usm-3.c
new file mode 100644
index 00000000000..c8230e7ff7c
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/gomp/usm-3.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-foffload-memory=unified -fdump-tree-usm_transform" } */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+void *malloc (__SIZE_TYPE__);
+void *calloc(__SIZE_TYPE__, __SIZE_TYPE__);
+void *realloc(void *, __SIZE_TYPE__);
+void free (void *);
+
+#ifdef __cplusplus
+}
+#endif
+
+void
+foo ()
+{
+  void *p1 = malloc(20);
+  void *p2 = realloc(p1, 30);
+  void *p3 = calloc(4, 15);
+  free (p2);
+  free (p3);
+}
+
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_free" 2 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " free"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " malloc"  "usm_transform"  } } */
diff --git a/gcc/testsuite/g++.dg/gomp/usm-1.C b/gcc/testsuite/g++.dg/gomp/usm-1.C
new file mode 100644
index 00000000000..bd70a81b5bb
--- /dev/null
+++ b/gcc/testsuite/g++.dg/gomp/usm-1.C
@@ -0,0 +1,32 @@
+// { dg-do compile }
+// { dg-options "-fopenmp -fdump-tree-usm_transform" }
+
+#pragma omp requires unified_shared_memory
+
+struct t1
+{
+  int a;
+  int b;
+};
+
+typedef unsigned char uint8_t;
+
+void
+foo (__SIZE_TYPE__ x, __SIZE_TYPE__ y)
+{
+  uint8_t *p1 = new uint8_t;
+  uint8_t *p2 = new uint8_t[20];
+  t1 *p3 = new t1;
+  t1 *p4 = new t1[y];
+  delete p1;
+  delete p3;
+  delete [] p2;
+  delete [] p4;
+}
+
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(1, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_alloc" 4 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not "operator new"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not "operator delete"  "usm_transform"  } } */
diff --git a/gcc/testsuite/g++.dg/gomp/usm-2.C b/gcc/testsuite/g++.dg/gomp/usm-2.C
new file mode 100644
index 00000000000..f6ab155c6de
--- /dev/null
+++ b/gcc/testsuite/g++.dg/gomp/usm-2.C
@@ -0,0 +1,30 @@
+// { dg-do compile }
+// { dg-options "-fopenmp -foffload-memory=unified -fdump-tree-usm_transform" }
+
+struct t1
+{
+  int a;
+  int b;
+};
+
+typedef unsigned char uint8_t;
+
+void
+foo (__SIZE_TYPE__ x, __SIZE_TYPE__ y)
+{
+  uint8_t *p1 = new uint8_t;
+  uint8_t *p2 = new uint8_t[20];
+  t1 *p3 = new t1;
+  t1 *p4 = new t1[y];
+  delete p1;
+  delete p3;
+  delete [] p2;
+  delete [] p4;
+}
+
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(1, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_alloc" 4 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not "operator new"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not "operator delete"  "usm_transform"  } } */
diff --git a/gcc/testsuite/g++.dg/gomp/usm-3.C b/gcc/testsuite/g++.dg/gomp/usm-3.C
new file mode 100644
index 00000000000..50ac9302c8b
--- /dev/null
+++ b/gcc/testsuite/g++.dg/gomp/usm-3.C
@@ -0,0 +1,38 @@
+// { dg-do compile }
+// { dg-options "-fopenmp -fdump-tree-usm_transform" }
+
+#pragma omp requires unified_shared_memory
+
+#include <new>
+
+
+struct X {
+    static void* operator new(std::size_t count)
+    {
+      static char buf[10];
+      return &buf[0];
+    }
+    static void* operator new[](std::size_t count)
+    {
+      static char buf[10];
+      return &buf[0];
+    }
+    static void operator delete(void*)
+    {
+    }
+    static void operator delete[](void*)
+    {
+    }
+};
+void foo() {
+  X* p1 = new X;
+  delete p1;
+  X* p2 = new X[10];
+  delete[] p2;
+  unsigned char buf[24] ;
+  int *p3 = new (buf) int(3);
+  p3[0] = 1;
+}
+
+/* { dg-final { scan-tree-dump-not "omp_alloc"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not "omp_free"  "usm_transform"  } } */
diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-2.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-2.f90
new file mode 100644
index 00000000000..dc775260cb7
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/gomp/usm-2.f90
@@ -0,0 +1,16 @@
+! { dg-do compile }
+! { dg-additional-options "-fdump-tree-usm_transform" }
+
+!$omp requires unified_shared_memory
+end
+
+subroutine foo()
+  implicit none
+  integer, allocatable :: var1
+
+  allocate(var1)
+
+end subroutine
+
+! { dg-final { scan-tree-dump-times "omp_alloc" 1 "usm_transform"  } } 
+! { dg-final { scan-tree-dump-times "omp_free" 1 "usm_transform"  } } 
\ No newline at end of file
diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-3.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-3.f90
new file mode 100644
index 00000000000..7983444ebff
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/gomp/usm-3.f90
@@ -0,0 +1,13 @@
+! { dg-do compile }
+! { dg-additional-options "-foffload-memory=unified -fdump-tree-usm_transform" }
+
+subroutine foo()
+  implicit none
+  integer, allocatable :: var1
+
+  allocate(var1)
+
+end subroutine
+
+! { dg-final { scan-tree-dump-times "omp_alloc" 1 "usm_transform"  } } 
+! { dg-final { scan-tree-dump-times "omp_free" 1 "usm_transform"  } } 
\ No newline at end of file
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 606d1d60b85..494a9662afa 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -424,6 +424,7 @@ extern gimple_opt_pass *make_pass_lower_vector_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_oacc_kernels_decompose (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lower_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_usm_transform (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_target_link (gcc::context *ctxt);
diff --git a/libgomp/testsuite/libgomp.c++/usm-1.C b/libgomp/testsuite/libgomp.c++/usm-1.C
new file mode 100644
index 00000000000..fea25e5f10b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c++/usm-1.C
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-skip-if "Only valid for nvptx" { ! offload_target_nvptx } } */
+#include <stdint.h>
+
+#pragma omp requires unified_shared_memory
+
+int g1 = 0;
+
+struct s1
+{
+  s1() { a = g1++;}
+  ~s1() { g1--;}
+  int a;
+};
+
+int
+main ()
+{
+  s1 *p1 = new s1;
+  s1 *p2 = new s1[10];
+
+  if (!p1 || !p2 || p1->a != 0)
+    __builtin_abort ();
+
+  for (int i = 0; i < 10; i++)
+    if (p2[i].a != i+1)
+      __builtin_abort ();
+
+  uintptr_t pp1 = (uintptr_t)p1;
+  uintptr_t pp2 = (uintptr_t)p2;
+
+#pragma omp target firstprivate(pp1, pp2)
+    {
+      s1 *t1 = (s1*)pp1;
+      s1 *t2 = (s1*)pp2;
+      if (t1->a != 0)
+	__builtin_abort ();
+
+      for (int i = 0; i < 10; i++)
+	if (t2[i].a != i+1)
+	  __builtin_abort ();
+
+      t1->a = 42;
+    }
+
+  if (p1->a != 42)
+    __builtin_abort ();
+
+  delete [] p2;
+  delete p1;
+  if (g1 != 0)
+    __builtin_abort ();
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/usm-6.c b/libgomp/testsuite/libgomp.c/usm-6.c
new file mode 100644
index 00000000000..d98da68a1ed
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/usm-6.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-skip-if "Only valid for nvptx" { ! offload_target_nvptx } } */
+
+#include <stdint.h>
+#include <stdlib.h>
+
+#pragma omp requires unified_shared_memory
+
+int
+main ()
+{
+  int *a = (int *) malloc(sizeof(int)*2);
+  int *b = (int *) calloc(sizeof(int), 3);
+  int *c = (int *) realloc(NULL, sizeof(int) * 4);
+  if (!a || !b || !c)
+    __builtin_abort ();
+
+  a[0] = 42;
+  a[1] = 43;
+  b[0] = 52;
+  b[1] = 53;
+  b[2] = 54;
+  c[0] = 62;
+  c[1] = 63;
+  c[2] = 64;
+  c[3] = 65;
+
+  uintptr_t a_p = (uintptr_t)a;
+  uintptr_t b_p = (uintptr_t)b;
+  uintptr_t c_p = (uintptr_t)c;
+
+#pragma omp target enter data map(to:a[0:2])
+
+#pragma omp target is_device_ptr(c)
+    {
+      if (a[0] != 42 || a_p != (uintptr_t)a)
+	__builtin_abort ();
+      if (b[0] != 52 || b[2] != 54 || b_p != (uintptr_t)b)
+	__builtin_abort ();
+      if (c[0] != 62 || c[3] != 65 || c_p != (uintptr_t)c)
+	__builtin_abort ();
+      a[0] = 72;
+      b[0] = 82;
+      c[0] = 92;
+    }
+
+#pragma omp target
+    {
+      if (a[1] != 43 || a_p != (uintptr_t)a)
+	__builtin_abort ();
+      if (b[1] != 53 || b_p != (uintptr_t)b)
+	__builtin_abort ();
+      if (c[1] != 63 || c[2] != 64 || c_p != (uintptr_t)c)
+	__builtin_abort ();
+      a[1] = 73;
+      b[1] = 83;
+      c[1] = 93;
+    }
+
+#pragma omp target exit data map(delete:a[0:2])
+
+  if (a[0] != 72 || a[1] != 73
+      || b[0] != 82 || b[1] != 83
+      || c[0] != 92 || c[1] != 93)
+	__builtin_abort ();
+  free(a);
+  free(b);
+  free(c);
+  return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/5] openmp: -foffload-memory=pinned
  2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
                   ` (3 preceding siblings ...)
  2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer
@ 2022-03-08 11:30 ` Hafiz Abid Qadeer
  2022-03-30 22:40   ` Andrew Stubbs
  2023-02-09 11:16   ` [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned) Thomas Schwinge
  2022-04-13 13:14 ` [PATCH 0/5] openmp: Handle pinned and unified shared memory Andrew Stubbs
  2022-04-20 13:25 ` [PATCH] openmp: Handle unified address memory Andrew Stubbs
  6 siblings, 2 replies; 18+ messages in thread
From: Hafiz Abid Qadeer @ 2022-03-08 11:30 UTC (permalink / raw)
  To: gcc-patches, fortran; +Cc: jakub, ams, joseph

From: Andrew Stubbs <ams@codesourcery.com>

Implement the -foffload-memory=pinned option such that libgomp is
instructed to enable fully-pinned memory at start-up.  The option is
intended to provide a performance boost to certain offload programs without
modifying the code.

This feature only works on Linux, at present, and simply calls mlockall to
enable always-on memory pinning.  It requires that the ulimit feature is
set high enough to accommodate all the program's memory usage.

In this mode the ompx_pinned_memory_alloc feature is disabled as it is not
needed and may conflict.

gcc/ChangeLog:

	* omp-low.cc (omp_enable_pinned_mode): New function.
	(execute_lower_omp): Call omp_enable_pinned_mode.

libgomp/ChangeLog:

	* config/linux/allocator.c (always_pinned_mode): New variable.
	(GOMP_enable_pinned_mode): New function.
	(linux_memspace_alloc): Disable pinning when always_pinned_mode set.
	(linux_memspace_calloc): Likewise.
	(linux_memspace_free): Likewise.
	(linux_memspace_realloc): Likewise.
	* libgomp.map (GOMP_5.1.1): New version space with
	GOMP_enable_pinned_mode.
	* testsuite/libgomp.c/alloc-pinned-7.c: New test.

gcc/testsuite/ChangeLog:

	* c-c++-common/gomp/alloc-pinned-1.c: New test.
---
 gcc/omp-low.cc                                | 68 +++++++++++++++++++
 .../c-c++-common/gomp/alloc-pinned-1.c        | 28 ++++++++
 libgomp/config/linux/allocator.c              | 26 +++++++
 libgomp/libgomp.map                           |  5 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-7.c  | 66 ++++++++++++++++++
 5 files changed, 193 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index ec08d59f676..ce21b3bd6f8 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -14441,6 +14441,70 @@ lower_omp (gimple_seq *body, omp_context *ctx)
   input_location = saved_location;
 }
 
+/* Emit a constructor function to enable -foffload-memory=pinned
+   at runtime.  Libgomp handles the OS mode setting, but we need to trigger
+   it by calling GOMP_enable_pinned mode before the program proper runs.  */
+
+static void
+omp_enable_pinned_mode ()
+{
+  static bool visited = false;
+  if (visited)
+    return;
+  visited = true;
+
+  /* Create a new function like this:
+
+       static void __attribute__((constructor))
+       __set_pinned_mode ()
+       {
+	 GOMP_enable_pinned_mode ();
+       }
+  */
+
+  tree name = get_identifier ("__set_pinned_mode");
+  tree voidfntype = build_function_type_list (void_type_node, NULL_TREE);
+  tree decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, name, voidfntype);
+
+  TREE_STATIC (decl) = 1;
+  TREE_USED (decl) = 1;
+  DECL_ARTIFICIAL (decl) = 1;
+  DECL_IGNORED_P (decl) = 0;
+  TREE_PUBLIC (decl) = 0;
+  DECL_UNINLINABLE (decl) = 1;
+  DECL_EXTERNAL (decl) = 0;
+  DECL_CONTEXT (decl) = NULL_TREE;
+  DECL_INITIAL (decl) = make_node (BLOCK);
+  BLOCK_SUPERCONTEXT (DECL_INITIAL (decl)) = decl;
+  DECL_STATIC_CONSTRUCTOR (decl) = 1;
+  DECL_ATTRIBUTES (decl) = tree_cons (get_identifier ("constructor"),
+				      NULL_TREE, NULL_TREE);
+
+  tree t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE,
+		       void_type_node);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_IGNORED_P (t) = 1;
+  DECL_CONTEXT (t) = decl;
+  DECL_RESULT (decl) = t;
+
+  push_struct_function (decl);
+  init_tree_ssa (cfun);
+
+  tree callname = get_identifier ("GOMP_enable_pinned_mode");
+  tree calldecl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, callname,
+			      voidfntype);
+  gcall *call = gimple_build_call (calldecl, 0);
+
+  gimple_seq seq = NULL;
+  gimple_seq_add_stmt (&seq, call);
+  gimple_set_body (decl, gimple_build_bind (NULL_TREE, seq, NULL));
+
+  cfun->function_end_locus = UNKNOWN_LOCATION;
+  cfun->curr_properties |= PROP_gimple_any;
+  pop_cfun ();
+  cgraph_node::add_new_function (decl, true);
+}
+
 /* Main entry point.  */
 
 static unsigned int
@@ -14497,6 +14561,10 @@ execute_lower_omp (void)
   for (auto task_stmt : task_cpyfns)
     finalize_task_copyfn (task_stmt);
   task_cpyfns.release ();
+
+  if (flag_offload_memory == OFFLOAD_MEMORY_PINNED)
+    omp_enable_pinned_mode ();
+
   return 0;
 }
 
diff --git a/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c b/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c
new file mode 100644
index 00000000000..e0e08019bff
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c
@@ -0,0 +1,28 @@
+/* { dg-do run } */
+/* { dg-additional-options "-foffload-memory=pinned" } */
+/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
+
+#if __cplusplus
+#define EXTERNC extern "C"
+#else
+#define EXTERNC
+#endif
+
+/* Intercept the libgomp initialization call to check it happens.  */
+
+int good = 0;
+
+EXTERNC void
+GOMP_enable_pinned_mode ()
+{
+  good = 1;
+}
+
+int
+main ()
+{
+  if (!good)
+    __builtin_exit (1);
+
+  return 0;
+}
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index face524259c..4bd5bd6b930 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -39,9 +39,26 @@
 #include <string.h>
 #include "libgomp.h"
 
+static bool always_pinned_mode = false;
+
+/* This function is called by the compiler when -foffload-memory=pinned
+   is used.  */
+
+void
+GOMP_enable_pinned_mode ()
+{
+  if (mlockall (MCL_CURRENT | MCL_FUTURE) != 0)
+    gomp_error ("failed to pin all memory (ulimit too low?)");
+  else
+    always_pinned_mode = true;
+}
+
 static void *
 linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
 {
+  /* Explicit pinning may not be required.  */
+  pin = pin && !always_pinned_mode;
+
   if (memspace == ompx_unified_shared_mem_space)
     {
       return gomp_usm_alloc (size, GOMP_DEVICE_ICV);
@@ -69,6 +86,9 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
 static void *
 linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
 {
+  /* Explicit pinning may not be required.  */
+  pin = pin && !always_pinned_mode;
+
   if (memspace == ompx_unified_shared_mem_space)
     {
       void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV);
@@ -86,6 +106,9 @@ static void
 linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
 		     int pin)
 {
+  /* Explicit pinning may not be required.  */
+  pin = pin && !always_pinned_mode;
+
   if (memspace == ompx_unified_shared_mem_space)
     gomp_usm_free (addr, GOMP_DEVICE_ICV);
   else if (pin)
@@ -98,6 +121,9 @@ static void *
 linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
 			size_t oldsize, size_t size, int oldpin, int pin)
 {
+  /* Explicit pinning may not be required.  */
+  pin = pin && !always_pinned_mode;
+
   if (memspace == ompx_unified_shared_mem_space)
     goto manual_realloc;
   else if (oldpin && pin)
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 2ac58094169..40402dc9893 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -402,6 +402,11 @@ GOMP_5.1 {
 	GOMP_teams4;
 } GOMP_5.0.1;
 
+GOMP_5.1.1 {
+  global:
+	GOMP_enable_pinned_mode;
+} GOMP_5.1;
+
 OACC_2.0 {
   global:
 	acc_get_num_devices;
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c
new file mode 100644
index 00000000000..6fd19b46a5c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c
@@ -0,0 +1,66 @@
+/* { dg-do run } */
+/* { dg-additional-options "-foffload-memory=pinned" } */
+
+/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
+
+/* Test that pinned memory works.  */
+
+#ifdef __linux__
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/mman.h>
+
+int
+get_pinned_mem ()
+{
+  int pid = getpid ();
+  char buf[100];
+  sprintf (buf, "/proc/%d/status", pid);
+
+  FILE *proc = fopen (buf, "r");
+  if (!proc)
+    abort ();
+  while (fgets (buf, 100, proc))
+    {
+      int val;
+      if (sscanf (buf, "VmLck: %d", &val))
+	{
+	  printf ("lock %d\n", val);
+	  fclose (proc);
+	  return val;
+	}
+    }
+  abort ();
+}
+#else
+int
+get_pinned_mem ()
+{
+  return 0;
+}
+
+#define mlockall(...) 0
+#endif
+
+#include <omp.h>
+
+/* Allocate more than a page each time, but stay within the ulimit.  */
+#define SIZE 10*1024
+
+int
+main ()
+{
+  // Sanity check
+  if (get_pinned_mem () == 0)
+    {
+      /* -foffload-memory=pinned has failed, but maybe that's because
+	 isufficient pinned memory was available.  */
+      if (mlockall (MCL_CURRENT | MCL_FUTURE) == 0)
+	abort ();
+    }
+
+  return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] openmp: -foffload-memory=pinned
  2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer
@ 2022-03-30 22:40   ` Andrew Stubbs
  2023-02-09 11:16   ` [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned) Thomas Schwinge
  1 sibling, 0 replies; 18+ messages in thread
From: Andrew Stubbs @ 2022-03-30 22:40 UTC (permalink / raw)
  To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph

[-- Attachment #1: Type: text/plain, Size: 324 bytes --]

On 08/03/2022 11:30, Hafiz Abid Qadeer wrote:
> gcc/ChangeLog:
> 
> 	* omp-low.cc (omp_enable_pinned_mode): New function.
> 	(execute_lower_omp): Call omp_enable_pinned_mode.

This worked for x86_64, but I needed to make the attached adjustment to 
work on powerpc without a linker error.

I've committed it to OG11.

Andrew

[-- Attachment #2: 220330-gomp_enable_pinned_mode.patch --]
[-- Type: text/plain, Size: 1360 bytes --]

openmp: BUILT_IN_GOMP_ENABLE_PINNED_MODE

Rework the GOMP_enable_pinned_mode call so that it works on powerpc where
the old way gave a local call.

gcc/ChangeLog:

	* omp-builtins.def (BUILT_IN_GOMP_ENABLE_PINNED_MODE): New.
	* omp-low.c (omp_enable_pinned_mode): Use
	BUILT_IN_GOMP_ENABLE_PINNED_MODE.

diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index c591d79fa07..e442b0b5c94 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -468,3 +468,6 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ERROR, "GOMP_error",
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_EVALUATE_TARGET_DEVICE, "GOMP_evaluate_target_device",
 		  BT_FN_BOOL_INT_CONST_PTR_CONST_PTR_CONST_PTR,
 		  ATTR_NOTHROW_LEAF_LIST)
+DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ENABLE_PINNED_MODE,
+		  "GOMP_enable_pinned_mode",
+		  BT_FN_VOID, ATTR_NOTHROW_LIST)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index f7ecfb52c73..4e8ab9e4ca0 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -15432,9 +15432,7 @@ omp_enable_pinned_mode ()
   push_struct_function (decl);
   init_tree_ssa (cfun);
 
-  tree callname = get_identifier ("GOMP_enable_pinned_mode");
-  tree calldecl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, callname,
-			      voidfntype);
+  tree calldecl = builtin_decl_explicit (BUILT_IN_GOMP_ENABLE_PINNED_MODE);
   gcall *call = gimple_build_call (calldecl, 0);
 
   gimple_seq seq = NULL;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory.
  2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer
@ 2022-04-02 12:04   ` Andrew Stubbs
  2022-04-02 12:42     ` Andrew Stubbs
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Stubbs @ 2022-04-02 12:04 UTC (permalink / raw)
  To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph

[-- Attachment #1: Type: text/plain, Size: 491 bytes --]

On 08/03/2022 11:30, Hafiz Abid Qadeer wrote:
> This patches changes calls to malloc/free/calloc/realloc and operator new to
> memory allocation functions in libgomp with
> allocator=ompx_unified_shared_mem_alloc.

This additional patch adds transformation for omp_target_alloc. The 
OpenMP 5.0 document says that addresses allocated this way needs to work 
without is_device_ptr. The easiest way to make that work is to make them 
USM addresses.

I will commit this to OG11 shortly.

Andrew

[-- Attachment #2: 220401-usm-omp_target_alloc.patch --]
[-- Type: text/plain, Size: 6595 bytes --]

openmp: Do USM transform for omp_target_alloc

OpenMP 5.0 says that omp_target_alloc should return USM addresses.

gcc/ChangeLog:

	* omp-low.c (usm_transform): Transform omp_target_alloc and
	omp_target_free.

libgomp/ChangeLog:

	* testsuite/libgomp.c/usm-6.c: Add omp_target_alloc.

gcc/testsuite/ChangeLog:

	* c-c++-common/gomp/usm-2.c: Add omp_target_alloc.
	* c-c++-common/gomp/usm-3.c: Add omp_target_alloc.

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 4e8ab9e4ca0..9235eafd1d7 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -15880,7 +15880,8 @@ usm_transform (gimple_stmt_iterator *gsi_p, bool *,
 	    if ((strcmp (name, "malloc") == 0)
 		 || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
 		     && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_MALLOC)
-		 || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl))
+		 || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl)
+		 || strcmp (name, "omp_target_alloc") == 0)
 	      {
 		  tree omp_alloc_type
 		    = build_function_type_list (ptr_type_node, size_type_node,
@@ -15952,7 +15953,8 @@ usm_transform (gimple_stmt_iterator *gsi_p, bool *,
 		       || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
 			   && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_FREE)
 		       || (DECL_IS_OPERATOR_DELETE_P (fndecl)
-			   && DECL_IS_REPLACEABLE_OPERATOR (fndecl)))
+			   && DECL_IS_REPLACEABLE_OPERATOR (fndecl))
+		       || strcmp (name, "omp_target_free") == 0)
 	      {
 		tree omp_free_type
 		  = build_function_type_list (void_type_node, ptr_type_node,
diff --git a/gcc/testsuite/c-c++-common/gomp/usm-2.c b/gcc/testsuite/c-c++-common/gomp/usm-2.c
index 64dbb6be131..8c20ef94e69 100644
--- a/gcc/testsuite/c-c++-common/gomp/usm-2.c
+++ b/gcc/testsuite/c-c++-common/gomp/usm-2.c
@@ -12,6 +12,8 @@ void *aligned_alloc (__SIZE_TYPE__, __SIZE_TYPE__);
 void *calloc(__SIZE_TYPE__, __SIZE_TYPE__);
 void *realloc(void *, __SIZE_TYPE__);
 void free (void *);
+void *omp_target_alloc (__SIZE_TYPE__, int);
+void omp_target_free (void *, int);
 
 #ifdef __cplusplus
 }
@@ -24,16 +26,21 @@ foo ()
   void *p2 = realloc(p1, 30);
   void *p3 = calloc(4, 15);
   void *p4 = aligned_alloc(16, 40);
+  void *p5 = omp_target_alloc(50, 1);
   free (p2);
   free (p3);
   free (p4);
+  omp_target_free (p5, 1);
 }
 
 /* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-times "omp_aligned_alloc \\(16, 40, 10\\)" 1 "usm_transform"  } } */
-/* { dg-final { scan-tree-dump-times "omp_free" 3 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(50, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-not " free"  "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-not " aligned_alloc"  "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-not " malloc"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " omp_target_alloc"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " omp_target_free"  "usm_transform"  } } */
diff --git a/gcc/testsuite/c-c++-common/gomp/usm-3.c b/gcc/testsuite/c-c++-common/gomp/usm-3.c
index 934582ea5fd..2b0cbb45e27 100644
--- a/gcc/testsuite/c-c++-common/gomp/usm-3.c
+++ b/gcc/testsuite/c-c++-common/gomp/usm-3.c
@@ -10,6 +10,8 @@ void *aligned_alloc (__SIZE_TYPE__, __SIZE_TYPE__);
 void *calloc(__SIZE_TYPE__, __SIZE_TYPE__);
 void *realloc(void *, __SIZE_TYPE__);
 void free (void *);
+void *omp_target_alloc (__SIZE_TYPE__, int);
+void omp_target_free (void *, int);
 
 #ifdef __cplusplus
 }
@@ -22,16 +24,21 @@ foo ()
   void *p2 = realloc(p1, 30);
   void *p3 = calloc(4, 15);
   void *p4 = aligned_alloc(16, 40);
+  void *p5 = omp_target_alloc(50, 1);
   free (p2);
   free (p3);
   free (p4);
+  omp_target_free (p5, 1);
 }
 
 /* { dg-final { scan-tree-dump-times "omp_alloc \\(20, 10\\)" 1 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-times "omp_realloc \\(.*, 30, 10, 10\\)" 1 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-times "omp_calloc \\(4, 15, 10\\)" 1 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-times "omp_aligned_alloc \\(16, 40, 10\\)" 1 "usm_transform"  } } */
-/* { dg-final { scan-tree-dump-times "omp_free" 3 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_alloc \\(50, 10\\)" 1 "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-times "omp_free" 4 "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-not " free"  "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-not " aligned_alloc"  "usm_transform"  } } */
 /* { dg-final { scan-tree-dump-not " malloc"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " omp_target_alloc"  "usm_transform"  } } */
+/* { dg-final { scan-tree-dump-not " omp_target_free"  "usm_transform"  } } */
diff --git a/libgomp/testsuite/libgomp.c/usm-6.c b/libgomp/testsuite/libgomp.c/usm-6.c
index d2c828fdc9d..c207140092a 100644
--- a/libgomp/testsuite/libgomp.c/usm-6.c
+++ b/libgomp/testsuite/libgomp.c/usm-6.c
@@ -4,6 +4,8 @@
 #include <stdint.h>
 #include <stdlib.h>
 
+#include <omp.h>
+
 /* On old systems, the declaraition may not be present in stdlib.h which
    will generate a warning.  This function is going to be replaced with
    omp_aligned_alloc so the purpose of this declaration is to avoid that
@@ -19,7 +21,8 @@ main ()
   int *b = (int *) calloc(sizeof(int), 3);
   int *c = (int *) realloc(NULL, sizeof(int) * 4);
   int *d = (int *) aligned_alloc(32, sizeof(int));
-  if (!a || !b || !c || !d)
+  int *e = (int *) omp_target_alloc(sizeof(int), 1);
+  if (!a || !b || !c || !d || !e)
     __builtin_abort ();
 
   a[0] = 42;
@@ -36,6 +39,7 @@ main ()
   uintptr_t b_p = (uintptr_t)b;
   uintptr_t c_p = (uintptr_t)c;
   uintptr_t d_p = (uintptr_t)d;
+  uintptr_t e_p = (uintptr_t)e;
 
   if (d_p & 31 != 0)
     __builtin_abort ();
@@ -52,9 +56,12 @@ main ()
 	__builtin_abort ();
       if (d_p != (uintptr_t)d)
 	__builtin_abort ();
+      if (e_p != (uintptr_t)e)
+	__builtin_abort ();
       a[0] = 72;
       b[0] = 82;
       c[0] = 92;
+      e[0] = 102;
     }
 
 #pragma omp target
@@ -74,10 +81,12 @@ main ()
 
   if (a[0] != 72 || a[1] != 73
       || b[0] != 82 || b[1] != 83
-      || c[0] != 92 || c[1] != 93)
+      || c[0] != 92 || c[1] != 93
+      || e[0] != 102)
 	__builtin_abort ();
   free(a);
   free(b);
   free(c);
+  omp_target_free(e, 1);
   return 0;
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory.
  2022-04-02 12:04   ` Andrew Stubbs
@ 2022-04-02 12:42     ` Andrew Stubbs
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Stubbs @ 2022-04-02 12:42 UTC (permalink / raw)
  To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph

On 02/04/2022 13:04, Andrew Stubbs wrote:
> This additional patch adds transformation for omp_target_alloc. The 
> OpenMP 5.0 document says that addresses allocated this way needs to work 
> without is_device_ptr. The easiest way to make that work is to make them 
> USM addresses.

Actually, reading on, it says "Every device address allocated through 
OpenMP device memory routines is a valid host pointer", so USM is the 
correct answer.

Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] openmp: Handle pinned and unified shared memory.
  2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
                   ` (4 preceding siblings ...)
  2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer
@ 2022-04-13 13:14 ` Andrew Stubbs
  2022-04-20 13:25 ` [PATCH] openmp: Handle unified address memory Andrew Stubbs
  6 siblings, 0 replies; 18+ messages in thread
From: Andrew Stubbs @ 2022-04-13 13:14 UTC (permalink / raw)
  To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph

[-- Attachment #1: Type: text/plain, Size: 338 bytes --]

This patch adjusts the testcases, previously proposed, to allow for 
testing on machines with varying page sizes and default amounts of 
lockable memory.  There turns out to be more variation than I had thought.

This should go on mainline at the same time as the previous patches in 
this thread.

I'll commit it to OG11 shortly.

Andrew

[-- Attachment #2: 220413-pinned-tests.patch --]
[-- Type: text/plain, Size: 8831 bytes --]

libgomp: autodetect page sizes in pinned memory tests

There's not one number that works everywhere.
This also fixes the failure mode on non-Linux hosts.

libgomp/ChangeLog:

	* testsuite/libgomp.c/alloc-pinned-1.c: Autodetect page size.
	* testsuite/libgomp.c/alloc-pinned-2.c: Likewise.
	* testsuite/libgomp.c/alloc-pinned-3.c: Likewise.
	* testsuite/libgomp.c/alloc-pinned-4.c: Likewise.
	* testsuite/libgomp.c/alloc-pinned-5.c: Likewise.
	* testsuite/libgomp.c/alloc-pinned-6.c: Likewise.
	* testsuite/libgomp.c/alloc-pinned-7.c: Clean up.

diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
index 0a6360cda29..79792b16d83 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c
@@ -4,13 +4,23 @@
 
 /* Test that pinned memory works.  */
 
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __linux__
 #include <sys/types.h>
 #include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
 
 #include <sys/mman.h>
+#include <sys/resource.h>
+
+#define PAGE_SIZE sysconf(_SC_PAGESIZE)
+#define CHECK_SIZE(SIZE) { \
+  struct rlimit limit; \
+  if (getrlimit (RLIMIT_MEMLOCK, &limit) \
+      || limit.rlim_cur <= SIZE) \
+    fprintf (stderr, "unsufficient lockable memory; please increase ulimit\n"); \
+  }
 
 int
 get_pinned_mem ()
@@ -34,6 +44,9 @@ get_pinned_mem ()
   abort ();
 }
 #else
+#define PAGE_SIZE 1 /* unknown */
+#define CHECK_SIZE(SIZE) fprintf (stderr, "OS unsupported\n");
+
 int
 get_pinned_mem ()
 {
@@ -43,12 +56,13 @@ get_pinned_mem ()
 
 #include <omp.h>
 
-/* Allocate more than a page each time, but stay within the ulimit.  */
-#define SIZE 10*1024
-
 int
 main ()
 {
+  /* Allocate at least a page each time, but stay within the ulimit.  */
+  const int SIZE = PAGE_SIZE;
+  CHECK_SIZE (SIZE*3);
+
   const omp_alloctrait_t traits[] = {
       { omp_atk_pinned, 1 }
   };
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
index 8fdb4ff5cfd..228c656b715 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-2.c
@@ -4,13 +4,23 @@
 
 /* Test that pinned memory works (pool_size code path).  */
 
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __linux__
 #include <sys/types.h>
 #include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
 
 #include <sys/mman.h>
+#include <sys/resource.h>
+
+#define PAGE_SIZE sysconf(_SC_PAGESIZE)
+#define CHECK_SIZE(SIZE) { \
+  struct rlimit limit; \
+  if (getrlimit (RLIMIT_MEMLOCK, &limit) \
+      || limit.rlim_cur <= SIZE) \
+    fprintf (stderr, "unsufficient lockable memory; please increase ulimit\n"); \
+  }
 
 int
 get_pinned_mem ()
@@ -34,6 +44,9 @@ get_pinned_mem ()
   abort ();
 }
 #else
+#define PAGE_SIZE 1 /* unknown */
+#define CHECK_SIZE(SIZE) fprintf (stderr, "OS unsupported\n");
+
 int
 get_pinned_mem ()
 {
@@ -43,12 +56,13 @@ get_pinned_mem ()
 
 #include <omp.h>
 
-/* Allocate more than a page each time, but stay within the ulimit.  */
-#define SIZE 10*1024
-
 int
 main ()
 {
+  /* Allocate at least a page each time, but stay within the ulimit.  */
+  const int SIZE = PAGE_SIZE;
+  CHECK_SIZE (SIZE*3);
+
   const omp_alloctrait_t traits[] = {
       { omp_atk_pinned, 1 },
       { omp_atk_pool_size, SIZE*8 }
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-3.c b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
index 943dfea5c9b..90539ffe3e0 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-3.c
@@ -2,15 +2,18 @@
 
 /* Test that pinned memory fails correctly.  */
 
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __linux__
 #include <sys/types.h>
 #include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
 
 #include <sys/mman.h>
 #include <sys/resource.h>
 
+#define PAGE_SIZE sysconf(_SC_PAGESIZE)
+
 int
 get_pinned_mem ()
 {
@@ -45,6 +48,8 @@ set_pin_limit (int size)
 }
 #else
 int
+#define PAGE_SIZE 10000*1024 /* unknown */
+
 get_pinned_mem ()
 {
   return 0;
@@ -58,12 +63,12 @@ set_pin_limit ()
 
 #include <omp.h>
 
-/* This should be large enough to cover multiple pages.  */
-#define SIZE 10000*1024
-
 int
 main ()
 {
+  /* This needs to be large enough to cover multiple pages.  */
+  const int SIZE = PAGE_SIZE*4;
+
   /* Pinned memory, no fallback.  */
   const omp_alloctrait_t traits1[] = {
       { omp_atk_pinned, 1 },
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-4.c b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
index d9cb8dfe1fd..534e49eefc4 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-4.c
@@ -2,15 +2,18 @@
 
 /* Test that pinned memory fails correctly, pool_size code path.  */
 
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __linux__
 #include <sys/types.h>
 #include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
 
 #include <sys/mman.h>
 #include <sys/resource.h>
 
+#define PAGE_SIZE sysconf(_SC_PAGESIZE)
+
 int
 get_pinned_mem ()
 {
@@ -45,6 +48,8 @@ set_pin_limit (int size)
 }
 #else
 int
+#define PAGE_SIZE 10000*1024 /* unknown */
+
 get_pinned_mem ()
 {
   return 0;
@@ -58,12 +63,12 @@ set_pin_limit ()
 
 #include <omp.h>
 
-/* This should be large enough to cover multiple pages.  */
-#define SIZE 10000*1024
-
 int
 main ()
 {
+  /* This needs to be large enough to cover multiple pages.  */
+  const int SIZE = PAGE_SIZE*4;
+
   /* Pinned memory, no fallback.  */
   const omp_alloctrait_t traits1[] = {
       { omp_atk_pinned, 1 },
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-5.c b/libgomp/testsuite/libgomp.c/alloc-pinned-5.c
index 8355ca83790..315c7161a39 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-5.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-5.c
@@ -4,13 +4,23 @@
 
 /* Test that ompx_pinned_mem_alloc works.  */
 
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __linux__
 #include <sys/types.h>
 #include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
 
 #include <sys/mman.h>
+#include <sys/resource.h>
+
+#define PAGE_SIZE sysconf(_SC_PAGESIZE)
+#define CHECK_SIZE(SIZE) { \
+  struct rlimit limit; \
+  if (getrlimit (RLIMIT_MEMLOCK, &limit) \
+      || limit.rlim_cur <= SIZE) \
+    fprintf (stderr, "unsufficient lockable memory; please increase ulimit\n"); \
+  }
 
 int
 get_pinned_mem ()
@@ -34,6 +44,9 @@ get_pinned_mem ()
   abort ();
 }
 #else
+#define PAGE_SIZE 1 /* unknown */
+#define CHECK_SIZE(SIZE) fprintf (stderr, "OS unsupported\n");
+
 int
 get_pinned_mem ()
 {
@@ -43,12 +56,13 @@ get_pinned_mem ()
 
 #include <omp.h>
 
-/* Allocate more than a page each time, but stay within the ulimit.  */
-#define SIZE 10*1024
-
 int
 main ()
 {
+  /* Allocate at least a page each time, but stay within the ulimit.  */
+  const int SIZE = PAGE_SIZE;
+  CHECK_SIZE (SIZE*3);
+
   // Sanity check
   if (get_pinned_mem () != 0)
     abort ();
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-6.c b/libgomp/testsuite/libgomp.c/alloc-pinned-6.c
index 80fd37ab875..bbe20c04875 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-6.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-6.c
@@ -2,15 +2,18 @@
 
 /* Test that ompx_pinned_mem_alloc fails correctly.  */
 
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __linux__
 #include <sys/types.h>
 #include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
 
 #include <sys/mman.h>
 #include <sys/resource.h>
 
+#define PAGE_SIZE sysconf(_SC_PAGESIZE)
+
 int
 get_pinned_mem ()
 {
@@ -44,6 +47,8 @@ set_pin_limit (int size)
     abort ();
 }
 #else
+#define PAGE_SIZE 10000*1024 /* unknown */
+
 int
 get_pinned_mem ()
 {
@@ -58,12 +63,12 @@ set_pin_limit ()
 
 #include <omp.h>
 
-/* This should be large enough to cover multiple pages.  */
-#define SIZE 10000*1024
-
 int
 main ()
 {
+  /* Allocate at least a page each time, but stay within the ulimit.  */
+  const int SIZE = PAGE_SIZE*4;
+
   /* Ensure that the limit is smaller than the allocation.  */
   set_pin_limit (SIZE/2);
 
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c
index 6fd19b46a5c..8dc19055038 100644
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c
@@ -5,11 +5,12 @@
 
 /* Test that pinned memory works.  */
 
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __linux__
 #include <sys/types.h>
 #include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
 
 #include <sys/mman.h>
 
@@ -28,7 +29,6 @@ get_pinned_mem ()
       int val;
       if (sscanf (buf, "VmLck: %d", &val))
 	{
-	  printf ("lock %d\n", val);
 	  fclose (proc);
 	  return val;
 	}
@@ -47,9 +47,6 @@ get_pinned_mem ()
 
 #include <omp.h>
 
-/* Allocate more than a page each time, but stay within the ulimit.  */
-#define SIZE 10*1024
-
 int
 main ()
 {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] openmp: Handle unified address memory.
  2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
                   ` (5 preceding siblings ...)
  2022-04-13 13:14 ` [PATCH 0/5] openmp: Handle pinned and unified shared memory Andrew Stubbs
@ 2022-04-20 13:25 ` Andrew Stubbs
  6 siblings, 0 replies; 18+ messages in thread
From: Andrew Stubbs @ 2022-04-20 13:25 UTC (permalink / raw)
  To: Hafiz Abid Qadeer, gcc-patches, fortran; +Cc: jakub, joseph

[-- Attachment #1: Type: text/plain, Size: 393 bytes --]

This patch adds enough support for "requires unified_address" to make 
the sollve_vv testcases pass. It implements unified_address as a synonym 
of unified_shared_memory, which is both valid and the only way I know of 
to unify addresses with Cuda (could be wrong).

This patch should be applied on to of the previous patch set for USM.

OK for stage 1?

I'll apply it to OG11 shortly.

Andrew

[-- Attachment #2: 220420-unified_address.patch --]
[-- Type: text/plain, Size: 6037 bytes --]

openmp: unified_address support

This makes "requires unified_address" work by making it eqivalent to
"requires unified_shared_memory".  This is more than is strictly necessary,
but should be standard compliant.

gcc/c/ChangeLog:

	* c-parser.c (c_parser_omp_requires): Check requires unified_address
	for conflict with -foffload-memory=shared.

gcc/cp/ChangeLog:

	* parser.c (cp_parser_omp_requires): Check requires unified_address
	for conflict with -foffload-memory=shared.

gcc/fortran/ChangeLog:

	* openmp.c (gfc_match_omp_requires): Check requires unified_address
	for conflict with -foffload-memory=shared.

gcc/ChangeLog:

	* omp-low.c: Do USM transformations for "unified_address".

gcc/testsuite/ChangeLog:

	* c-c++-common/gomp/usm-4.c: New test.
	* gfortran.dg/gomp/usm-4.f90: New test.

diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index 12408770193..9a3d0cb8cea 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -22531,18 +22531,27 @@ c_parser_omp_requires (c_parser *parser)
 	  enum omp_requires this_req = (enum omp_requires) 0;
 
 	  if (!strcmp (p, "unified_address"))
-	    this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+	    {
+	      this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+
+	      if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "unified_address is incompatible with the "
+			  "selected -foffload-memory option");
+	      flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	    }
 	  else if (!strcmp (p, "unified_shared_memory"))
-	  {
-	    this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
-
-	    if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
-		&& flag_offload_memory != OFFLOAD_MEMORY_NONE)
-	      error_at (cloc,
-			"unified_shared_memory is incompatible with the "
-			"selected -foffload-memory option");
-	    flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
-	  }
+	    {
+	      this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+	      if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "unified_shared_memory is incompatible with the "
+			  "selected -foffload-memory option");
+	      flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	    }
 	  else if (!strcmp (p, "dynamic_allocators"))
 	    this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
 	  else if (!strcmp (p, "reverse_offload"))
diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c
index fd9f62f4543..3a9ea272f10 100644
--- a/gcc/cp/parser.c
+++ b/gcc/cp/parser.c
@@ -46406,18 +46406,27 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok)
 	  enum omp_requires this_req = (enum omp_requires) 0;
 
 	  if (!strcmp (p, "unified_address"))
-	    this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+	    {
+	      this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+
+	      if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "unified_address is incompatible with the "
+			  "selected -foffload-memory option");
+	      flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	    }
 	  else if (!strcmp (p, "unified_shared_memory"))
-	  {
-	    this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
-
-	    if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
-		&& flag_offload_memory != OFFLOAD_MEMORY_NONE)
-	      error_at (cloc,
-			"unified_shared_memory is incompatible with the "
-			"selected -foffload-memory option");
-	    flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
-	  }
+	    {
+	      this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+	      if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "unified_shared_memory is incompatible with the "
+			  "selected -foffload-memory option");
+	      flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	    }
 	  else if (!strcmp (p, "dynamic_allocators"))
 	    this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
 	  else if (!strcmp (p, "reverse_offload"))
diff --git a/gcc/fortran/openmp.c b/gcc/fortran/openmp.c
index ac4126bd7ea..ece04c03a68 100644
--- a/gcc/fortran/openmp.c
+++ b/gcc/fortran/openmp.c
@@ -5546,6 +5546,12 @@ gfc_match_omp_requires (void)
 	  requires_clause = OMP_REQ_UNIFIED_ADDRESS;
 	  if (requires_clauses & OMP_REQ_UNIFIED_ADDRESS)
 	    goto duplicate_clause;
+
+	  if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+	      && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+	    gfc_error_now ("unified_address at %C is incompatible with "
+			   "the selected -foffload-memory option");
+	  flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
 	}
       else if (gfc_match (clauses[2]) == MATCH_YES)
 	{
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 4653370aa41..ce30f53dbb5 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -16008,7 +16008,8 @@ public:
   {
     return (flag_openmp || flag_openmp_simd)
 	    && (flag_offload_memory == OFFLOAD_MEMORY_UNIFIED
-		|| omp_requires_mask & OMP_REQUIRES_UNIFIED_SHARED_MEMORY);
+		|| omp_requires_mask & OMP_REQUIRES_UNIFIED_SHARED_MEMORY
+		|| omp_requires_mask & OMP_REQUIRES_UNIFIED_ADDRESS);
   }
   virtual unsigned int execute (function *)
   {
diff --git a/gcc/testsuite/c-c++-common/gomp/usm-4.c b/gcc/testsuite/c-c++-common/gomp/usm-4.c
new file mode 100644
index 00000000000..b19664e9b66
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/gomp/usm-4.c
@@ -0,0 +1,4 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-foffload-memory=pinned" } */
+
+#pragma omp requires unified_address        /* { dg-error "unified_address is incompatible with the selected -foffload-memory option" } */
diff --git a/gcc/testsuite/gfortran.dg/gomp/usm-4.f90 b/gcc/testsuite/gfortran.dg/gomp/usm-4.f90
new file mode 100644
index 00000000000..725b07f2f88
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/gomp/usm-4.f90
@@ -0,0 +1,6 @@
+! { dg-do compile }
+! { dg-additional-options "-foffload-memory=pinned" }
+
+!$omp requires unified_address  ! { dg-error "unified_address at .* is incompatible with the selected -foffload-memory option" }
+
+end

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned)
  2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer
  2022-03-30 22:40   ` Andrew Stubbs
@ 2023-02-09 11:16   ` Thomas Schwinge
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Schwinge @ 2023-02-09 11:16 UTC (permalink / raw)
  To: gcc-patches, ams; +Cc: Hafiz Abid Qadeer, fortran, jakub, joseph

[-- Attachment #1: Type: text/plain, Size: 967 bytes --]

Hi!

On 2022-03-08T11:30:59+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote:
> From: Andrew Stubbs <ams@codesourcery.com>
>
> [...]

> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c
> @@ -0,0 +1,28 @@
> +/* { dg-do run } */

Pushed to devel/omp/gcc-12 branch
commit 9c0ffa3776a135a69697253a0bd75ebf9b9d0150
"'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c'",
see attached.

Note that this likewise applies to the current upstream submission:
<inbox.sourceware.org/gcc-patches/8011a994bb38db60f37127880b0fc682564f6e8d.1657188329.git.ams@codesourcery.com>
"openmp: -foffload-memory=pinned".


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-c-c-common-gomp-alloc-pinned-1.c-libgomp.c-c-common-.patch --]
[-- Type: text/x-diff, Size: 2046 bytes --]

From 9c0ffa3776a135a69697253a0bd75ebf9b9d0150 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Mon, 30 Jan 2023 17:46:29 +0100
Subject: [PATCH] 'c-c++-common/gomp/alloc-pinned-1.c' ->
 'libgomp.c-c++-common/alloc-pinned-1.c'

Otherwise, for build-tree testing:

    xgcc: fatal error: cannot read spec file 'libgomp.spec': No such file or directory

..., and thus corresponding FAILs, UNRESOLVEDs.

Fix-up for og12 commit 842df187487f5b16ae29bbe7e9acd79661a9df48
"openmp: -foffload-memory=pinned".

	gcc/testsuite/
	* c-c++-common/gomp/alloc-pinned-1.c: Cut.
	libgomp/
	* testsuite/libgomp.c-c++-common/alloc-pinned-1.c: Paste.
---
 gcc/testsuite/ChangeLog.omp                                   | 2 ++
 libgomp/ChangeLog.omp                                         | 4 ++++
 .../testsuite/libgomp.c-c++-common}/alloc-pinned-1.c          | 0
 3 files changed, 6 insertions(+)
 rename {gcc/testsuite/c-c++-common/gomp => libgomp/testsuite/libgomp.c-c++-common}/alloc-pinned-1.c (100%)

diff --git a/gcc/testsuite/ChangeLog.omp b/gcc/testsuite/ChangeLog.omp
index 42769c7dae5..9f9d5a10ac3 100644
--- a/gcc/testsuite/ChangeLog.omp
+++ b/gcc/testsuite/ChangeLog.omp
@@ -1,5 +1,7 @@
 2023-02-09  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* c-c++-common/gomp/alloc-pinned-1.c: Cut.
+
 	* gfortran.dg/gomp/allocate-4.f90: Fix 'omp_allocator_handle_kind'
 	example.
 
diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index d319d43ceb0..39165173884 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,3 +1,7 @@
+2023-02-09  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.c-c++-common/alloc-pinned-1.c: Paste.
+
 2023-02-08  Tobias Burnus  <tobias@codesourcery.com>
 
 	Backported from master:
diff --git a/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c b/libgomp/testsuite/libgomp.c-c++-common/alloc-pinned-1.c
similarity index 100%
rename from gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c
rename to libgomp/testsuite/libgomp.c-c++-common/alloc-pinned-1.c
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc
  2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer
@ 2023-02-10 14:21   ` Thomas Schwinge
  2023-02-10 15:31     ` Andrew Stubbs
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Schwinge @ 2023-02-10 14:21 UTC (permalink / raw)
  To: ams; +Cc: Hafiz Abid Qadeer, gcc-patches

Hi Andrew!

On 2022-03-08T11:30:57+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote:
> From: Andrew Stubbs <ams@codesourcery.com>
>
> This adds support for using Cuda Managed Memory with omp_alloc.  It will be
> used as the underpinnings for "requires unified_shared_memory" in a later
> patch.
>
> There are two new predefined allocators, ompx_unified_shared_mem_alloc and
> ompx_host_mem_alloc, plus corresponding memory spaces, [...]

> --- a/libgomp/config/linux/allocator.c
> +++ b/libgomp/config/linux/allocator.c
> @@ -42,9 +42,11 @@
>  static void *
>  linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
>  {
> -  (void)memspace;
> -
> -  if (pin)
> +  if (memspace == ompx_unified_shared_mem_space)
> +    {
> +      return gomp_usm_alloc (size, GOMP_DEVICE_ICV);
> +    }
> +  else if (pin)
>      {
>        void *addr = mmap (NULL, size, PROT_READ | PROT_WRITE,
>                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

This I understand conceptually, but then:

> @@ -67,7 +69,14 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
>  static void *
>  linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
>  {
> -  if (pin)
> +  if (memspace == ompx_unified_shared_mem_space)
> +    {
> +      void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV);
> +      memset (ret, 0, size);
> +      return ret;
> +    }
> +  else if (memspace == ompx_unified_shared_mem_space
> +      || pin)
>      return linux_memspace_alloc (memspace, size, pin);
>    else
>      return calloc (1, size);

..., here, we've got a duplicated (and thus always-false) expression
'memspace == ompx_unified_shared_mem_space' (..., which
'-Wduplicated-cond' fails to report; <https://gcc.gnu.org/PR108753>
"'-Wduplicated-cond' doesn't diagnose duplicated subexpressions"...).
Is the correct fix the following (conceptually like
'linux_memspace_alloc' cited above), or is there something that I fail to
understand?

     static void *
     linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
     {
       if (memspace == ompx_unified_shared_mem_space)
         {
           void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV);
           memset (ret, 0, size);
           return ret;
         }
    -  else if (memspace == ompx_unified_shared_mem_space
    -      || pin)
    +  else if (pin)
         return linux_memspace_alloc (memspace, size, pin);
       else
         return calloc (1, size);

The following ones then again are conceptually like
'linux_memspace_alloc' cited above:

> @@ -77,9 +86,9 @@ static void
>  linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
>                    int pin)
>  {
> -  (void)memspace;
> -
> -  if (pin)
> +  if (memspace == ompx_unified_shared_mem_space)
> +    gomp_usm_free (addr, GOMP_DEVICE_ICV);
> +  else if (pin)
>      munmap (addr, size);
>    else
>      free (addr);
> @@ -89,7 +98,9 @@ static void *
>  linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
>                       size_t oldsize, size_t size, int oldpin, int pin)
>  {
> -  if (oldpin && pin)
> +  if (memspace == ompx_unified_shared_mem_space)
> +    goto manual_realloc;
> +  else if (oldpin && pin)
>      {
>        void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE);
>        if (newaddr == MAP_FAILED)
> @@ -98,18 +109,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
> [...]

..., and similar those here:

> --- a/libgomp/config/nvptx/allocator.c
> +++ b/libgomp/config/nvptx/allocator.c
> @@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
>        __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE);
>        return result;
>      }
> +  else if (memspace == ompx_host_mem_space)
> +    return NULL;
>    else
>      return malloc (size);
>  }
> @@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
>
>        return result;
>      }
> +  else if (memspace == ompx_host_mem_space)
> +    return NULL;
>    else
>      return calloc (1, size);
>  }
> @@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
>       }
>        return result;
>      }
> +  else if (memspace == ompx_host_mem_space)
> +    return NULL;
>    else
>      return realloc (addr, size);
>  }

(I'd have added an explicit no-op (or, 'abort'?) to
'nvptx_memspace_free', but that's maybe just me...)  ;-\


> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h

> +extern void * gomp_usm_alloc (size_t size, int device_num);
> +extern void gomp_usm_free (void *device_ptr, int device_num);
> +extern bool gomp_is_usm_ptr (void *ptr);

'gomp_is_usm_ptr' isn't defined/used anywhere; I'll remove it.


> --- a/libgomp/target.c
> +++ b/libgomp/target.c

> @@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
>    DLSYM (unload_image);
>    DLSYM (alloc);
>    DLSYM (free);
> +  DLSYM_OPT (usm_alloc, usm_alloc);
> +  DLSYM_OPT (usm_free, usm_free);
> +  DLSYM_OPT (is_usm_ptr, is_usm_ptr);
>    DLSYM (dev2host);
>    DLSYM (host2dev);

As a sanity check, shouldn't we check that either none or all three of
those are defined, like in the 'if (cuda && cuda != 4) { [error] }' check
a bit further down?


Note that these remarks likewise apply to the current upstream
submission:
<https://inbox.sourceware.org/gcc-patches/ef374d055251b2bc65b97d7e54a0a72d811b869d.1657188329.git.ams@codesourcery.com>
"openmp, nvptx: ompx_unified_shared_mem_alloc".


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc
  2023-02-10 14:21   ` Thomas Schwinge
@ 2023-02-10 15:31     ` Andrew Stubbs
  2023-02-16 21:24       ` [og12] Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space' (was: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc) Thomas Schwinge
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Stubbs @ 2023-02-10 15:31 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Hafiz Abid Qadeer, gcc-patches

On 10/02/2023 14:21, Thomas Schwinge wrote:
> Is the correct fix the following (conceptually like
> 'linux_memspace_alloc' cited above), or is there something that I fail to
> understand?
> 
>       static void *
>       linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
>       {
>         if (memspace == ompx_unified_shared_mem_space)
>           {
>             void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV);
>             memset (ret, 0, size);
>             return ret;
>           }
>      -  else if (memspace == ompx_unified_shared_mem_space
>      -      || pin)
>      +  else if (pin)
>           return linux_memspace_alloc (memspace, size, pin);
>         else
>           return calloc (1, size);

Yes, I think that is what was intended (and what actually happens). You 
can have your memory both unified and pinned (well, maybe it's possible, 
but there's no one Cuda API for that), so the USM takes precedence.

> The following ones then again are conceptually like
> 'linux_memspace_alloc' cited above:
> 
>> @@ -77,9 +86,9 @@ static void
>>   linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
>>                     int pin)
>>   {
>> -  (void)memspace;
>> -
>> -  if (pin)
>> +  if (memspace == ompx_unified_shared_mem_space)
>> +    gomp_usm_free (addr, GOMP_DEVICE_ICV);
>> +  else if (pin)
>>       munmap (addr, size);
>>     else
>>       free (addr);
>> @@ -89,7 +98,9 @@ static void *
>>   linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
>>                        size_t oldsize, size_t size, int oldpin, int pin)
>>   {
>> -  if (oldpin && pin)
>> +  if (memspace == ompx_unified_shared_mem_space)
>> +    goto manual_realloc;
>> +  else if (oldpin && pin)
>>       {
>>         void *newaddr = mremap (addr, oldsize, size, MREMAP_MAYMOVE);
>>         if (newaddr == MAP_FAILED)
>> @@ -98,18 +109,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
>> [...]

Yes.
> ..., and similar those here:
> 
>> --- a/libgomp/config/nvptx/allocator.c
>> +++ b/libgomp/config/nvptx/allocator.c
>> @@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
>>         __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE);
>>         return result;
>>       }
>> +  else if (memspace == ompx_host_mem_space)
>> +    return NULL;
>>     else
>>       return malloc (size);
>>   }
>> @@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
>>
>>         return result;
>>       }
>> +  else if (memspace == ompx_host_mem_space)
>> +    return NULL;
>>     else
>>       return calloc (1, size);
>>   }
>> @@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
>>        }
>>         return result;
>>       }
>> +  else if (memspace == ompx_host_mem_space)
>> +    return NULL;
>>     else
>>       return realloc (addr, size);
>>   }
> 
> (I'd have added an explicit no-op (or, 'abort'?) to
> 'nvptx_memspace_free', but that's maybe just me...)  ;-\

Why? The host memspace is just the regular heap, which can be a thing on 
any device. It's an extension though so we can define it either way.

>> --- a/libgomp/libgomp.h
>> +++ b/libgomp/libgomp.h
> 
>> +extern void * gomp_usm_alloc (size_t size, int device_num);
>> +extern void gomp_usm_free (void *device_ptr, int device_num);
>> +extern bool gomp_is_usm_ptr (void *ptr);
> 
> 'gomp_is_usm_ptr' isn't defined/used anywhere; I'll remove it.

I think I started that and then decided against. Thanks.

>> --- a/libgomp/target.c
>> +++ b/libgomp/target.c
> 
>> @@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
>>     DLSYM (unload_image);
>>     DLSYM (alloc);
>>     DLSYM (free);
>> +  DLSYM_OPT (usm_alloc, usm_alloc);
>> +  DLSYM_OPT (usm_free, usm_free);
>> +  DLSYM_OPT (is_usm_ptr, is_usm_ptr);
>>     DLSYM (dev2host);
>>     DLSYM (host2dev);
> 
> As a sanity check, shouldn't we check that either none or all three of
> those are defined, like in the 'if (cuda && cuda != 4) { [error] }' check
> a bit further down?

This is only going to happen when somebody writes a new plugin, and then 
they'll discover very quickly that there are issues. I've wasted more 
time writing this sentence than it's worth already. :)

> Note that these remarks likewise apply to the current upstream
> submission:
> <https://inbox.sourceware.org/gcc-patches/ef374d055251b2bc65b97d7e54a0a72d811b869d.1657188329.git.ams@codesourcery.com>> "openmp, nvptx: ompx_unified_shared_mem_alloc".

I have new patches to heap on top of this set already on OG12, and more 
planned, plus these ones you're working on; the whole patchset is going 
to have to get a rebase, squash, and tidy "soonish".

Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory)
  2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer
@ 2023-02-13 14:38   ` Thomas Schwinge
  2023-02-13 15:20     ` Andrew Stubbs
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Schwinge @ 2023-02-13 14:38 UTC (permalink / raw)
  To: Andrew Stubbs, Hafiz Abid Qadeer, Jakub Jelinek, Tobias Burnus
  Cc: gcc-patches

Hi!

On 2022-03-08T11:30:55+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote:
> From: Andrew Stubbs <ams@codesourcery.com>
>
> Add a new option.  It will be used in follow-up patches.

> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi

> +@option{-foffload-memory=pinned} forces all host memory to be pinned (this
> +mode may require the user to increase the ulimit setting for locked memory).

So, this is currently implemented via 'mlockall', which, as discussed,
(a) has issues ('ulimit -l'), and (b) doesn't actually achieve what it
meant to achieve (because it doesn't register the page-locked memory with
the GPU driver).

So one idea was to re-purpose the unified shared memory
'gcc/omp-low.cc:pass_usm_transform' (compiler pass that "changes calls to
malloc/free/calloc/realloc and operator new to memory allocation
functions in libgomp with allocator=ompx_unified_shared_mem_alloc"),
<https://inbox.sourceware.org/gcc-patches/20220308113059.688551-5-abidh@codesourcery.com>.
(I have not yet looked into that in detail.)

Here's now a different idea.  As '-foffload-memory=pinned', per the name
of the option, concerns itself with memory used in offloading but not
host execution generally, why are we actually attempting to "[force] all
host memory to be pinned" -- why not just the memory that's being used
with offloading?  That is, if '-foffload-memory=pinned' is set, register
as page-locked with the GPU driver all memory that appears in OMP
offloading data regions, such as OpenMP 'target' 'map' clauses etc.  That
way, this is directed at the offloading data transfers, as itended, but
at the same time we don't "waste" page-locked memory for generic host
memory allocations.  What do you think -- you, who've spent a lot more
time on this topic than I have, so it's likely possible that I fail to
realize some "details"?


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory)
  2023-02-13 14:38   ` -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) Thomas Schwinge
@ 2023-02-13 15:20     ` Andrew Stubbs
  2023-04-03 14:56       ` [og12] '-foffload-memory=pinned' using offloading device interfaces (was: -foffload-memory=pinned) Thomas Schwinge
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Stubbs @ 2023-02-13 15:20 UTC (permalink / raw)
  To: Thomas Schwinge, Hafiz Abid Qadeer, Jakub Jelinek, Tobias Burnus
  Cc: gcc-patches

On 13/02/2023 14:38, Thomas Schwinge wrote:
> Hi!
> 
> On 2022-03-08T11:30:55+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote:
>> From: Andrew Stubbs <ams@codesourcery.com>
>>
>> Add a new option.  It will be used in follow-up patches.
> 
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
> 
>> +@option{-foffload-memory=pinned} forces all host memory to be pinned (this
>> +mode may require the user to increase the ulimit setting for locked memory).
> 
> So, this is currently implemented via 'mlockall', which, as discussed,
> (a) has issues ('ulimit -l'), and (b) doesn't actually achieve what it
> meant to achieve (because it doesn't register the page-locked memory with
> the GPU driver).
> 
> So one idea was to re-purpose the unified shared memory
> 'gcc/omp-low.cc:pass_usm_transform' (compiler pass that "changes calls to
> malloc/free/calloc/realloc and operator new to memory allocation
> functions in libgomp with allocator=ompx_unified_shared_mem_alloc"),
> <https://inbox.sourceware.org/gcc-patches/20220308113059.688551-5-abidh@codesourcery.com>> (I have not yet looked into that in detail.)
> 
> Here's now a different idea.  As '-foffload-memory=pinned', per the name
> of the option, concerns itself with memory used in offloading but not
> host execution generally, why are we actually attempting to "[force] all
> host memory to be pinned" -- why not just the memory that's being used
> with offloading?  That is, if '-foffload-memory=pinned' is set, register
> as page-locked with the GPU driver all memory that appears in OMP
> offloading data regions, such as OpenMP 'target' 'map' clauses etc.  That
> way, this is directed at the offloading data transfers, as itended, but
> at the same time we don't "waste" page-locked memory for generic host
> memory allocations.  What do you think -- you, who've spent a lot more
> time on this topic than I have, so it's likely possible that I fail to
> realize some "details"?

The main reason it is the way it is is because in general it's not 
possible to know what memory is going to be offloaded at the time it is 
allocated (and stack/static memory is never allocated that way).

If there's a way to pin it after the fact then maybe that's not a 
terrible idea? The downside is that the memory might already have been 
paged out at that point, and we'd have to track what we'd previously 
pinned, or else re-pin it every time we launch a kernel. We'd also have 
no way to unpin previously pinned memory (not that that's relevant to 
the "lock all" case).

My original plan was to use omp_alloc for both the standard OpenMP 
support and the -foffload-memory option (to get the benefit of pinning 
without modifying any source), but then I decided that the mlockall 
option was much less invasive. This is still the best way to implement 
target-independent pinning, when there's no driver registration option.

Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [og12] Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space' (was: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc)
  2023-02-10 15:31     ` Andrew Stubbs
@ 2023-02-16 21:24       ` Thomas Schwinge
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Schwinge @ 2023-02-16 21:24 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches; +Cc: Hafiz Abid Qadeer

[-- Attachment #1: Type: text/plain, Size: 3355 bytes --]

Hi!

On 2023-02-10T15:31:47+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
> On 10/02/2023 14:21, Thomas Schwinge wrote:
>> Is the correct fix the following [...]
>
> Yes, [...]

>>> --- a/libgomp/config/nvptx/allocator.c
>>> +++ b/libgomp/config/nvptx/allocator.c
>>> @@ -125,6 +125,8 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
>>>         __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE);
>>>         return result;
>>>       }
>>> +  else if (memspace == ompx_host_mem_space)
>>> +    return NULL;
>>>     else
>>>       return malloc (size);
>>>   }
>>> @@ -145,6 +147,8 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
>>>
>>>         return result;
>>>       }
>>> +  else if (memspace == ompx_host_mem_space)
>>> +    return NULL;
>>>     else
>>>       return calloc (1, size);
>>>   }
>>> @@ -354,6 +358,8 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
>>>        }
>>>         return result;
>>>       }
>>> +  else if (memspace == ompx_host_mem_space)
>>> +    return NULL;
>>>     else
>>>       return realloc (addr, size);
>>>   }
>>
>> (I'd have added an explicit no-op (or, 'abort'?) to
>> 'nvptx_memspace_free', but that's maybe just me...)  ;-\
>
> Why? The host memspace is just the regular heap, which can be a thing on
> any device. It's an extension though so we can define it either way.

My point was: for nvptx libgomp, all 'ompx_host_mem_space' allocator
functions (cited above) 'return NULL', and it's a cheap check to verify
that in 'nvptx_memspace_free'.

>>> --- a/libgomp/libgomp.h
>>> +++ b/libgomp/libgomp.h
>>
>>> +extern void * gomp_usm_alloc (size_t size, int device_num);
>>> +extern void gomp_usm_free (void *device_ptr, int device_num);
>>> +extern bool gomp_is_usm_ptr (void *ptr);
>>
>> 'gomp_is_usm_ptr' isn't defined/used anywhere; I'll remove it.
>
> I think I started that and then decided against. Thanks.

These three combined, I've pushed to devel/omp/gcc-12 branch
commit 23f52e49368d7b26a1b1a72d6bb903d31666e961
"Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space'",
see attached.


>>> --- a/libgomp/target.c
>>> +++ b/libgomp/target.c
>>
>>> @@ -3740,6 +3807,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
>>>     DLSYM (unload_image);
>>>     DLSYM (alloc);
>>>     DLSYM (free);
>>> +  DLSYM_OPT (usm_alloc, usm_alloc);
>>> +  DLSYM_OPT (usm_free, usm_free);
>>> +  DLSYM_OPT (is_usm_ptr, is_usm_ptr);
>>>     DLSYM (dev2host);
>>>     DLSYM (host2dev);
>>
>> As a sanity check, shouldn't we check that either none or all three of
>> those are defined, like in the 'if (cuda && cuda != 4) { [error] }' check
>> a bit further down?
>
> This is only going to happen when somebody writes a new plugin, and then
> they'll discover very quickly that there are issues. I've wasted more
> time writing this sentence than it's worth already. :)

Eh.  ;-) OK, outvoted.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Miscellaneous-clean-up-re-OpenMP-ompx_unified_shared.patch --]
[-- Type: text/x-diff, Size: 3153 bytes --]

From 23f52e49368d7b26a1b1a72d6bb903d31666e961 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 14 Feb 2023 17:10:57 +0100
Subject: [PATCH] Miscellaneous clean-up re OpenMP
 'ompx_unified_shared_mem_space', 'ompx_host_mem_space'

Clean-up for og12 commit 84914e197d91a67b3d27db0e4c69a433462983a5
"openmp, nvptx: ompx_unified_shared_mem_alloc".  No functional change.

	libgomp/
	* config/linux/allocator.c (linux_memspace_calloc): Elide
	(innocuous) duplicate 'if' condition.
	* config/nvptx/allocator.c (nvptx_memspace_free): Explicitly
	handle 'memspace == ompx_host_mem_space'.
	* libgomp.h (gomp_is_usm_ptr): Remove.
---
 libgomp/ChangeLog.omp            | 6 ++++++
 libgomp/config/linux/allocator.c | 3 +--
 libgomp/config/nvptx/allocator.c | 4 ++++
 libgomp/libgomp.h                | 1 -
 4 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index b667c72b8ca..1c4b1833c0b 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,5 +1,11 @@
 2023-02-16  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* config/linux/allocator.c (linux_memspace_calloc): Elide
+	(innocuous) duplicate 'if' condition.
+	* config/nvptx/allocator.c (nvptx_memspace_free): Explicitly
+	handle 'memspace == ompx_host_mem_space'.
+	* libgomp.h (gomp_is_usm_ptr): Remove.
+
 	* basic-allocator.c (BASIC_ALLOC_YIELD): instead of '#deine',
 	'#define' it.
 
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index 07af3a2821a..8a9171c36df 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -95,8 +95,7 @@ linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
       memset (ret, 0, size);
       return ret;
     }
-  else if (memspace == ompx_unified_shared_mem_space
-      || pin)
+  else if (pin)
     return linux_memspace_alloc (memspace, size, pin);
   else
     return calloc (1, size);
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index 7c2a7463bf7..cbf86b8a2ec 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -42,6 +42,7 @@
    chunks.  */
 
 #include "libgomp.h"
+#include <assert.h>
 #include <stdlib.h>
 
 #define BASIC_ALLOC_PREFIX __nvptx_lowlat
@@ -93,6 +94,9 @@ nvptx_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
 
       __nvptx_lowlat_free (shared_pool, addr, size);
     }
+  else if (memspace == ompx_host_mem_space)
+    /* Just verify what all allocator functions return.  */
+    assert (addr == NULL);
   else
     free (addr);
 }
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index d1e45cc584e..c001b468252 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1133,7 +1133,6 @@ extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
 			     void *);
 extern void * gomp_usm_alloc (size_t size, int device_num);
 extern void gomp_usm_free (void *device_ptr, int device_num);
-extern bool gomp_is_usm_ptr (void *ptr);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [og12] '-foffload-memory=pinned' using offloading device interfaces (was: -foffload-memory=pinned)
  2023-02-13 15:20     ` Andrew Stubbs
@ 2023-04-03 14:56       ` Thomas Schwinge
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Schwinge @ 2023-04-03 14:56 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches
  Cc: Hafiz Abid Qadeer, Jakub Jelinek, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 2727 bytes --]

Hi!

On 2023-02-13T15:20:07+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
> On 13/02/2023 14:38, Thomas Schwinge wrote:
>> On 2022-03-08T11:30:55+0000, Hafiz Abid Qadeer <abidh@codesourcery.com> wrote:
>>> From: Andrew Stubbs <ams@codesourcery.com>
>>>
>>> Add a new option.  It will be used in follow-up patches.
>>
>>> --- a/gcc/doc/invoke.texi
>>> +++ b/gcc/doc/invoke.texi
>>
>>> +@option{-foffload-memory=pinned} forces all host memory to be pinned (this
>>> +mode may require the user to increase the ulimit setting for locked memory).
>>
>> So, this is currently implemented via 'mlockall', which, as discussed,
>> (a) has issues ('ulimit -l'), and (b) doesn't actually achieve what it
>> meant to achieve (because it doesn't register the page-locked memory with
>> the GPU driver).
>> [...]
>> As '-foffload-memory=pinned', per the name
>> of the option, concerns itself with memory used in offloading but not
>> host execution generally, why are we actually attempting to "[force] all
>> host memory to be pinned" -- why not just the memory that's being used
>> with offloading?  That is, if '-foffload-memory=pinned' is set, register
>> as page-locked with the GPU driver all memory that appears in OMP
>> offloading data regions, such as OpenMP 'target' 'map' clauses etc.  That
>> way, this is directed at the offloading data transfers, as itended, but
>> at the same time we don't "waste" page-locked memory for generic host
>> memory allocations.  What do you think -- you, who've spent a lot more
>> time on this topic than I have, so it's likely possible that I fail to
>> realize some "details"?
>
> The main reason it is the way it is is because in general it's not
> possible to know what memory is going to be offloaded at the time it is
> allocated (and stack/static memory is never allocated that way).
>
> If there's a way to pin it after the fact then maybe that's not a
> terrible idea?  [...]

I've now pushed to devel/omp/gcc-12 branch my take on this in
commit 43095690ea519205bf56fc148b346edaa43e0f0f
"'-foffload-memory=pinned' using offloading device interfaces", and for
changes related to og12 commit 15d0f61a7fecdc8fd12857c40879ea3730f6d99f
"Merge non-contiguous array support patches":
commit 694bbd399c1323975b4a6735646e46c6914de63d
"'-foffload-memory=pinned' using offloading device interfaces for non-contiguous array support",
see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-foffload-memory-pinned-using-offloading-device-inter.patch --]
[-- Type: text/x-diff, Size: 77424 bytes --]

From 43095690ea519205bf56fc148b346edaa43e0f0f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 30 Mar 2023 10:08:12 +0200
Subject: [PATCH 1/2] '-foffload-memory=pinned' using offloading device
 interfaces

Implemented for nvptx offloading via 'cuMemHostAlloc', 'cuMemHostRegister'.

	gcc/
	* doc/invoke.texi (-foffload-memory=pinned): Document.
	include/
	* cuda/cuda.h (CUresult): Add
	'CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED'.
	(CUdevice_attribute): Add
	'CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED'.
	(CU_MEMHOSTREGISTER_READ_ONLY): Add.
	(cuMemHostGetFlags, cuMemHostRegister, cuMemHostUnregister): Add.
	libgomp/
	* libgomp-plugin.h (GOMP_OFFLOAD_page_locked_host_free): Add
	'struct goacc_asyncqueue *' formal parameter.
	(GOMP_OFFLOAD_page_locked_host_register)
	(GOMP_OFFLOAD_page_locked_host_unregister)
	(GOMP_OFFLOAD_page_locked_host_p): Add.
	* libgomp.h (always_pinned_mode)
	(gomp_page_locked_host_register_dev)
	(gomp_page_locked_host_unregister_dev): Add.
	(struct splay_tree_key_s): Add 'page_locked_host_p'.
	(struct gomp_device_descr): Add
	'GOMP_OFFLOAD_page_locked_host_register',
	'GOMP_OFFLOAD_page_locked_host_unregister',
	'GOMP_OFFLOAD_page_locked_host_p'.
	* libgomp.texi (-foffload-memory=pinned): Document.
	* plugin/cuda-lib.def (cuMemHostGetFlags, cuMemHostRegister_v2)
	(cuMemHostRegister, cuMemHostUnregister): Add.
	* plugin/plugin-nvptx.c (struct ptx_device): Add
	'read_only_host_register_supported'.
	(nvptx_open_device): Initialize it.
	(free_host_blocks, free_host_blocks_lock)
	(nvptx_run_deferred_page_locked_host_free)
	(nvptx_page_locked_host_free_callback, nvptx_page_locked_host_p)
	(GOMP_OFFLOAD_page_locked_host_register)
	(nvptx_page_locked_host_unregister_callback)
	(GOMP_OFFLOAD_page_locked_host_unregister)
	(GOMP_OFFLOAD_page_locked_host_p)
	(nvptx_run_deferred_page_locked_host_unregister)
	(nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback):
	Add.
	(GOMP_OFFLOAD_fini_device, GOMP_OFFLOAD_page_locked_host_alloc)
	(GOMP_OFFLOAD_run): Call
	'nvptx_run_deferred_page_locked_host_free'.
	(struct goacc_asyncqueue): Add
	'page_locked_host_unregister_blocks_lock',
	'page_locked_host_unregister_blocks'.
	(nvptx_goacc_asyncqueue_construct)
	(nvptx_goacc_asyncqueue_destruct): Handle those.
	(GOMP_OFFLOAD_page_locked_host_free): Handle
	'struct goacc_asyncqueue *' formal parameter.
	(GOMP_OFFLOAD_openacc_async_test)
	(nvptx_goacc_asyncqueue_synchronize): Call
	'nvptx_run_deferred_page_locked_host_unregister'.
	(GOMP_OFFLOAD_openacc_async_serialize): Call
	'nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback'.
	* config/linux/allocator.c (linux_memspace_alloc)
	(linux_memspace_calloc, linux_memspace_free)
	(linux_memspace_realloc): Remove 'always_pinned_mode' handling.
	(GOMP_enable_pinned_mode): Move...
	* target.c: ... here.
	(always_pinned_mode, verify_always_pinned_mode)
	(gomp_verify_always_pinned_mode, gomp_page_locked_host_alloc_dev)
	(gomp_page_locked_host_free_dev)
	(gomp_page_locked_host_aligned_alloc_dev)
	(gomp_page_locked_host_aligned_free_dev)
	(gomp_page_locked_host_register_dev)
	(gomp_page_locked_host_unregister_dev): Add.
	(gomp_copy_host2dev, gomp_map_vars_internal)
	(gomp_remove_var_internal, gomp_unmap_vars_internal)
	(get_gomp_offload_icvs, gomp_load_image_to_device)
	(gomp_target_rev, omp_target_memcpy_copy)
	(omp_target_memcpy_rect_worker): Handle 'always_pinned_mode'.
	(gomp_copy_host2dev, gomp_copy_dev2host): Handle
	'verify_always_pinned_mode'.
	(GOMP_target_ext): Add 'assert'.
	(gomp_page_locked_host_alloc): Use
	'gomp_page_locked_host_alloc_dev'.
	(gomp_page_locked_host_free): Use
	'gomp_page_locked_host_free_dev'.
	(omp_target_associate_ptr): Adjust.
	(gomp_load_plugin_for_device): Handle 'page_locked_host_register',
	'page_locked_host_unregister', 'page_locked_host_p'.
	* oacc-mem.c (memcpy_tofrom_device): Handle 'always_pinned_mode'.
	* libgomp_g.h (GOMP_enable_pinned_mode): Adjust.
	* testsuite/libgomp.c/alloc-pinned-7.c: Remove.
---
 gcc/ChangeLog.omp                            |   4 +
 gcc/doc/invoke.texi                          |  19 +-
 include/ChangeLog.omp                        |   9 +
 include/cuda/cuda.h                          |  11 +-
 libgomp/ChangeLog.omp                        |  75 ++
 libgomp/config/linux/allocator.c             |  26 -
 libgomp/libgomp-plugin.h                     |   7 +-
 libgomp/libgomp.h                            |  15 +
 libgomp/libgomp.texi                         |  35 +
 libgomp/libgomp_g.h                          |   2 +-
 libgomp/oacc-mem.c                           |  16 +
 libgomp/plugin/cuda-lib.def                  |   4 +
 libgomp/plugin/plugin-nvptx.c                | 435 ++++++++++-
 libgomp/target.c                             | 771 +++++++++++++++++--
 libgomp/testsuite/libgomp.c/alloc-pinned-7.c |  63 --
 15 files changed, 1339 insertions(+), 153 deletions(-)
 delete mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c

diff --git a/gcc/ChangeLog.omp b/gcc/ChangeLog.omp
index 5e76158db06..d8aa0ab51bf 100644
--- a/gcc/ChangeLog.omp
+++ b/gcc/ChangeLog.omp
@@ -1,3 +1,7 @@
+2023-04-03  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* doc/invoke.texi (-foffload-memory=pinned): Document.
+
 2023-03-31  Frederik Harwath  <frederik@codesourcery.com>
 
 	* omp-transform-loops.cc (walk_omp_for_loops): Handle
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 1fe047042ae..070b63030f8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -2711,13 +2711,28 @@ Typical command lines are
 @itemx -foffload-memory=unified
 @itemx -foffload-memory=pinned
 @opindex foffload-memory
+@cindex Offloading memory modes
 @cindex OpenMP offloading memory modes
+
 Enable a memory optimization mode to use with OpenMP.  The default behavior,
 @option{-foffload-memory=none}, is to do nothing special (unless enabled via
 a requires directive in the code).  @option{-foffload-memory=unified} is
 equivalent to @code{#pragma omp requires unified_shared_memory}.
-@option{-foffload-memory=pinned} forces all host memory to be pinned (this
-mode may require the user to increase the ulimit setting for locked memory).
+
+@c The following paragraph is duplicated in
+@c '../../libgomp/libgomp.texi', '-foffload-memory=pinned'.
+If supported by the active offloading device,
+@option{-foffload-memory=pinned} enables automatic use of page-locked
+host memory for memory objects participating in host <-> device memory
+transfers, for both OpenACC and OpenMP offloading.
+Such memory is allocated or registered using the respective offloading
+device interfaces, which potentially helps optimization of host <->
+device data transfers.
+This option is experimental.
+Beware that use of a lot of pinned memory may degrade overall system
+performance, as it does reduce the amount of host memory available for
+paging.
+
 All translation units must select the same setting to avoid undefined
 behavior.
 
diff --git a/include/ChangeLog.omp b/include/ChangeLog.omp
index 244d67e6608..655377a6d0d 100644
--- a/include/ChangeLog.omp
+++ b/include/ChangeLog.omp
@@ -1,3 +1,12 @@
+2023-04-03  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* cuda/cuda.h (CUresult): Add
+	'CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED'.
+	(CUdevice_attribute): Add
+	'CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED'.
+	(CU_MEMHOSTREGISTER_READ_ONLY): Add.
+	(cuMemHostGetFlags, cuMemHostRegister, cuMemHostUnregister): Add.
+
 2023-02-20  Thomas Schwinge  <thomas@codesourcery.com>
 
 	* cuda/cuda.h (cuMemHostRegister, cuMemHostUnregister): Remove.
diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 062d394b95f..f8f464607db 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -57,6 +57,7 @@ typedef enum {
   CUDA_ERROR_INVALID_CONTEXT = 201,
   CUDA_ERROR_NOT_FOUND = 500,
   CUDA_ERROR_NOT_READY = 600,
+  CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED = 712,
   CUDA_ERROR_LAUNCH_FAILED = 719,
   CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720,
   CUDA_ERROR_NOT_PERMITTED = 800,
@@ -80,7 +81,8 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
   CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,
   CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED = 113
 } CUdevice_attribute;
 
 typedef enum {
@@ -124,8 +126,11 @@ enum {
 #define CU_LAUNCH_PARAM_END ((void *) 0)
 #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
 #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
+
 #define CU_MEMHOSTALLOC_DEVICEMAP 0x02U
 
+#define CU_MEMHOSTREGISTER_READ_ONLY 0x08
+
 enum {
   CU_STREAM_DEFAULT = 0,
   CU_STREAM_NON_BLOCKING = 1
@@ -183,6 +188,10 @@ CUresult cuMemAlloc (CUdeviceptr *, size_t);
 CUresult cuMemAllocHost (void **, size_t);
 CUresult cuMemAllocManaged(CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemHostAlloc (void **, size_t, unsigned int);
+CUresult cuMemHostGetFlags (unsigned int *, void *);
+#define cuMemHostRegister cuMemHostRegister_v2
+CUresult cuMemHostRegister(void *, size_t, unsigned int);
+CUresult cuMemHostUnregister(void *);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index 7afb5f43c04..1b02c057562 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,5 +1,80 @@
 2023-04-03  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* libgomp-plugin.h (GOMP_OFFLOAD_page_locked_host_free): Add
+	'struct goacc_asyncqueue *' formal parameter.
+	(GOMP_OFFLOAD_page_locked_host_register)
+	(GOMP_OFFLOAD_page_locked_host_unregister)
+	(GOMP_OFFLOAD_page_locked_host_p): Add.
+	* libgomp.h (always_pinned_mode)
+	(gomp_page_locked_host_register_dev)
+	(gomp_page_locked_host_unregister_dev): Add.
+	(struct splay_tree_key_s): Add 'page_locked_host_p'.
+	(struct gomp_device_descr): Add
+	'GOMP_OFFLOAD_page_locked_host_register',
+	'GOMP_OFFLOAD_page_locked_host_unregister',
+	'GOMP_OFFLOAD_page_locked_host_p'.
+	* libgomp.texi (-foffload-memory=pinned): Document.
+	* plugin/cuda-lib.def (cuMemHostGetFlags, cuMemHostRegister_v2)
+	(cuMemHostRegister, cuMemHostUnregister): Add.
+	* plugin/plugin-nvptx.c (struct ptx_device): Add
+	'read_only_host_register_supported'.
+	(nvptx_open_device): Initialize it.
+	(free_host_blocks, free_host_blocks_lock)
+	(nvptx_run_deferred_page_locked_host_free)
+	(nvptx_page_locked_host_free_callback, nvptx_page_locked_host_p)
+	(GOMP_OFFLOAD_page_locked_host_register)
+	(nvptx_page_locked_host_unregister_callback)
+	(GOMP_OFFLOAD_page_locked_host_unregister)
+	(GOMP_OFFLOAD_page_locked_host_p)
+	(nvptx_run_deferred_page_locked_host_unregister)
+	(nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback):
+	Add.
+	(GOMP_OFFLOAD_fini_device, GOMP_OFFLOAD_page_locked_host_alloc)
+	(GOMP_OFFLOAD_run): Call
+	'nvptx_run_deferred_page_locked_host_free'.
+	(struct goacc_asyncqueue): Add
+	'page_locked_host_unregister_blocks_lock',
+	'page_locked_host_unregister_blocks'.
+	(nvptx_goacc_asyncqueue_construct)
+	(nvptx_goacc_asyncqueue_destruct): Handle those.
+	(GOMP_OFFLOAD_page_locked_host_free): Handle
+	'struct goacc_asyncqueue *' formal parameter.
+	(GOMP_OFFLOAD_openacc_async_test)
+	(nvptx_goacc_asyncqueue_synchronize): Call
+	'nvptx_run_deferred_page_locked_host_unregister'.
+	(GOMP_OFFLOAD_openacc_async_serialize): Call
+	'nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback'.
+	* config/linux/allocator.c (linux_memspace_alloc)
+	(linux_memspace_calloc, linux_memspace_free)
+	(linux_memspace_realloc): Remove 'always_pinned_mode' handling.
+	(GOMP_enable_pinned_mode): Move...
+	* target.c: ... here.
+	(always_pinned_mode, verify_always_pinned_mode)
+	(gomp_verify_always_pinned_mode, gomp_page_locked_host_alloc_dev)
+	(gomp_page_locked_host_free_dev)
+	(gomp_page_locked_host_aligned_alloc_dev)
+	(gomp_page_locked_host_aligned_free_dev)
+	(gomp_page_locked_host_register_dev)
+	(gomp_page_locked_host_unregister_dev): Add.
+	(gomp_copy_host2dev, gomp_map_vars_internal)
+	(gomp_remove_var_internal, gomp_unmap_vars_internal)
+	(get_gomp_offload_icvs, gomp_load_image_to_device)
+	(gomp_target_rev, omp_target_memcpy_copy)
+	(omp_target_memcpy_rect_worker): Handle 'always_pinned_mode'.
+	(gomp_copy_host2dev, gomp_copy_dev2host): Handle
+	'verify_always_pinned_mode'.
+	(GOMP_target_ext): Add 'assert'.
+	(gomp_page_locked_host_alloc): Use
+	'gomp_page_locked_host_alloc_dev'.
+	(gomp_page_locked_host_free): Use
+	'gomp_page_locked_host_free_dev'.
+	(omp_target_associate_ptr): Adjust.
+	(gomp_load_plugin_for_device): Handle 'page_locked_host_register',
+	'page_locked_host_unregister', 'page_locked_host_p'.
+	* oacc-mem.c (memcpy_tofrom_device): Handle 'always_pinned_mode'.
+	* libgomp_g.h (GOMP_enable_pinned_mode): Adjust.
+	* testsuite/libgomp.c/alloc-pinned-7.c: Remove.
+
 	PR other/76739
 	* target.c (gomp_map_vars_internal): Pass pre-allocated 'ptrblock'
 	to 'goacc_noncontig_array_create_ptrblock'.
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index 3e1bd5a1285..62649f64221 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -45,20 +45,6 @@
 #include <assert.h>
 #include "libgomp.h"
 
-static bool always_pinned_mode = false;
-
-/* This function is called by the compiler when -foffload-memory=pinned
-   is used.  */
-
-void
-GOMP_enable_pinned_mode ()
-{
-  if (mlockall (MCL_CURRENT | MCL_FUTURE) != 0)
-    gomp_error ("failed to pin all memory (ulimit too low?)");
-  else
-    always_pinned_mode = true;
-}
-
 static int using_device_for_page_locked
   = /* uninitialized */ -1;
 
@@ -70,9 +56,6 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin,
 	      __FUNCTION__, (unsigned long long) memspace,
 	      (unsigned long long) size, pin, init0);
 
-  /* Explicit pinning may not be required.  */
-  pin = pin && !always_pinned_mode;
-
   void *addr;
 
   if (memspace == ompx_unified_shared_mem_space)
@@ -137,9 +120,6 @@ linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
   gomp_debug (0, "%s: memspace=%llu, size=%llu, pin=%d\n",
 	      __FUNCTION__, (unsigned long long) memspace, (unsigned long long) size, pin);
 
-  /* Explicit pinning may not be required.  */
-  pin = pin && !always_pinned_mode;
-
   if (memspace == ompx_unified_shared_mem_space)
     {
       void *ret = gomp_usm_alloc (size, GOMP_DEVICE_ICV);
@@ -159,9 +139,6 @@ linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
   gomp_debug (0, "%s: memspace=%llu, addr=%p, size=%llu, pin=%d\n",
 	      __FUNCTION__, (unsigned long long) memspace, addr, (unsigned long long) size, pin);
 
-  /* Explicit pinning may not be required.  */
-  pin = pin && !always_pinned_mode;
-
   if (memspace == ompx_unified_shared_mem_space)
     gomp_usm_free (addr, GOMP_DEVICE_ICV);
   else if (pin)
@@ -188,9 +165,6 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
   gomp_debug (0, "%s: memspace=%llu, addr=%p, oldsize=%llu, size=%llu, oldpin=%d, pin=%d\n",
 	      __FUNCTION__, (unsigned long long) memspace, addr, (unsigned long long) oldsize, (unsigned long long) size, oldpin, pin);
 
-  /* Explicit pinning may not be required.  */
-  pin = pin && !always_pinned_mode;
-
   if (memspace == ompx_unified_shared_mem_space)
     goto manual_realloc;
   else if (oldpin && pin)
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index ca557a79380..7456b7d1026 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -141,7 +141,12 @@ extern void *GOMP_OFFLOAD_usm_alloc (int, size_t);
 extern bool GOMP_OFFLOAD_usm_free (int, void *);
 extern bool GOMP_OFFLOAD_is_usm_ptr (void *);
 extern bool GOMP_OFFLOAD_page_locked_host_alloc (void **, size_t);
-extern bool GOMP_OFFLOAD_page_locked_host_free (void *);
+extern bool GOMP_OFFLOAD_page_locked_host_free (void *,
+						struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_page_locked_host_register (int, void *, size_t, int);
+extern bool GOMP_OFFLOAD_page_locked_host_unregister (void *, size_t,
+						      struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_page_locked_host_p (int, const void *, size_t);
 extern bool GOMP_OFFLOAD_dev2host (int, void *, const void *, size_t);
 extern bool GOMP_OFFLOAD_host2dev (int, void *, const void *, size_t);
 extern bool GOMP_OFFLOAD_dev2dev (int, void *, const void *, size_t);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3b2b4aa9534..b7ac9d3da5b 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1123,6 +1123,8 @@ extern int gomp_pause_host (void);
 
 /* target.c */
 
+extern bool always_pinned_mode;
+
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
@@ -1130,6 +1132,11 @@ extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
 			     int, struct goacc_asyncqueue *);
 extern void * gomp_usm_alloc (size_t size, int device_num);
 extern void gomp_usm_free (void *device_ptr, int device_num);
+extern int gomp_page_locked_host_register_dev (struct gomp_device_descr *,
+					       void *, size_t, int);
+extern bool gomp_page_locked_host_unregister_dev (struct gomp_device_descr *,
+						  void *, size_t,
+						  struct goacc_asyncqueue *);
 extern bool gomp_page_locked_host_alloc (void **, size_t);
 extern void gomp_page_locked_host_free (void *);
 
@@ -1232,6 +1239,9 @@ struct splay_tree_key_s {
     uintptr_t *structelem_refcount_ptr;
   };
   struct splay_tree_aux *aux;
+  /* Whether we have registered page-locked host memory for
+     '[host_start, host_end)'.  */
+  bool page_locked_host_p;
 };
 
 /* The comparison function.  */
@@ -1393,6 +1403,11 @@ struct gomp_device_descr
   __typeof (GOMP_OFFLOAD_is_usm_ptr) *is_usm_ptr_func;
   __typeof (GOMP_OFFLOAD_page_locked_host_alloc) *page_locked_host_alloc_func;
   __typeof (GOMP_OFFLOAD_page_locked_host_free) *page_locked_host_free_func;
+  __typeof (GOMP_OFFLOAD_page_locked_host_register)
+       *page_locked_host_register_func;
+  __typeof (GOMP_OFFLOAD_page_locked_host_unregister)
+       *page_locked_host_unregister_func;
+  __typeof (GOMP_OFFLOAD_page_locked_host_p) *page_locked_host_p_func;
   __typeof (GOMP_OFFLOAD_dev2host) *dev2host_func;
   __typeof (GOMP_OFFLOAD_host2dev) *host2dev_func;
   __typeof (GOMP_OFFLOAD_dev2dev) *dev2dev_func;
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 6355ce2a37b..df52fd3039c 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -4402,10 +4402,41 @@ creating memory allocators requesting
 The following sections present notes on the offload-target specifics
 
 @menu
+* @option{-foffload-memory=pinned}::
 * AMD Radeon::
 * nvptx::
 @end menu
 
+@node @option{-foffload-memory=pinned}
+@section @option{-foffload-memory=pinned}
+
+@c The following paragraph is duplicated from
+@c '../gcc/doc/invoke.texi', '-foffload-memory=pinned'.
+If supported by the active offloading device,
+@option{-foffload-memory=pinned} enables automatic use of page-locked
+host memory for memory objects participating in host <-> device memory
+transfers, for both OpenACC and OpenMP offloading.
+Such memory is allocated or registered using the respective offloading
+device interfaces, which potentially helps optimization of host <->
+device data transfers.
+This option is experimental.
+Beware that use of a lot of pinned memory may degrade overall system
+performance, as it does reduce the amount of host memory available for
+paging.
+
+An OpenACC @emph{async} @code{enter data}-like operation may register
+a memory object as pinned.  After the corresponding @emph{async}
+@code{exit data}-like operation, this registration does last until
+next synchronization point (such as @code{acc_async_synchronize}).
+During this time, the user code must not "touch" the host-side memory
+allocation -- but that does correspond to the @emph{async} semantics
+anyway.
+
+We don't consider @code{-foffload-memory=pinned} for one-time internal
+data transfers, such as setup during device initialization.
+
+
+
 @node AMD Radeon
 @section AMD Radeon (GCN)
 
@@ -4459,6 +4490,8 @@ The implementation remark:
 @item OpenMP @emph{pinned} memory (@code{omp_atk_pinned},
       @code{ompx_pinned_mem_alloc}, for example)
       is allocated via @code{mmap}, @code{mlock}.
+@item @option{-foffload-memory=pinned} is not supported,
+      @pxref{@option{-foffload-memory=pinned}}.
 @end itemize
 
 
@@ -4526,6 +4559,8 @@ The implementation remark:
       is allocated via @code{cuMemHostAlloc} (CUDA Driver API).
       This potentially helps optimization of host <-> device data
       transfers.
+@item @option{-foffload-memory=pinned} is supported,
+      @pxref{@option{-foffload-memory=pinned}}.
 @end itemize
 
 
diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h
index fe66a53d94a..2a515ce7348 100644
--- a/libgomp/libgomp_g.h
+++ b/libgomp/libgomp_g.h
@@ -365,6 +365,7 @@ extern bool GOMP_teams4 (unsigned int, unsigned int, unsigned int, bool);
 
 extern bool GOMP_evaluate_target_device (int, const char *, const char *,
 					 const char *);
+extern void GOMP_enable_pinned_mode (void);
 
 /* teams.c */
 
@@ -375,7 +376,6 @@ extern void GOMP_teams_reg (void (*) (void *), void *, unsigned, unsigned,
 
 extern void *GOMP_alloc (size_t, size_t, uintptr_t);
 extern void GOMP_free (void *, uintptr_t);
-extern void GOMP_enable_pinned_mode (void);
 
 /* error.c */
 
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index bd82beefcdb..75ec8958501 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -199,11 +199,27 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
     }
 
   goacc_aq aq = get_goacc_asyncqueue (async);
+
+  int h_page_locked_host_p = 0;
+
+  if (always_pinned_mode
+      && s != 0)
+    {
+      h_page_locked_host_p = gomp_page_locked_host_register_dev
+	(thr->dev, h, s, from ? GOMP_MAP_FROM : GOMP_MAP_TO);
+      if (h_page_locked_host_p < 0)
+	exit (EXIT_FAILURE);
+    }
+
   if (from)
     gomp_copy_dev2host (thr->dev, aq, h, d, s);
   else
     gomp_copy_host2dev (thr->dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL);
 
+  if (h_page_locked_host_p
+      && !gomp_page_locked_host_unregister_dev (thr->dev, h, s, aq))
+    exit (EXIT_FAILURE);
+
   if (profiling_p)
     {
       thr->prof_info = NULL;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index 9b786c9f2f6..062a141053f 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -31,6 +31,10 @@ CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
 CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemHostAlloc)
+CUDA_ONE_CALL (cuMemHostGetFlags)
+CUDA_ONE_CALL_MAYBE_NULL (cuMemHostRegister_v2)
+CUDA_ONE_CALL (cuMemHostRegister)
+CUDA_ONE_CALL (cuMemHostUnregister)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 23f89b6fb34..e57a2b30e66 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -78,11 +78,14 @@ extern CUresult cuGetErrorString (CUresult, const char **);
 CUresult cuLinkAddData (CUlinkState, CUjitInputType, void *, size_t,
 			const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate (unsigned, CUjit_option *, void **, CUlinkState *);
+#undef cuMemHostRegister
+CUresult cuMemHostRegister (void *, size_t, unsigned int);
 #else
 typedef size_t (*CUoccupancyB2DSize)(int);
 CUresult cuLinkAddData_v2 (CUlinkState, CUjitInputType, void *, size_t,
 			   const char *, unsigned, CUjit_option *, void **);
 CUresult cuLinkCreate_v2 (unsigned, CUjit_option *, void **, CUlinkState *);
+CUresult cuMemHostRegister_v2 (void *, size_t, unsigned int);
 CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
 					  CUoccupancyB2DSize, size_t, int);
 #endif
@@ -218,6 +221,8 @@ static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 struct goacc_asyncqueue
 {
   CUstream cuda_stream;
+  pthread_mutex_t page_locked_host_unregister_blocks_lock;
+  struct ptx_free_block *page_locked_host_unregister_blocks;
 };
 
 struct nvptx_callback
@@ -314,6 +319,7 @@ struct ptx_device
   int warp_size;
   int max_threads_per_block;
   int max_threads_per_multiprocessor;
+  bool read_only_host_register_supported;
   int default_dims[GOMP_DIM_MAX];
   int compute_major, compute_minor;
 
@@ -340,6 +346,33 @@ struct ptx_device
 
 static struct ptx_device **ptx_devices;
 
+static struct ptx_free_block *free_host_blocks = NULL;
+static pthread_mutex_t free_host_blocks_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static bool
+nvptx_run_deferred_page_locked_host_free (void)
+{
+  GOMP_PLUGIN_debug (0, "%s\n",
+		     __FUNCTION__);
+
+  pthread_mutex_lock (&free_host_blocks_lock);
+  struct ptx_free_block *b = free_host_blocks;
+  free_host_blocks = NULL;
+  pthread_mutex_unlock (&free_host_blocks_lock);
+
+  while (b)
+    {
+      GOMP_PLUGIN_debug (0, "  b=%p: cuMemFreeHost(b->ptr=%p)\n",
+			 b, b->ptr);
+
+      struct ptx_free_block *b_next = b->next;
+      CUDA_CALL (cuMemFreeHost, b->ptr);
+      free (b);
+      b = b_next;
+    }
+  return true;
+}
+
 /* OpenMP kernels reserve a small amount of ".shared" space for use by
    omp_alloc.  The size is configured using GOMP_NVPTX_LOWLAT_POOL, but the
    default is set here.  */
@@ -542,6 +575,19 @@ nvptx_open_device (int n)
 			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
   assert (r == CUDA_SUCCESS && pi);
 
+  /* This is a CUDA 11.1 feature.  */
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED,
+			 dev);
+  if (r == CUDA_ERROR_INVALID_VALUE)
+    pi = false;
+  else if (r != CUDA_SUCCESS)
+    {
+      GOMP_PLUGIN_error ("cuDeviceGetAttribute error: %s", cuda_error (r));
+      return NULL;
+    }
+  ptx_dev->read_only_host_register_supported = pi;
+
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1278,6 +1324,11 @@ GOMP_OFFLOAD_init_device (int n)
 bool
 GOMP_OFFLOAD_fini_device (int n)
 {
+  /* This isn't related to this specific 'ptx_devices[n]', but is a convenient
+     place to clean up.  */
+  if (!nvptx_run_deferred_page_locked_host_free ())
+    return false;
+
   pthread_mutex_lock (&ptx_dev_lock);
 
   if (ptx_devices[n] != NULL)
@@ -1711,6 +1762,12 @@ GOMP_OFFLOAD_page_locked_host_alloc (void **ptr, size_t size)
   GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, size=%llu\n",
 		     __FUNCTION__, ptr, (unsigned long long) size);
 
+  /* TODO: Maybe running the deferred 'cuMemFreeHost's here is not the best
+     idea, given that we don't know what context we're called from?  (See
+     'GOMP_OFFLOAD_run' reverse offload handling.)  But, where to do it?  */
+  if (!nvptx_run_deferred_page_locked_host_free ())
+    return false;
+
   CUresult r;
 
   unsigned int flags = 0;
@@ -1729,16 +1786,243 @@ GOMP_OFFLOAD_page_locked_host_alloc (void **ptr, size_t size)
   return true;
 }
 
+static void
+nvptx_page_locked_host_free_callback (CUstream stream, CUresult r, void *ptr)
+{
+  GOMP_PLUGIN_debug (0, "%s: stream=%p, r=%u, ptr=%p\n",
+		     __FUNCTION__, stream, (unsigned) r, ptr);
+
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (r));
+
+  /* We can't now call 'cuMemFreeHost': we're in a CUDA stream context,
+     where we "must not make any CUDA API calls".
+     And, in particular in an OpenMP 'target' reverse offload context,
+     this may even dead-lock?!  */
+  /* See 'nvptx_free'.  */
+  struct ptx_free_block *n
+    = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
+  GOMP_PLUGIN_debug (0, "  defer; n=%p\n", n);
+  n->ptr = ptr;
+  pthread_mutex_lock (&free_host_blocks_lock);
+  n->next = free_host_blocks;
+  free_host_blocks = n;
+  pthread_mutex_unlock (&free_host_blocks_lock);
+}
+
+bool
+GOMP_OFFLOAD_page_locked_host_free (void *ptr, struct goacc_asyncqueue *aq)
+{
+  GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, aq=%p\n",
+		     __FUNCTION__, ptr, aq);
+
+  if (aq)
+    {
+      GOMP_PLUGIN_debug (0, "  aq <-"
+			 " nvptx_page_locked_host_free_callback(ptr)\n");
+      CUDA_CALL (cuStreamAddCallback, aq->cuda_stream,
+		 nvptx_page_locked_host_free_callback, ptr, 0);
+    }
+  else
+    CUDA_CALL (cuMemFreeHost, ptr);
+  return true;
+}
+
+static int
+nvptx_page_locked_host_p (const void *ptr, size_t size)
+{
+  GOMP_PLUGIN_debug (0, "%s: ptr=%p, size=%llu\n",
+		     __FUNCTION__, ptr, (unsigned long long) size);
+
+  int ret;
+
+  CUresult r;
+
+  /* Apparently, there exists no CUDA call to query 'PTR + [0, SIZE)'.  Instead
+     of invoking 'cuMemHostGetFlags' SIZE times, we deem it sufficient to only
+     query the base PTR.  */
+  unsigned int flags;
+  void *ptr_noconst = (void *) ptr;
+  r = CUDA_CALL_NOCHECK (cuMemHostGetFlags, &flags, ptr_noconst);
+  (void) flags;
+  if (r == CUDA_SUCCESS)
+    ret = 1;
+  else if (r == CUDA_ERROR_INVALID_VALUE)
+    ret = 0;
+  else
+    {
+      GOMP_PLUGIN_error ("cuMemHostGetFlags error: %s", cuda_error (r));
+      ret = -1;
+    }
+  GOMP_PLUGIN_debug (0, "  -> %d (with r = %u)\n",
+		     ret, (unsigned) r);
+  return ret;
+}
+
+int
+GOMP_OFFLOAD_page_locked_host_register (int ord,
+					void *ptr, size_t size, int kind)
+{
+  bool try_read_only;
+  /* Magic number: if the actualy mapping kind is unknown...  */
+  if (kind == -1)
+    /* ..., allow for trying read-only registration here.  */
+    try_read_only = true;
+  else
+    try_read_only = !GOMP_MAP_COPY_FROM_P (kind);
+  GOMP_PLUGIN_debug (0, "nvptx %s: ord=%d, ptr=%p, size=%llu,"
+		     " kind=%d (try_read_only=%d)\n",
+		     __FUNCTION__, ord, ptr, (unsigned long long) size,
+		     kind, try_read_only);
+  assert (size != 0);
+
+  if (!nvptx_attach_host_thread_to_device (ord))
+    return -1;
+  struct ptx_device *ptx_dev = ptx_devices[ord];
+
+  int ret = -1;
+
+  CUresult r;
+
+  unsigned int flags = 0;
+  /* Given 'CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING', we don't need
+     'flags |= CU_MEMHOSTREGISTER_PORTABLE;' here.  */
+ cuMemHostRegister:
+  if (CUDA_CALL_EXISTS (cuMemHostRegister_v2))
+    r = CUDA_CALL_NOCHECK (cuMemHostRegister_v2, ptr, size, flags);
+  else
+    r = CUDA_CALL_NOCHECK (cuMemHostRegister, ptr, size, flags);
+  if (r == CUDA_SUCCESS)
+    ret = 1;
+  else if (r == CUDA_ERROR_INVALID_VALUE)
+    {
+      /* For example, for 'cuMemHostAlloc' (via the user code, for example)
+	 followed by 'cuMemHostRegister' (via 'always_pinned_mode', for
+	 example), we don't get 'CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED' but
+	 'CUDA_ERROR_INVALID_VALUE'.  */
+      if (nvptx_page_locked_host_p (ptr, size))
+	/* Accept the case that the region already is page-locked.  */
+	ret = 0;
+      /* Depending on certain circumstances (see 'cuMemHostRegister'
+	 documentation), for example, 'const' data that is placed in section
+	 '.rodata' may need 'flags |= CU_MEMHOSTREGISTER_READ_ONLY;', to avoid
+	 'CUDA_ERROR_INVALID_VALUE'.  If running into that, we now apply/re-try
+	 lazily instead of actively setting it above, to avoid the following
+	 problem.  Supposedly/observably (but, not documented), if part of a
+	 memory page has been registered without 'CU_MEMHOSTREGISTER_READ_ONLY'
+	 and we then try to register another part with
+	 'CU_MEMHOSTREGISTER_READ_ONLY', we'll get 'CUDA_ERROR_INVALID_VALUE'.
+	 In that case, we can solve the issue by re-trying with
+	 'CU_MEMHOSTREGISTER_READ_ONLY' masked out.  However, if part of a
+	 memory page has been registered with 'CU_MEMHOSTREGISTER_READ_ONLY'
+	 and we then try to register another part without
+	 'CU_MEMHOSTREGISTER_READ_ONLY', that latter part apparently inherits
+	 the former's 'CU_MEMHOSTREGISTER_READ_ONLY' (and any device to host
+	 copy then fails).  We can't easily resolve that situation
+	 retroactively, that is, we can't easily re-register the first
+	 'CU_MEMHOSTREGISTER_READ_ONLY' part without that flag.  */
+      else if (!(flags & CU_MEMHOSTREGISTER_READ_ONLY)
+	       && try_read_only
+	       && ptx_dev->read_only_host_register_supported)
+	{
+	  GOMP_PLUGIN_debug (0, "  flags |= CU_MEMHOSTREGISTER_READ_ONLY;\n");
+	  flags |= CU_MEMHOSTREGISTER_READ_ONLY;
+	  goto cuMemHostRegister;
+	}
+      /* We ought to use 'CU_MEMHOSTREGISTER_READ_ONLY', but it's not
+	 available.  */
+      else if (try_read_only
+	       && !ptx_dev->read_only_host_register_supported)
+	{
+	  assert (!(flags & CU_MEMHOSTREGISTER_READ_ONLY));
+	  GOMP_PLUGIN_debug (0, "  punt;"
+			     " CU_MEMHOSTREGISTER_READ_ONLY not available\n");
+	  /* Accept this (legacy) case; we can't (easily) register page-locked
+	     this region of host memory.  */
+	  ret = 0;
+	}
+    }
+  else if (r == CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED)
+    {
+      /* 'cuMemHostRegister' (via the user code, for example) followed by
+	 another (potentially partially overlapping) 'cuMemHostRegister'
+	 (via 'always_pinned_mode', for example).  */
+      /* Accept this case in good faith; do not verify further.  */
+      ret = 0;
+    }
+  if (ret == -1)
+    GOMP_PLUGIN_error ("cuMemHostRegister error: %s", cuda_error (r));
+  GOMP_PLUGIN_debug (0, "  -> %d (with r = %u)\n",
+		     ret, (unsigned) r);
+  return ret;
+}
+
+static void
+nvptx_page_locked_host_unregister_callback (CUstream stream, CUresult r,
+					    void *b_)
+{
+  void **b = b_;
+  struct goacc_asyncqueue *aq = b[0];
+  void *ptr = b[1];
+  GOMP_PLUGIN_debug (0, "%s: stream=%p, r=%u, b_=%p (aq=%p, ptr=%p)\n",
+		     __FUNCTION__, stream, (unsigned) r, b_, aq, ptr);
+
+  free (b_);
+
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (r));
+
+  /* We can't now call 'cuMemHostUnregister': we're in a CUDA stream context,
+     where we "must not make any CUDA API calls".  */
+  /* See 'nvptx_free'.  */
+  struct ptx_free_block *n
+    = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
+  GOMP_PLUGIN_debug (0, "  defer; n=%p\n", n);
+  n->ptr = ptr;
+  pthread_mutex_lock (&aq->page_locked_host_unregister_blocks_lock);
+  n->next = aq->page_locked_host_unregister_blocks;
+  aq->page_locked_host_unregister_blocks = n;
+  pthread_mutex_unlock (&aq->page_locked_host_unregister_blocks_lock);
+}
+
 bool
-GOMP_OFFLOAD_page_locked_host_free (void *ptr)
+GOMP_OFFLOAD_page_locked_host_unregister (void *ptr, size_t size,
+					  struct goacc_asyncqueue *aq)
 {
-  GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p\n",
-		     __FUNCTION__, ptr);
+  GOMP_PLUGIN_debug (0, "nvptx %s: ptr=%p, size=%llu, aq=%p\n",
+		     __FUNCTION__, ptr, (unsigned long long) size, aq);
+  assert (size != 0);
 
-  CUDA_CALL (cuMemFreeHost, ptr);
+  if (aq)
+    {
+      /* We don't unregister right away, as in-flight operations may still
+	 benefit from the registration.  */
+      void **b = GOMP_PLUGIN_malloc (2 * sizeof (*b));
+      b[0] = aq;
+      b[1] = ptr;
+      GOMP_PLUGIN_debug (0, "  aq <-"
+			 " nvptx_page_locked_host_unregister_callback(b=%p)\n",
+			 b);
+      CUDA_CALL (cuStreamAddCallback, aq->cuda_stream,
+		 nvptx_page_locked_host_unregister_callback, b, 0);
+    }
+  else
+    CUDA_CALL (cuMemHostUnregister, ptr);
   return true;
 }
 
+int
+GOMP_OFFLOAD_page_locked_host_p (int ord, const void *ptr, size_t size)
+{
+  GOMP_PLUGIN_debug (0, "nvptx %s: ord=%d, ptr=%p, size=%llu\n",
+		     __FUNCTION__, ord, ptr, (unsigned long long) size);
+
+  if (!nvptx_attach_host_thread_to_device (ord))
+    return -1;
+
+  return nvptx_page_locked_host_p (ptr, size);
+}
+
 
 void
 GOMP_OFFLOAD_openacc_exec (void (*fn) (void *),
@@ -1841,12 +2125,19 @@ GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
 static struct goacc_asyncqueue *
 nvptx_goacc_asyncqueue_construct (unsigned int flags)
 {
+  GOMP_PLUGIN_debug (0, "%s: flags=%u\n",
+		     __FUNCTION__, flags);
+
   CUstream stream = NULL;
   CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, flags);
 
   struct goacc_asyncqueue *aq
     = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
   aq->cuda_stream = stream;
+  pthread_mutex_init (&aq->page_locked_host_unregister_blocks_lock, NULL);
+  aq->page_locked_host_unregister_blocks = NULL;
+  GOMP_PLUGIN_debug (0, "  -> aq=%p (with cuda_stream=%p)\n",
+		     aq, aq->cuda_stream);
   return aq;
 }
 
@@ -1859,9 +2150,24 @@ GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
 static bool
 nvptx_goacc_asyncqueue_destruct (struct goacc_asyncqueue *aq)
 {
+  GOMP_PLUGIN_debug (0, "nvptx %s: aq=%p\n",
+		     __FUNCTION__, aq);
+
   CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
+
+  bool ret = true;
+  pthread_mutex_lock (&aq->page_locked_host_unregister_blocks_lock);
+  if (aq->page_locked_host_unregister_blocks != NULL)
+    {
+      GOMP_PLUGIN_error ("aq->page_locked_host_unregister_blocks not empty");
+      ret = false;
+    }
+  pthread_mutex_unlock (&aq->page_locked_host_unregister_blocks_lock);
+  pthread_mutex_destroy (&aq->page_locked_host_unregister_blocks_lock);
+
   free (aq);
-  return true;
+
+  return ret;
 }
 
 bool
@@ -1870,12 +2176,50 @@ GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
   return nvptx_goacc_asyncqueue_destruct (aq);
 }
 
+static bool
+nvptx_run_deferred_page_locked_host_unregister (struct goacc_asyncqueue *aq)
+{
+  GOMP_PLUGIN_debug (0, "%s: aq=%p\n",
+		     __FUNCTION__, aq);
+
+  bool ret = true;
+  pthread_mutex_lock (&aq->page_locked_host_unregister_blocks_lock);
+  for (struct ptx_free_block *b = aq->page_locked_host_unregister_blocks; b;)
+    {
+      GOMP_PLUGIN_debug (0, "  b=%p: cuMemHostUnregister(b->ptr=%p)\n",
+			 b, b->ptr);
+
+      struct ptx_free_block *b_next = b->next;
+      CUresult r = CUDA_CALL_NOCHECK (cuMemHostUnregister, b->ptr);
+      if (r != CUDA_SUCCESS)
+	{
+	  GOMP_PLUGIN_error ("cuMemHostUnregister error: %s", cuda_error (r));
+	  ret = false;
+	}
+      free (b);
+      b = b_next;
+    }
+  aq->page_locked_host_unregister_blocks = NULL;
+  pthread_mutex_unlock (&aq->page_locked_host_unregister_blocks_lock);
+  return ret;
+}
+
 int
 GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 {
+  GOMP_PLUGIN_debug (0, "nvptx %s: aq=%p\n",
+		     __FUNCTION__, aq);
+
   CUresult r = CUDA_CALL_NOCHECK (cuStreamQuery, aq->cuda_stream);
   if (r == CUDA_SUCCESS)
-    return 1;
+    {
+      /* As a user may expect that they don't need to 'wait' if
+	 'acc_async_test' returns 'true', clean up here, too.  */
+      if (!nvptx_run_deferred_page_locked_host_unregister (aq))
+	return -1;
+
+      return 1;
+    }
   if (r == CUDA_ERROR_NOT_READY)
     return 0;
 
@@ -1886,7 +2230,17 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 static bool
 nvptx_goacc_asyncqueue_synchronize (struct goacc_asyncqueue *aq)
 {
+  GOMP_PLUGIN_debug (0, "%s: aq=%p\n",
+		     __FUNCTION__, aq);
+
   CUDA_CALL_ERET (false, cuStreamSynchronize, aq->cuda_stream);
+
+  /* This is called from a user code (non-stream) context, and upon returning,
+     we must've given up on any page-locked memory registrations, so unregister
+     any pending ones now.  */
+  if (!nvptx_run_deferred_page_locked_host_unregister (aq))
+    return false;
+
   return true;
 }
 
@@ -1896,14 +2250,70 @@ GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
   return nvptx_goacc_asyncqueue_synchronize (aq);
 }
 
+static void
+nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback
+(CUstream stream, CUresult r, void *b_)
+{
+  void **b = b_;
+  struct goacc_asyncqueue *aq1 = b[0];
+  struct goacc_asyncqueue *aq2 = b[1];
+  GOMP_PLUGIN_debug (0, "%s: stream=%p, r=%u, b_=%p (aq1=%p, aq2=%p)\n",
+		     __FUNCTION__, stream, (unsigned) r, b_, aq1, aq2);
+
+  free (b_);
+
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (r));
+
+  pthread_mutex_lock (&aq1->page_locked_host_unregister_blocks_lock);
+  if (aq1->page_locked_host_unregister_blocks)
+    {
+      pthread_mutex_lock (&aq2->page_locked_host_unregister_blocks_lock);
+      GOMP_PLUGIN_debug (0, "  page_locked_host_unregister_blocks:"
+			 " aq1 -> aq2\n");
+      if (aq2->page_locked_host_unregister_blocks == NULL)
+	aq2->page_locked_host_unregister_blocks
+	  = aq1->page_locked_host_unregister_blocks;
+      else
+	{
+	  struct ptx_free_block *b = aq2->page_locked_host_unregister_blocks;
+	  while (b->next != NULL)
+	    b = b->next;
+	  b->next = aq1->page_locked_host_unregister_blocks;
+	}
+      pthread_mutex_unlock (&aq2->page_locked_host_unregister_blocks_lock);
+      aq1->page_locked_host_unregister_blocks = NULL;
+    }
+  pthread_mutex_unlock (&aq1->page_locked_host_unregister_blocks_lock);
+}
+
 bool
 GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
 				      struct goacc_asyncqueue *aq2)
 {
+  GOMP_PLUGIN_debug (0, "nvptx %s: aq1=%p, aq2=%p\n",
+		     __FUNCTION__, aq1, aq2);
+
+  if (aq1 != aq2)
+    {
+      void **b = GOMP_PLUGIN_malloc (2 * sizeof (*b));
+      b[0] = aq1;
+      b[1] = aq2;
+      /* Enqueue on 'aq1': move 'page_locked_host_unregister_blocks' of 'aq1'
+	 to 'aq2'.  */
+      GOMP_PLUGIN_debug (0, "  aq1 <-"
+			 " nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback"
+			 "(b=%p)\n", b);
+      CUDA_CALL (cuStreamAddCallback, aq1->cuda_stream,
+		 nvptx_move_page_locked_host_unregister_blocks_aq1_aq2_callback,
+		 b, 0);
+    }
+
   CUevent e;
   CUDA_CALL_ERET (false, cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
   CUDA_CALL_ERET (false, cuEventRecord, e, aq1->cuda_stream);
   CUDA_CALL_ERET (false, cuStreamWaitEvent, aq2->cuda_stream, e, 0);
+
   return true;
 }
 
@@ -2238,6 +2648,19 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 	    if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq))
 	      exit (EXIT_FAILURE);
 	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+
+	    /* Clean up here; otherwise we may run into the situation that
+	       a following reverse offload does
+	       'GOMP_OFFLOAD_page_locked_host_alloc', and that then runs the
+	       deferred 'cuMemFreeHost's -- which may dead-lock?!
+	       TODO: This may need more considerations for the case that
+	       different host threads do reverse offload?  We could move
+	       'free_host_blocks' into 'aq' (which is separate per reverse
+	       offload) instead of global, like
+	       'page_locked_host_unregister_blocks', but that doesn't seem the
+	       right thing for OpenACC 'async' generally?  */
+	    if (!nvptx_run_deferred_page_locked_host_free ())
+	      exit (EXIT_FAILURE);
 	  }
 	usleep (1);
       }
diff --git a/libgomp/target.c b/libgomp/target.c
index b88b1ebaa13..ed2fc09cf44 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -108,6 +108,74 @@ static int num_devices_openmp;
 /* OpenMP requires mask.  */
 static int omp_requires_mask;
 
+
+static void *gomp_page_locked_host_alloc_dev (struct gomp_device_descr *,
+					      size_t, bool);
+static bool gomp_page_locked_host_free_dev (struct gomp_device_descr *,
+					    void *,
+					    struct goacc_asyncqueue *);
+static void *gomp_page_locked_host_aligned_alloc_dev (struct gomp_device_descr *,
+						      size_t, size_t);
+static bool gomp_page_locked_host_aligned_free_dev (struct gomp_device_descr *,
+						    void *,
+						    struct goacc_asyncqueue *);
+
+/* Use (that is, allocate or register) page-locked host memory for memory
+   objects participating in host <-> device memory transfers.
+
+   When this is enabled, there is no fallback to non-page-locked host
+   memory.  */
+
+attribute_hidden
+bool always_pinned_mode = false;
+
+/* This function is called by the compiler when -foffload-memory=pinned
+   is used.  */
+
+void
+GOMP_enable_pinned_mode ()
+{
+  always_pinned_mode = true;
+}
+
+/* Verify that page-locked host memory is used for memory objects participating
+   in host <-> device memory transfers.  */
+
+static const bool verify_always_pinned_mode = false;
+
+static bool
+gomp_verify_always_pinned_mode (struct gomp_device_descr *device,
+				const void *ptr, size_t size)
+{
+  gomp_debug (0, "%s: device=%p (%s), ptr=%p, size=%llu\n",
+	      __FUNCTION__,
+	      device, device->name, ptr, (unsigned long long) size);
+
+  if (size == 0)
+    /* Skip zero-size requests; for those we've got no actual region of
+       page-locked host memory.  */
+    ;
+  else if (device->page_locked_host_register_func)
+    {
+      int page_locked_host_p
+	= device->page_locked_host_p_func (device->target_id, ptr, size);
+      if (page_locked_host_p < 0)
+	{
+	  gomp_error ("Failed to test page-locked host memory"
+		      " via %s libgomp plugin",
+		      device->name);
+	  return false;
+	}
+      if (!page_locked_host_p)
+	{
+	  gomp_error ("Failed page-locked host memory test");
+	  return false;
+	}
+    }
+  return true;
+}
+
+
 /* Similar to gomp_realloc, but release register_lock before gomp_fatal.  */
 
 static void *
@@ -402,6 +470,9 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 		  if (__builtin_expect (aq != NULL, 0))
 		    assert (ephemeral);
 
+		  /* We're just filling the CBUF; 'always_pinned_mode' isn't
+		     relevant.  */
+
 		  memcpy ((char *) cbuf->buf + (doff - cbuf->chunks[0].start),
 			  h, sz);
 		  return;
@@ -422,18 +493,92 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 	     stack local in a function that is no longer executing).  As we've
 	     not been able to use CBUF, make a copy of the data into a
 	     temporary buffer.  */
-	  h_buf = gomp_malloc (sz);
+	  if (always_pinned_mode)
+	    {
+	      h_buf = gomp_page_locked_host_alloc_dev (devicep, sz, false);
+	      if (!h_buf)
+		{
+		  gomp_mutex_unlock (&devicep->lock);
+		  exit (EXIT_FAILURE);
+		}
+	    }
+	  else
+	    h_buf = gomp_malloc (sz);
 	  memcpy (h_buf, h, sz);
 	}
+
+      /* No 'gomp_verify_always_pinned_mode' for 'ephemeral'; have just
+	 allocated.  */
+      if (!ephemeral
+	  && verify_always_pinned_mode
+	  && always_pinned_mode)
+	if (!gomp_verify_always_pinned_mode (devicep, h_buf, sz))
+	  {
+	    gomp_mutex_unlock (&devicep->lock);
+	    exit (EXIT_FAILURE);
+	  }
+
       goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
 			       "dev", d, "host", h_buf, h, sz, aq);
+
       if (ephemeral)
-	/* Free once the transfer has completed.  */
-	devicep->openacc.async.queue_callback_func (aq, free, h_buf);
+	{
+	  if (always_pinned_mode)
+	    {
+	      if (!gomp_page_locked_host_free_dev (devicep, h_buf, aq))
+		{
+		  gomp_mutex_unlock (&devicep->lock);
+		  exit (EXIT_FAILURE);
+		}
+	    }
+	  else
+	    /* Free once the transfer has completed.  */
+	    devicep->openacc.async.queue_callback_func (aq, free, h_buf);
+	}
     }
   else
-    gomp_device_copy (devicep, devicep->host2dev_func,
-		      "dev", d, "host", h, sz);
+    {
+      if (ephemeral
+	  && always_pinned_mode)
+	{
+	  /* TODO: Page-locking on the spot probably doesn't make a lot of
+	     sense (performance-wise).  Should we instead use a "page-locked
+	     host memory bounce buffer" (per host thread, or per device,
+	     or...)?  */
+	  void *ptr = (void *) h;
+	  int page_locked_host_p
+	    = gomp_page_locked_host_register_dev (devicep,
+						  ptr, sz, GOMP_MAP_TO);
+	  if (page_locked_host_p < 0)
+	    {
+	      gomp_mutex_unlock (&devicep->lock);
+	      exit (EXIT_FAILURE);
+	    }
+	  /* Ephemeral data isn't already page-locked host memory.  */
+	  assert (page_locked_host_p);
+	}
+      else if (verify_always_pinned_mode
+	       && always_pinned_mode)
+	if (!gomp_verify_always_pinned_mode (devicep, h, sz))
+	  {
+	    gomp_mutex_unlock (&devicep->lock);
+	    exit (EXIT_FAILURE);
+	  }
+
+      gomp_device_copy (devicep, devicep->host2dev_func,
+			"dev", d, "host", h, sz);
+
+      if (ephemeral
+	  && always_pinned_mode)
+	{
+	  void *ptr = (void *) h;
+	  if (!gomp_page_locked_host_unregister_dev (devicep, ptr, sz, aq))
+	    {
+	      gomp_mutex_unlock (&devicep->lock);
+	      exit (EXIT_FAILURE);
+	    }
+	}
+    }
 }
 
 attribute_hidden void
@@ -441,6 +586,14 @@ gomp_copy_dev2host (struct gomp_device_descr *devicep,
 		    struct goacc_asyncqueue *aq,
 		    void *h, const void *d, size_t sz)
 {
+  if (verify_always_pinned_mode
+      && always_pinned_mode)
+    if (!gomp_verify_always_pinned_mode (devicep, h, sz))
+      {
+	gomp_mutex_unlock (&devicep->lock);
+	exit (EXIT_FAILURE);
+      }
+
   if (__builtin_expect (aq != NULL, 0))
     goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
 			     "host", h, "dev", d, NULL, sz, aq);
@@ -1367,8 +1520,19 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	cbuf.chunk_cnt--;
       if (cbuf.chunk_cnt > 0)
 	{
-	  cbuf.buf
-	    = malloc (cbuf.chunks[cbuf.chunk_cnt - 1].end - cbuf.chunks[0].start);
+	  size_t sz
+	    = cbuf.chunks[cbuf.chunk_cnt - 1].end - cbuf.chunks[0].start;
+	  if (always_pinned_mode)
+	    {
+	      cbuf.buf = gomp_page_locked_host_alloc_dev (devicep, sz, false);
+	      if (!cbuf.buf)
+		{
+		  gomp_mutex_unlock (&devicep->lock);
+		  exit (EXIT_FAILURE);
+		}
+	    }
+	  else
+	    cbuf.buf = malloc (sz);
 	  if (cbuf.buf)
 	    {
 	      cbuf.tgt = tgt;
@@ -1671,6 +1835,23 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		k->tgt = tgt;
 		k->refcount = 0;
 		k->dynamic_refcount = 0;
+		k->page_locked_host_p = false;
+		if (always_pinned_mode)
+		  {
+		    void *ptr = (void *) k->host_start;
+		    size_t size = k->host_end - k->host_start;
+		    int page_locked_host_p = 0;
+		    if (size != 0)
+		      page_locked_host_p = gomp_page_locked_host_register_dev
+			(devicep, ptr, size, kind & typemask);
+		    if (page_locked_host_p < 0)
+		      {
+			gomp_mutex_unlock (&devicep->lock);
+			exit (EXIT_FAILURE);
+		      }
+		    if (page_locked_host_p)
+		      k->page_locked_host_p = true;
+		  }
 		if (field_tgt_clear != FIELD_TGT_EMPTY)
 		  {
 		    k->tgt_offset = k->host_start - field_tgt_base
@@ -1976,11 +2157,22 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 						 - cbuf.chunks[0].start),
 			    cbuf.chunks[c].end - cbuf.chunks[c].start,
 			    false, NULL);
-      if (aq)
-	/* Free once the transfer has completed.  */
-	devicep->openacc.async.queue_callback_func (aq, free, cbuf.buf);
+      if (always_pinned_mode)
+	{
+	  if (!gomp_page_locked_host_free_dev (devicep, cbuf.buf, aq))
+	    {
+	      gomp_mutex_unlock (&devicep->lock);
+	      exit (EXIT_FAILURE);
+	    }
+	}
       else
-	free (cbuf.buf);
+	{
+	  if (aq)
+	    /* Free once the transfer has completed.  */
+	    devicep->openacc.async.queue_callback_func (aq, free, cbuf.buf);
+	  else
+	    free (cbuf.buf);
+	}
       cbuf.buf = NULL;
       cbufp = NULL;
     }
@@ -2112,6 +2304,23 @@ gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k,
 	  /* Starting from the _FIRST key, and continue for all following
 	     sibling keys.  */
 	  gomp_remove_splay_tree_key (&devicep->mem_map, k);
+
+	  if (always_pinned_mode)
+	    {
+	      if (k->page_locked_host_p)
+		{
+		  void *ptr = (void *) k->host_start;
+		  size_t size = k->host_end - k->host_start;
+		  if (!gomp_page_locked_host_unregister_dev (devicep,
+							     ptr, size, aq))
+		    {
+		      gomp_mutex_unlock (&devicep->lock);
+		      exit (EXIT_FAILURE);
+		    }
+		  k->page_locked_host_p = false;
+		}
+	    }
+
 	  if (REFCOUNT_STRUCTELEM_LAST_P (k->refcount))
 	    break;
 	  else
@@ -2119,7 +2328,25 @@ gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k,
 	}
     }
   else
-    gomp_remove_splay_tree_key (&devicep->mem_map, k);
+    {
+      gomp_remove_splay_tree_key (&devicep->mem_map, k);
+
+      if (always_pinned_mode)
+	{
+	  if (k->page_locked_host_p)
+	    {
+	      void *ptr = (void *) k->host_start;
+	      size_t size = k->host_end - k->host_start;
+	      if (!gomp_page_locked_host_unregister_dev (devicep,
+							 ptr, size, aq))
+		{
+		  gomp_mutex_unlock (&devicep->lock);
+		  exit (EXIT_FAILURE);
+		}
+	      k->page_locked_host_p = false;
+	    }
+	}
+    }
 
   if (aq)
     devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void,
@@ -2211,6 +2438,8 @@ gomp_unmap_vars_internal (struct target_mem_desc *tgt, bool do_copyfrom,
 				      + tgt->list[i].offset),
 			    tgt->list[i].length);
       /* Queue all removals together for processing below.
+	 We may unregister page-locked host memory only after all device to
+	 host memory transfers have completed.
 	 See also 'gomp_exit_data'.  */
       if (do_remove)
 	remove_vars[nrmvars++] = k;
@@ -2392,8 +2621,17 @@ get_gomp_offload_icvs (int dev_num)
   if (offload_icvs != NULL)
     return &offload_icvs->icvs;
 
-  struct gomp_offload_icv_list *new
-    = (struct gomp_offload_icv_list *) gomp_malloc (sizeof (struct gomp_offload_icv_list));
+  struct gomp_offload_icv_list *new;
+  size_t size = sizeof (struct gomp_offload_icv_list);
+  if (always_pinned_mode)
+    {
+      struct gomp_device_descr *device = &devices[dev_num];
+      new = gomp_page_locked_host_alloc_dev (device, size, false);
+      if (!new)
+	exit (EXIT_FAILURE);
+    }
+  else
+    new = gomp_malloc (size);
 
   new->device_num = dev_num;
   new->icvs.device_num = dev_num;
@@ -2447,6 +2685,8 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
 			   const void *host_table, const void *target_data,
 			   bool is_register_lock)
 {
+  gomp_debug (0, "%s: devicep=%p (%s)\n",
+	      __FUNCTION__, devicep, devicep->name);
   void **host_func_table = ((void ***) host_table)[0];
   void **host_funcs_end  = ((void ***) host_table)[1];
   void **host_var_table  = ((void ***) host_table)[2];
@@ -2511,6 +2751,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       k->refcount = REFCOUNT_INFINITY;
       k->dynamic_refcount = 0;
       k->aux = NULL;
+      k->page_locked_host_p = false;
       array->left = NULL;
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
@@ -2556,6 +2797,34 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       k->refcount = is_link_var ? REFCOUNT_LINK : REFCOUNT_INFINITY;
       k->dynamic_refcount = 0;
       k->aux = NULL;
+      k->page_locked_host_p = false;
+      if (always_pinned_mode)
+	{
+	  void *ptr = (void *) k->host_start;
+	  size_t size = k->host_end - k->host_start;
+	  gomp_debug (0, "  var %d: ptr=%p, size=%llu, is_link_var=%d\n",
+		      i, ptr, (unsigned long long) size, is_link_var);
+	  if (!is_link_var)
+	    {
+	      /* '#pragma omp declare target' variables typically are
+		 read/write, but in particular artificial ones, like Fortran
+		 array constructors, may be placed in section '.rodata'.
+		 We don't have the actual mapping kind available here, so we
+		 use a magic number.  */
+	      const int kind = -1;
+	      int page_locked_host_p = gomp_page_locked_host_register_dev
+		(devicep, ptr, size, kind);
+	      if (page_locked_host_p < 0)
+		{
+		  gomp_mutex_unlock (&devicep->lock);
+		  if (is_register_lock)
+		    gomp_mutex_unlock (&register_lock);
+		  exit (EXIT_FAILURE);
+		}
+	      if (page_locked_host_p)
+		k->page_locked_host_p = true;
+	    }
+	}
       array->left = NULL;
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
@@ -2577,6 +2846,13 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
 	     devicep->target_id.  */
 	  int dev_num = (int) (devicep - &devices[0]);
 	  struct gomp_offload_icvs *icvs = get_gomp_offload_icvs (dev_num);
+	  if (!icvs)
+	    {
+	      gomp_mutex_unlock (&devicep->lock);
+	      if (is_register_lock)
+		gomp_mutex_unlock (&register_lock);
+	      gomp_fatal ("'get_gomp_offload_icvs' failed");
+	    }
 	  size_t var_size = var->end - var->start;
 	  if (var_size != sizeof (struct gomp_offload_icvs))
 	    {
@@ -2599,6 +2875,8 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
 	  k->refcount = REFCOUNT_INFINITY;
 	  k->dynamic_refcount = 0;
 	  k->aux = NULL;
+	  /* 'always_pinned_mode' handled via 'get_gomp_offload_icvs'.  */
+	  k->page_locked_host_p = always_pinned_mode;
 	  array->left = NULL;
 	  array->right = NULL;
 	  splay_tree_insert (&devicep->mem_map, array);
@@ -3261,6 +3539,12 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 
   flags = clear_unsupported_flags (devicep, flags);
 
+  /* For 'nowait' we supposedly have to unregister/free page-locked host memory
+     via 'GOMP_PLUGIN_target_task_completion'.  There is no current
+     configuration exercising this (and thus, infeasible to test).  */
+  assert (!(flags & GOMP_TARGET_FLAG_NOWAIT)
+	  || !(devicep && devicep->page_locked_host_register_func));
+
   if (flags & GOMP_TARGET_FLAG_NOWAIT)
     {
       struct gomp_thread *thr = gomp_thread ();
@@ -3572,18 +3856,37 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
     }
   else
     {
-      devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
-      sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
-      kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
+      size_t devaddrs_size = mapnum * sizeof (uint64_t);
+      size_t sizes_size = mapnum * sizeof (uint64_t);
+      size_t kinds_size = mapnum * sizeof (unsigned short);
+      if (always_pinned_mode)
+	{
+	  if (!(devaddrs = gomp_page_locked_host_alloc_dev (devicep,
+							    devaddrs_size,
+							    false))
+	      || !(sizes = gomp_page_locked_host_alloc_dev (devicep,
+							    sizes_size,
+							    false))
+	      || !(kinds = gomp_page_locked_host_alloc_dev (devicep,
+							    kinds_size,
+							    false)))
+	    exit (EXIT_FAILURE);
+	}
+      else
+	{
+	  devaddrs = gomp_malloc (devaddrs_size);
+	  sizes = gomp_malloc (sizes_size);
+	  kinds = gomp_malloc (kinds_size);
+	}
       gomp_copy_dev2host (devicep, aq, devaddrs,
 			  (const void *) (uintptr_t) devaddrs_ptr,
-			  mapnum * sizeof (uint64_t));
+			  devaddrs_size);
       gomp_copy_dev2host (devicep, aq, sizes,
 			  (const void *) (uintptr_t) sizes_ptr,
-			  mapnum * sizeof (uint64_t));
+			  sizes_size);
       gomp_copy_dev2host (devicep, aq, kinds,
 			  (const void *) (uintptr_t) kinds_ptr,
-			  mapnum * sizeof (unsigned short));
+			  kinds_size);
       if (aq && !devicep->openacc.async.synchronize_func (aq))
 	exit (EXIT_FAILURE);
     }
@@ -3598,7 +3901,23 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 
   if (tgt_align)
     {
-      char *tgt = gomp_alloca (tgt_size + tgt_align - 1);
+      size_t tgt_alloc_size = tgt_size + tgt_align - 1;
+      char *tgt = gomp_alloca (tgt_alloc_size);
+      if (always_pinned_mode)
+	{
+	  /* TODO: See 'gomp_copy_host2dev' re "page-locking on the spot".
+	     On the other hand, performance isn't really a concern, here.  */
+	  int page_locked_host_p = 0;
+	  if (tgt_alloc_size != 0)
+	    {
+	      page_locked_host_p = gomp_page_locked_host_register_dev
+		(devicep, tgt, tgt_alloc_size, GOMP_MAP_TOFROM);
+	      if (page_locked_host_p < 0)
+		exit (EXIT_FAILURE);
+	      /* 'gomp_alloca' isn't already page-locked host memory.  */
+	      assert (page_locked_host_p);
+	    }
+	}
       uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
       if (al)
 	tgt += tgt_align - al;
@@ -3632,6 +3951,14 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		++i;
 	      }
 	  }
+      if (always_pinned_mode)
+	{
+	  if (tgt_alloc_size != 0
+	      && !gomp_page_locked_host_unregister_dev (devicep,
+							tgt, tgt_alloc_size,
+							NULL))
+	    exit (EXIT_FAILURE);
+	}
     }
 
   if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM) && mapnum > 0)
@@ -3718,9 +4045,20 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		  {
 		    cdata[i].aligned = true;
 		    size_t align = (size_t) 1 << (kinds[i] >> 8);
-		    devaddrs[i]
-		      = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
-								   sizes[i]);
+		    void *ptr;
+		    if (always_pinned_mode)
+		      {
+			ptr = gomp_page_locked_host_aligned_alloc_dev
+			  (devicep, align, sizes[i]);
+			if (!ptr)
+			  {
+			    gomp_mutex_unlock (&devicep->lock);
+			    exit (EXIT_FAILURE);
+			  }
+		      }
+		    else
+		      ptr = gomp_aligned_alloc (align, sizes[i]);
+		    devaddrs[i] = (uint64_t) (uintptr_t) ptr;
 		  }
 		else if (n2 != NULL)
 		  devaddrs[i] = (n2->host_start + cdata[i].devaddr
@@ -3770,7 +4108,23 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		      }
 		  }
 		if (!cdata[i].present)
-		  devaddrs[i] = (uintptr_t) gomp_malloc (sizeof (void*));
+		  {
+		    void *ptr;
+		    size_t size = sizeof (void *);
+		    if (always_pinned_mode)
+		      {
+			ptr = gomp_page_locked_host_alloc_dev (devicep,
+							       size, false);
+			if (!ptr)
+			  {
+			    gomp_mutex_unlock (&devicep->lock);
+			    exit (EXIT_FAILURE);
+			  }
+		      }
+		    else
+		      ptr = gomp_malloc (size);
+		    devaddrs[i] = (uintptr_t) ptr;
+		  }
 		/* Assume that when present, the pointer is already correct.  */
 		if (!n2)
 		  *(uint64_t *) (uintptr_t) (devaddrs[i] + sizes[i])
@@ -3803,9 +4157,20 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		  {
 		    cdata[i].aligned = true;
 		    size_t align = (size_t) 1 << (kinds[i] >> 8);
-		    devaddrs[i]
-		      = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
-								   sizes[i]);
+		    void *ptr;
+		    if (always_pinned_mode)
+		      {
+			ptr = gomp_page_locked_host_aligned_alloc_dev
+			  (devicep, align, sizes[i]);
+			if (!ptr)
+			  {
+			    gomp_mutex_unlock (&devicep->lock);
+			    exit (EXIT_FAILURE);
+			  }
+		      }
+		    else
+		      ptr = gomp_aligned_alloc (align, sizes[i]);
+		    devaddrs[i] = (uint64_t) (uintptr_t) ptr;
 		    gomp_copy_dev2host (devicep, aq,
 					(void *) (uintptr_t) devaddrs[i],
 					(void *) (uintptr_t) cdata[i].devaddr,
@@ -3881,7 +4246,20 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 					  + sizes[i + sizes[i]]);
 		    size_t align = (size_t) 1 << (kinds[i] >> 8);
 		    cdata[i].aligned = true;
-		    devaddrs[i] = (uintptr_t) gomp_aligned_alloc (align, sz);
+		    void *ptr;
+		    if (always_pinned_mode)
+		      {
+			ptr = gomp_page_locked_host_aligned_alloc_dev
+			  (devicep, align, sz);
+			if (!ptr)
+			  {
+			    gomp_mutex_unlock (&devicep->lock);
+			    exit (EXIT_FAILURE);
+			  }
+		      }
+		    else
+		      ptr = gomp_aligned_alloc (align, sz);
+		    devaddrs[i] = (uintptr_t) ptr;
 		    devaddrs[i] -= devaddrs[i+1] - cdata[i].devaddr;
 		  }
 		else
@@ -3945,9 +4323,29 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 	      struct_cpy = sizes[i];
 	    }
 	  else if (!cdata[i].present && cdata[i].aligned)
-	    gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]);
+	    {
+	      void *ptr = (void *) (uintptr_t) devaddrs[i];
+	      if (always_pinned_mode)
+		{
+		  if (!gomp_page_locked_host_aligned_free_dev (devicep,
+							       ptr,
+							       aq))
+		    exit (EXIT_FAILURE);
+		}
+	      else
+		gomp_aligned_free (ptr);
+	    }
 	  else if (!cdata[i].present)
-	    free ((void *) (uintptr_t) devaddrs[i]);
+	    {
+	      void *ptr = (void *) (uintptr_t) devaddrs[i];
+	      if (always_pinned_mode)
+		{
+		  if (!gomp_page_locked_host_free_dev (devicep, ptr, aq))
+		    exit (EXIT_FAILURE);
+		}
+	      else
+		free (ptr);
+	    }
 	}
       if (clean_struct)
 	for (uint64_t i = 0; i < mapnum; i++)
@@ -3956,12 +4354,30 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		  == GOMP_MAP_STRUCT))
 	    {
 	      devaddrs[i] += cdata[i+1].devaddr - cdata[i].devaddr;
-	      gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]);
+	      void *ptr = (void *) (uintptr_t) devaddrs[i];
+	      if (always_pinned_mode)
+		{
+		  if (!gomp_page_locked_host_aligned_free_dev (devicep,
+							       ptr, aq))
+		    exit (EXIT_FAILURE);
+		}
+	      else
+		gomp_aligned_free (ptr);
 	    }
 
-      free (devaddrs);
-      free (sizes);
-      free (kinds);
+      if (always_pinned_mode)
+	{
+	  if (!gomp_page_locked_host_free_dev (devicep, devaddrs, aq)
+	      || !gomp_page_locked_host_free_dev (devicep, sizes, aq)
+	      || !gomp_page_locked_host_free_dev (devicep, kinds, aq))
+	    exit (EXIT_FAILURE);
+	}
+      else
+	{
+	  free (devaddrs);
+	  free (sizes);
+	  free (kinds);
+	}
     }
 }
 
@@ -4585,6 +5001,160 @@ gomp_usm_free (void *device_ptr, int device_num)
 }
 
 
+/* Allocate page-locked host memory via DEVICE.  */
+
+static void *
+gomp_page_locked_host_alloc_dev (struct gomp_device_descr *device,
+				 size_t size, bool allow_null)
+{
+  gomp_debug (0, "%s: device=%p (%s), size=%llu\n",
+	      __FUNCTION__, device, device->name, (unsigned long long) size);
+
+  void *ret;
+  if (!device->page_locked_host_alloc_func (&ret, size))
+    {
+      const char *fmt
+	= "Failed to allocate page-locked host memory via %s libgomp plugin";
+      if (allow_null)
+	gomp_fatal (fmt, device->name);
+      else
+	gomp_error (fmt, device->name);
+      ret = NULL;
+    }
+  else if (ret == NULL && !allow_null)
+    gomp_error ("Out of memory allocating %lu bytes"
+		" page-locked host memory"
+		" via %s libgomp plugin",
+		(unsigned long) size, device->name);
+  else
+    gomp_debug (0, "  -> ret=[%p, %p)\n",
+		ret, ret + size);
+  return ret;
+}
+
+/* Free page-locked host memory via DEVICE.  */
+
+static bool
+gomp_page_locked_host_free_dev (struct gomp_device_descr *device,
+				void *ptr,
+				struct goacc_asyncqueue *aq)
+{
+  gomp_debug (0, "%s: device=%p (%s), ptr=%p, aq=%p\n",
+	      __FUNCTION__, device, device->name, ptr, aq);
+
+  if (!device->page_locked_host_free_func (ptr, aq))
+    {
+      gomp_error ("Failed to free page-locked host memory"
+		  " via %s libgomp plugin",
+		  device->name);
+      return false;
+    }
+  return true;
+}
+
+/* Allocate aligned page-locked host memory via DEVICE.
+
+   That is, 'gomp_aligned_alloc' (see 'alloc.c') for page-locked host
+   memory.  */
+
+static void *
+gomp_page_locked_host_aligned_alloc_dev (struct gomp_device_descr *device,
+					 size_t al, size_t size)
+{
+  gomp_debug (0, "%s: device=%p (%s), al=%llu, size=%llu\n",
+	      __FUNCTION__, device, device->name,
+	      (unsigned long long) al, (unsigned long long) size);
+
+  void *ret;
+  if (al < sizeof (void *))
+    al = sizeof (void *);
+  ret = NULL;
+  if ((al & (al - 1)) == 0 && size)
+    {
+      void *p = gomp_page_locked_host_alloc_dev (device, size + al, true);
+      if (p)
+	{
+	  void *ap = (void *) (((uintptr_t) p + al) & -al);
+	  ((void **) ap)[-1] = p;
+	  ret = ap;
+	}
+    }
+  if (ret == NULL)
+    gomp_error ("Out of memory allocating %lu bytes", (unsigned long) size);
+  else
+    gomp_debug (0, "  -> ret=[%p, %p)\n",
+		ret, ret + size);
+  return ret;
+}
+
+/* Free aligned page-locked host memory via DEVICE.
+
+   That is, 'gomp_aligned_free' (see 'alloc.c') for page-locked host
+   memory.  */
+
+static bool
+gomp_page_locked_host_aligned_free_dev (struct gomp_device_descr *device,
+					void *ptr,
+					struct goacc_asyncqueue *aq)
+{
+  gomp_debug (0, "%s: device=%p (%s), ptr=%p, aq=%p\n",
+	      __FUNCTION__, device, device->name, ptr, aq);
+
+  if (ptr)
+    {
+      ptr = ((void **) ptr)[-1];
+      gomp_debug (0, "  ptr=%p\n",
+		  ptr);
+
+      if (!gomp_page_locked_host_free_dev (device, ptr, aq))
+	return false;
+    }
+  return true;
+}
+
+/* Register page-locked host memory via DEVICE.  */
+
+attribute_hidden int
+gomp_page_locked_host_register_dev (struct gomp_device_descr *device,
+				    void *ptr, size_t size, int kind)
+{
+  gomp_debug (0, "%s: device=%p (%s), ptr=%p, size=%llu, kind=%d\n",
+	      __FUNCTION__, device, device->name,
+	      ptr, (unsigned long long) size, kind);
+  assert (size != 0);
+
+  int ret = device->page_locked_host_register_func (device->target_id,
+						    ptr, size, kind);
+  if (ret < 0)
+    gomp_error ("Failed to register page-locked host memory"
+		" via %s libgomp plugin",
+		device->name);
+  return ret;
+}
+
+/* Unregister page-locked host memory via DEVICE.  */
+
+attribute_hidden bool
+gomp_page_locked_host_unregister_dev (struct gomp_device_descr *device,
+				      void *ptr, size_t size,
+				      struct goacc_asyncqueue *aq)
+{
+  gomp_debug (0, "%s: device=%p (%s), ptr=%p, size=%llu, aq=%p\n",
+	      __FUNCTION__, device, device->name,
+	      ptr, (unsigned long long) size, aq);
+  assert (size != 0);
+
+  if (!device->page_locked_host_unregister_func (ptr, size, aq))
+    {
+      gomp_error ("Failed to unregister page-locked host memory"
+		  " via %s libgomp plugin",
+		  device->name);
+      return false;
+    }
+  return true;
+}
+
+
 /* Device (really: libgomp plugin) to use for paged-locked memory.  We
    assume there is either none or exactly one such device for the lifetime of
    the process.  */
@@ -4681,10 +5251,7 @@ gomp_page_locked_host_alloc (void **ptr, size_t size)
 	}
       gomp_mutex_unlock (&device->lock);
 
-      if (!device->page_locked_host_alloc_func (ptr, size))
-	gomp_fatal ("Failed to allocate page-locked host memory"
-		    " via %s libgomp plugin",
-		    device->name);
+      *ptr = gomp_page_locked_host_alloc_dev (device, size, true);
     }
   return device != NULL;
 }
@@ -4713,10 +5280,8 @@ gomp_page_locked_host_free (void *ptr)
     }
   gomp_mutex_unlock (&device->lock);
 
-  if (!device->page_locked_host_free_func (ptr))
-    gomp_fatal ("Failed to free page-locked host memory"
-		" via %s libgomp plugin",
-		device->name);
+  if (!gomp_page_locked_host_free_dev (device, ptr, NULL))
+    exit (EXIT_FAILURE);
 }
 
 
@@ -4792,30 +5357,84 @@ omp_target_memcpy_copy (void *dst, const void *src, size_t length,
   bool ret;
   if (src_devicep == NULL && dst_devicep == NULL)
     {
+      /* No 'gomp_verify_always_pinned_mode' here.  */
       memcpy ((char *) dst + dst_offset, (char *) src + src_offset, length);
       return 0;
     }
   if (src_devicep == NULL)
     {
       gomp_mutex_lock (&dst_devicep->lock);
+
+      void *src_ptr = (void *) src + src_offset;
+      int src_ptr_page_locked_host_p = 0;
+
+      if (always_pinned_mode)
+	{
+	  if (length != 0)
+	    src_ptr_page_locked_host_p = gomp_page_locked_host_register_dev
+	      (dst_devicep, src_ptr, length, GOMP_MAP_TO);
+	  if (src_ptr_page_locked_host_p < 0)
+	    {
+	      gomp_mutex_unlock (&dst_devicep->lock);
+	      return ENOMEM;
+	    }
+	}
+
+      /* No 'gomp_verify_always_pinned_mode' here; have just registered.  */
       ret = dst_devicep->host2dev_func (dst_devicep->target_id,
 					(char *) dst + dst_offset,
-					(char *) src + src_offset, length);
+					src_ptr, length);
+
+      if (src_ptr_page_locked_host_p
+	  && !gomp_page_locked_host_unregister_dev (dst_devicep,
+						    src_ptr, length, NULL))
+	    {
+	      gomp_mutex_unlock (&dst_devicep->lock);
+	      return ENOMEM;
+	    }
+
       gomp_mutex_unlock (&dst_devicep->lock);
       return (ret ? 0 : EINVAL);
     }
   if (dst_devicep == NULL)
     {
       gomp_mutex_lock (&src_devicep->lock);
+
+      void *dst_ptr = (void *) dst + dst_offset;
+      int dst_ptr_page_locked_host_p = 0;
+
+      if (always_pinned_mode)
+	{
+	  if (length != 0)
+	    dst_ptr_page_locked_host_p = gomp_page_locked_host_register_dev
+	      (src_devicep, dst_ptr, length, GOMP_MAP_FROM);
+	  if (dst_ptr_page_locked_host_p < 0)
+	    {
+	      gomp_mutex_unlock (&src_devicep->lock);
+	      return ENOMEM;
+	    }
+	}
+
+      /* No 'gomp_verify_always_pinned_mode' here; have just registered.  */
       ret = src_devicep->dev2host_func (src_devicep->target_id,
-					(char *) dst + dst_offset,
+					dst_ptr,
 					(char *) src + src_offset, length);
+
+      if (dst_ptr_page_locked_host_p
+	  && !gomp_page_locked_host_unregister_dev (src_devicep,
+						    dst_ptr, length, NULL))
+	    {
+	      gomp_mutex_unlock (&src_devicep->lock);
+	      return ENOMEM;
+	    }
+
       gomp_mutex_unlock (&src_devicep->lock);
       return (ret ? 0 : EINVAL);
     }
   if (src_devicep == dst_devicep)
     {
       gomp_mutex_lock (&src_devicep->lock);
+      /* No 'gomp_verify_always_pinned_mode' here.  */
       ret = src_devicep->dev2dev_func (src_devicep->target_id,
 				       (char *) dst + dst_offset,
 				       (char *) src + src_offset, length);
@@ -4927,21 +5546,63 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size,
 	return EINVAL;
       if (dst_devicep == NULL && src_devicep == NULL)
 	{
+	  /* No 'gomp_verify_always_pinned_mode' here.  */
 	  memcpy ((char *) dst + dst_off, (const char *) src + src_off,
 		  length);
 	  ret = 1;
 	}
       else if (src_devicep == NULL)
-	ret = dst_devicep->host2dev_func (dst_devicep->target_id,
-					  (char *) dst + dst_off,
-					  (const char *) src + src_off,
-					  length);
+	{
+	  void *src_ptr = (void *) src + src_off;
+	  int src_ptr_page_locked_host_p = 0;
+
+	  if (always_pinned_mode)
+	    {
+	      if (length != 0)
+		src_ptr_page_locked_host_p = gomp_page_locked_host_register_dev
+		  (dst_devicep, src_ptr, length, GOMP_MAP_TO);
+	      if (src_ptr_page_locked_host_p < 0)
+		return ENOMEM;
+	    }
+
+	  /* No 'gomp_verify_always_pinned_mode' here; have just registered.  */
+	  ret = dst_devicep->host2dev_func (dst_devicep->target_id,
+					    (char *) dst + dst_off,
+					    src_ptr,
+					    length);
+
+	  if (src_ptr_page_locked_host_p
+	      && !gomp_page_locked_host_unregister_dev (dst_devicep,
+							src_ptr, length, NULL))
+	    return ENOMEM;
+	}
       else if (dst_devicep == NULL)
-	ret = src_devicep->dev2host_func (src_devicep->target_id,
-					  (char *) dst + dst_off,
-					  (const char *) src + src_off,
-					  length);
+	{
+	  void *dst_ptr = (void *) dst + dst_off;
+	  int dst_ptr_page_locked_host_p = 0;
+
+	  if (always_pinned_mode)
+	    {
+	      if (length != 0)
+		dst_ptr_page_locked_host_p = gomp_page_locked_host_register_dev
+		  (src_devicep, dst_ptr, length, GOMP_MAP_FROM);
+	      if (dst_ptr_page_locked_host_p < 0)
+		return ENOMEM;
+	    }
+
+	  /* No 'gomp_verify_always_pinned_mode' here; have just registered.  */
+	  ret = src_devicep->dev2host_func (src_devicep->target_id,
+					    dst_ptr,
+					    (const char *) src + src_off,
+					    length);
+
+	  if (dst_ptr_page_locked_host_p
+	      && !gomp_page_locked_host_unregister_dev (src_devicep,
+							dst_ptr, length, NULL))
+	    return ENOMEM;
+	}
       else if (src_devicep == dst_devicep)
+	/* No 'gomp_verify_always_pinned_mode' here.  */
 	ret = src_devicep->dev2dev_func (src_devicep->target_id,
 					 (char *) dst + dst_off,
 					 (const char *) src + src_off,
@@ -5184,6 +5845,7 @@ omp_target_associate_ptr (const void *host_ptr, const void *device_ptr,
       k->refcount = REFCOUNT_INFINITY;
       k->dynamic_refcount = 0;
       k->aux = NULL;
+      k->page_locked_host_p = false;
       array->left = NULL;
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
@@ -5406,6 +6068,9 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM_OPT (is_usm_ptr, is_usm_ptr);
   DLSYM_OPT (page_locked_host_alloc, page_locked_host_alloc);
   DLSYM_OPT (page_locked_host_free, page_locked_host_free);
+  DLSYM_OPT (page_locked_host_register, page_locked_host_register);
+  DLSYM_OPT (page_locked_host_unregister, page_locked_host_unregister);
+  DLSYM_OPT (page_locked_host_p, page_locked_host_p);
   DLSYM (dev2host);
   DLSYM (host2dev);
   DLSYM (evaluate_device);
diff --git a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c b/libgomp/testsuite/libgomp.c/alloc-pinned-7.c
deleted file mode 100644
index 8dc19055038..00000000000
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-7.c
+++ /dev/null
@@ -1,63 +0,0 @@
-/* { dg-do run } */
-/* { dg-additional-options "-foffload-memory=pinned" } */
-
-/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } } */
-
-/* Test that pinned memory works.  */
-
-#include <stdio.h>
-#include <stdlib.h>
-
-#ifdef __linux__
-#include <sys/types.h>
-#include <unistd.h>
-
-#include <sys/mman.h>
-
-int
-get_pinned_mem ()
-{
-  int pid = getpid ();
-  char buf[100];
-  sprintf (buf, "/proc/%d/status", pid);
-
-  FILE *proc = fopen (buf, "r");
-  if (!proc)
-    abort ();
-  while (fgets (buf, 100, proc))
-    {
-      int val;
-      if (sscanf (buf, "VmLck: %d", &val))
-	{
-	  fclose (proc);
-	  return val;
-	}
-    }
-  abort ();
-}
-#else
-int
-get_pinned_mem ()
-{
-  return 0;
-}
-
-#define mlockall(...) 0
-#endif
-
-#include <omp.h>
-
-int
-main ()
-{
-  // Sanity check
-  if (get_pinned_mem () == 0)
-    {
-      /* -foffload-memory=pinned has failed, but maybe that's because
-	 isufficient pinned memory was available.  */
-      if (mlockall (MCL_CURRENT | MCL_FUTURE) == 0)
-	abort ();
-    }
-
-  return 0;
-}
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-foffload-memory-pinned-using-offloading-device-inter.patch --]
[-- Type: text/x-diff, Size: 3428 bytes --]

From 694bbd399c1323975b4a6735646e46c6914de63d Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 30 Mar 2023 10:08:12 +0200
Subject: [PATCH 2/2] '-foffload-memory=pinned' using offloading device
 interfaces for non-contiguous array support

Changes related to og12 commit 15d0f61a7fecdc8fd12857c40879ea3730f6d99f
"Merge non-contiguous array support patches".

	libgomp/
	* target.c (gomp_map_vars_internal)
	<non-contiguous array support>: Handle 'always_pinned_mode'.
---
 libgomp/ChangeLog.omp |  3 +++
 libgomp/target.c      | 55 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index 1b02c057562..09cf9c6f3c1 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,5 +1,8 @@
 2023-04-03  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* target.c (gomp_map_vars_internal)
+	<non-contiguous array support>: Handle 'always_pinned_mode'.
+
 	* libgomp-plugin.h (GOMP_OFFLOAD_page_locked_host_free): Add
 	'struct goacc_asyncqueue *' formal parameter.
 	(GOMP_OFFLOAD_page_locked_host_register)
diff --git a/libgomp/target.c b/libgomp/target.c
index ed2fc09cf44..38eb5d1aa5b 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2087,6 +2087,23 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		      k->dynamic_refcount = 0;
 		      k->aux = NULL;
 		      k->tgt_offset = tgt_size;
+		      k->page_locked_host_p = false;
+		      if (always_pinned_mode)
+			{
+			  void *ptr = (void *) k->host_start;
+			  size_t size = k->host_end - k->host_start;
+			  int page_locked_host_p = 0;
+			  if (size != 0)
+			    page_locked_host_p = gomp_page_locked_host_register_dev
+			      (devicep, ptr, size, kind & typemask);
+			  if (page_locked_host_p < 0)
+			    {
+			      gomp_mutex_unlock (&devicep->lock);
+			      exit (EXIT_FAILURE);
+			    }
+			  if (page_locked_host_p)
+			    k->page_locked_host_p = true;
+			}
 
 		      tgt_size += nca->data_row_size;
 
@@ -2120,16 +2137,44 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		 accelerator side ptrblock and copy it in.  */
 	      if (nca->ptrblock_size)
 		{
-		  void *ptrblock = gomp_malloc (nca->ptrblock_size);
+		  void *ptrblock;
+		  if (always_pinned_mode)
+		    {
+		      ptrblock
+			= gomp_page_locked_host_alloc_dev (devicep,
+							   nca->ptrblock_size,
+							   false);
+		      if (!ptrblock)
+			{
+			  gomp_mutex_unlock (&devicep->lock);
+			  exit (EXIT_FAILURE);
+			}
+		    }
+		  else
+		    ptrblock = gomp_malloc (nca->ptrblock_size);
 		  goacc_noncontig_array_create_ptrblock
 		    (nca, ptrblock, target_ptrblock);
 		  gomp_copy_host2dev (devicep, aq, target_ptrblock, ptrblock,
 				      nca->ptrblock_size, false, cbufp);
-		  if (aq)
-		    /* Free once the transfer has completed.  */
-		    devicep->openacc.async.queue_callback_func (aq, free, ptrblock);
+		  if (always_pinned_mode)
+		    {
+		      if (!gomp_page_locked_host_free_dev (devicep,
+							   ptrblock,
+							   aq))
+			{
+			  gomp_mutex_unlock (&devicep->lock);
+			  exit (EXIT_FAILURE);
+			}
+		    }
 		  else
-		    free (ptrblock);
+		    {
+		      if (aq)
+			/* Free once the transfer has completed.  */
+			devicep->openacc.async.queue_callback_func
+			  (aq, free, ptrblock);
+		      else
+			free (ptrblock);
+		    }
 		}
 	    }
 	}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-04-03 14:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08 11:30 [PATCH 0/5] openmp: Handle pinned and unified shared memory Hafiz Abid Qadeer
2022-03-08 11:30 ` [PATCH 1/5] openmp: Add -foffload-memory Hafiz Abid Qadeer
2023-02-13 14:38   ` -foffload-memory=pinned (was: [PATCH 1/5] openmp: Add -foffload-memory) Thomas Schwinge
2023-02-13 15:20     ` Andrew Stubbs
2023-04-03 14:56       ` [og12] '-foffload-memory=pinned' using offloading device interfaces (was: -foffload-memory=pinned) Thomas Schwinge
2022-03-08 11:30 ` [PATCH 2/5] openmp: allow requires unified_shared_memory Hafiz Abid Qadeer
2022-03-08 11:30 ` [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc Hafiz Abid Qadeer
2023-02-10 14:21   ` Thomas Schwinge
2023-02-10 15:31     ` Andrew Stubbs
2023-02-16 21:24       ` [og12] Miscellaneous clean-up re OpenMP 'ompx_unified_shared_mem_space', 'ompx_host_mem_space' (was: [PATCH 3/5] openmp, nvptx: ompx_unified_shared_mem_alloc) Thomas Schwinge
2022-03-08 11:30 ` [PATCH 4/5] openmp: Use libgomp memory allocation functions with unified shared memory Hafiz Abid Qadeer
2022-04-02 12:04   ` Andrew Stubbs
2022-04-02 12:42     ` Andrew Stubbs
2022-03-08 11:30 ` [PATCH 5/5] openmp: -foffload-memory=pinned Hafiz Abid Qadeer
2022-03-30 22:40   ` Andrew Stubbs
2023-02-09 11:16   ` [og12] 'c-c++-common/gomp/alloc-pinned-1.c' -> 'libgomp.c-c++-common/alloc-pinned-1.c' (was: [PATCH 5/5] openmp: -foffload-memory=pinned) Thomas Schwinge
2022-04-13 13:14 ` [PATCH 0/5] openmp: Handle pinned and unified shared memory Andrew Stubbs
2022-04-20 13:25 ` [PATCH] openmp: Handle unified address memory Andrew Stubbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).