[PATCH v3 0/3] libgomp: OpenMP low-latency omp

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc
@ 2023-12-03  0:32 Andrew Stubbs
  2023-12-03  0:32 ` [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator Andrew Stubbs
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Andrew Stubbs @ 2023-12-03  0:32 UTC (permalink / raw)
  To: gcc-patches

This patch series is a rework of the patch series posted in August.

https://patchwork.sourceware.org/project/gcc/list/?series=23045&state=%2A&archive=both

The series implements device-specific allocators and adds a low-latency
allocator for both GPUs architectures.

This time the omp_low_lat_mem_alloc does not work because the default
traits are incompatible (GPU low-latency memory is not accessible to
other teams).  I've also included documentation and addressed the
comments from Tobias's review.

Andrew

Andrew Stubbs (3):
  libgomp, nvptx: low-latency memory allocator
  openmp, nvptx: low-lat memory access traits
  amdgcn, libgomp: low-latency allocator

 gcc/config/gcn/gcn-builtins.def               |   2 +
 gcc/config/gcn/gcn.cc                         |  16 +-
 libgomp/allocator.c                           | 266 +++++++-----
 libgomp/basic-allocator.c                     | 380 ++++++++++++++++++
 libgomp/config/gcn/allocator.c                | 127 ++++++
 libgomp/config/gcn/libgomp-gcn.h              |   6 +
 libgomp/config/gcn/team.c                     |  12 +
 libgomp/config/nvptx/allocator.c              | 141 +++++++
 libgomp/config/nvptx/team.c                   |  18 +
 libgomp/libgomp.h                             |   3 -
 libgomp/libgomp.texi                          |  40 +-
 libgomp/plugin/plugin-gcn.c                   |  35 +-
 libgomp/plugin/plugin-nvptx.c                 |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c     |  66 +++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c     |  72 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-3.c     |  49 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c     | 197 +++++++++
 libgomp/testsuite/libgomp.c/omp_alloc-5.c     |  71 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-6.c     | 118 ++++++
 .../testsuite/libgomp.c/omp_alloc-traits.c    |  66 +++
 20 files changed, 1595 insertions(+), 113 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/gcn/allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-traits.c

-- 
2.41.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
  2023-12-03  0:32 [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Andrew Stubbs
@ 2023-12-03  0:32 ` Andrew Stubbs
  2023-12-04 16:04   ` Tobias Burnus
  2023-12-05 11:25   ` Tobias Burnus
  2023-12-03  0:32 ` [PATCH v3 2/3] openmp, nvptx: low-lat memory access traits Andrew Stubbs
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Andrew Stubbs @ 2023-12-03  0:32 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3315 bytes --]


This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.

libgomp/ChangeLog:

	* allocator.c (MEMSPACE_ALLOC): New macro.
	(MEMSPACE_CALLOC): New macro.
	(MEMSPACE_REALLOC): New macro.
	(MEMSPACE_FREE): New macro.
	(predefined_alloc_mapping): New array.  Add _Static_assert to match.
	(ARRAY_SIZE): New macro.
	(omp_aligned_alloc): Use MEMSPACE_ALLOC.
	Implement fall-backs for predefined allocators.  Simplify existing
	fall-backs.
	(omp_free): Use MEMSPACE_FREE.
	(omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for
	predefined allocators.  Simplify existing fall-backs.
	(omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE.
	Implement fall-backs for predefined allocators.  Simplify existing
	fall-backs.
	* config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable.
	(__nvptx_lowlat_init): New prototype.
	(gomp_nvptx_main): Call __nvptx_lowlat_init.
	* libgomp.texi: Update memory space table.
	* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
	(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
	(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
	* basic-allocator.c: New file.
	* config/nvptx/allocator.c: New file.
	* testsuite/libgomp.c/omp_alloc-1.c: New test.
	* testsuite/libgomp.c/omp_alloc-2.c: New test.
	* testsuite/libgomp.c/omp_alloc-3.c: New test.
	* testsuite/libgomp.c/omp_alloc-4.c: New test.
	* testsuite/libgomp.c/omp_alloc-5.c: New test.
	* testsuite/libgomp.c/omp_alloc-6.c: New test.

Co-authored-by: Kwok Cheung Yeung  <kcy@codesourcery.com>
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
---
 libgomp/allocator.c                       | 246 ++++++++------
 libgomp/basic-allocator.c                 | 380 ++++++++++++++++++++++
 libgomp/config/nvptx/allocator.c          | 120 +++++++
 libgomp/config/nvptx/team.c               |  18 +
 libgomp/libgomp.texi                      |   9 +-
 libgomp/plugin/plugin-nvptx.c             |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c |  56 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  64 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  42 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c | 196 +++++++++++
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  63 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 117 +++++++
 12 files changed, 1231 insertions(+), 103 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: v3-0001-libgomp-nvptx-low-latency-memory-allocator.patch --]
[-- Type: text/x-patch; name="v3-0001-libgomp-nvptx-low-latency-memory-allocator.patch", Size: 47432 bytes --]

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index b4e50e2ad72..fa398128368 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -37,6 +37,47 @@
 
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
+/* These macros may be overridden in config/<target>/allocator.c.
+   The following definitions (ab)use comma operators to avoid unused
+   variable errors.  */
+#ifndef MEMSPACE_ALLOC
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  malloc (((void)(MEMSPACE), (SIZE)))
+#endif
+#ifndef MEMSPACE_CALLOC
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  calloc (1, (((void)(MEMSPACE), (SIZE))))
+#endif
+#ifndef MEMSPACE_REALLOC
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE))))
+#endif
+#ifndef MEMSPACE_FREE
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
+#endif
+
+/* Map the predefined allocators to the correct memory space.
+   The index to this table is the omp_allocator_handle_t enum value.
+   When the user calls omp_alloc with a predefined allocator this
+   table determines what memory they get.  */
+static const omp_memspace_handle_t predefined_alloc_mapping[] = {
+  omp_default_mem_space,   /* omp_null_allocator doesn't actually use this. */
+  omp_default_mem_space,   /* omp_default_mem_alloc. */
+  omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */
+  omp_const_mem_space,     /* omp_const_mem_alloc. */
+  omp_high_bw_mem_space,   /* omp_high_bw_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_low_lat_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_cgroup_mem_alloc (implementation defined). */
+  omp_low_lat_mem_space,   /* omp_pteam_mem_alloc (implementation defined). */
+  omp_low_lat_mem_space,   /* omp_thread_mem_alloc (implementation defined). */
+};
+
+#define ARRAY_SIZE(A) (sizeof (A) / sizeof ((A)[0]))
+_Static_assert (ARRAY_SIZE (predefined_alloc_mapping)
+		== omp_max_predefined_alloc + 1,
+		"predefined_alloc_mapping must match omp_memspace_handle_t");
+
 enum gomp_numa_memkind_kind
 {
   GOMP_MEMKIND_NONE = 0,
@@ -533,7 +574,7 @@ retry:
 	}
       else
 #endif
-	ptr = malloc (new_size);
+	ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
       if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -565,7 +606,13 @@ retry:
 	}
       else
 #endif
-	ptr = malloc (new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  ptr = MEMSPACE_ALLOC (memspace, new_size);
+	}
       if (ptr == NULL)
 	goto fail;
     }
@@ -582,36 +629,26 @@ retry:
   ((struct omp_mem_header *) ret)[-1].allocator = allocator;
   return ret;
 
-fail:
-  if (allocator_data)
+fail:;
+  int fallback = (allocator_data
+		  ? allocator_data->fallback
+		  : allocator == omp_default_mem_alloc
+		  ? omp_atv_null_fb
+		  : omp_atv_default_mem_fb);
+  switch (fallback)
     {
-      switch (allocator_data->fallback)
-	{
-	case omp_atv_default_mem_fb:
-	  if ((new_alignment > sizeof (void *) && new_alignment > alignment)
-#if defined(LIBGOMP_USE_MEMKIND) || defined(LIBGOMP_USE_LIBNUMA)
-	      || memkind
-#endif
-	      || (allocator_data
-		  && allocator_data->pool_size < ~(uintptr_t) 0))
-	    {
-	      allocator = omp_default_mem_alloc;
-	      goto retry;
-	    }
-	  /* Otherwise, we've already performed default mem allocation
-	     and if that failed, it won't succeed again (unless it was
-	     intermittent.  Return NULL then, as that is the fallback.  */
-	  break;
-	case omp_atv_null_fb:
-	  break;
-	default:
-	case omp_atv_abort_fb:
-	  gomp_fatal ("Out of memory allocating %lu bytes",
-		      (unsigned long) size);
-	case omp_atv_allocator_fb:
-	  allocator = allocator_data->fb_data;
-	  goto retry;
-	}
+    case omp_atv_default_mem_fb:
+      allocator = omp_default_mem_alloc;
+      goto retry;
+    case omp_atv_null_fb:
+      break;
+    default:
+    case omp_atv_abort_fb:
+      gomp_fatal ("Out of memory allocating %lu bytes",
+		  (unsigned long) size);
+    case omp_atv_allocator_fb:
+      allocator = allocator_data->fb_data;
+      goto retry;
     }
   return NULL;
 }
@@ -644,6 +681,7 @@ void
 omp_free (void *ptr, omp_allocator_handle_t allocator)
 {
   struct omp_mem_header *data;
+  omp_memspace_handle_t memspace = omp_default_mem_space;
 
   if (ptr == NULL)
     return;
@@ -683,10 +721,12 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 	  return;
 	}
 #endif
+
+      memspace = allocator_data->memspace;
     }
-#ifdef LIBGOMP_USE_MEMKIND
   else
     {
+#ifdef LIBGOMP_USE_MEMKIND
       enum gomp_numa_memkind_kind memkind = GOMP_MEMKIND_NONE;
       if (data->allocator == omp_high_bw_mem_alloc)
 	memkind = GOMP_MEMKIND_HBW_PREFERRED;
@@ -702,9 +742,12 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 	      return;
 	    }
 	}
-    }
 #endif
-  free (data->ptr);
+
+      memspace = predefined_alloc_mapping[data->allocator];
+    }
+
+  MEMSPACE_FREE (memspace, data->ptr, data->size);
 }
 
 ialias (omp_free)
@@ -831,7 +874,7 @@ retry:
 	}
       else
 #endif
-	ptr = calloc (1, new_size);
+	ptr = MEMSPACE_CALLOC (allocator_data->memspace, new_size);
       if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -865,7 +908,13 @@ retry:
 	}
       else
 #endif
-	ptr = calloc (1, new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  ptr = MEMSPACE_CALLOC (memspace, new_size);
+	}
       if (ptr == NULL)
 	goto fail;
     }
@@ -882,36 +931,26 @@ retry:
   ((struct omp_mem_header *) ret)[-1].allocator = allocator;
   return ret;
 
-fail:
-  if (allocator_data)
+fail:;
+  int fallback = (allocator_data
+		  ? allocator_data->fallback
+		  : allocator == omp_default_mem_alloc
+		  ? omp_atv_null_fb
+		  : omp_atv_default_mem_fb);
+  switch (fallback)
     {
-      switch (allocator_data->fallback)
-	{
-	case omp_atv_default_mem_fb:
-	  if ((new_alignment > sizeof (void *) && new_alignment > alignment)
-#if defined(LIBGOMP_USE_MEMKIND) || defined(LIBGOMP_USE_LIBNUMA)
-	      || memkind
-#endif
-	      || (allocator_data
-		  && allocator_data->pool_size < ~(uintptr_t) 0))
-	    {
-	      allocator = omp_default_mem_alloc;
-	      goto retry;
-	    }
-	  /* Otherwise, we've already performed default mem allocation
-	     and if that failed, it won't succeed again (unless it was
-	     intermittent.  Return NULL then, as that is the fallback.  */
-	  break;
-	case omp_atv_null_fb:
-	  break;
-	default:
-	case omp_atv_abort_fb:
-	  gomp_fatal ("Out of memory allocating %lu bytes",
-		      (unsigned long) (size * nmemb));
-	case omp_atv_allocator_fb:
-	  allocator = allocator_data->fb_data;
-	  goto retry;
-	}
+    case omp_atv_default_mem_fb:
+      allocator = omp_default_mem_alloc;
+      goto retry;
+    case omp_atv_null_fb:
+      break;
+    default:
+    case omp_atv_abort_fb:
+      gomp_fatal ("Out of memory allocating %lu bytes",
+		  (unsigned long) (size * nmemb));
+    case omp_atv_allocator_fb:
+      allocator = allocator_data->fb_data;
+      goto retry;
     }
   return NULL;
 }
@@ -1101,9 +1140,10 @@ retry:
       else
 #endif
       if (prev_size)
-	new_ptr = realloc (data->ptr, new_size);
+	new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr,
+				    data->size, new_size);
       else
-	new_ptr = malloc (new_size);
+	new_ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
       if (new_ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -1151,7 +1191,13 @@ retry:
 	}
       else
 #endif
-	new_ptr = realloc (data->ptr, new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  new_ptr = MEMSPACE_REALLOC (memspace, data->ptr, data->size, new_size);
+	}
       if (new_ptr == NULL)
 	goto fail;
       ret = (char *) new_ptr + sizeof (struct omp_mem_header);
@@ -1178,7 +1224,13 @@ retry:
 	}
       else
 #endif
-	new_ptr = malloc (new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  new_ptr = MEMSPACE_ALLOC (memspace, new_size);
+	}
       if (new_ptr == NULL)
 	goto fail;
     }
@@ -1227,39 +1279,35 @@ retry:
       return ret;
     }
 #endif
-  free (data->ptr);
+  {
+    omp_memspace_handle_t was_memspace;
+    was_memspace = (free_allocator_data
+		    ? free_allocator_data->memspace
+		    : predefined_alloc_mapping[free_allocator]);
+    MEMSPACE_FREE (was_memspace, data->ptr, data->size);
+  }
   return ret;
 
-fail:
-  if (allocator_data)
+fail:;
+  int fallback = (allocator_data
+		  ? allocator_data->fallback
+		  : allocator == omp_default_mem_alloc
+		  ? omp_atv_null_fb
+		  : omp_atv_default_mem_fb);
+  switch (fallback)
     {
-      switch (allocator_data->fallback)
-	{
-	case omp_atv_default_mem_fb:
-	  if (new_alignment > sizeof (void *)
-#if defined(LIBGOMP_USE_MEMKIND) || defined(LIBGOMP_USE_LIBNUMA)
-	      || memkind
-#endif
-	      || (allocator_data
-		  && allocator_data->pool_size < ~(uintptr_t) 0))
-	    {
-	      allocator = omp_default_mem_alloc;
-	      goto retry;
-	    }
-	  /* Otherwise, we've already performed default mem allocation
-	     and if that failed, it won't succeed again (unless it was
-	     intermittent.  Return NULL then, as that is the fallback.  */
-	  break;
-	case omp_atv_null_fb:
-	  break;
-	default:
-	case omp_atv_abort_fb:
-	  gomp_fatal ("Out of memory allocating %lu bytes",
-		      (unsigned long) size);
-	case omp_atv_allocator_fb:
-	  allocator = allocator_data->fb_data;
-	  goto retry;
-	}
+    case omp_atv_default_mem_fb:
+      allocator = omp_default_mem_alloc;
+      goto retry;
+    case omp_atv_null_fb:
+      break;
+    default:
+    case omp_atv_abort_fb:
+      gomp_fatal ("Out of memory allocating %lu bytes",
+		  (unsigned long) size);
+    case omp_atv_allocator_fb:
+      allocator = allocator_data->fb_data;
+      goto retry;
     }
   return NULL;
 }
diff --git a/libgomp/basic-allocator.c b/libgomp/basic-allocator.c
new file mode 100644
index 00000000000..35c7439bed6
--- /dev/null
+++ b/libgomp/basic-allocator.c
@@ -0,0 +1,380 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a basic "malloc" implementation intended for use with small,
+   low-latency memories.
+
+   To use this template, define BASIC_ALLOC_PREFIX, and then #include the
+   source file.  The other configuration macros are optional.
+
+   The root heap descriptor is stored in the first bytes of the heap, and each
+   free chunk contains a similar descriptor for the next free chunk in the
+   chain.
+
+   The descriptor is two values: offset and size, which describe the
+   location of a chunk of memory available for allocation. The offset is
+   relative to the base of the heap.  The special offset value 0xffffffff
+   indicates that the heap (free chain) is locked.  The offset and size are
+   32-bit values so the base alignment can be 8-bytes.
+
+   Memory is allocated to the first free chunk that fits.  The free chain
+   is always stored in order of the offset to assist coalescing adjacent
+   chunks.  */
+
+#include "libgomp.h"
+
+#ifndef BASIC_ALLOC_PREFIX
+#error "BASIC_ALLOC_PREFIX not defined."
+#endif
+
+#ifndef BASIC_ALLOC_YIELD
+#define BASIC_ALLOC_YIELD
+#endif
+
+#define ALIGN(VAR) (((VAR) + 7) & ~7)    /* 8-byte granularity.  */
+
+#define fn1(prefix, name) prefix ## _ ## name
+#define fn(prefix, name) fn1 (prefix, name)
+#define basic_alloc_init fn(BASIC_ALLOC_PREFIX,init)
+#define basic_alloc_alloc fn(BASIC_ALLOC_PREFIX,alloc)
+#define basic_alloc_calloc fn(BASIC_ALLOC_PREFIX,calloc)
+#define basic_alloc_free fn(BASIC_ALLOC_PREFIX,free)
+#define basic_alloc_realloc fn(BASIC_ALLOC_PREFIX,realloc)
+
+typedef struct {
+  uint32_t offset;
+  uint32_t size;
+} heapdesc;
+
+void
+basic_alloc_init (char *heap, size_t limit)
+{
+  if (heap == NULL)
+    return;
+
+  /* Initialize the head of the free chain.  */
+  heapdesc *root = (heapdesc *) heap;
+  root->offset = ALIGN(1);
+  root->size = limit - root->offset;
+
+  /* And terminate the chain.  */
+  heapdesc *next = (heapdesc *) (heap + root->offset);
+  next->offset = 0;
+  next->size = 0;
+}
+
+static void *
+basic_alloc_alloc (char *heap, size_t size)
+{
+  if (heap == NULL)
+    return NULL;
+
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc *) heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+      BASIC_ALLOC_YIELD;
+    }
+  while (1);
+
+  /* Walk the free chain.  */
+  heapdesc chunk = root;
+  heapdesc *prev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc *) (heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && (uint32_t) size > chunk.size)
+    {
+      chunk = onward_chain;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc *) (heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  void *result = NULL;
+  if (chunk.size != 0)
+    {
+      /* Allocation successful.  */
+      result = chunkptr;
+
+      /* Update the free chain.  */
+      heapdesc stillfree = chunk;
+      stillfree.offset += size;
+      stillfree.size -= size;
+      heapdesc *stillfreeptr = (heapdesc *) (heap + stillfree.offset);
+
+      if (stillfree.size == 0)
+	/* The whole chunk was used.  */
+	stillfree = onward_chain;
+      else
+	/* The chunk was split, so restore the onward chain.  */
+	*stillfreeptr = onward_chain;
+
+      /* The previous free slot or root now points to stillfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = stillfree;
+      else
+	root = stillfree;
+    }
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+  return result;
+}
+
+static void *
+basic_alloc_calloc (char *heap, size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  uint64_t *result = basic_alloc_alloc (heap, size);
+  if (result)
+    /* Inline memset in which we know size is a multiple of 8.  */
+    for (unsigned i = 0; i < (unsigned) size / 8; i++)
+    result[i] = 0;
+
+  return result;
+}
+
+static void
+basic_alloc_free (char *heap, void *addr, size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc *) heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+    }
+  while (1);
+
+  /* Walk the free chain to find where to insert a new entry.  */
+  heapdesc chunk = root, prev_chunk = {0};
+  heapdesc *prev_chunkptr = NULL, *prevprev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc *) (heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && addr > (void *) chunkptr)
+    {
+      prev_chunk = chunk;
+      chunk = onward_chain;
+      prevprev_chunkptr = prev_chunkptr;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc *) (heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  /* Create the new chunk descriptor.  */
+  heapdesc newfreechunk;
+  newfreechunk.offset = (uint32_t) ((uintptr_t) addr - (uintptr_t) heap);
+  newfreechunk.size = (uint32_t) size;
+
+  /* Coalesce adjacent free chunks.  */
+  if (newfreechunk.offset + size == chunk.offset)
+    {
+      /* Free chunk follows.  */
+      newfreechunk.size += chunk.size;
+      chunk = onward_chain;
+    }
+  if (prev_chunkptr)
+    {
+      if (prev_chunk.offset + prev_chunk.size
+	  == newfreechunk.offset)
+	{
+	  /* Free chunk precedes.  */
+	  newfreechunk.offset = prev_chunk.offset;
+	  newfreechunk.size += prev_chunk.size;
+	  addr = heap + prev_chunk.offset;
+	  prev_chunkptr = prevprev_chunkptr;
+	}
+    }
+
+  /* Update the free chain in the new and previous chunks.  */
+  *(heapdesc *) addr = chunk;
+  if (prev_chunkptr)
+    *prev_chunkptr = newfreechunk;
+  else
+    root = newfreechunk;
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+}
+
+static void *
+basic_alloc_realloc (char *heap, void *addr, size_t oldsize,
+				      size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  oldsize = ALIGN (oldsize);
+  size = ALIGN (size);
+
+  if (oldsize == size)
+    return addr;
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc *) heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+    }
+  while (1);
+
+  /* Walk the free chain.  */
+  heapdesc chunk = root;
+  heapdesc *prev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc *) (heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && (void *) chunkptr < addr)
+    {
+      chunk = onward_chain;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc *) (heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  void *result = NULL;
+  if (size < oldsize)
+    {
+      /* The new allocation is smaller than the old; we can always
+	 shrink an allocation in place.  */
+      result = addr;
+
+      heapdesc *nowfreeptr = (heapdesc *) (addr + size);
+
+      /* Update the free chain.  */
+      heapdesc nowfree;
+      nowfree.offset = (char *) nowfreeptr - heap;
+      nowfree.size = oldsize - size;
+
+      if (nowfree.offset + size == chunk.offset)
+	{
+	  /* Coalesce following free chunk.  */
+	  nowfree.size += chunk.size;
+	  *nowfreeptr = onward_chain;
+	}
+      else
+	*nowfreeptr = chunk;
+
+      /* The previous free slot or root now points to nowfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = nowfree;
+      else
+	root = nowfree;
+    }
+  else if (chunk.size != 0
+	   && (char *) addr + oldsize == (char *) chunkptr
+	   && chunk.size >= size-oldsize)
+    {
+      /* The new allocation is larger than the old, and we found a
+	 large enough free block right after the existing block,
+	 so we extend into that space.  */
+      result = addr;
+
+      uint32_t delta = size-oldsize;
+
+      /* Update the free chain.  */
+      heapdesc stillfree = chunk;
+      stillfree.offset += delta;
+      stillfree.size -= delta;
+      heapdesc *stillfreeptr = (heapdesc *) (heap + stillfree.offset);
+
+      if (stillfree.size == 0)
+	/* The whole chunk was used.  */
+	stillfree = onward_chain;
+      else
+	/* The chunk was split, so restore the onward chain.  */
+	*stillfreeptr = onward_chain;
+
+      /* The previous free slot or root now points to stillfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = stillfree;
+      else
+	root = stillfree;
+    }
+  /* Else realloc in-place has failed and result remains NULL.  */
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+  if (result == NULL)
+    {
+      /* The allocation could not be extended in place, so we simply
+	 allocate fresh memory and move the data.  If we can't allocate
+	 from low-latency memory then we leave the original alloaction
+	 intact and return NULL.
+	 We could do a fall-back to main memory, but we don't know what
+	 the fall-back trait said to do.  */
+      result = basic_alloc_alloc (heap, size);
+      if (result != NULL)
+	{
+	  /* Inline memcpy in which we know oldsize is a multiple of 8.  */
+	  uint64_t *from = addr, *to = result;
+	  for (unsigned i = 0; i < (unsigned) oldsize / 8; i++)
+	    to[i] = from[i];
+
+	  basic_alloc_free (heap, addr, oldsize);
+	}
+    }
+
+  return result;
+}
+
+#undef ALIGN
+#undef fn1
+#undef fn
+#undef basic_alloc_init
+#undef basic_alloc_alloc
+#undef basic_alloc_free
+#undef basic_alloc_realloc
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
new file mode 100644
index 00000000000..6014fba177f
--- /dev/null
+++ b/libgomp/config/nvptx/allocator.c
@@ -0,0 +1,120 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The low-latency allocators use space reserved in .shared memory when the
+   kernel is launched.  The heap is initialized in gomp_nvptx_main and all
+   allocations are forgotten when the kernel exits.  Allocations to other
+   memory spaces all use the system malloc syscall.
+
+   The root heap descriptor is stored elsewhere in shared memory, and each
+   free chunk contains a similar descriptor for the next free chunk in the
+   chain.
+
+   The descriptor is two 16-bit values: offset and size, which describe the
+   location of a chunk of memory available for allocation. The offset is
+   relative to the base of the heap.  The special value 0xffff, 0xffff
+   indicates that the heap is locked.  The descriptor is encoded into a
+   single 32-bit integer so that it may be easily accessed atomically.
+
+   Memory is allocated to the first free chunk that fits.  The free chain
+   is always stored in order of the offset to assist coalescing adjacent
+   chunks.  */
+
+#include "libgomp.h"
+#include <stdlib.h>
+
+#define BASIC_ALLOC_PREFIX __nvptx_lowlat
+#include "../../basic-allocator.c"
+
+/* There should be some .shared space reserved for us.  There's no way to
+   express this magic extern sizeless array in C so use asm.  */
+asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n");
+
+static void *
+nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      return __nvptx_lowlat_alloc (shared_pool, size);
+    }
+  else
+    return malloc (size);
+}
+
+static void *
+nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      return __nvptx_lowlat_calloc (shared_pool, size);
+    }
+  else
+    return calloc (1, size);
+}
+
+static void
+nvptx_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      __nvptx_lowlat_free (shared_pool, addr, size);
+    }
+  else
+    free (addr);
+}
+
+static void *
+nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
+			size_t oldsize, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      return __nvptx_lowlat_realloc (shared_pool, addr, oldsize, size);
+    }
+  else
+    return realloc (addr, size);
+}
+
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  nvptx_memspace_alloc (MEMSPACE, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  nvptx_memspace_calloc (MEMSPACE, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  nvptx_memspace_free (MEMSPACE, ADDR, SIZE)
+
+#include "../../allocator.c"
diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index 59521fabd99..9243774e41a 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -37,6 +37,12 @@ int __gomp_team_num __attribute__((shared,nocommon));
 static void gomp_thread_start (struct gomp_thread_pool *);
 extern void build_indirect_map (void);
 
+/* There should be some .shared space reserved for us.  There's no way to
+   express this magic extern sizeless array in C so use asm.  */
+asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n");
+
+/* Defined in basic-allocator.c via config/nvptx/allocator.c.  */
+void __nvptx_lowlat_init (void *heap, size_t size);
 
 /* This externally visible function handles target region entry.  It
    sets up a per-team thread pool and transfers control by calling FN (FN_DATA)
@@ -68,6 +74,18 @@ gomp_nvptx_main (void (*fn) (void *), void *fn_data)
       nvptx_thrs = alloca (ntids * sizeof (*nvptx_thrs));
       memset (nvptx_thrs, 0, ntids * sizeof (*nvptx_thrs));
 
+      /* Find the low-latency heap details ....  */
+      uint32_t *shared_pool;
+      uint32_t shared_pool_size = 0;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool));
+#if __PTX_ISA_VERSION_MAJOR__ > 4 \
+    || (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MINOR__ >= 1)
+      asm ("mov.u32\t%0, %%dynamic_smem_size;\n"
+	   : "=r"(shared_pool_size));
+#endif
+      __nvptx_lowlat_init (shared_pool, shared_pool_size);
+
+      /* Initialize the thread pool.  */
       struct gomp_thread_pool *pool = alloca (sizeof (*pool));
       pool->threads = alloca (ntids * sizeof (*pool->threads));
       for (tid = 0; tid < ntids; tid++)
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index e5fe7af76af..39d0749e7b3 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,11 +3012,14 @@ value.
 @item omp_const_mem_alloc       @tab omp_const_mem_space
 @item omp_high_bw_mem_alloc     @tab omp_high_bw_mem_space
 @item omp_low_lat_mem_alloc     @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc      @tab --
-@item omp_pteam_mem_alloc       @tab --
-@item omp_thread_mem_alloc      @tab --
+@item omp_cgroup_mem_alloc      @tab omp_low_lat_mem_space (implementation defined)
+@item omp_pteam_mem_alloc       @tab omp_low_lat_mem_space (implementation defined)
+@item omp_thread_mem_alloc      @tab omp_low_lat_mem_space (implementation defined)
 @end multitable
 
+The @code{omp_low_lat_mem_space} is only available on supported devices.
+See @ref{Offload-Target Specifics}.
+
 The predefined allocators use the default values for the traits,
 as listed below.  Except that the last three allocators have the
 @code{access} trait set to @code{cgroup}, @code{pteam}, and
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 0548e7e09e5..d4a254ed4f0 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -341,6 +341,11 @@ struct ptx_device
 
 static struct ptx_device **ptx_devices;
 
+/* OpenMP kernels reserve a small amount of ".shared" space for use by
+   omp_alloc.  The size is configured using GOMP_NVPTX_LOWLAT_POOL, but the
+   default is set here.  */
+static unsigned lowlat_pool_size = 8 * 1024;
+
 static inline struct nvptx_thread *
 nvptx_thread (void)
 {
@@ -1219,6 +1224,22 @@ GOMP_OFFLOAD_init_device (int n)
       instantiated_devices++;
     }
 
+  const char *var_name = "GOMP_NVPTX_LOWLAT_POOL";
+  const char *env_var = secure_getenv (var_name);
+  notify_var (var_name, env_var);
+
+  if (env_var != NULL)
+    {
+      char *endptr;
+      unsigned long val = strtoul (env_var, &endptr, 10);
+      if (endptr == NULL || *endptr != '\0'
+	  || errno == ERANGE || errno == EINVAL
+	  || val > UINT_MAX)
+	GOMP_PLUGIN_error ("Error parsing %s", var_name);
+      else
+	lowlat_pool_size = val;
+    }
+
   pthread_mutex_unlock (&ptx_dev_lock);
 
   return dev != NULL;
@@ -2178,7 +2199,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
-			 32, threads, 1, 0, NULL, NULL, config);
+			 32, threads, 1, lowlat_pool_size, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
   if (reverse_offload)
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-1.c b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
new file mode 100644
index 00000000000..f4e594f1e98
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
@@ -0,0 +1,56 @@
+/* { dg-do run } */
+
+/* Test that omp_alloc returns usable memory.  */
+
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+test (int n, omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:n) map(to:allocator)
+  {
+    int *a;
+    a = (int *) omp_alloc (n * sizeof (int), allocator);
+
+    #pragma omp parallel
+    for (int i = 0; i < n; i++)
+      a[i] = i;
+
+    for (int i = 0; i < n; i++)
+      if (a[i] != i)
+	{
+	  __builtin_printf ("data mismatch at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    omp_free (a, allocator);
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (10, omp_default_mem_alloc);
+  test (10, omp_large_cap_mem_alloc);
+  test (10, omp_const_mem_alloc);
+  test (10, omp_high_bw_mem_alloc);
+  test (10, omp_low_lat_mem_alloc);
+  test (10, omp_cgroup_mem_alloc);
+  test (10, omp_pteam_mem_alloc);
+  test (10, omp_thread_mem_alloc);
+
+  // Larger than low-latency memory limit
+  test (100000, omp_default_mem_alloc);
+  test (100000, omp_large_cap_mem_alloc);
+  test (100000, omp_const_mem_alloc);
+  test (100000, omp_high_bw_mem_alloc);
+  test (100000, omp_low_lat_mem_alloc);
+  test (100000, omp_cgroup_mem_alloc);
+  test (100000, omp_pteam_mem_alloc);
+  test (100000, omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-2.c b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
new file mode 100644
index 00000000000..e9fd1602946
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
@@ -0,0 +1,64 @@
+/* { dg-do run } */
+
+/* Test concurrent and repeated allocations.  */
+
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+test (int n, omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:n) map(to:allocator)
+  {
+    int **a;
+    a = (int **) omp_alloc (n * sizeof (int *), allocator);
+
+    #pragma omp parallel for
+    for (int i = 0; i < n; i++)
+      {
+	/*Use 10x to ensure we do activate low-latency fall-back.  */
+	a[i] = omp_alloc (sizeof (int) * 10, allocator);
+	a[i][0] = i;
+      }
+
+    for (int i = 0; i < n; i++)
+      if (a[i][0] != i)
+	{
+	  __builtin_printf ("data mismatch at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    #pragma omp parallel for
+    for (int i = 0; i < n; i++)
+      omp_free (a[i], allocator);
+
+    omp_free (a, allocator);
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (10, omp_default_mem_alloc);
+  test (10, omp_large_cap_mem_alloc);
+  test (10, omp_const_mem_alloc);
+  test (10, omp_high_bw_mem_alloc);
+  test (10, omp_low_lat_mem_alloc);
+  test (10, omp_cgroup_mem_alloc);
+  test (10, omp_pteam_mem_alloc);
+  test (10, omp_thread_mem_alloc);
+
+  // Larger than low-latency memory limit (on aggregate)
+  test (1000, omp_default_mem_alloc);
+  test (1000, omp_large_cap_mem_alloc);
+  test (1000, omp_const_mem_alloc);
+  test (1000, omp_high_bw_mem_alloc);
+  test (1000, omp_low_lat_mem_alloc);
+  test (1000, omp_cgroup_mem_alloc);
+  test (1000, omp_pteam_mem_alloc);
+  test (1000, omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-3.c b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
new file mode 100644
index 00000000000..792e2200f30
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+
+/* Stress-test omp_alloc/omp_malloc under concurrency.  */
+
+#include <omp.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#pragma omp requires dynamic_allocators
+
+#define N 1000
+
+void
+test (omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:allocator)
+  {
+    #pragma omp parallel for
+    for (int i = 0; i < N; i++)
+      for (int j = 0; j < N; j++)
+	{
+	  int *p = omp_alloc (sizeof (int), allocator);
+	  omp_free (p, allocator);
+	}
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (omp_default_mem_alloc);
+  test (omp_large_cap_mem_alloc);
+  test (omp_const_mem_alloc);
+  test (omp_high_bw_mem_alloc);
+  test (omp_low_lat_mem_alloc);
+  test (omp_cgroup_mem_alloc);
+  test (omp_pteam_mem_alloc);
+  test (omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-4.c b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
new file mode 100644
index 00000000000..66e13c09234
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
@@ -0,0 +1,196 @@
+/* { dg-do run } */
+
+/* Test that low-latency free chains are sound.  */
+
+#include <stddef.h>
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+check (int cond, const char *msg)
+{
+  if (!cond)
+    {
+      __builtin_printf ("%s\n", msg);
+      __builtin_abort ();
+    }
+}
+
+int
+main ()
+{
+  #pragma omp target
+  {
+    /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
+    omp_alloctrait_t traits[1]
+      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
+							1, traits);
+
+    int size = 4;
+
+    char *a = omp_alloc (size, lowlat);
+    char *b = omp_alloc (size, lowlat);
+    char *c = omp_alloc (size, lowlat);
+    char *d = omp_alloc (size, lowlat);
+
+    /* There are headers and padding to account for.  */
+    int size2 = size + (b-a);
+    int size3 = size + (c-a);
+    int size4 = size + (d-a) + 100; /* Random larger amount.  */
+
+    check (a != NULL && b != NULL && c != NULL && d != NULL,
+	   "omp_alloc returned NULL\n");
+
+    omp_free (a, lowlat);
+    char *p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not reuse first chunk");
+
+    omp_free (b, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not reuse second chunk");
+
+    omp_free (c, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not reuse third chunk");
+
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == a, "allocate did not coalesce first two chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2)");
+
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == b, "allocate did not coalesce middle two chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2)");
+
+    omp_free (b, lowlat);
+    omp_free (a, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == a, "allocate did not coalesce first two chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2), reverse free");
+
+    omp_free (c, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == b, "allocate did not coalesce second two chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2), reverse free");
+
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == a, "allocate did not coalesce first three chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2)");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split first chunk (3)");
+
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    omp_free (d, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == b, "allocate did not coalesce last three chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2)");
+    p = omp_alloc (size, lowlat);
+    check (p == d, "allocate did not split second chunk (3)");
+
+    omp_free (c, lowlat);
+    omp_free (b, lowlat);
+    omp_free (a, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == a, "allocate did not coalesce first three chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split first chunk (3), reverse free");
+
+    omp_free (d, lowlat);
+    omp_free (c, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == b, "allocate did not coalesce second three chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == d, "allocate did not split second chunk (3), reverse free");
+
+    omp_free (c, lowlat);
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == a, "allocate did not coalesce first three chunks, mixed free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split first chunk (3), mixed free");
+
+    omp_free (d, lowlat);
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == b, "allocate did not coalesce second three chunks, mixed free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == d, "allocate did not split second chunk (3), mixed free");
+
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    omp_free (d, lowlat);
+    p = omp_alloc (size4, lowlat);
+    check (p == a, "allocate did not coalesce all memory");
+  }
+
+  return 0;
+}
+
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-5.c b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
new file mode 100644
index 00000000000..10805ded6d0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
@@ -0,0 +1,63 @@
+/* { dg-do run } */
+
+/* Test calloc with omp_alloc.  */
+
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+test (int n, omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:n) map(to:allocator)
+  {
+    int *a;
+    a = (int *) omp_calloc (n, sizeof (int), allocator);
+
+    for (int i = 0; i < n; i++)
+      if (a[i] != 0)
+	{
+	  __builtin_printf ("memory not zeroed at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    #pragma omp parallel
+    for (int i = 0; i < n; i++)
+      a[i] = i;
+
+    for (int i = 0; i < n; i++)
+      if (a[i] != i)
+	{
+	  __builtin_printf ("data mismatch at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    omp_free (a, allocator);
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (10, omp_default_mem_alloc);
+  test (10, omp_large_cap_mem_alloc);
+  test (10, omp_const_mem_alloc);
+  test (10, omp_high_bw_mem_alloc);
+  test (10, omp_low_lat_mem_alloc);
+  test (10, omp_cgroup_mem_alloc);
+  test (10, omp_pteam_mem_alloc);
+  test (10, omp_thread_mem_alloc);
+
+  // Larger than low-latency memory limit
+  test (100000, omp_default_mem_alloc);
+  test (100000, omp_large_cap_mem_alloc);
+  test (100000, omp_const_mem_alloc);
+  test (100000, omp_high_bw_mem_alloc);
+  test (100000, omp_low_lat_mem_alloc);
+  test (100000, omp_cgroup_mem_alloc);
+  test (100000, omp_pteam_mem_alloc);
+  test (100000, omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-6.c b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
new file mode 100644
index 00000000000..66bf69b0455
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
@@ -0,0 +1,117 @@
+/* { dg-do run } */
+
+/* Test that low-latency realloc and free chains are sound.  */
+
+#include <stddef.h>
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+check (int cond, const char *msg)
+{
+  if (!cond)
+    {
+      __builtin_printf ("%s\n", msg);
+      __builtin_abort ();
+    }
+}
+
+int
+main ()
+{
+  #pragma omp target
+  {
+    /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
+    omp_alloctrait_t traits[1]
+      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
+							1, traits);
+
+    int size = 16;
+
+    char *a = (char *) omp_alloc (size, lowlat);
+    char *b = (char *) omp_alloc (size, lowlat);
+    char *c = (char *) omp_alloc (size, lowlat);
+    char *d = (char *) omp_alloc (size, lowlat);
+
+    /* There are headers and padding to account for.  */
+    int size2 = size + (b-a);
+    int size3 = size + (c-a);
+    int size4 = size + (d-a) + 100; /* Random larger amount.  */
+
+    check (a != NULL && b != NULL && c != NULL && d != NULL,
+	   "omp_alloc returned NULL\n");
+
+    char *p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse same size chunk, no space after");
+
+    p = omp_realloc (b, size-8, lowlat, lowlat);
+    check (p == b, "realloc did not reuse smaller chunk, no space after");
+
+    p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse original size chunk, no space after");
+
+    /* Make space after b.  */
+    omp_free (c, lowlat);
+
+    p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse same size chunk");
+
+    p = omp_realloc (b, size-8, lowlat, lowlat);
+    check (p == b, "realloc did not reuse smaller chunk");
+
+    p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse original size chunk");
+
+    p = omp_realloc (b, size+8, lowlat, lowlat);
+    check (p == b, "realloc did not extend in place by a little");
+
+    p = omp_realloc (b, size2, lowlat, lowlat);
+    check (p == b, "realloc did not extend into whole next chunk");
+
+    p = omp_realloc (b, size3, lowlat, lowlat);
+    check (p != b, "realloc did not move b elsewhere");
+    omp_free (p, lowlat);
+
+
+    p = omp_realloc (a, size, lowlat, lowlat);
+    check (p == a, "realloc did not reuse same size chunk, first position");
+
+    p = omp_realloc (a, size-8, lowlat, lowlat);
+    check (p == a, "realloc did not reuse smaller chunk, first position");
+
+    p = omp_realloc (a, size, lowlat, lowlat);
+    check (p == a, "realloc did not reuse original size chunk, first position");
+
+    p = omp_realloc (a, size+8, lowlat, lowlat);
+    check (p == a, "realloc did not extend in place by a little, first position");
+
+    p = omp_realloc (a, size3, lowlat, lowlat);
+    check (p == a, "realloc did not extend into whole next chunk, first position");
+
+    p = omp_realloc (a, size4, lowlat, lowlat);
+    check (p != a, "realloc did not move a elsewhere, first position");
+    omp_free (p, lowlat);
+
+
+    p = omp_realloc (d, size, lowlat, lowlat);
+    check (p == d, "realloc did not reuse same size chunk, last position");
+
+    p = omp_realloc (d, size-8, lowlat, lowlat);
+    check (p == d, "realloc did not reuse smaller chunk, last position");
+
+    p = omp_realloc (d, size, lowlat, lowlat);
+    check (p == d, "realloc did not reuse original size chunk, last position");
+
+    p = omp_realloc (d, size+8, lowlat, lowlat);
+    check (p == d, "realloc did not extend in place by d little, last position");
+
+    /* Larger than low latency memory.  */
+    p = omp_realloc (d, 100000000, lowlat, lowlat);
+    check (p == NULL, "realloc did not fail on OOM");
+  }
+
+  return 0;
+}
+

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 2/3] openmp, nvptx: low-lat memory access traits
  2023-12-03  0:32 [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Andrew Stubbs
  2023-12-03  0:32 ` [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator Andrew Stubbs
@ 2023-12-03  0:32 ` Andrew Stubbs
  2023-12-03  0:32 ` [PATCH v3 3/3] amdgcn, libgomp: low-latency allocator Andrew Stubbs
  2023-12-04 15:34 ` [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Tobias Burnus
  3 siblings, 0 replies; 10+ messages in thread
From: Andrew Stubbs @ 2023-12-03  0:32 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1859 bytes --]


The NVPTX low latency memory is not accessible outside the team that allocates
it, and therefore should be unavailable for allocators with the access trait
"all".  This change means that the omp_low_lat_mem_alloc predefined
allocator no longer works (but omp_cgroup_mem_alloc still does).

libgomp/ChangeLog:

	* allocator.c (MEMSPACE_VALIDATE): New macro.
	(omp_init_allocator): Use MEMSPACE_VALIDATE.
	(omp_aligned_alloc): Use OMP_LOW_LAT_MEM_ALLOC_INVALID.
	(omp_aligned_calloc): Likewise.
	(omp_realloc): Likewise.
	* config/nvptx/allocator.c (nvptx_memspace_validate): New function.
	(MEMSPACE_VALIDATE): New macro.
	(OMP_LOW_LAT_MEM_ALLOC_INVALID): New define.
	* libgomp.texi: Document low-latency implementation details.
	* testsuite/libgomp.c/omp_alloc-1.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-2.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-3.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-4.c (main): Add access trait.
	* testsuite/libgomp.c/omp_alloc-5.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-6.c (main): Add access trait.
	* testsuite/libgomp.c/omp_alloc-traits.c: New test.
---
 libgomp/allocator.c                           | 20 ++++++
 libgomp/config/nvptx/allocator.c              | 21 ++++++
 libgomp/libgomp.texi                          | 18 +++++
 libgomp/testsuite/libgomp.c/omp_alloc-1.c     | 10 +++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c     |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-3.c     |  7 ++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c     |  7 +-
 libgomp/testsuite/libgomp.c/omp_alloc-5.c     |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-6.c     |  7 +-
 .../testsuite/libgomp.c/omp_alloc-traits.c    | 66 +++++++++++++++++++
 10 files changed, 166 insertions(+), 6 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-traits.c


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: v3-0002-openmp-nvptx-low-lat-memory-access-traits.patch --]
[-- Type: text/x-patch; name="v3-0002-openmp-nvptx-low-lat-memory-access-traits.patch", Size: 13319 bytes --]

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index fa398128368..a8a80f8028d 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -56,6 +56,10 @@
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
   free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
 #endif
+#ifndef MEMSPACE_VALIDATE
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  (((void)(MEMSPACE), (void)(ACCESS), 1))
+#endif
 
 /* Map the predefined allocators to the correct memory space.
    The index to this table is the omp_allocator_handle_t enum value.
@@ -439,6 +443,10 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
   if (data.pinned)
     return omp_null_allocator;
 
+  /* Reject unsupported memory spaces.  */
+  if (!MEMSPACE_VALIDATE (data.memspace, data.access))
+    return omp_null_allocator;
+
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
   *ret = data;
 #ifndef HAVE_SYNC_BUILTINS
@@ -522,6 +530,10 @@ retry:
     new_size += new_alignment - sizeof (void *);
   if (__builtin_add_overflow (size, new_size, &new_size))
     goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+    goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -820,6 +832,10 @@ retry:
     goto fail;
   if (__builtin_add_overflow (size_temp, new_size, &new_size))
     goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+    goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -1054,6 +1070,10 @@ retry:
   if (__builtin_add_overflow (size, new_size, &new_size))
     goto fail;
   old_size = data->size;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+    goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index 6014fba177f..a3302411bcb 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -108,6 +108,21 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
     return realloc (addr, size);
 }
 
+static inline int
+nvptx_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+#if __PTX_ISA_VERSION_MAJOR__ > 4 \
+    || (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MINOR >= 1)
+  /* Disallow use of low-latency memory when it must be accessible by
+     all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+#else
+  /* Low-latency memory is not available before PTX 4.1.  */
+  return (memspace != omp_low_lat_mem_space);
+#endif
+}
+
 #define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
   nvptx_memspace_alloc (MEMSPACE, SIZE)
 #define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
@@ -116,5 +131,11 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
   nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
   nvptx_memspace_free (MEMSPACE, ADDR, SIZE)
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  nvptx_memspace_validate (MEMSPACE, ACCESS)
+
+/* The default low-latency memspace implies omp_atv_all, which is incompatible
+   with the .shared memory space.  */
+#define OMP_LOW_LAT_MEM_ALLOC_INVALID 1
 
 #include "../../allocator.c"
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 39d0749e7b3..7fdd6fe9410 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -5769,6 +5769,9 @@ Additional notes regarding the traits:
 @item The @code{sync_hint} trait has no effect.
 @end itemize
 
+See also:
+@ref{Offload-Target Specifics}
+
 @c ---------------------------------------------------------------------
 @c Offload-Target Specifics
 @c ---------------------------------------------------------------------
@@ -5902,6 +5905,21 @@ The implementation remark:
       directive for non-contiguous list items will use the 2D and 3D
       memory-copy functions of the CUDA library.  Higher dimensions will
       call those functions in a loop and are therefore supported.
+@item Low-latency memory (@code{omp_low_lat_mem_space}) is supported when the
+      the @code{access} trait is set to @code{cgroup}, the ISA is at least
+      @code{sm_53}, and the PTX version is at least 4.1.  The default pool size
+      is 8 kiB per team, but may be adjusted at runtime by setting environment
+      variable @code{GOMP_NVPTX_LOWLAT_POOL=@var{bytes}}.  The maximum value is
+      limited by the available hardware, and care should be taken that the
+      selected pool size does not unduly limit the number of teams that can
+      run simultaneously.
+@item @code{omp_low_lat_mem_alloc} cannot be used with true low-latency memory
+      because the definition implies the @code{omp_atv_all} trait; main
+      graphics memory is used instead.
+@item @code{omp_cgroup_mem_alloc}, @code{omp_pteam_mem_alloc}, and
+      @code{omp_thread_mem_alloc}, all use low-latency memory as first
+      preference, and fall back to main graphics memory when the low-latency
+      pool is exhausted.
 @end itemize
 
 
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-1.c b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
index f4e594f1e98..7f7f440c12c 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-1.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
@@ -32,12 +32,21 @@ test (int n, omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_allocator_handle_t gpu_lowlat = 0;
+  #pragma omp target map(from:gpu_lowlat)
+    {
+      omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+      gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+    }
+
   // Smaller than low-latency memory limit
   test (10, omp_default_mem_alloc);
   test (10, omp_large_cap_mem_alloc);
   test (10, omp_const_mem_alloc);
   test (10, omp_high_bw_mem_alloc);
   test (10, omp_low_lat_mem_alloc);
+  test (10, gpu_lowlat);
   test (10, omp_cgroup_mem_alloc);
   test (10, omp_pteam_mem_alloc);
   test (10, omp_thread_mem_alloc);
@@ -48,6 +57,7 @@ main ()
   test (100000, omp_const_mem_alloc);
   test (100000, omp_high_bw_mem_alloc);
   test (100000, omp_low_lat_mem_alloc);
+  test (100000, gpu_lowlat);
   test (100000, omp_cgroup_mem_alloc);
   test (100000, omp_pteam_mem_alloc);
   test (100000, omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-2.c b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
index e9fd1602946..54523f1061e 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-2.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
@@ -40,12 +40,19 @@ test (int n, omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+  omp_allocator_handle_t gpu_lowlat;
+  #pragma omp target map(from:gpu_lowlat)
+  gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+
   // Smaller than low-latency memory limit
   test (10, omp_default_mem_alloc);
   test (10, omp_large_cap_mem_alloc);
   test (10, omp_const_mem_alloc);
   test (10, omp_high_bw_mem_alloc);
   test (10, omp_low_lat_mem_alloc);
+  test (10, gpu_lowlat);
   test (10, omp_cgroup_mem_alloc);
   test (10, omp_pteam_mem_alloc);
   test (10, omp_thread_mem_alloc);
@@ -56,6 +63,7 @@ main ()
   test (1000, omp_const_mem_alloc);
   test (1000, omp_high_bw_mem_alloc);
   test (1000, omp_low_lat_mem_alloc);
+  test (1000, gpu_lowlat);
   test (1000, omp_cgroup_mem_alloc);
   test (1000, omp_pteam_mem_alloc);
   test (1000, omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-3.c b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
index 792e2200f30..682d149d379 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-3.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
@@ -28,12 +28,19 @@ test (omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+  omp_allocator_handle_t gpu_lowlat;
+  #pragma omp target map(from:gpu_lowlat)
+  gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+
   // Smaller than low-latency memory limit
   test (omp_default_mem_alloc);
   test (omp_large_cap_mem_alloc);
   test (omp_const_mem_alloc);
   test (omp_high_bw_mem_alloc);
   test (omp_low_lat_mem_alloc);
+  test (gpu_lowlat);
   test (omp_cgroup_mem_alloc);
   test (omp_pteam_mem_alloc);
   test (omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-4.c b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
index 66e13c09234..dd8fcfbeeba 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-4.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
@@ -23,10 +23,11 @@ main ()
   #pragma omp target
   {
     /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
-    omp_alloctrait_t traits[1]
-      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_alloctrait_t traits[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+          { omp_atk_access, omp_atv_cgroup } };
     omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
-							1, traits);
+							2, traits);
 
     int size = 4;
 
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-5.c b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
index 10805ded6d0..26bf38c1ca6 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-5.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
@@ -39,12 +39,19 @@ test (int n, omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+  omp_allocator_handle_t gpu_lowlat;
+  #pragma omp target map(from:gpu_lowlat)
+  gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+
   // Smaller than low-latency memory limit
   test (10, omp_default_mem_alloc);
   test (10, omp_large_cap_mem_alloc);
   test (10, omp_const_mem_alloc);
   test (10, omp_high_bw_mem_alloc);
   test (10, omp_low_lat_mem_alloc);
+  test (10, gpu_lowlat);
   test (10, omp_cgroup_mem_alloc);
   test (10, omp_pteam_mem_alloc);
   test (10, omp_thread_mem_alloc);
@@ -55,6 +62,7 @@ main ()
   test (100000, omp_const_mem_alloc);
   test (100000, omp_high_bw_mem_alloc);
   test (100000, omp_low_lat_mem_alloc);
+  test (100000, gpu_lowlat);
   test (100000, omp_cgroup_mem_alloc);
   test (100000, omp_pteam_mem_alloc);
   test (100000, omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-6.c b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
index 66bf69b0455..947a0ed23f8 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-6.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
@@ -23,10 +23,11 @@ main ()
   #pragma omp target
   {
     /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
-    omp_alloctrait_t traits[1]
-      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_alloctrait_t traits[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+          { omp_atk_access, omp_atv_cgroup } };
     omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
-							1, traits);
+							2, traits);
 
     int size = 16;
 
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-traits.c b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
new file mode 100644
index 00000000000..4ff0fca4986
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
@@ -0,0 +1,66 @@
+/* { dg-do run } */
+
+/* { dg-require-effective-target offload_device } */
+/* { dg-xfail-if "not implemented" { ! offload_target_nvptx } } */
+
+/* Test that GPU low-latency allocation is limited to team access.  */
+
+#include <stddef.h>
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+int
+main ()
+{
+  #pragma omp target
+  {
+    /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
+    omp_alloctrait_t traits[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+	  { omp_atk_access, omp_atv_cgroup } };
+    omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
+							2, traits); // good
+
+    omp_alloctrait_t traits_all[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+	  { omp_atk_access, omp_atv_all } };
+    omp_allocator_handle_t lowlat_all
+      = omp_init_allocator (omp_low_lat_mem_space, 2, traits_all); // bad
+
+    omp_alloctrait_t traits_default[1]
+      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_allocator_handle_t lowlat_default
+      = omp_init_allocator (omp_low_lat_mem_space, 1, traits_default); // bad
+
+    if (lowlat_all != omp_null_allocator
+	|| lowlat_default != omp_null_allocator)
+      __builtin_abort ();
+
+    void *a = omp_alloc (1, lowlat); // good
+
+    if (!a)
+      __builtin_abort ();
+
+    omp_free (a, lowlat);
+
+
+    a = omp_calloc (1, 1, lowlat); // good
+
+    if (!a)
+      __builtin_abort ();
+
+    omp_free (a, lowlat);
+
+
+    a = omp_realloc (NULL, 1, lowlat, lowlat); // good
+
+    if (!a)
+      __builtin_abort ();
+
+    omp_free (a, lowlat);
+  }
+
+  return 0;
+}
+

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 3/3] amdgcn, libgomp: low-latency allocator
  2023-12-03  0:32 [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Andrew Stubbs
  2023-12-03  0:32 ` [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator Andrew Stubbs
  2023-12-03  0:32 ` [PATCH v3 2/3] openmp, nvptx: low-lat memory access traits Andrew Stubbs
@ 2023-12-03  0:32 ` Andrew Stubbs
  2023-12-04 15:34 ` [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Tobias Burnus
  3 siblings, 0 replies; 10+ messages in thread
From: Andrew Stubbs @ 2023-12-03  0:32 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]


This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).

Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible.  This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories).  A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.

gcc/ChangeLog:

	* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
	* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
	addressing.
	(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.

libgomp/ChangeLog:

	* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
	(TEAM_ARENA_FREE): Likewise.
	(TEAM_ARENA_END): Likewise.
	(GCN_LOWLAT_HEAP): New.
	* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
	(__gcn_lowlat_init): New prototype.
	(gomp_gcn_enter_kernel): Initialize the low-latency heap.
	* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
	(TEAM_ARENA_FREE): Likewise.
	(TEAM_ARENA_END): Likewise.
	* plugin/plugin-gcn.c (lowlat_size): New variable.
	(print_kernel_dispatch): Label the group_segment_size purpose.
	(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
	(create_kernel_dispatch): Pass low-latency head allocation to kernel.
	(run_kernel): Use shadow; don't assume values.
	* testsuite/libgomp.c/omp_alloc-traits.c: Enable for amdgcn.
	* config/gcn/allocator.c: New file.
	* libgomp.texi: Document low-latency implementation details.
---
 gcc/config/gcn/gcn-builtins.def               |   2 +
 gcc/config/gcn/gcn.cc                         |  16 ++-
 libgomp/config/gcn/allocator.c                | 127 ++++++++++++++++++
 libgomp/config/gcn/libgomp-gcn.h              |   6 +
 libgomp/config/gcn/team.c                     |  12 ++
 libgomp/libgomp.h                             |   3 -
 libgomp/libgomp.texi                          |  13 ++
 libgomp/plugin/plugin-gcn.c                   |  35 ++++-
 .../testsuite/libgomp.c/omp_alloc-traits.c    |   2 +-
 9 files changed, 205 insertions(+), 11 deletions(-)
 create mode 100644 libgomp/config/gcn/allocator.c


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: v3-0003-amdgcn-libgomp-low-latency-allocator.patch --]
[-- Type: text/x-patch; name="v3-0003-amdgcn-libgomp-low-latency-allocator.patch", Size: 13334 bytes --]

diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def
index 636a8e7a1a9..471457d7c23 100644
--- a/gcc/config/gcn/gcn-builtins.def
+++ b/gcc/config/gcn/gcn-builtins.def
@@ -164,6 +164,8 @@ DEF_BUILTIN (FIRST_CALL_THIS_THREAD_P, -1, "first_call_this_thread_p", B_INSN,
 	     _A1 (GCN_BTI_BOOL), gcn_expand_builtin_1)
 DEF_BUILTIN (KERNARG_PTR, -1, "kernarg_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
 	     gcn_expand_builtin_1)
+DEF_BUILTIN (DISPATCH_PTR, -1, "dispatch_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
+	     gcn_expand_builtin_1)
 DEF_BUILTIN (GET_STACK_LIMIT, -1, "get_stack_limit", B_INSN,
 	     _A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1)
 
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 22d2b6ebf6d..d70238820dd 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -110,7 +110,8 @@ gcn_init_machine_status (void)
 
   f = ggc_cleared_alloc<machine_function> ();
 
-  if (TARGET_GCN3)
+  // FIXME: re-enable global addressing with safety for LDS-flat addresses
+  //if (TARGET_GCN3)
     f->use_flat_addressing = true;
 
   return f;
@@ -4881,6 +4882,19 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ ,
 	  }
 	return ptr;
       }
+    case GCN_BUILTIN_DISPATCH_PTR:
+      {
+	rtx ptr;
+	if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0)
+	   ptr = gen_rtx_REG (DImode,
+			      cfun->machine->args.reg[DISPATCH_PTR_ARG]);
+	else
+	  {
+	    ptr = gen_reg_rtx (DImode);
+	    emit_move_insn (ptr, const0_rtx);
+	  }
+	return ptr;
+      }
     case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P:
       {
 	/* Stash a marker in the unused upper 16 bits of s[0:1] to indicate
diff --git a/libgomp/config/gcn/allocator.c b/libgomp/config/gcn/allocator.c
new file mode 100644
index 00000000000..e9a95d683f9
--- /dev/null
+++ b/libgomp/config/gcn/allocator.c
@@ -0,0 +1,127 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The low-latency allocators use space reserved in LDS memory when the
+   kernel is launched.  The heap is initialized in gomp_gcn_enter_kernel and
+   all allocations are forgotten when the kernel exits.  Allocations to other
+   memory spaces all use the system malloc syscall.
+
+   The pointers returned are 64-bit "Flat" addresses indistinguishable from
+   regular pointers, but only compatible with the "flat_load/store"
+   instructions.  The compiler has been coded to assign default address
+   spaces accordingly.
+
+   LDS memory is not visible to other teams, and therefore may only be used
+   when the memspace access trait is set accordingly.  */
+
+#include "libgomp.h"
+#include <stdlib.h>
+
+#define BASIC_ALLOC_PREFIX __gcn_lowlat
+#define BASIC_ALLOC_YIELD asm ("s_sleep 1" ::: "memory")
+#include "../../basic-allocator.c"
+
+/* The low-latency heap is located in LDS memory, but we need the __flat
+   address space for compatibility reasons.  */
+#define FLAT_HEAP_PTR \
+  ((void *) (uintptr_t) (void __flat *) (void __lds *) GCN_LOWLAT_HEAP)
+
+static void *
+gcn_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_alloc (shared_pool, size);
+    }
+  else
+    return malloc (size);
+}
+
+static void *
+gcn_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_calloc (shared_pool, size);
+    }
+  else
+    return calloc (1, size);
+}
+
+static void
+gcn_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      __gcn_lowlat_free (shared_pool, addr, size);
+    }
+  else
+    free (addr);
+}
+
+static void *
+gcn_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
+			size_t oldsize, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_realloc (shared_pool, addr, oldsize, size);
+    }
+  else
+    return realloc (addr, size);
+}
+
+static inline int
+gcn_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+  /* Disallow use of low-latency memory when it must be accessible by
+     all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+}
+
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  gcn_memspace_alloc (MEMSPACE, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  gcn_memspace_calloc (MEMSPACE, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  gcn_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  gcn_memspace_free (MEMSPACE, ADDR, SIZE)
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  gcn_memspace_validate (MEMSPACE, ACCESS)
+
+/* The default low-latency memspace implies omp_atv_all, which is incompatible
+   with the LDS memory space.  */
+#define OMP_LOW_LAT_MEM_ALLOC_INVALID 1
+
+#include "../../allocator.c"
diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
index f62b7dde0e7..05b6fb60cc9 100644
--- a/libgomp/config/gcn/libgomp-gcn.h
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -33,6 +33,12 @@
 #define DEFAULT_GCN_STACK_SIZE (32*1024)
 #define DEFAULT_TEAM_ARENA_SIZE (64*1024)
 
+/* These define the LDS location of data needed by OpenMP.  */
+#define TEAM_ARENA_START 16  /* LDS offset of free pointer.  */
+#define TEAM_ARENA_FREE  24  /* LDS offset of free pointer.  */
+#define TEAM_ARENA_END   32  /* LDS offset of end pointer.  */
+#define GCN_LOWLAT_HEAP  40  /* LDS offset of the OpenMP low-latency heap.  */
+
 struct heap
 {
   int64_t size;
diff --git a/libgomp/config/gcn/team.c b/libgomp/config/gcn/team.c
index fb20cbbcf9f..7ee6115b666 100644
--- a/libgomp/config/gcn/team.c
+++ b/libgomp/config/gcn/team.c
@@ -29,6 +29,12 @@
 #include <stdlib.h>
 #include <string.h>
 
+#define LITTLEENDIAN_CPU
+#include "hsa.h"
+
+/* Defined in basic-allocator.c via config/amdgcn/allocator.c.  */
+void __gcn_lowlat_init (void *heap, size_t size);
+
 static void gomp_thread_start (struct gomp_thread_pool *);
 extern void build_indirect_map (void);
 
@@ -75,6 +81,12 @@ gomp_gcn_enter_kernel (void)
       *arena_free = team_arena;
       *arena_end = team_arena + kernargs->arena_size_per_team;
 
+      /* Initialize the low-latency heap.  The header is the size.  */
+      void __lds *lowlat = (void __lds *)GCN_LOWLAT_HEAP;
+      hsa_kernel_dispatch_packet_t *queue_ptr = __builtin_gcn_dispatch_ptr ();
+      __gcn_lowlat_init ((void*)(uintptr_t)(void __flat*)lowlat,
+			 queue_ptr->group_segment_size - GCN_LOWLAT_HEAP);
+
       /* Allocate and initialize the team-local-storage data.  */
       struct gomp_thread *thrs = team_malloc_cleared (sizeof (*thrs)
 						      * numthreads);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 15a767cf317..fa29f428976 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -114,9 +114,6 @@ extern void gomp_aligned_free (void *);
 #ifdef __AMDGCN__
 #include "libgomp-gcn.h"
 /* The arena is initialized in config/gcn/team.c.  */
-#define TEAM_ARENA_START 16  /* LDS offset of free pointer.  */
-#define TEAM_ARENA_FREE  24  /* LDS offset of free pointer.  */
-#define TEAM_ARENA_END   32  /* LDS offset of end pointer.  */
 
 static inline void * __attribute__((malloc))
 team_malloc (size_t size)
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 7fdd6fe9410..9d0aee72b33 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -5838,6 +5838,19 @@ The implementation remark:
       available devices (``host fallback'').
 @item The available stack size can be changed using the @code{GCN_STACK_SIZE}
       environment variable; the default is 32 kiB per thread.
+@item Low-latency memory (@code{omp_low_lat_mem_space}) is supported when the
+      the @code{access} trait is set to @code{cgroup}.  The default pool size
+      is automatically scaled to share the 64 kiB LDS memory between the number
+      of teams configured to run on each compute-unit, but may be adjusted at
+      runtime by setting environment variable
+      @code{GOMP_GCN_LOWLAT_POOL=@var{bytes}}.
+@item @code{omp_low_lat_mem_alloc} cannot be used with true low-latency memory
+      because the definition implies the @code{omp_atv_all} trait; main
+      graphics memory is used instead.
+@item @code{omp_cgroup_mem_alloc}, @code{omp_pteam_mem_alloc}, and
+      @code{omp_thread_mem_alloc}, all use low-latency memory as first
+      preference, and fall back to main graphics memory when the low-latency
+      pool is exhausted.
 @end itemize
 
 
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 8aabbd99881..7f8178c78b7 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -550,6 +550,7 @@ static size_t gcn_kernel_heap_size = DEFAULT_GCN_HEAP_SIZE;
 
 static int team_arena_size = DEFAULT_TEAM_ARENA_SIZE;
 static int stack_size = DEFAULT_GCN_STACK_SIZE;
+static int lowlat_size = -1;
 
 /* Flag to decide whether print to stderr information about what is going on.
    Set in init_debug depending on environment variables.  */
@@ -1016,8 +1017,8 @@ print_kernel_dispatch (struct kernel_dispatch *dispatch, unsigned indent)
   fprintf (stderr, "%*sobject: %lu\n", indent, "", dispatch->object);
   fprintf (stderr, "%*sprivate_segment_size: %u\n", indent, "",
 	   dispatch->private_segment_size);
-  fprintf (stderr, "%*sgroup_segment_size: %u\n", indent, "",
-	   dispatch->group_segment_size);
+  fprintf (stderr, "%*sgroup_segment_size: %u (low-latency pool)\n", indent,
+	   "", dispatch->group_segment_size);
   fprintf (stderr, "\n");
 }
 
@@ -1088,6 +1089,10 @@ init_environment_variables (void)
       if (tmp)
 	stack_size = tmp;;
     }
+
+  const char *lowlat = secure_getenv ("GOMP_GCN_LOWLAT_POOL");
+  if (lowlat)
+    lowlat_size = atoi (lowlat);
 }
 
 /* Return malloc'd string with name of SYMBOL.  */
@@ -1930,7 +1935,25 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
 
   shadow->signal = sync_signal.handle;
   shadow->private_segment_size = kernel->private_segment_size;
-  shadow->group_segment_size = kernel->group_segment_size;
+
+  if (lowlat_size < 0)
+    {
+      /* Divide the LDS between the number of running teams.
+	 Allocate not less than is defined in the kernel metadata.  */
+      int teams_per_cu = num_teams / get_cu_count (agent);
+      int LDS_per_team = (teams_per_cu ? 65536 / teams_per_cu : 65536);
+      shadow->group_segment_size
+	= (kernel->group_segment_size > LDS_per_team
+	   ? kernel->group_segment_size
+	   : LDS_per_team);;
+    }
+  else if (lowlat_size < GCN_LOWLAT_HEAP+8)
+    /* Ensure that there's space for the OpenMP libgomp data.  */
+    shadow->group_segment_size = GCN_LOWLAT_HEAP+8;
+  else
+    shadow->group_segment_size = (lowlat_size > 65536
+				  ? 65536
+				  : lowlat_size);
 
   /* We expect kernels to request a single pointer, explicitly, and the
      rest of struct kernargs, implicitly.  If they request anything else
@@ -2290,9 +2313,9 @@ run_kernel (struct kernel_info *kernel, void *vars,
       print_kernel_dispatch (shadow, 2);
     }
 
-  packet->private_segment_size = kernel->private_segment_size;
-  packet->group_segment_size = kernel->group_segment_size;
-  packet->kernel_object = kernel->object;
+  packet->private_segment_size = shadow->private_segment_size;
+  packet->group_segment_size = shadow->group_segment_size;
+  packet->kernel_object = shadow->object;
   packet->kernarg_address = shadow->kernarg_address;
   hsa_signal_t s;
   s.handle = shadow->signal;
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-traits.c b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
index 4ff0fca4986..e9acc8673a3 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
@@ -1,7 +1,7 @@
 /* { dg-do run } */
 
 /* { dg-require-effective-target offload_device } */
-/* { dg-xfail-if "not implemented" { ! offload_target_nvptx } } */
+/* { dg-xfail-if "not implemented" { ! { offload_target_nvptx || offload_target_amdgcn } } } */
 
 /* Test that GPU low-latency allocation is limited to team access.  */
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc
  2023-12-03  0:32 [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Andrew Stubbs
                   ` (2 preceding siblings ...)
  2023-12-03  0:32 ` [PATCH v3 3/3] amdgcn, libgomp: low-latency allocator Andrew Stubbs
@ 2023-12-04 15:34 ` Tobias Burnus
  2023-12-04 15:53   ` Tobias Burnus
  3 siblings, 1 reply; 10+ messages in thread
From: Tobias Burnus @ 2023-12-04 15:34 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches; +Cc: Sandra Loosemore

Hi Andrew,

On 03.12.23 01:32, Andrew Stubbs wrote:
> This patch series is a rework of the patch series posted in August.
> https://patchwork.sourceware.org/project/gcc/list/?series=23045&state=%2A&archive=both
>
> The series implements device-specific allocators and adds a low-latency
> allocator for both GPUs architectures.

As mentioned, can you also update libgomp/libgomp.texi?

I don't have a strong preference where in that file nor how it is
documented, but it seems to make sense to document as follows:

(A) Document the GCN/NVPX specifics on the respective pages below
https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-Specifics.html

(B) To make it possible to find it, @ref'er to that page from:

And https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html

(May be just 'See also:' or 'For offload-device specifics to memory
allocation, see' or something like that.)

(C) Maybe, some wording should be added to OMP_ALLOCATOR that the
cgrouppteam/thread pre-defined allocators use (implementation choice)
the low-latency memory space; for instance, add a sentence under the
first table – or use the first table 'omp_low_lat_mem_space
(implementation choice)' or some other wordings which make clear what
GCC does but that the spec does not specify this.

https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html

Maybe we should add to OMP_ALLOCATOR also a @ref to "Offload Target
Specifics"?

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc
  2023-12-04 15:34 ` [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Tobias Burnus
@ 2023-12-04 15:53   ` Tobias Burnus
  0 siblings, 0 replies; 10+ messages in thread
From: Tobias Burnus @ 2023-12-04 15:53 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

I cannot "grep" – all three patches do contain .texi changes. I have a
comment to them, but I will comment individually on them.

Hence, scratch:

On 04.12.23 16:34, Tobias Burnus wrote:
> On 03.12.23 01:32, Andrew Stubbs wrote:
>> This patch series is a rework of the patch series posted in August.
>> https://patchwork.sourceware.org/project/gcc/list/?series=23045&state=%2A&archive=both
>>
>>
>> The series implements device-specific allocators and adds a low-latency
>> allocator for both GPUs architectures.
>
> As mentioned, can you also update libgomp/libgomp.texi?

Sorry for missing those changes.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
  2023-12-03  0:32 ` [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator Andrew Stubbs
@ 2023-12-04 16:04   ` Tobias Burnus
  2023-12-05 15:39     ` Andrew Stubbs
  2023-12-05 11:25   ` Tobias Burnus
  1 sibling, 1 reply; 10+ messages in thread
From: Tobias Burnus @ 2023-12-04 16:04 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 03.12.23 01:32, Andrew Stubbs wrote:
> This patch adds support for allocating low-latency ".shared" memory on
> NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
> can be allocated, reallocated, and freed using a basic but fast algorithm,
> is thread safe and the size of the low-latency heap can be configured using
> the GOMP_NVPTX_LOWLAT_POOL environment variable.
>
> The use of the PTX dynamic_smem_size feature means that low-latency allocator
> will not work with the PTX 3.1 multilib.
>
> For now, the omp_low_lat_mem_alloc allocator also works, but that will change
> when I implement the access traits.

...

LGTM, however, I about the following:

> diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
> index e5fe7af76af..39d0749e7b3 100644
> --- a/libgomp/libgomp.texi
> +++ b/libgomp/libgomp.texi
> @@ -3012,11 +3012,14 @@ value.
>   @item omp_const_mem_alloc       @tab omp_const_mem_space
>   @item omp_high_bw_mem_alloc     @tab omp_high_bw_mem_space
>   @item omp_low_lat_mem_alloc     @tab omp_low_lat_mem_space
> -@item omp_cgroup_mem_alloc      @tab --
> -@item omp_pteam_mem_alloc       @tab --
> -@item omp_thread_mem_alloc      @tab --
> +@item omp_cgroup_mem_alloc      @tab omp_low_lat_mem_space (implementation defined)
> +@item omp_pteam_mem_alloc       @tab omp_low_lat_mem_space (implementation defined)
> +@item omp_thread_mem_alloc      @tab omp_low_lat_mem_space (implementation defined)
>   @end multitable
>
> +The @code{omp_low_lat_mem_space} is only available on supported devices.
> +See @ref{Offload-Target Specifics}.
> +

Whether it would be clearer to have this wording not here for the OMP_ALLOCATOR env, i.e.
https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html
but just a simple crossref like:

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3061,5 +3061,5 @@ OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest
  @item @emph{See also}:
  @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specifics}

  @item @emph{Reference}:


And add your wording to:
   https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html

As this sections mentions that "omp_low_lat_mem_space maps to omp_default_mem_space" in general.
Hence, mentioning in this section in addition that  omp_low_lat_mem_space  is honored on devices
seems to be the better location.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
  2023-12-03  0:32 ` [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator Andrew Stubbs
  2023-12-04 16:04   ` Tobias Burnus
@ 2023-12-05 11:25   ` Tobias Burnus
  1 sibling, 0 replies; 10+ messages in thread
From: Tobias Burnus @ 2023-12-05 11:25 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

Hi Andrew,

I now looked at the whole series - and the series LGTM,
except for some testcase issues, as outlined below.


First, I notice that there is no call to:
   omp_destroy_allocator (gpu_lowlat);

While it might make sense to leave some of the testcases
without that call for testing purpose, I think at least one
it not all (but one?) should call it for completeness/testing
purpose.

However, the real issue is:

On 03.12.23 01:32, Andrew Stubbs wrote:
>       * testsuite/libgomp.c/omp_alloc-4.c: New test.
>       * testsuite/libgomp.c/omp_alloc-6.c: New test.

In particular: If you run this without offloading
(not configured, -foffload=disable, or no hardware available)


Result: This will fail all over the place.
I am not sure whether some tests should remain with hostfallback.

If yes, I think you need a check whether
   omp_get_initial_device() == omp_get_default_device()
or something similar.

If not, you should expect it to fail unless
    { target offload_device }


In any case, I get with host fallback the following for the -4.c test:

62: allocate did not coalesce first two chunks
66: allocate did not split first chunk (1)
68: allocate did not split first chunk (2)
73: allocate did not coalesce middle two chunks
77: allocate did not split second chunk (1)
79: allocate did not split second chunk (2)
84: allocate did not coalesce first two chunks, reverse free
95: allocate did not coalesce second two chunks, reverse free
107: allocate did not coalesce first three chunks
111: allocate did not split first chunk (1)
115: allocate did not split first chunk (3)
121: allocate did not coalesce last three chunks
125: allocate did not split second chunk (1)
129: allocate did not split second chunk (3)
135: allocate did not coalesce first three chunks, reverse free
149: allocate did not coalesce second three chunks, reverse free
163: allocate did not coalesce first three chunks, mixed free
167: allocate did not split first chunk (1), mixed free
169: allocate did not split first chunk (2), mixed free
177: allocate did not coalesce second three chunks, mixed free
181: allocate did not split second chunk (1), mixed free
183: allocate did not split second chunk (2), mixed free
192: allocate did not coalesce all memory

And with ASAN already for:
49: allocate did not reuse first chunk
53: allocate did not reuse second chunk
57: allocate did not reuse third chunk
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1395708==ERROR: AddressSanitizer: SEGV on unknown address 0xfffffffffffffffb (pc 0x7ffa2f649025 bp 0xfffffffffffffffb sp 0x7ffe22a90910 T0)
...
     #4 0x401748 in main._omp_fn.0 libgomp/testsuite/libgomp.c/omp_alloc-4.c:59



And the -6 tests fails with:

72: realloc did not extend into whole next chunk
free(): invalid pointer
Segmentation fault (core dumped)

where the free() crash is at:
74          p = omp_realloc (b, size3, lowlat, lowlat);

And with ASAN it fails already with:

48: realloc did not reuse same size chunk, no space after
=================================================================
==1396453==ERROR: AddressSanitizer: heap-use-after-free on address 0x5040000000b0 at pc 0x7f83c3ef9406 bp 0x7fffc48d7e10 sp 0x7fffc48d75d0
...
READ of size 8 at 0x5040000000b0 thread T0
...
     #5 0x401358 in main libgomp/testsuite/libgomp.c/omp_alloc-6.c:23
freed by thread T0 here:
     #1 0x7f83c459af6a in omp_realloc ../../../repos/gcc/libgomp/config/linux/../../allocator.c:1219
previously allocated by thread T0 here:
     #1 0x7f83c459a13b in omp_aligned_alloc ../../../repos/gcc/libgomp/config/linux/../../allocator.c:626


Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
  2023-12-04 16:04   ` Tobias Burnus
@ 2023-12-05 15:39     ` Andrew Stubbs
  2023-12-05 17:13       ` Tobias Burnus
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Stubbs @ 2023-12-05 15:39 UTC (permalink / raw)
  To: Tobias Burnus, gcc-patches

On 04/12/2023 16:04, Tobias Burnus wrote:
> On 03.12.23 01:32, Andrew Stubbs wrote:
>> This patch adds support for allocating low-latency ".shared" memory on
>> NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The 
>> memory
>> can be allocated, reallocated, and freed using a basic but fast 
>> algorithm,
>> is thread safe and the size of the low-latency heap can be configured 
>> using
>> the GOMP_NVPTX_LOWLAT_POOL environment variable.
>>
>> The use of the PTX dynamic_smem_size feature means that low-latency 
>> allocator
>> will not work with the PTX 3.1 multilib.
>>
>> For now, the omp_low_lat_mem_alloc allocator also works, but that will 
>> change
>> when I implement the access traits.
> 
> ...
> 
> LGTM, however, I about the following:
> 
>> diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
>> index e5fe7af76af..39d0749e7b3 100644
>> --- a/libgomp/libgomp.texi
>> +++ b/libgomp/libgomp.texi
>> @@ -3012,11 +3012,14 @@ value.
>>   @item omp_const_mem_alloc       @tab omp_const_mem_space
>>   @item omp_high_bw_mem_alloc     @tab omp_high_bw_mem_space
>>   @item omp_low_lat_mem_alloc     @tab omp_low_lat_mem_space
>> -@item omp_cgroup_mem_alloc      @tab --
>> -@item omp_pteam_mem_alloc       @tab --
>> -@item omp_thread_mem_alloc      @tab --
>> +@item omp_cgroup_mem_alloc      @tab omp_low_lat_mem_space 
>> (implementation defined)
>> +@item omp_pteam_mem_alloc       @tab omp_low_lat_mem_space 
>> (implementation defined)
>> +@item omp_thread_mem_alloc      @tab omp_low_lat_mem_space 
>> (implementation defined)
>>   @end multitable
>>
>> +The @code{omp_low_lat_mem_space} is only available on supported devices.
>> +See @ref{Offload-Target Specifics}.
>> +
> 
> Whether it would be clearer to have this wording not here for the 
> OMP_ALLOCATOR env, i.e.
> https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html
> but just a simple crossref like:
> 
> --- a/libgomp/libgomp.texi
> +++ b/libgomp/libgomp.texi
> @@ -3061,5 +3061,5 @@ 
> OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest
>   @item @emph{See also}:
>   @ref{Memory allocation}, @ref{omp_get_default_allocator},
> -@ref{omp_set_default_allocator}
> +@ref{omp_set_default_allocator}, @ref{Offload-Target Specifics}
> 
>   @item @emph{Reference}:
> 
> 
> And add your wording to:
>    https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html
> 
> As this sections mentions that "omp_low_lat_mem_space maps to 
> omp_default_mem_space" in general.
> Hence, mentioning in this section in addition that  
> omp_low_lat_mem_space  is honored on devices
> seems to be the better location.

How about this?

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,9 +3012,9 @@ value.
  @item omp_const_mem_alloc       @tab omp_const_mem_space
  @item omp_high_bw_mem_alloc     @tab omp_high_bw_mem_space
  @item omp_low_lat_mem_alloc     @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc      @tab --
-@item omp_pteam_mem_alloc       @tab --
-@item omp_thread_mem_alloc      @tab --
+@item omp_cgroup_mem_alloc      @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_pteam_mem_alloc       @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_thread_mem_alloc      @tab omp_low_lat_mem_space 
(implementation defined)
  @end multitable

  The predefined allocators use the default values for the traits,
@@ -3060,7 +3060,7 @@ 
OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest

  @item @emph{See also}:
  @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specific}

  @item @emph{Reference}:
  @uref{https://www.openmp.org, OpenMP specification v5.0}, Section 6.21
@@ -5710,7 +5710,8 @@ For the memory spaces, the following applies:
  @itemize
  @item @code{omp_default_mem_space} is supported
  @item @code{omp_const_mem_space} maps to @code{omp_default_mem_space}
-@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space}
+@item @code{omp_low_lat_mem_space} is only available on supported devices,
+      and maps to @code{omp_default_mem_space} otherwise.
  @item @code{omp_large_cap_mem_space} maps to @code{omp_default_mem_space},
        unless the memkind library is available
  @item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space},
@@ -5766,6 +5767,9 @@ Additional notes regarding the traits:
  @item The @code{sync_hint} trait has no effect.
  @end itemize

+See also:
+@ref{Offload-Target Specifics}
+
  @c ---------------------------------------------------------------------
  @c Offload-Target Specifics
  @c ---------------------------------------------------------------------

> 
> Tobias
> 
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 
> 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: 
> Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; 
> Registergericht München, HRB 106955
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
  2023-12-05 15:39     ` Andrew Stubbs
@ 2023-12-05 17:13       ` Tobias Burnus
  0 siblings, 0 replies; 10+ messages in thread
From: Tobias Burnus @ 2023-12-05 17:13 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

On 05.12.23 16:39, Andrew Stubbs wrote:
>> Hence, mentioning in this section in addition that
>> omp_low_lat_mem_space  is honored on devices
>> seems to be the better location.
>
> How about this?

LGTM – Thanks!

Tobias

> --- a/libgomp/libgomp.texi
> +++ b/libgomp/libgomp.texi
> @@ -3012,9 +3012,9 @@ value.
>  @item omp_const_mem_alloc       @tab omp_const_mem_space
>  @item omp_high_bw_mem_alloc     @tab omp_high_bw_mem_space
>  @item omp_low_lat_mem_alloc     @tab omp_low_lat_mem_space
> -@item omp_cgroup_mem_alloc      @tab --
> -@item omp_pteam_mem_alloc       @tab --
> -@item omp_thread_mem_alloc      @tab --
> +@item omp_cgroup_mem_alloc      @tab omp_low_lat_mem_space
> (implementation defined)
> +@item omp_pteam_mem_alloc       @tab omp_low_lat_mem_space
> (implementation defined)
> +@item omp_thread_mem_alloc      @tab omp_low_lat_mem_space
> (implementation defined)
>  @end multitable
>
>  The predefined allocators use the default values for the traits,
> @@ -3060,7 +3060,7 @@
> OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest
>
>  @item @emph{See also}:
>  @ref{Memory allocation}, @ref{omp_get_default_allocator},
> -@ref{omp_set_default_allocator}
> +@ref{omp_set_default_allocator}, @ref{Offload-Target Specific}
>
>  @item @emph{Reference}:
>  @uref{https://www.openmp.org/, OpenMP specification v5.0}, Section 6.21
> @@ -5710,7 +5710,8 @@ For the memory spaces, the following applies:
>  @itemize
>  @item @code{omp_default_mem_space} is supported
>  @item @code{omp_const_mem_space} maps to @code{omp_default_mem_space}
> -@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space}
> +@item @code{omp_low_lat_mem_space} is only available on supported
> devices,
> +      and maps to @code{omp_default_mem_space} otherwise.
>  @item @code{omp_large_cap_mem_space} maps to
> @code{omp_default_mem_space},
>        unless the memkind library is available
>  @item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space},
> @@ -5766,6 +5767,9 @@ Additional notes regarding the traits:
>  @item The @code{sync_hint} trait has no effect.
>  @end itemize
>
> +See also:
> +@ref{Offload-Target Specifics}
> +
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-12-05 17:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-03  0:32 [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Andrew Stubbs
2023-12-03  0:32 ` [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator Andrew Stubbs
2023-12-04 16:04   ` Tobias Burnus
2023-12-05 15:39     ` Andrew Stubbs
2023-12-05 17:13       ` Tobias Burnus
2023-12-05 11:25   ` Tobias Burnus
2023-12-03  0:32 ` [PATCH v3 2/3] openmp, nvptx: low-lat memory access traits Andrew Stubbs
2023-12-03  0:32 ` [PATCH v3 3/3] amdgcn, libgomp: low-latency allocator Andrew Stubbs
2023-12-04 15:34 ` [PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc Tobias Burnus
2023-12-04 15:53   ` Tobias Burnus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).