public inbox for gcc-cvs@sourceware.org
help / color / mirror / Atom feed
* [gcc/devel/omp/gcc-12] amdgcn, libgomp: low-latency allocator
@ 2023-02-16 18:02 Andrew Stubbs
0 siblings, 0 replies; only message in thread
From: Andrew Stubbs @ 2023-02-16 18:02 UTC (permalink / raw)
To: gcc-cvs
https://gcc.gnu.org/g:c77c45a641fedc3fe770e909cc010fb1735bdbbd
commit c77c45a641fedc3fe770e909cc010fb1735bdbbd
Author: Andrew Stubbs <ams@codesourcery.com>
Date: Mon Jan 30 14:43:00 2023 +0000
amdgcn, libgomp: low-latency allocator
This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).
Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible. This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories). A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.
gcc/ChangeLog:
* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
addressing.
(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.
libgomp/ChangeLog:
* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
(GCN_LOWLAT_HEAP): New.
* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
(__gcn_lowlat_init): New prototype.
(gomp_gcn_enter_kernel): Initialize the low-latency heap.
* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
* plugin/plugin-gcn.c (lowlat_size): New variable.
(print_kernel_dispatch): Label the group_segment_size purpose.
(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
(create_kernel_dispatch): Pass low-latency head allocation to kernel.
(run_kernel): Use shadow; don't assume values.
* testsuite/libgomp.c/allocators-7.c: Enable for amdgcn.
* config/gcn/allocator.c: New file.
Diff:
---
gcc/ChangeLog.omp | 7 ++
gcc/config/gcn/gcn-builtins.def | 2 +
gcc/config/gcn/gcn.cc | 16 +++-
libgomp/ChangeLog.omp | 19 +++++
libgomp/config/gcn/allocator.c | 129 +++++++++++++++++++++++++++++
libgomp/config/gcn/libgomp-gcn.h | 6 ++
libgomp/config/gcn/team.c | 12 +++
libgomp/libgomp.h | 3 -
libgomp/plugin/plugin-gcn.c | 35 ++++++--
libgomp/testsuite/libgomp.c/allocators-7.c | 2 +-
10 files changed, 220 insertions(+), 11 deletions(-)
diff --git a/gcc/ChangeLog.omp b/gcc/ChangeLog.omp
index a76c54f7ddc..f362b297558 100644
--- a/gcc/ChangeLog.omp
+++ b/gcc/ChangeLog.omp
@@ -1,3 +1,10 @@
+2023-02-16 Andrew Stubbs <ams@codesourcery.com>
+
+ * config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
+ * config/gcn/gcn.cc (gcn_init_machine_status): Disable global
+ addressing.
+ (gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.
+
2023-02-09 Kwok Cheung Yeung <kcy@codesourcery.com>
* gimplify.cc (omp_notice_variable): Apply GOVD_MAP_ALLOC_ONLY flag
diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def
index f1cf30bbc94..3619cab4402 100644
--- a/gcc/config/gcn/gcn-builtins.def
+++ b/gcc/config/gcn/gcn-builtins.def
@@ -164,6 +164,8 @@ DEF_BUILTIN (FIRST_CALL_THIS_THREAD_P, -1, "first_call_this_thread_p", B_INSN,
_A1 (GCN_BTI_BOOL), gcn_expand_builtin_1)
DEF_BUILTIN (KERNARG_PTR, -1, "kernarg_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
gcn_expand_builtin_1)
+DEF_BUILTIN (DISPATCH_PTR, -1, "dispatch_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
+ gcn_expand_builtin_1)
DEF_BUILTIN (GET_STACK_LIMIT, -1, "get_stack_limit", B_INSN,
_A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1)
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 0b21dbd256e..8e487b94e95 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -114,7 +114,8 @@ gcn_init_machine_status (void)
f = ggc_cleared_alloc<machine_function> ();
- if (TARGET_GCN3)
+ // FIXME: re-enable global addressing with safety for LDS-flat addresses
+ //if (TARGET_GCN3)
f->use_flat_addressing = true;
return f;
@@ -4626,6 +4627,19 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ ,
}
return ptr;
}
+ case GCN_BUILTIN_DISPATCH_PTR:
+ {
+ rtx ptr;
+ if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0)
+ ptr = gen_rtx_REG (DImode,
+ cfun->machine->args.reg[DISPATCH_PTR_ARG]);
+ else
+ {
+ ptr = gen_reg_rtx (DImode);
+ emit_move_insn (ptr, const0_rtx);
+ }
+ return ptr;
+ }
case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P:
{
/* Stash a marker in the unused upper 16 bits of s[0:1] to indicate
diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index cfcd3ca1d58..2a20516cd09 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -13,6 +13,25 @@
* testsuite/libgomp.fortran/target-nowait-array-section.f90: Fix
comment typo and improve its wording.
+2023-02-16 Andrew Stubbs <ams@codesourcery.com>
+
+ * config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
+ (TEAM_ARENA_FREE): Likewise.
+ (TEAM_ARENA_END): Likewise.
+ (GCN_LOWLAT_HEAP): New.
+ * config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
+ (__gcn_lowlat_init): New prototype.
+ (gomp_gcn_enter_kernel): Initialize the low-latency heap.
+ * libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
+ (TEAM_ARENA_FREE): Likewise.
+ (TEAM_ARENA_END): Likewise.
+ * plugin/plugin-gcn.c (lowlat_size): New variable.
+ (print_kernel_dispatch): Label the group_segment_size purpose.
+ (init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
+ (create_kernel_dispatch): Pass low-latency head allocation to kernel.
+ (run_kernel): Use shadow; don't assume values.
+ * testsuite/libgomp.c/allocators-7.c: Enable for amdgcn.
+
2023-02-16 Andrew Stubbs <ams@codesourcery.com>
* config/nvptx/allocator.c (BASIC_ALLOC_PREFIX): New define, and
diff --git a/libgomp/config/gcn/allocator.c b/libgomp/config/gcn/allocator.c
new file mode 100644
index 00000000000..001de89ffe0
--- /dev/null
+++ b/libgomp/config/gcn/allocator.c
@@ -0,0 +1,129 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+ This file is part of the GNU Offloading and Multi Processing Library
+ (libgomp).
+
+ Libgomp is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 3, or (at your option)
+ any later version.
+
+ Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+ WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ Under Section 7 of GPL version 3, you are granted additional
+ permissions described in the GCC Runtime Library Exception, version
+ 3.1, as published by the Free Software Foundation.
+
+ You should have received a copy of the GNU General Public License and
+ a copy of the GCC Runtime Library Exception along with this program;
+ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
+ <http://www.gnu.org/licenses/>. */
+
+/* The low-latency allocators use space reserved in LDS memory when the
+ kernel is launched. The heap is initialized in gomp_gcn_enter_kernel and
+ all allocations are forgotten when the kernel exits. Allocations to other
+ memory spaces all use the system malloc syscall.
+
+ The pointers returned are 64-bit "Flat" addresses indistinguishable from
+ regular pointers, but only compatible with the "flat_load/store"
+ instructions. The compiler has been coded to assign default address
+ spaces accordingly.
+
+ LDS memory is not visible to other teams, and therefore may only be used
+ when the memspace access trait is set accordingly. */
+
+#include "libgomp.h"
+#include <stdlib.h>
+
+#define BASIC_ALLOC_PREFIX __gcn_lowlat
+#define BASIC_ALLOC_YIELD asm("s_sleep 1" ::: "memory")
+#include "../../basic-allocator.c"
+
+/* The low-latency heap is located in LDS memory, but we need the __flat
+ address space for compatibility reasons. */
+#define FLAT_HEAP_PTR \
+ ((void*)(uintptr_t)(void __flat*)(void __lds *)GCN_LOWLAT_HEAP)
+
+static void *
+gcn_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
+{
+ if (memspace == omp_low_lat_mem_space)
+ {
+ char *shared_pool = FLAT_HEAP_PTR;
+
+ return __gcn_lowlat_alloc (shared_pool, size);
+ }
+ else if (memspace == ompx_host_mem_space)
+ return NULL;
+ else
+ return malloc (size);
+}
+
+static void *
+gcn_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
+{
+ if (memspace == omp_low_lat_mem_space)
+ {
+ char *shared_pool = FLAT_HEAP_PTR;
+
+ return __gcn_lowlat_calloc (shared_pool, size);
+ }
+ else if (memspace == ompx_host_mem_space)
+ return NULL;
+ else
+ return calloc (1, size);
+}
+
+static void
+gcn_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
+{
+ if (memspace == omp_low_lat_mem_space)
+ {
+ char *shared_pool = FLAT_HEAP_PTR;
+
+ __gcn_lowlat_free (shared_pool, addr, size);
+ }
+ else
+ free (addr);
+}
+
+static void *
+gcn_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
+ size_t oldsize, size_t size)
+{
+ if (memspace == omp_low_lat_mem_space)
+ {
+ char *shared_pool = FLAT_HEAP_PTR;
+
+ return __gcn_lowlat_realloc (shared_pool, addr, oldsize, size);
+ }
+ else if (memspace == ompx_host_mem_space)
+ return NULL;
+ else
+ return realloc (addr, size);
+}
+
+static inline int
+gcn_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+ /* Disallow use of low-latency memory when it must be accessible by
+ all threads. */
+ return (memspace != omp_low_lat_mem_space
+ || access != omp_atv_all);
+}
+
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
+ gcn_memspace_alloc (MEMSPACE, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
+ gcn_memspace_calloc (MEMSPACE, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \
+ gcn_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
+ gcn_memspace_free (MEMSPACE, ADDR, SIZE)
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+ gcn_memspace_validate (MEMSPACE, ACCESS)
+
+#include "../../allocator.c"
diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
index 1521166baa3..3e8d7451453 100644
--- a/libgomp/config/gcn/libgomp-gcn.h
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -33,6 +33,12 @@
#define DEFAULT_GCN_STACK_SIZE (32*1024)
#define DEFAULT_TEAM_ARENA_SIZE (64*1024)
+/* These define the LDS location of data needed by OpenMP. */
+#define TEAM_ARENA_START 16 /* LDS offset of free pointer. */
+#define TEAM_ARENA_FREE 24 /* LDS offset of free pointer. */
+#define TEAM_ARENA_END 32 /* LDS offset of end pointer. */
+#define GCN_LOWLAT_HEAP 40 /* LDS offset of the OpenMP low-latency heap. */
+
struct heap
{
int64_t size;
diff --git a/libgomp/config/gcn/team.c b/libgomp/config/gcn/team.c
index ffdc09b7f35..13641a4702c 100644
--- a/libgomp/config/gcn/team.c
+++ b/libgomp/config/gcn/team.c
@@ -29,6 +29,12 @@
#include <stdlib.h>
#include <string.h>
+#define LITTLEENDIAN_CPU
+#include "hsa.h"
+
+/* Defined in basic-allocator.c via config/amdgcn/allocator.c. */
+void __gcn_lowlat_init (void *heap, size_t size);
+
static void gomp_thread_start (struct gomp_thread_pool *);
/* This externally visible function handles target region entry. It
@@ -71,6 +77,12 @@ gomp_gcn_enter_kernel (void)
*arena_free = team_arena;
*arena_end = team_arena + kernargs->arena_size_per_team;
+ /* Initialize the low-latency heap. The header is the size. */
+ void __lds *lowlat = (void __lds *)GCN_LOWLAT_HEAP;
+ hsa_kernel_dispatch_packet_t *queue_ptr = __builtin_gcn_dispatch_ptr ();
+ __gcn_lowlat_init ((void*)(uintptr_t)(void __flat*)lowlat,
+ queue_ptr->group_segment_size - GCN_LOWLAT_HEAP);
+
/* Allocate and initialize the team-local-storage data. */
struct gomp_thread *thrs = team_malloc_cleared (sizeof (*thrs)
* numthreads);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a0af66e396b..d1e45cc584e 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -114,9 +114,6 @@ extern void gomp_aligned_free (void *);
#ifdef __AMDGCN__
#include "libgomp-gcn.h"
/* The arena is initialized in config/gcn/team.c. */
-#define TEAM_ARENA_START 16 /* LDS offset of free pointer. */
-#define TEAM_ARENA_FREE 24 /* LDS offset of free pointer. */
-#define TEAM_ARENA_END 32 /* LDS offset of end pointer. */
static inline void * __attribute__((malloc))
team_malloc (size_t size)
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 70a555a24a2..ca89ba658fd 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -563,6 +563,7 @@ static size_t gcn_kernel_heap_size = DEFAULT_GCN_HEAP_SIZE;
static int team_arena_size = DEFAULT_TEAM_ARENA_SIZE;
static int stack_size = DEFAULT_GCN_STACK_SIZE;
+static int lowlat_size = -1;
/* Flag to decide whether print to stderr information about what is going on.
Set in init_debug depending on environment variables. */
@@ -1047,8 +1048,8 @@ print_kernel_dispatch (struct kernel_dispatch *dispatch, unsigned indent)
fprintf (stderr, "%*sobject: %lu\n", indent, "", dispatch->object);
fprintf (stderr, "%*sprivate_segment_size: %u\n", indent, "",
dispatch->private_segment_size);
- fprintf (stderr, "%*sgroup_segment_size: %u\n", indent, "",
- dispatch->group_segment_size);
+ fprintf (stderr, "%*sgroup_segment_size: %u (low-latency pool)\n", indent,
+ "", dispatch->group_segment_size);
fprintf (stderr, "\n");
}
@@ -1119,6 +1120,10 @@ init_environment_variables (void)
if (tmp)
stack_size = tmp;;
}
+
+ const char *lowlat = secure_getenv ("GOMP_GCN_LOWLAT_POOL");
+ if (lowlat)
+ lowlat_size = atoi (lowlat);
}
/* Return malloc'd string with name of SYMBOL. */
@@ -1946,7 +1951,25 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
shadow->signal = sync_signal.handle;
shadow->private_segment_size = kernel->private_segment_size;
- shadow->group_segment_size = kernel->group_segment_size;
+
+ if (lowlat_size < 0)
+ {
+ /* Divide the LDS between the number of running teams.
+ Allocate not less than is defined in the kernel metadata. */
+ int teams_per_cu = num_teams / get_cu_count (agent);
+ int LDS_per_team = (teams_per_cu ? 65536 / teams_per_cu : 65536);
+ shadow->group_segment_size
+ = (kernel->group_segment_size > LDS_per_team
+ ? kernel->group_segment_size
+ : LDS_per_team);;
+ }
+ else if (lowlat_size < GCN_LOWLAT_HEAP+8)
+ /* Ensure that there's space for the OpenMP libgomp data. */
+ shadow->group_segment_size = GCN_LOWLAT_HEAP+8;
+ else
+ shadow->group_segment_size = (lowlat_size > 65536
+ ? 65536
+ : lowlat_size);
/* We expect kernels to request a single pointer, explicitly, and the
rest of struct kernargs, implicitly. If they request anything else
@@ -2305,9 +2328,9 @@ run_kernel (struct kernel_info *kernel, void *vars,
print_kernel_dispatch (shadow, 2);
}
- packet->private_segment_size = kernel->private_segment_size;
- packet->group_segment_size = kernel->group_segment_size;
- packet->kernel_object = kernel->object;
+ packet->private_segment_size = shadow->private_segment_size;
+ packet->group_segment_size = shadow->group_segment_size;
+ packet->kernel_object = shadow->object;
packet->kernarg_address = shadow->kernarg_address;
hsa_signal_t s;
s.handle = shadow->signal;
diff --git a/libgomp/testsuite/libgomp.c/allocators-7.c b/libgomp/testsuite/libgomp.c/allocators-7.c
index a0a738b1d1d..5ef0c5cb3e3 100644
--- a/libgomp/testsuite/libgomp.c/allocators-7.c
+++ b/libgomp/testsuite/libgomp.c/allocators-7.c
@@ -1,7 +1,7 @@
/* { dg-do run } */
/* { dg-require-effective-target offload_device } */
-/* { dg-xfail-if "not implemented" { ! offload_target_nvptx } } */
+/* { dg-xfail-if "not implemented" { ! { offload_target_nvptx || offload_target_amdgcn } } } */
/* Test that GPU low-latency allocation is limited to team access. */
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2023-02-16 18:02 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-16 18:02 [gcc/devel/omp/gcc-12] amdgcn, libgomp: low-latency allocator Andrew Stubbs
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).