From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <burnus@sourceware.org>
Received: by sourceware.org (Postfix, from userid 1534)
	id 87B50385742E; Fri,  9 Sep 2022 09:37:50 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 87B50385742E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1662716270;
	bh=WjMyi6ipACs72WYoY7Ilp54v6S2EvvBHDeRrFQsnu6Y=;
	h=From:To:Subject:Date:From;
	b=VZJcQeKqKVKuGMxm9Wa/JCyoIzO/NAhr118ErFHe2DMkK5CoCp7I6M0lwLo8D3UYC
	 TE6lIHqDWxFSk3iUS1aGFDoYWoAnIhvU7UScoWHjHn/x/XnHS4iW3gndsH9VPXMlBa
	 SEv8dXvMTnmyxSVB2/T4PalEAHzS59PT9V8nGcFk=
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Tobias Burnus <burnus@gcc.gnu.org>
To: gcc-cvs@gcc.gnu.org
Subject: [gcc/devel/omp/gcc-12] libgomp.texi: Document libmemkind + nvptx/gcn
 specifics
X-Act-Checkin: gcc
X-Git-Author: Tobias Burnus <tobias@codesourcery.com>
X-Git-Refname: refs/heads/devel/omp/gcc-12
X-Git-Oldrev: db7940c9138683d2ee2e8c086000928bff47c127
X-Git-Newrev: 8b479946d9a32bcd47494c9cb6b121b83e185c40
Message-Id: <20220909093750.87B50385742E@sourceware.org>
Date: Fri,  9 Sep 2022 09:37:50 +0000 (GMT)
List-Id: <gcc-cvs.sourceware.org>

https://gcc.gnu.org/g:8b479946d9a32bcd47494c9cb6b121b83e185c40

commit 8b479946d9a32bcd47494c9cb6b121b83e185c40
Author: Tobias Burnus <tobias@codesourcery.com>
Date:   Fri Sep 9 11:28:51 2022 +0200

    libgomp.texi: Document libmemkind + nvptx/gcn specifics
    
    libgomp/ChangeLog:
    
            * libgomp.texi (OpenMP-Implementation Specifics): New; add libmemkind
            section; move OpenMP Context Selectors from ...
            (Offload-Target Specifics): ... here; add 'AMD Radeo (GCN)' and
            'nvptx' sections.
    
    (cherry picked from commit 4f05ff34d63b582557918189528531f35041ef0e)

Diff:
---
 libgomp/ChangeLog.omp |  10 ++++
 libgomp/libgomp.texi  | 131 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 135 insertions(+), 6 deletions(-)

diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index 79574927b1f..62fbddbf677 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,3 +1,13 @@
+2022-09-09  Tobias Burnus  <tobias@codesourcery.com>
+
+	Backport from mainline:
+	2022-09-08  Tobias Burnus  <tobias@codesourcery.com>
+
+	* libgomp.texi (OpenMP-Implementation Specifics): New; add libmemkind
+	section; move OpenMP Context Selectors from ...
+	(Offload-Target Specifics): ... here; add 'AMD Radeo (GCN)' and
+	'nvptx' sections.
+
 2022-09-09  Tobias Burnus  <tobias@codesourcery.com>
 
 	Backport from mainline:
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 8b800985a48..10bfb90abb0 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -113,6 +113,8 @@ changed to GNU Offloading and Multi Processing Runtime Library.
 * OpenACC Library Interoperability:: OpenACC library interoperability with the
                                NVIDIA CUBLAS library.
 * OpenACC Profiling Interface::
+* OpenMP-Implementation Specifics:: Notes specifics of this OpenMP
+                               implementation
 * Offload-Target Specifics::   Notes on offload-target specific internals
 * The libgomp ABI::            Notes on the external ABI presented by libgomp.
 * Reporting Bugs::             How to report bugs in the GNU Offloading and
@@ -4269,16 +4271,15 @@ offloading devices (it's not clear if they should be):
 @end itemize
 
 @c ---------------------------------------------------------------------
-@c Offload-Target Specifics
+@c OpenMP-Implementation Specifics
 @c ---------------------------------------------------------------------
 
-@node Offload-Target Specifics
-@chapter Offload-Target Specifics
-
-The following sections present notes on the offload-target specifics.
+@node OpenMP-Implementation Specifics
+@chapter OpenMP-Implementation Specifics
 
 @menu
 * OpenMP Context Selectors::
+* Memory allocation with libmemkind::
 @end menu
 
 @node OpenMP Context Selectors
@@ -4297,9 +4298,127 @@ The following sections present notes on the offload-target specifics.
       @tab See @code{-march=} in ``AMD GCN Options''
 @item @code{nvptx}
       @tab @code{gpu}
-      @tab See @code{-misa=} in ``Nvidia PTX Options''
+      @tab See @code{-march=} in ``Nvidia PTX Options''
 @end multitable
 
+@node Memory allocation with libmemkind
+@section Memory allocation with libmemkind
+
+On Linux systems, where the @uref{https://github.com/memkind/memkind, memkind
+library} (@code{libmemkind.so.0}) is available at runtime, it is used when
+creating memory allocators requesting
+
+@itemize
+@item the memory space @code{omp_high_bw_mem_space}
+@item the memory space @code{omp_large_cap_mem_space}
+@item the partition trait @code{omp_atv_interleaved}
+@end itemize
+
+
+@c ---------------------------------------------------------------------
+@c Offload-Target Specifics
+@c ---------------------------------------------------------------------
+
+@node Offload-Target Specifics
+@chapter Offload-Target Specifics
+
+The following sections present notes on the offload-target specifics
+
+@menu
+* AMD Radeon::
+* nvptx::
+@end menu
+
+@node AMD Radeon
+@section AMD Radeon (GCN)
+
+On the hardware side, there is the hierarchy (fine to coarse):
+@itemize
+@item work item (thread)
+@item wavefront
+@item work group
+@item compute unite (CU)
+@end itemize
+
+All OpenMP and OpenACC levels are used, i.e.
+@itemize
+@item OpenMP's simd and OpenACC's vector map to work items (thread)
+@item OpenMP's threads (``parallel'') and OpenACC's workers map
+      to wavefronts
+@item OpenMP's teams and OpenACC's gang use a threadpool with the
+      size of the number of teams or gangs, respectively.
+@end itemize
+
+The used sizes are
+@itemize
+@item Number of teams is the specified @code{num_teams} (OpenMP) or
+      @code{num_gangs} (OpenACC) or otherwise the number of CU
+@item Number of wavefronts is 4 for gfx900 and 16 otherwise;
+      @code{num_threads} (OpenMP) and @code{num_workers} (OpenACC)
+      overrides this if smaller.
+@item The wavefront has 102 scalars and 64 vectors
+@item Number of workitems is always 64
+@item The hardware permits maximally 40 workgroups/CU and
+      16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
+@item 80 scalars registers and 24 vector registers in non-kernel functions
+      (the chosen procedure-calling API).
+@item For the kernel itself: as many as register pressure demands (number of
+      teams and number of threads, scaled down if registers are exhausted)
+@end itemize
+
+The implementation remark:
+@itemize
+@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+      using the C library @code{printf} functions and the Fortran
+      @code{print}/@code{write} statements.
+@end itemize
+
+
+
+@node nvptx
+@section nvptx
+
+On the hardware side, there is the hierarchy (fine to coarse):
+@itemize
+@item thread
+@item warp
+@item thread block
+@item streaming multiprocessor
+@end itemize
+
+All OpenMP and OpenACC levels are used, i.e.
+@itemize
+@item OpenMP's simd and OpenACC's vector map to threads
+@item OpenMP's threads (``parallel'') and OpenACC's workers map to warps
+@item OpenMP's teams and OpenACC's gang use a threadpool with the
+      size of the number of teams or gangs, respectively.
+@end itemize
+
+The used sizes are
+@itemize
+@item The @code{warp_size} is always 32
+@item CUDA kernel launched: @code{dim=@{#teams,1,1@}, blocks=@{#threads,warp_size,1@}}.
+@end itemize
+
+Additional information can be obtained by setting the environment variable to
+@code{GOMP_DEBUG=1} (very verbose; grep for @code{kernel.*launch} for launch
+parameters).
+
+GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
+which caches the JIT in the user's directory (see CUDA documentation; can be
+tuned by the environment variables @code{CUDA_CACHE_@{DISABLE,MAXSIZE,PATH@}}.
+
+Note: While PTX ISA is generic, the @code{-mptx=} and @code{-march=} commandline
+options still affect the used PTX ISA code and, thus, the requirments on
+CUDA version and hardware.
+
+The implementation remark:
+@itemize
+@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+      using the C library @code{printf} functions. Note that the Fortran
+      @code{print}/@code{write} statements are not supported, yet.
+@end itemize
+
 
 @c ---------------------------------------------------------------------
 @c The libgomp ABI