public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
@ 2022-09-27 13:15 Tobias Burnus
  2022-09-27 13:16 ` Tobias Burnus
  0 siblings, 1 reply; 10+ messages in thread
From: Tobias Burnus @ 2022-09-27 13:15 UTC (permalink / raw)
  To: gcc-patches, Andrew Stubbs; +Cc: Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1672 bytes --]

This patch adds support to handle reverse offload to libgomp's plugin-gcn.c and
to AMD GCN's libgomp target.c.

In theory, that's all whats needed for GCN – in practice there a known issue with
private stack variables which has to be addressed independently. Once this and
the target.c generic code is committed, omp requires reverse-offload
support can be claimed for the device (→ GOMP_OFFLOAD_get_num_devices).

Note: Contrary to nvptx, the code to handle reverse offload is already enabled
if there is 'omp requires reverse_offload' (+ target functions) in the code;
for nvptx, an actual reverse-offload-target region has to exist in the code.
This probably does not matter that much in practice.

The '#if 1' code block inside plugin-gcn.c has to be replaced by the
target.c/libgomp-plugins.h/libgomp.h/libgomp.map patch that is part
of the patch: https://gcc.gnu.org/pipermail/gcc-patches/2022-September/602321.html
"[Patch] libgomp/nvptx: Prepare for reverse-offload callback handling"

Andrew did suggest a while back to piggyback on the console_output handling,
avoiding another atomic access. - If this is still wanted, I like to have some
guidance regarding how to actually implement it.

Comments, suggestions?
If not, OK for mainline?*

Tobias

*Without the '#if 1' code and once the non-nvptx bits of the other patch
have been approved and committed.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
  2022-09-27 13:15 [Patch] libgomp/gcn: Prepare for reverse-offload callback handling Tobias Burnus
@ 2022-09-27 13:16 ` Tobias Burnus
  2022-09-29 16:24   ` Andrew Stubbs
  0 siblings, 1 reply; 10+ messages in thread
From: Tobias Burnus @ 2022-09-27 13:16 UTC (permalink / raw)
  To: gcc-patches, Andrew Stubbs; +Cc: Jakub Jelinek


[-- Attachment #1.1: Type: text/plain, Size: 1796 bytes --]

For those without a working crystal ball, I have now also included the patch.

On 27.09.22 15:15, Tobias Burnus wrote:

This patch adds support to handle reverse offload to libgomp's plugin-gcn.c and
to AMD GCN's libgomp target.c.

In theory, that's all whats needed for GCN – in practice there a known issue with
private stack variables which has to be addressed independently. Once this and
the target.c generic code is committed, omp requires reverse-offload
support can be claimed for the device (→ GOMP_OFFLOAD_get_num_devices).

Note: Contrary to nvptx, the code to handle reverse offload is already enabled
if there is 'omp requires reverse_offload' (+ target functions) in the code;
for nvptx, an actual reverse-offload-target region has to exist in the code.
This probably does not matter that much in practice.

The '#if 1' code block inside plugin-gcn.c has to be replaced by the
target.c/libgomp-plugins.h/libgomp.h/libgomp.map patch that is part
of the patch: https://gcc.gnu.org/pipermail/gcc-patches/2022-September/602321.html
"[Patch] libgomp/nvptx: Prepare for reverse-offload callback handling"

Andrew did suggest a while back to piggyback on the console_output handling,
avoiding another atomic access. - If this is still wanted, I like to have some
guidance regarding how to actually implement it.

Comments, suggestions?
If not, OK for mainline?*

Tobias

*Without the '#if 1' code and once the non-nvptx bits of the other patch
have been approved and committed.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-gcn.diff --]
[-- Type: text/x-patch, Size: 8841 bytes --]

libgomp/gcn: Prepare for reverse-offload callback handling

libgomp/ChangeLog:

	* config/gcn/libgomp-gcn.h: New file.
	* config/gcn/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare as extern var.
	(GOMP_target_ext): Handle reverse offload.
	* plugin/plugin-gcn.c (struct kernargs): Add 'int64_t rev_ptr' as
	6th argument and 'struct rev_offload rev_data'.
	(struct agent_info): Add has_reverse_offload; move prog_finalized
	up to reduce padding.
	(create_kernel_dispatch): Init kernargs' rev_ptr and rev_data.
	(reverse_offload): New.
	(run_kernel): Call it.
	(GOMP_OFFLOAD_init_device, GOMP_OFFLOAD_load_image): Set
	has_reverse_offload.

 libgomp/config/gcn/libgomp-gcn.h | 50 +++++++++++++++++++++++++++++++++++++
 libgomp/config/gcn/target.c      | 35 ++++++++++++++++++++------
 libgomp/plugin/plugin-gcn.c      | 54 +++++++++++++++++++++++++++++++++++++---
 3 files changed, 129 insertions(+), 10 deletions(-)

diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
new file mode 100644
index 00000000000..884f0094d05
--- /dev/null
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -0,0 +1,50 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_GCN_H
+#define LIBGOMP_GCN_H 1
+
+
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_GCN_H */
diff --git a/libgomp/config/gcn/target.c b/libgomp/config/gcn/target.c
index c8484fa18d9..ecbf3f337d0 100644
--- a/libgomp/config/gcn/target.c
+++ b/libgomp/config/gcn/target.c
@@ -24,8 +24,11 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-gcn.h"
 #include <limits.h>
 
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
 	     unsigned int thread_limit, bool first)
@@ -75,16 +78,34 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
+  struct rev_offload *rev;
+
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK || fn == NULL)
+    return;
+
+  register void **kernargs asm ("s8");
+  rev = (struct rev_offload *) kernargs[5];
+
+  while (__sync_lock_test_and_set (&rev->lock, (uint8_t) 1))
+    /* spin */ ;
+
+  rev->mapnum = mapnum;
+  rev->addrs = hostaddrs;
+  rev->sizes = sizes;
+  rev->kinds = kinds;
+  rev->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&rev->fn, fn, __ATOMIC_RELEASE);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&rev->fn, __ATOMIC_ACQUIRE) != 0)
+    /* spin */ ;
+  __sync_lock_release (&rev->lock);
 }
 
 void
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 04b122f2a09..ebc9a55eb13 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -42,6 +42,7 @@
 #include <dlfcn.h>
 #include <signal.h>
 #include "libgomp-plugin.h"
+#include "config/gcn/libgomp-gcn.h"
 #include "gomp-constants.h"
 #include <elf.h>
 #include "oacc-plugin.h"
@@ -251,6 +252,9 @@ struct kernargs {
     Only needed for OpenMP.  */
   int64_t arena_ptr;
 
+  /* A pointer to reverse-offload. */
+  int64_t rev_ptr;
+
   /* Output data.  */
   struct output {
     int return_value;
@@ -267,6 +271,8 @@ struct kernargs {
     } queue[1024];
     unsigned int consumed;
   } output_data;
+
+  struct rev_offload rev_data;
 };
 
 /* A queue entry for a future asynchronous launch.  */
@@ -422,6 +428,12 @@ struct agent_info
      if it has been.  */
   bool initialized;
 
+  /* Flag whether the HSA program that consists of all the modules has been
+     finalized.  */
+  bool prog_finalized;
+  /* Flag whether the HSA OpenMP's requires_reverse_offload has been used.  */
+  bool has_reverse_offload;
+
   /* The instruction set architecture of the device. */
   gcn_isa device_isa;
   /* Name of the agent. */
@@ -456,9 +468,6 @@ struct agent_info
      thread should have locked agent->module_rwlock for reading before
      acquiring it.  */
   pthread_mutex_t prog_mutex;
-  /* Flag whether the HSA program that consists of all the modules has been
-     finalized.  */
-  bool prog_finalized;
   /* HSA executable - the finalized program that is used to locate kernels.  */
   hsa_executable_t executable;
 };
@@ -1915,6 +1924,9 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams)
        i++)
     kernargs->output_data.queue[i].written = 0;
   kernargs->output_data.consumed = 0;
+  kernargs->rev_ptr = (int64_t) &kernargs->rev_data;
+  kernargs->rev_data.lock = 0;
+  kernargs->rev_data.fn = 0;
 
   /* Pass in the heap location.  */
   kernargs->heap_ptr = (int64_t)kernel->module->heap;
@@ -1931,6 +1943,36 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams)
   return shadow;
 }
 
+#if 1
+/* This is part of the patch:
+   "libgomp/nvptx: Prepare for reverse-offload callback handling" */
+static void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *)
+{
+}
+#endif
+
+static void
+reverse_offload (struct kernargs *kernargs)
+{
+  uint64_t fn_ptr = __atomic_load_n (&kernargs->rev_data.fn, __ATOMIC_ACQUIRE);
+  if (fn_ptr == 0)
+    return;
+
+  uint64_t mapnum = kernargs->rev_data.mapnum;
+  uint64_t addr_ptr = kernargs->rev_data.addrs;
+  uint64_t sizes_ptr = kernargs->rev_data.sizes;
+  uint64_t kinds_ptr = kernargs->rev_data.kinds;
+  int dev_num = (int) kernargs->rev_data.dev_num;
+  GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr, kinds_ptr,
+			  dev_num, NULL, NULL);
+  __atomic_store_n (&kernargs->rev_data.fn, 0, __ATOMIC_RELEASE);
+}
+
 /* Output any data written to console output from the kernel.  It is expected
    that this function is polled during kernel execution.
 
@@ -2263,11 +2305,15 @@ run_kernel (struct kernel_info *kernel, void *vars,
 
   GCN_DEBUG ("Kernel dispatched, waiting for completion\n");
 
+  bool has_reverse_offload = kernel->agent->has_reverse_offload;
+
   /* Root signal waits with 1ms timeout.  */
   while (hsa_fns.hsa_signal_wait_acquire_fn (s, HSA_SIGNAL_CONDITION_LT, 1,
 					     1000 * 1000,
 					     HSA_WAIT_STATE_BLOCKED) != 0)
     {
+      if (has_reverse_offload)
+	reverse_offload (shadow->kernarg_address);
       console_output (kernel, shadow->kernarg_address, false);
     }
   console_output (kernel, shadow->kernarg_address, true);
@@ -3340,6 +3386,7 @@ GOMP_OFFLOAD_init_device (int n)
 
   GCN_DEBUG ("GCN agent %d initialized\n", n);
 
+  agent->has_reverse_offload = false;
   agent->initialized = true;
   return true;
 }
@@ -3547,6 +3594,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       GOMP_OFFLOAD_dev2host (agent->device_id, *rev_fn_table,
 			     (void*) fn_table_addr,
 			     kernel_count * sizeof (uint64_t));
+      agent->has_reverse_offload = true;
     }
 
   return kernel_count + var_count + other_count;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
  2022-09-27 13:16 ` Tobias Burnus
@ 2022-09-29 16:24   ` Andrew Stubbs
  2022-10-12 14:29     ` Tobias Burnus
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Stubbs @ 2022-09-29 16:24 UTC (permalink / raw)
  To: Tobias Burnus, gcc-patches; +Cc: Jakub Jelinek

On 27/09/2022 14:16, Tobias Burnus wrote:
> @@ -422,6 +428,12 @@ struct agent_info
>       if it has been.  */
>    bool initialized;
>  
> +  /* Flag whether the HSA program that consists of all the modules has been
> +     finalized.  */
> +  bool prog_finalized;
> +  /* Flag whether the HSA OpenMP's requires_reverse_offload has been used.  */
> +  bool has_reverse_offload;
> +
>    /* The instruction set architecture of the device. */
>    gcn_isa device_isa;
>    /* Name of the agent. */
> @@ -456,9 +468,6 @@ struct agent_info
>       thread should have locked agent->module_rwlock for reading before
>       acquiring it.  */
>    pthread_mutex_t prog_mutex;
> -  /* Flag whether the HSA program that consists of all the modules has been
> -     finalized.  */
> -  bool prog_finalized;
>    /* HSA executable - the finalized program that is used to locate kernels.  */
>    hsa_executable_t executable;
>  };

Why has prog_finalized been moved?

> Andrew did suggest a while back to piggyback on the console_output handling,
> avoiding another atomic access. - If this is still wanted, I like to have some
> guidance regarding how to actually implement it.

The console output ring buffer has the following type:

    struct output {
      int return_value;
      unsigned int next_output;
      struct printf_data {
        int written;
        char msg[128];
        int type;
        union {
          int64_t ivalue;
          double dvalue;
          char text[128];
        };
      } queue[1024];
      unsigned int consumed;
    } output_data;

That is, for each entry in the buffer there is a 128-byte message 
string, an integer argument-type identifier, and a 128-byte argument 
field.  Before we had printf we had functions that could print 
string+int (gomp_print_integer, type==0), string+double 
(gomp_print_double, type==1) and string+string (gomp_print_string, 
type==2). The string conversion could then be done on the host to keep 
the target code simple. These would still be useful functions if you 
want to dump debug quickly without affecting performance so much, but I 
don't think they ever got upstreamed because somebody (who should have 
known better!) created an unrelated function upstream with the same name 
(gomp_print_string) and we already had working printf by then so the 
effort to fix it wasn't worth it.

The current printf implementation (actually the write syscall), uses 
type==3 to print 256-bytes of output, per packet, with no implied newline.

The point is that you can use the "msg" and "text" fields for whatever 
data you want, as long as you invent a new value for "type".

The current loop has:

   switch (data->type)
     {
     case 0: printf ("%.128s%ld\n", data->msg, data->ivalue); break;
     case 1: printf ("%.128s%f\n", data->msg, data->dvalue); break;
     case 2: printf ("%.128s%.128s\n", data->msg, data->text); break;
     case 3: printf ("%.128s%.128s", data->msg, data->text); break;
     default: printf ("GCN print buffer error!\n"); break;
     }

You can make "case 4" do whatever you want. There are enough bytes for 4 
pointers, and you could use multiple packets (although it's not safe to 
assume they're contiguous or already arrived; maybe "case 4" for part 1, 
"case 5" for part 2). It's possible to change this structure, of course, 
but the target implementation is in newlib so versioning becomes a problem.

Reusing this would remove the need for has_reverse_offload, since the 
console output is scanned anyway, and also eliminate rev_ptr, rev_data, 
and means that, hypothetically, the device can queue up reverse offload 
requests asynchronously in the ring buffer (you'd need to ensure 
multi-part packets don't get interleaved though).

Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
  2022-09-29 16:24   ` Andrew Stubbs
@ 2022-10-12 14:29     ` Tobias Burnus
  2022-10-12 17:09       ` Andrew Stubbs
  2022-11-18 17:41       ` Tobias Burnus
  0 siblings, 2 replies; 10+ messages in thread
From: Tobias Burnus @ 2022-10-12 14:29 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches; +Cc: Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2306 bytes --]

On 29.09.22 18:24, Andrew Stubbs wrote:
> On 27/09/2022 14:16, Tobias Burnus wrote:
>> Andrew did suggest a while back to piggyback on the console_output
>> handling,
>> avoiding another atomic access. - If this is still wanted, I like to
>> have some
>> guidance regarding how to actually implement it.
> [...]
> The point is that you can use the "msg" and "text" fields for whatever
> data you want, as long as you invent a new value for "type".
> [....]
> You can make "case 4" do whatever you want. There are enough bytes for
> 4 pointers, and you could use multiple packets (although it's not safe
> to assume they're contiguous or already arrived; maybe "case 4" for
> part 1, "case 5" for part 2). It's possible to change this structure,
> of course, but the target implementation is in newlib so versioning
> becomes a problem.

I think  – also looking at the Newlib write.c implementation - that the
data is contiguous: there is an atomic add, where instead of passing '1'
for a single slot, I could also add '2' for two slots.

Attached is one variant – for the decl of the GOMP_OFFLOAD_target_rev,
it needs the generic parts of the sister nvptx patch.*

2*128 bytes were not enough, I need 3*128 bytes. (Or rather 5*64 + 32.)
As target_ext is blocking, I decided to use a stack local variable for
the remaining arguments and pass it along. Alternatively, I could also
use 2 slots - and process them together. This would avoid one
device->host memory copy but would make console_output less clear.

OK for mainline?

Tobias

* https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603354.html

PS: Currently, device stack variables are private and cannot be accessed
from the host; this will change in a separate patch. It not only affects
the "rest" part as used in this patch but also the actual arrays behind
addr, kinds, and sizes. And quite likely a lot of the map/firstprivate
variables passed to addr.

As num_devices() will return 0 or -1, this is for now a non-issue.
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-gcn-v2.diff --]
[-- Type: text/x-patch, Size: 7242 bytes --]

libgomp/gcn: Prepare for reverse-offload callback handling

libgomp/ChangeLog:

	* config/gcn/libgomp-gcn.h: New file; contains
	struct output, declared previously in plugin-gcn.c.
	* config/gcn/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare as extern var.
	(GOMP_target_ext): Handle reverse offload.
	* plugin/plugin-gcn.c: Include libgomp-gcn.h.
	(struct kernargs): Replace struct def by the one
	from libgomp-gcn.h for output_data.
	(process_reverse_offload): New.
	(console_output): Call it.

 libgomp/config/gcn/libgomp-gcn.h | 61 ++++++++++++++++++++++++++++++++++++++++
 libgomp/config/gcn/target.c      | 44 ++++++++++++++++++++++++-----
 libgomp/plugin/plugin-gcn.c      | 34 ++++++++++++----------
 3 files changed, 117 insertions(+), 22 deletions(-)

diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
new file mode 100644
index 00000000000..91560be787f
--- /dev/null
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -0,0 +1,61 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_GCN_H
+#define LIBGOMP_GCN_H 1
+
+/* This struct is also used in Newlib's libc/sys/amdgcn/write.c.  */
+struct output
+{
+  int return_value;
+  unsigned int next_output;
+  struct printf_data {
+    int written;
+    union {
+      char msg[128];
+      uint64_t msg_u64[2];
+    };
+    int type;
+    union {
+      int64_t ivalue;
+      double dvalue;
+      char text[128];
+      uint64_t value_u64[2];
+    };
+  } queue[1024];
+  unsigned int consumed;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_GCN_H */
diff --git a/libgomp/config/gcn/target.c b/libgomp/config/gcn/target.c
index c8484fa18d9..f5a4bf64655 100644
--- a/libgomp/config/gcn/target.c
+++ b/libgomp/config/gcn/target.c
@@ -24,8 +24,11 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-gcn.h"
 #include <limits.h>
 
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
 	     unsigned int thread_limit, bool first)
@@ -75,16 +78,43 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK || fn == NULL)
+    return;
+
+  /* The output data is at ((void*) kernargs)[2].  */
+  register void **kernargs asm("s8");
+  struct output *data = (struct output *) kernargs[2];
+  /* Reserve one slot. */
+  unsigned int index = __atomic_fetch_add (&data->next_output, 1,
+					   __ATOMIC_ACQUIRE);
+
+  if ((unsigned int) (index + 1) < data->consumed)
+    abort ();  /* Overflow.  */
+
+  /* Spinlock while the host catches up.  */
+  if (index >= 1024)
+    while (__atomic_load_n (&data->consumed, __ATOMIC_ACQUIRE)
+	   <= (index - 1024))
+      asm ("s_sleep 64");
+
+  unsigned int slot = index % 1024;
+  uint64_t addrs_sizes_kind[3] = {(uint64_t) hostaddrs, (uint64_t) sizes,
+				  (uint64_t) kinds};
+  data->queue[slot].msg_u64[0] = (uint64_t) fn;
+  data->queue[slot].msg_u64[1] = (uint64_t) mapnum;
+  data->queue[slot].value_u64[0] = (uint64_t) &addrs_sizes_kind[0];
+  data->queue[slot].value_u64[1] = (uint64_t) GOMP_ADDITIONAL_ICVS.device_num;
+
+  data->queue[slot].type = 4; /* Reverse offload.  */
+  __atomic_store_n (&data->queue[slot].written, 1, __ATOMIC_RELEASE);
+
+  /* Spinlock while the host catches up.  */
+  while (__atomic_load_n (&data->queue[slot].written, __ATOMIC_ACQUIRE) != 0)
+    asm ("s_sleep 64");
 }
 
 void
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 04b122f2a09..ffe5cf5af2c 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -42,6 +42,7 @@
 #include <dlfcn.h>
 #include <signal.h>
 #include "libgomp-plugin.h"
+#include "config/gcn/libgomp-gcn.h"  /* For struct output.  */
 #include "gomp-constants.h"
 #include <elf.h>
 #include "oacc-plugin.h"
@@ -252,21 +253,7 @@ struct kernargs {
   int64_t arena_ptr;
 
   /* Output data.  */
-  struct output {
-    int return_value;
-    unsigned int next_output;
-    struct printf_data {
-      int written;
-      char msg[128];
-      int type;
-      union {
-	int64_t ivalue;
-	double dvalue;
-	char text[128];
-      };
-    } queue[1024];
-    unsigned int consumed;
-  } output_data;
+  struct output output_data;
 };
 
 /* A queue entry for a future asynchronous launch.  */
@@ -1931,6 +1918,19 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams)
   return shadow;
 }
 
+static void
+process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t rev_data,
+			 uint64_t dev_num64)
+{
+  int dev_num = dev_num64;
+  uint64_t addrs_sizes_kinds[3];
+  GOMP_OFFLOAD_host2dev (dev_num, &addrs_sizes_kinds, (void *) rev_data,
+			 sizeof (addrs_sizes_kinds));
+  GOMP_PLUGIN_target_rev (fn, mapnum, addrs_sizes_kinds[0],
+			  addrs_sizes_kinds[1], addrs_sizes_kinds[2],
+			  dev_num, NULL, NULL, NULL);
+}
+
 /* Output any data written to console output from the kernel.  It is expected
    that this function is polled during kernel execution.
 
@@ -1975,6 +1975,10 @@ console_output (struct kernel_info *kernel, struct kernargs *kernargs,
 	case 1: printf ("%.128s%f\n", data->msg, data->dvalue); break;
 	case 2: printf ("%.128s%.128s\n", data->msg, data->text); break;
 	case 3: printf ("%.128s%.128s", data->msg, data->text); break;
+	case 4:
+	  process_reverse_offload (data->msg_u64[0], data->msg_u64[1],
+				   data->value_u64[0],data->value_u64[1]);
+	  break;
 	default: printf ("GCN print buffer error!\n"); break;
 	}
       data->written = 0;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
  2022-10-12 14:29     ` Tobias Burnus
@ 2022-10-12 17:09       ` Andrew Stubbs
  2022-10-12 17:25         ` Tobias Burnus
  2022-11-18 17:41       ` Tobias Burnus
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Stubbs @ 2022-10-12 17:09 UTC (permalink / raw)
  To: Tobias Burnus, gcc-patches; +Cc: Jakub Jelinek

On 12/10/2022 15:29, Tobias Burnus wrote:
> On 29.09.22 18:24, Andrew Stubbs wrote:
>> On 27/09/2022 14:16, Tobias Burnus wrote:
>>> Andrew did suggest a while back to piggyback on the console_output 
>>> handling,
>>> avoiding another atomic access. - If this is still wanted, I like to 
>>> have some
>>> guidance regarding how to actually implement it.
>> [...]
>> The point is that you can use the "msg" and "text" fields for whatever 
>> data you want, as long as you invent a new value for "type".
>> [....]
>> You can make "case 4" do whatever you want. There are enough bytes for 
>> 4 pointers, and you could use multiple packets (although it's not safe 
>> to assume they're contiguous or already arrived; maybe "case 4" for 
>> part 1, "case 5" for part 2). It's possible to change this structure, 
>> of course, but the target implementation is in newlib so versioning 
>> becomes a problem.
> 
> I think  – also looking at the Newlib write.c implementation - that the 
> data is contiguous: there is an atomic add, where instead of passing '1' 
> for a single slot, I could also add '2' for two slots.

Right, sorry, the buffer is circular, but the counter is linear.  It 
simplified reservation that way, but it does mean that there's a limit 
to the number of times the buffer can cycle before the counter 
saturates. (You'd need to stream out gigabytes of data to hit the limit 
though.)

> Attached is one variant – for the decl of the GOMP_OFFLOAD_target_rev, 
> it needs the generic parts of the sister nvptx patch.*
> 
> 2*128 bytes were not enough, I need 3*128 bytes. (Or rather 5*64 + 32.) 
> As target_ext is blocking, I decided to use a stack local variable for 
> the remaining arguments and pass it along. Alternatively, I could also 
> use 2 slots - and process them together. This would avoid one 
> device->host memory copy but would make console_output less clear.

> PS: Currently, device stack variables are private and cannot be accessed 
> from the host; this will change in a separate patch. It not only affects 
> the "rest" part as used in this patch but also the actual arrays behind 
> addr, kinds, and sizes. And quite likely a lot of the map/firstprivate 
> variables passed to addr.
> 
> As num_devices() will return 0 or -1, this is for now a non-issue.

So, the patch, as is, is known to be non-functional? How can you have 
tested it? For the addrs_sizes_kind data to be accessible the asm("s8") 
has to be wrong.

I think the patch looks good, in principle. The use of the existing 
ring-buffer is the right way to do it, IMO.

Can we get the manually allocated stacks patch in first and then follow 
up with these patches when they actually work?

Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
  2022-10-12 17:09       ` Andrew Stubbs
@ 2022-10-12 17:25         ` Tobias Burnus
  0 siblings, 0 replies; 10+ messages in thread
From: Tobias Burnus @ 2022-10-12 17:25 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches; +Cc: Jakub Jelinek

On 12.10.22 19:09, Andrew Stubbs wrote:

> On 12/10/2022 15:29, Tobias Burnus wrote:
>
> Right, sorry, the buffer is circular, but the counter is linear. It
> simplified reservation that way, but it does mean that there's a limit
> to the number of times the buffer can cycle before the counter
> saturates. (You'd need to stream out gigabytes of data to hit the
> limit though.)
Or in other words, you can have 2^32 = 4,294,967,296 (write chunks +
reverse offloads) per kernel launch.
> ...
>> PS: Currently, device stack variables are private and cannot be
>> accessed from the host; this will change in a separate patch. [...]
> So, the patch, as is, is known to be non-functional? How can you have
> tested it? For the addrs_sizes_kind data to be accessible the
> asm("s8") has to be wrong.

I have tested the non-addrs_sizes_kind part only, which permits to run
reverse-offload functions just fine, but only if they do not use
firstprivate or map. — And I actually also tested with the
addrs_sizes_kind part but that unsurprisingly fails hard when trying to
copy the stack data.

> I think the patch looks good, in principle. The use of the existing
> ring-buffer is the right way to do it, IMO. Can we get the manually
> allocated stacks patch in first and then follow up with these patches
> when they actually work?

I stash this patch as: "OK – but ams still want to have a glance once
__builtin_gcn_kernarg_ptr is in".

I terms of having fewer *.diff files around, I of course would prefer to
just change one line in a follow-up commit instead of keeping a full
patch around, but holding off until __builtin_gcn_kernarg_ptr is ready +
the default has changed to non-private stack variables is also fine.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
  2022-10-12 14:29     ` Tobias Burnus
  2022-10-12 17:09       ` Andrew Stubbs
@ 2022-11-18 17:41       ` Tobias Burnus
  2022-11-18 17:56         ` Andrew Stubbs
  1 sibling, 1 reply; 10+ messages in thread
From: Tobias Burnus @ 2022-11-18 17:41 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches; +Cc: Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 3390 bytes --]

Attached is the updated/rediffed version, which now uses the builtin
instead of the 'asm("s8").

The code in principle works; that is: If no private stack variables are
copied, it works.

Or in other words: reverse-offload target regions that don't use
firstprivate or mapping work, the rest would crash. That's avoided by
not accepting reverse offload inside GOMP_OFFLOAD_get_num_devices for now.

To get it working, the manual stack allocation patch + the trivial
update to that get_num_devices func is needed, but no change to the
attached patch.

In order to reduce local patches, I would love to have it on mainline –
otherwise, I have at least the current version in gcc-patches@.

Tobias

PS: Previous patch email quoted below. Note: there were two follow up
emails, one by Andrew and one by me; cf. your own mail archive (of this
thread) or
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603383.html + the
next two by thread messages.

On 12.10.22 16:29, Tobias Burnus wrote:
> On 29.09.22 18:24, Andrew Stubbs wrote:
>> On 27/09/2022 14:16, Tobias Burnus wrote:
>>> Andrew did suggest a while back to piggyback on the console_output
>>> handling,
>>> avoiding another atomic access. - If this is still wanted, I like to
>>> have some
>>> guidance regarding how to actually implement it.
>> [...]
>> The point is that you can use the "msg" and "text" fields for
>> whatever data you want, as long as you invent a new value for "type".
>> [....]
>> You can make "case 4" do whatever you want. There are enough bytes
>> for 4 pointers, and you could use multiple packets (although it's not
>> safe to assume they're contiguous or already arrived; maybe "case 4"
>> for part 1, "case 5" for part 2). It's possible to change this
>> structure, of course, but the target implementation is in newlib so
>> versioning becomes a problem.
>
> I think  – also looking at the Newlib write.c implementation - that
> the data is contiguous: there is an atomic add, where instead of
> passing '1' for a single slot, I could also add '2' for two slots.
>
> Attached is one variant – for the decl of the GOMP_OFFLOAD_target_rev,
> it needs the generic parts of the sister nvptx patch.*
>
> 2*128 bytes were not enough, I need 3*128 bytes. (Or rather 5*64 +
> 32.) As target_ext is blocking, I decided to use a stack local
> variable for the remaining arguments and pass it along. Alternatively,
> I could also use 2 slots - and process them together. This would avoid
> one device->host memory copy but would make console_output less clear.
>
> OK for mainline?
>
> Tobias
>
> * https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603354.html
>
> PS: Currently, device stack variables are private and cannot be
> accessed from the host; this will change in a separate patch. It not
> only affects the "rest" part as used in this patch but also the actual
> arrays behind addr, kinds, and sizes. And quite likely a lot of the
> map/firstprivate variables passed to addr.
>
> As num_devices() will return 0 or -1, this is for now a non-issue.
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-gcn-v3.diff --]
[-- Type: text/x-patch, Size: 7272 bytes --]

libgomp/gcn: Prepare for reverse-offload callback handling

libgomp/ChangeLog:

	* config/gcn/libgomp-gcn.h: New file; contains
	struct output, declared previously in plugin-gcn.c.
	* config/gcn/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare as extern var.
	(GOMP_target_ext): Handle reverse offload.
	* plugin/plugin-gcn.c: Include libgomp-gcn.h.
	(struct kernargs): Replace struct def by the one
	from libgomp-gcn.h for output_data.
	(process_reverse_offload): New.
	(console_output): Call it.

 libgomp/config/gcn/libgomp-gcn.h | 61 ++++++++++++++++++++++++++++++++++++++++
 libgomp/config/gcn/target.c      | 44 ++++++++++++++++++++++++-----
 libgomp/plugin/plugin-gcn.c      | 34 ++++++++++++----------
 3 files changed, 117 insertions(+), 22 deletions(-)

diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
new file mode 100644
index 00000000000..91560be787f
--- /dev/null
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -0,0 +1,61 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_GCN_H
+#define LIBGOMP_GCN_H 1
+
+/* This struct is also used in Newlib's libc/sys/amdgcn/write.c.  */
+struct output
+{
+  int return_value;
+  unsigned int next_output;
+  struct printf_data {
+    int written;
+    union {
+      char msg[128];
+      uint64_t msg_u64[2];
+    };
+    int type;
+    union {
+      int64_t ivalue;
+      double dvalue;
+      char text[128];
+      uint64_t value_u64[2];
+    };
+  } queue[1024];
+  unsigned int consumed;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_GCN_H */
diff --git a/libgomp/config/gcn/target.c b/libgomp/config/gcn/target.c
index c8484fa18d9..27854565d40 100644
--- a/libgomp/config/gcn/target.c
+++ b/libgomp/config/gcn/target.c
@@ -24,8 +24,11 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-gcn.h"
 #include <limits.h>
 
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
 	     unsigned int thread_limit, bool first)
@@ -75,16 +78,43 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK || fn == NULL)
+    return;
+
+  /* The output data is at ((void*) kernargs)[2].  */
+  register void **kernargs = (void**) __builtin_gcn_kernarg_ptr ();
+  struct output *data = (struct output *) kernargs[2];
+  /* Reserve one slot. */
+  unsigned int index = __atomic_fetch_add (&data->next_output, 1,
+					   __ATOMIC_ACQUIRE);
+
+  if ((unsigned int) (index + 1) < data->consumed)
+    abort ();  /* Overflow.  */
+
+  /* Spinlock while the host catches up.  */
+  if (index >= 1024)
+    while (__atomic_load_n (&data->consumed, __ATOMIC_ACQUIRE)
+	   <= (index - 1024))
+      asm ("s_sleep 64");
+
+  unsigned int slot = index % 1024;
+  uint64_t addrs_sizes_kind[3] = {(uint64_t) hostaddrs, (uint64_t) sizes,
+				  (uint64_t) kinds};
+  data->queue[slot].msg_u64[0] = (uint64_t) fn;
+  data->queue[slot].msg_u64[1] = (uint64_t) mapnum;
+  data->queue[slot].value_u64[0] = (uint64_t) &addrs_sizes_kind[0];
+  data->queue[slot].value_u64[1] = (uint64_t) GOMP_ADDITIONAL_ICVS.device_num;
+
+  data->queue[slot].type = 4; /* Reverse offload.  */
+  __atomic_store_n (&data->queue[slot].written, 1, __ATOMIC_RELEASE);
+
+  /* Spinlock while the host catches up.  */
+  while (__atomic_load_n (&data->queue[slot].written, __ATOMIC_ACQUIRE) != 0)
+    asm ("s_sleep 64");
 }
 
 void
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 04b122f2a09..ffe5cf5af2c 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -42,6 +42,7 @@
 #include <dlfcn.h>
 #include <signal.h>
 #include "libgomp-plugin.h"
+#include "config/gcn/libgomp-gcn.h"  /* For struct output.  */
 #include "gomp-constants.h"
 #include <elf.h>
 #include "oacc-plugin.h"
@@ -252,21 +253,7 @@ struct kernargs {
   int64_t arena_ptr;
 
   /* Output data.  */
-  struct output {
-    int return_value;
-    unsigned int next_output;
-    struct printf_data {
-      int written;
-      char msg[128];
-      int type;
-      union {
-	int64_t ivalue;
-	double dvalue;
-	char text[128];
-      };
-    } queue[1024];
-    unsigned int consumed;
-  } output_data;
+  struct output output_data;
 };
 
 /* A queue entry for a future asynchronous launch.  */
@@ -1931,6 +1918,19 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams)
   return shadow;
 }
 
+static void
+process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t rev_data,
+			 uint64_t dev_num64)
+{
+  int dev_num = dev_num64;
+  uint64_t addrs_sizes_kinds[3];
+  GOMP_OFFLOAD_host2dev (dev_num, &addrs_sizes_kinds, (void *) rev_data,
+			 sizeof (addrs_sizes_kinds));
+  GOMP_PLUGIN_target_rev (fn, mapnum, addrs_sizes_kinds[0],
+			  addrs_sizes_kinds[1], addrs_sizes_kinds[2],
+			  dev_num, NULL, NULL, NULL);
+}
+
 /* Output any data written to console output from the kernel.  It is expected
    that this function is polled during kernel execution.
 
@@ -1975,6 +1975,10 @@ console_output (struct kernel_info *kernel, struct kernargs *kernargs,
 	case 1: printf ("%.128s%f\n", data->msg, data->dvalue); break;
 	case 2: printf ("%.128s%.128s\n", data->msg, data->text); break;
 	case 3: printf ("%.128s%.128s", data->msg, data->text); break;
+	case 4:
+	  process_reverse_offload (data->msg_u64[0], data->msg_u64[1],
+				   data->value_u64[0],data->value_u64[1]);
+	  break;
 	default: printf ("GCN print buffer error!\n"); break;
 	}
       data->written = 0;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling
  2022-11-18 17:41       ` Tobias Burnus
@ 2022-11-18 17:56         ` Andrew Stubbs
  2022-11-21 13:40           ` [Patch] libgomp/gcn: fix/improve struct output (was: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling) Tobias Burnus
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Stubbs @ 2022-11-18 17:56 UTC (permalink / raw)
  To: Tobias Burnus, gcc-patches; +Cc: Jakub Jelinek

On 18/11/2022 17:41, Tobias Burnus wrote:
> Attached is the updated/rediffed version, which now uses the builtin 
> instead of the 'asm("s8").
> 
> The code in principle works; that is: If no private stack variables are 
> copied, it works.
> 
> Or in other words: reverse-offload target regions that don't use 
> firstprivate or mapping work, the rest would crash. That's avoided by 
> not accepting reverse offload inside GOMP_OFFLOAD_get_num_devices for now.
> 
> To get it working, the manual stack allocation patch + the trivial 
> update to that get_num_devices func is needed, but no change to the 
> attached patch.
> 
> In order to reduce local patches, I would love to have it on mainline – 
> otherwise, I have at least the current version in gcc-patches@.

OK with me.

Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Patch] libgomp/gcn: fix/improve struct output (was: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling)
  2022-11-18 17:56         ` Andrew Stubbs
@ 2022-11-21 13:40           ` Tobias Burnus
  2022-11-21 14:19             ` Andrew Stubbs
  0 siblings, 1 reply; 10+ messages in thread
From: Tobias Burnus @ 2022-11-21 13:40 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 948 bytes --]

Working on the builtins, I realized that I mixed up (again) bits and byes.
While 'uint64_t var[2]' has a size of 128 bits, 'char var[128]' has a size of 128 bytes.
Thus, there is sufficient space for 16 pointer-size/uin64_t values but I only need 6.

This patch now makes use of the available space, avoiding one device-to-host memory copy;
additionally, it avoids a 32bit vs 64bit alignment issue which I somehow missed :-(

Tested with libgomp on gfx908 offloading and getting only the known fails:
(libgomp.c-c++-common/teams-2.c, libgomp.fortran/async_io_*.f90,
libgomp.oacc-c-c++-common/{deep-copy-10.c,static-variable-1.c,vprop.c})

OK for mainline?

Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: gcn-fix-struct-output.diff --]
[-- Type: text/x-patch, Size: 4269 bytes --]

libgomp/gcn: fix/improve struct output

output.printf_data.(value union) contains text[128], which has the size
of 128 bytes, sufficient for 16 uint64_t variables; hence value_u64[2]
could be extended to value_u64[6] - sufficient for all required arguments
to gomp_target_rev.  Additionally, next_output.printf_data.(msg union)
contained msg_u64 which then is no longer needed and also caused 32bit
vs 64bit alignment issues.

libgomp/
	* config/gcn/libgomp-gcn.h (struct output):
	Remove 'msg_u64' from the union, change
	value_u64[2] to value_u64[6].
	* config/gcn/target.c (GOMP_target_ext): Update accordingly.
	* plugin/plugin-gcn.c (process_reverse_offload, console_output):
	Likewise.

 libgomp/config/gcn/libgomp-gcn.h |  7 ++-----
 libgomp/config/gcn/target.c      | 12 ++++++------
 libgomp/plugin/plugin-gcn.c      | 17 +++++++----------
 3 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
index 91560be787f..3933e846a86 100644
--- a/libgomp/config/gcn/libgomp-gcn.h
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -37,16 +37,13 @@ struct output
   unsigned int next_output;
   struct printf_data {
     int written;
-    union {
-      char msg[128];
-      uint64_t msg_u64[2];
-    };
+    char msg[128];
     int type;
     union {
       int64_t ivalue;
       double dvalue;
       char text[128];
-      uint64_t value_u64[2];
+      uint64_t value_u64[6];
     };
   } queue[1024];
   unsigned int consumed;
diff --git a/libgomp/config/gcn/target.c b/libgomp/config/gcn/target.c
index 27854565d40..11ae6ec9833 100644
--- a/libgomp/config/gcn/target.c
+++ b/libgomp/config/gcn/target.c
@@ -102,12 +102,12 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
       asm ("s_sleep 64");
 
   unsigned int slot = index % 1024;
-  uint64_t addrs_sizes_kind[3] = {(uint64_t) hostaddrs, (uint64_t) sizes,
-				  (uint64_t) kinds};
-  data->queue[slot].msg_u64[0] = (uint64_t) fn;
-  data->queue[slot].msg_u64[1] = (uint64_t) mapnum;
-  data->queue[slot].value_u64[0] = (uint64_t) &addrs_sizes_kind[0];
-  data->queue[slot].value_u64[1] = (uint64_t) GOMP_ADDITIONAL_ICVS.device_num;
+  data->queue[slot].value_u64[0] = (uint64_t) fn;
+  data->queue[slot].value_u64[1] = (uint64_t) mapnum;
+  data->queue[slot].value_u64[2] = (uint64_t) hostaddrs;
+  data->queue[slot].value_u64[3] = (uint64_t) sizes;
+  data->queue[slot].value_u64[4] = (uint64_t) kinds;
+  data->queue[slot].value_u64[5] = (uint64_t) GOMP_ADDITIONAL_ICVS.device_num;
 
   data->queue[slot].type = 4; /* Reverse offload.  */
   __atomic_store_n (&data->queue[slot].written, 1, __ATOMIC_RELEASE);
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index ffe5cf5af2c..388e87b7765 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1919,16 +1919,12 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams)
 }
 
 static void
-process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t rev_data,
-			 uint64_t dev_num64)
+process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t hostaddrs,
+			 uint64_t sizes, uint64_t kinds, uint64_t dev_num64)
 {
   int dev_num = dev_num64;
-  uint64_t addrs_sizes_kinds[3];
-  GOMP_OFFLOAD_host2dev (dev_num, &addrs_sizes_kinds, (void *) rev_data,
-			 sizeof (addrs_sizes_kinds));
-  GOMP_PLUGIN_target_rev (fn, mapnum, addrs_sizes_kinds[0],
-			  addrs_sizes_kinds[1], addrs_sizes_kinds[2],
-			  dev_num, NULL, NULL, NULL);
+  GOMP_PLUGIN_target_rev (fn, mapnum, hostaddrs, sizes, kinds, dev_num,
+			  NULL, NULL, NULL);
 }
 
 /* Output any data written to console output from the kernel.  It is expected
@@ -1976,8 +1972,9 @@ console_output (struct kernel_info *kernel, struct kernargs *kernargs,
 	case 2: printf ("%.128s%.128s\n", data->msg, data->text); break;
 	case 3: printf ("%.128s%.128s", data->msg, data->text); break;
 	case 4:
-	  process_reverse_offload (data->msg_u64[0], data->msg_u64[1],
-				   data->value_u64[0],data->value_u64[1]);
+	  process_reverse_offload (data->value_u64[0], data->value_u64[1],
+				   data->value_u64[2], data->value_u64[3],
+				   data->value_u64[4], data->value_u64[5]);
 	  break;
 	default: printf ("GCN print buffer error!\n"); break;
 	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Patch] libgomp/gcn: fix/improve struct output (was: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling)
  2022-11-21 13:40           ` [Patch] libgomp/gcn: fix/improve struct output (was: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling) Tobias Burnus
@ 2022-11-21 14:19             ` Andrew Stubbs
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Stubbs @ 2022-11-21 14:19 UTC (permalink / raw)
  To: Tobias Burnus, Andrew Stubbs, gcc-patches

On 21/11/2022 13:40, Tobias Burnus wrote:
> Working on the builtins, I realized that I mixed up (again) bits and byes.
> While 'uint64_t var[2]' has a size of 128 bits, 'char var[128]' has a 
> size of 128 bytes.
> Thus, there is sufficient space for 16 pointer-size/uin64_t values but I 
> only need 6.
> 
> This patch now makes use of the available space, avoiding one 
> device-to-host memory copy;
> additionally, it avoids a 32bit vs 64bit alignment issue which I somehow 
> missed :-(
> 
> Tested with libgomp on gfx908 offloading and getting only the known fails:
> (libgomp.c-c++-common/teams-2.c, libgomp.fortran/async_io_*.f90,
> libgomp.oacc-c-c++-common/{deep-copy-10.c,static-variable-1.c,vprop.c})
> 
> OK for mainline?

OK, although why not set value64 to 16 entries, even though reverse 
offload only uses 6?

Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-11-21 14:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-27 13:15 [Patch] libgomp/gcn: Prepare for reverse-offload callback handling Tobias Burnus
2022-09-27 13:16 ` Tobias Burnus
2022-09-29 16:24   ` Andrew Stubbs
2022-10-12 14:29     ` Tobias Burnus
2022-10-12 17:09       ` Andrew Stubbs
2022-10-12 17:25         ` Tobias Burnus
2022-11-18 17:41       ` Tobias Burnus
2022-11-18 17:56         ` Andrew Stubbs
2022-11-21 13:40           ` [Patch] libgomp/gcn: fix/improve struct output (was: [Patch] libgomp/gcn: Prepare for reverse-offload callback handling) Tobias Burnus
2022-11-21 14:19             ` Andrew Stubbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).