GCN, nvptx: Errors during device probing are fatal (was: Stabilizing flaky libgomp GCN target/offloading testing)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Thomas Schwinge <tschwinge@baylibre.com>
To: Andrew Stubbs <ams@baylibre.com>,
	gcc-patches@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>
Cc: Richard Biener <rguenther@suse.de>, Tobias Burnus <tburnus@baylibre.com>
Subject: GCN, nvptx: Errors during device probing are fatal (was: Stabilizing flaky libgomp GCN target/offloading testing)
Date: Fri, 08 Mar 2024 11:34:33 +0100	[thread overview]
Message-ID: <871q8l6mh2.fsf@euler.schwinge.ddns.net> (raw)
In-Reply-To: <87il2ij8sm.fsf@euler.schwinge.ddns.net>

[-- Attachment #1: Type: text/plain, Size: 1145 bytes --]

Hi!

On 2024-02-21T13:34:01+0100, I wrote:
> On 2024-02-01T15:49:02+0100, Richard Biener <rguenther@suse.de> wrote:
>> On Thu, 1 Feb 2024, Thomas Schwinge wrote:
>>> [...] what I
>>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  [...]
>
>>> [...] execution test FAILs.  Not all FAILs appear all the time [...]
>
> What disturbs the testing a lot is, that the GPU may get into a bad
> state, upon which any use either fails with a
> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
> 'libhsa-runtime64.so.1'...

So, there's a "fun" aspect: if we run into
'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (or other errors; and similar in the
libgomp nvptx plugin) during libgomp GCN plugin device probing, then it's
not fatal, but instead silently disables the libgomp plugin/device, thus
typically silently resorting to host-fallback execution.  That's not
helpful behavior in my opinion, so I propose the attached
"GCN, nvptx: Errors during device probing are fatal".  OK to push?

(That's also the behavior that's implemented in both the GCN and nvptx
target 'run' tools.)


Grüße
 Thomas



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-GCN-nvptx-Errors-during-device-probing-are-fatal.patch --]
[-- Type: text/x-diff, Size: 5022 bytes --]

From 0dc72089dccc10d3b55096ade5fc4d72de6cb96f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tschwinge@baylibre.com>
Date: Thu, 7 Mar 2024 14:42:07 +0100
Subject: [PATCH] GCN, nvptx: Errors during device probing are fatal

Currently, we silently disable libgomp GCN and nvptx plugins/devices in
presence of certain error conditions during device probing, thus typically
silently resorting to host-fallback execution.  Make such errors fatal, similar
as for any other device access later on, so that we early and reliably notice
when things go wrong.  (Keep just two cases non-fatal: (a) libgomp GCN or nvptx
plugins are available but 'libhsa-runtime64.so.1' or 'libcuda.so.1' are not,
and (b) those are available, but the corresponding devices are not.)

This resolves the issue that we've got execution test cases unexpectedly
PASSing, despite:

    libgomp: GCN fatal error: Run-time could not be initialized
    Runtime message: HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

..., and therefore they were not offloaded to the GCN device, but ran in
host-fallback execution mode.  What happend in that scenario is that in
'init_hsa_context' during the initial 'GOMP_OFFLOAD_get_num_devices' we ran
into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', but it wasn't fatal, but just
silently disabled the libgomp plugin/device.

Especially "entertaining" were cases where such unintended host-fallback
execution happened during effective-target checks like
'offload_device_available' (host-fallback execution there meaning: no offload
device available), but actual test cases then were running with an offload
device available, and therefore mis-configured.

	include/
	* cuda/cuda.h (CUresult): Add 'CUDA_ERROR_NO_DEVICE'.
	libgomp/
	* plugin/plugin-gcn.c (init_hsa_context): Add and handle
	'bool probe' parameter.  Adjust all users; errors during device
	probing are fatal.
	* plugin/plugin-nvptx.c (nvptx_get_num_devices): Aside from
	'CUDA_ERROR_NO_DEVICE', errors during device probing are fatal.
---
 include/cuda/cuda.h           |  1 +
 libgomp/plugin/plugin-gcn.c   | 14 ++++++++------
 libgomp/plugin/plugin-nvptx.c |  4 +++-
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 114aba4e074..0dca4b3a5c0 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -57,6 +57,7 @@ typedef enum {
   CUDA_ERROR_OUT_OF_MEMORY = 2,
   CUDA_ERROR_NOT_INITIALIZED = 3,
   CUDA_ERROR_DEINITIALIZED = 4,
+  CUDA_ERROR_NO_DEVICE = 100,
   CUDA_ERROR_INVALID_CONTEXT = 201,
   CUDA_ERROR_INVALID_HANDLE = 400,
   CUDA_ERROR_NOT_FOUND = 500,
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 7e141a85f31..2bea9157e9d 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1511,10 +1511,12 @@ assign_agent_ids (hsa_agent_t agent, void *data)
 }
 
 /* Initialize hsa_context if it has not already been done.
-   Return TRUE on success.  */
+   If !PROBE: returns TRUE on success.
+   If PROBE: returns TRUE on success or if the plugin/device shall be silently
+   ignored, and otherwise emits an error and returns FALSE.  */
 
 static bool
-init_hsa_context (void)
+init_hsa_context (bool probe)
 {
   hsa_status_t status;
   int agent_index = 0;
@@ -1529,7 +1531,7 @@ init_hsa_context (void)
 	GOMP_PLUGIN_fatal ("%s\n", msg);
       else
 	GCN_WARNING ("%s\n", msg);
-      return false;
+      return probe ? true : false;
     }
   status = hsa_fns.hsa_init_fn ();
   if (status != HSA_STATUS_SUCCESS)
@@ -3321,8 +3323,8 @@ GOMP_OFFLOAD_version (void)
 int
 GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
-  if (!init_hsa_context ())
-    return 0;
+  if (!init_hsa_context (true))
+    exit (EXIT_FAILURE);
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
      devices were present.  */
   if (hsa_context.agent_count > 0
@@ -3339,7 +3341,7 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 bool
 GOMP_OFFLOAD_init_device (int n)
 {
-  if (!init_hsa_context ())
+  if (!init_hsa_context (false))
     return false;
   if (n >= hsa_context.agent_count)
     {
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 81b4a7f499a..ba92a3a48cb 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -622,12 +622,14 @@ nvptx_get_num_devices (void)
       CUresult r = CUDA_CALL_NOCHECK (cuInit, 0);
       /* This is not an error: e.g. we may have CUDA libraries installed but
          no devices available.  */
-      if (r != CUDA_SUCCESS)
+      if (r == CUDA_ERROR_NO_DEVICE)
 	{
 	  GOMP_PLUGIN_debug (0, "Disabling nvptx offloading; cuInit: %s\n",
 			     cuda_error (r));
 	  return 0;
 	}
+      else if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuInit error: %s", cuda_error (r));
     }
 
   CUDA_CALL_ASSERT (cuDeviceGetCount, &n);
-- 
2.34.1

next prev parent reply	other threads:[~2024-03-08 10:34 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24 12:43 [PATCH] amdgcn: additional gfx1100 support Andrew Stubbs
2024-01-24 16:01 ` [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs (was: [PATCH] amdgcn: additional gfx1100 support) Tobias Burnus
2024-01-26 12:26   ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs (was: [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs) Tobias Burnus
2024-01-26 12:32     ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs Tobias Burnus
2024-01-26 12:40       ` Richard Biener
2024-01-26 12:59         ` Tobias Burnus
2024-01-26 16:21       ` Thomas Schwinge
2024-01-26 16:36         ` Richard Biener
2024-01-26 16:45         ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled (was: [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs) Tobias Burnus
2024-01-29 10:01           ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled Andrew Stubbs
2024-01-26  8:56 ` [PATCH] amdgcn: additional gfx1100 support Richard Biener
2024-01-26  9:45   ` Richard Biener
2024-01-26 10:19     ` Andrew Stubbs
2024-01-26 10:22       ` Richard Biener
2024-01-26 10:31         ` Andrew Stubbs
2024-02-01 14:41     ` libgomp GCN gfx1030/gfx1100 offloading status (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-02-01 14:49       ` Richard Biener
2024-02-21 12:34         ` Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status) Thomas Schwinge
2024-02-21 16:32           ` Richard Biener
2024-03-06 12:09             ` Stabilize flaky GCN target/offloading testing Thomas Schwinge
2024-03-06 12:39               ` Andrew Stubbs
2024-03-06 13:29                 ` Richard Biener
2024-03-08 10:34           ` Thomas Schwinge [this message]
2024-03-06 13:49 ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-03-06 14:03   ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases Andrew Stubbs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871q8l6mh2.fsf@euler.schwinge.ddns.net \
    --to=tschwinge@baylibre.com \
    --cc=ams@baylibre.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=rguenther@suse.de \
    --cc=tburnus@baylibre.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).