From: Thomas Schwinge <tschwinge@baylibre.com>
To: Tobias Burnus <tburnus@baylibre.com>
Cc: gcc-patches@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>,
Tom de Vries <tdevries@suse.de>
Subject: Re: [v2][patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
Date: Mon, 29 Jan 2024 16:53:30 +0100 [thread overview]
Message-ID: <875xzcf85h.fsf@euler.schwinge.ddns.net> (raw)
In-Reply-To: <53a3c4e3-452c-4445-8d4a-be66dccc9e45@baylibre.com>
Hi Tobias!
On 2024-01-23T10:55:16+0100, Tobias Burnus <tburnus@baylibre.com> wrote:
> Slightly changed patch:
>
> nvptx_attach_host_thread_to_device now fails again with an error for
> CUDA_ERROR_DEINITIALIZED, except for GOMP_OFFLOAD_fini_device.
>
> I think it makes more sense that way.
Agreed.
> Tobias Burnus wrote:
>> Testing showed that the libgomp.c/target-52.c failed with:
>>
>> libgomp: cuCtxGetDevice error: unknown cuda error
>>
>> libgomp: device finalization failed
>>
>> This testcase uses OMP_DISPLAY_ENV=true and
>> OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only
>> fails if dg-set-target-env-var is honored.
>>
>> If both env vars are set, the device initialization occurs earlier as
>> OMP_DEFAULT_DEVICE is shown due to the display-env env var and its
>> value (when target-offload-var is 'mandatory') might be either
>> 'omp_invalid_device' or '0'.
>>
>> It turned out that this had an effect on device finalization, which
>> caused CUDA to stop earlier than expected. This patch now handles this
>> case gracefully. For details, see the commit log message in the
>> attached patch and/or the PR.
> plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
>
> The following issue was found when running libgomp.c/target-52.c with
> nvptx offloading when the dg-set-target-env-var was honored.
Curious, I've never seen this failure mode in my several different
configurations. :-|
> The issue
> occurred for both -foffload=disable and with offloading configured when
> an nvidia device is available.
>
> At the end of the program, the offloading parts are shutdown via two means:
> The callback registered via 'atexit (gomp_target_fini)' and - via code
> generated in mkoffload, the '__attribute__((destructor)) fini' function
> that calls GOMP_offload_unregister_ver.
>
> In normal processing, first gomp_target_fini is called - which then sets
> GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver,
> but that's then because the state is GOMP_DEVICE_FINALIZED.
> If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set,
> the call omp_display_env already invokes gomp_init_targets_once, i.e. it
> occurs earlier than usual and is invoked via __attribute__((constructor))
> initialize_env.
>
> For some unknown reasons, while this does not have an effect on the
> order of the called plugin functions for initialization, it changes the
> order of function calls for shutting down. Namely, when the two environment
> variables are set, GOMP_offload_unregister_ver is called now before
> gomp_target_fini.
Re "unknown reasons", isn't that indeed explained by the different
'atexit' function/'__attribute__((destructor))' sequencing, due to
different order of 'atexit'/'__attribute__((constructor))' calls?
I think I agree that, defensively, we should behave correctly in libgomp
finitialization, no matter in which these calls occur.
> And it seems as if CUDA regards a call to cuModuleUnload
> (or unloading the last module?) as indication that the device context should
> be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
> CUDA_ERROR_DEINITIALIZED.
However, this I don't understand -- but would like to. Are you saying
that for:
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -1556,8 +1556,16 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
if (image->target_data == target_data)
{
*prev_p = image->next;
- if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS)
+ CUresult r;
+ r = CUDA_CALL_NOCHECK (cuModuleUnload, image->module);
+ GOMP_PLUGIN_debug (0, "%s: cuModuleUnload: %s\n", __FUNCTION__, cuda_error (r));
+ if (r != CUDA_SUCCESS)
ret = false;
+ CUdevice dev_;
+ r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev_);
+ GOMP_PLUGIN_debug (0, "%s: cuCtxGetDevice: %s\n", __FUNCTION__, cuda_error (r));
+ GOMP_PLUGIN_debug (0, "%s: dev_=%d, dev->dev=%d\n", __FUNCTION__, dev_, dev->dev);
+ assert (dev_ == dev->dev);
free (image->fns);
free (image);
break;
..., you're seeing an error for 'libgomp.c/target-52.c' with
'env OMP_TARGET_OFFLOAD=mandatory OMP_DISPLAY_ENV=true'? I get:
GOMP_OFFLOAD_unload_image: cuModuleUnload: no error
GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error
GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0
Or, is something else happening in between the 'cuModuleUnload' and your
reportedly failing 'cuCtxGetDevice'?
Re your PR113513 details, I don't see how your failure mode could be
related to (a) the PTX code ('--with-arch=sm_80'), or the GPU hardware
("NVIDIA RTX A1000 6GB") (..., unless the Nvidia Driver is doing "funny"
things, of course...), so could this possibly be due to a recent change
in the CUDA Driver/Nvidia Driver? You say "CUDA Version: 12.3", but
which which Nvidia Driver version? The latest I've now tested are:
Driver Version: 525.147.05 CUDA Version: 12.0
Driver Version: 535.154.05 CUDA Version: 12.2
I'll re-try with a more recent version.
> As the previous code in nvptx_attach_host_thread_to_device wasn't expecting
> that result, it called
> GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
> causing a fatal error of the program.
>
> This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such
> that GOMP_OFFLOAD_fini_device just works.
I'd like to please defer that one until we understand the actual origin
of the misbehavior.
> When reading the code, the following was observed in addition:
> When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
> to ensure that the queue is emptied. It seems to make sense to do
> likewise for GOMP_offload_unregister_ver, which this commit does in
> addition.
I don't understand why offload image unregistration (a) should trigger
'goacc_fini_asyncqueues', and (b) how that relates to PR113513?
Grüße
Thomas
next prev parent reply other threads:[~2024-01-29 15:53 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-22 19:45 [patch] " Tobias Burnus
2024-01-23 9:55 ` [v2][patch] " Tobias Burnus
2024-01-29 15:53 ` Thomas Schwinge [this message]
2024-01-29 17:07 ` Tobias Burnus
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875xzcf85h.fsf@euler.schwinge.ddns.net \
--to=tschwinge@baylibre.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=jakub@redhat.com \
--cc=tburnus@baylibre.com \
--cc=tdevries@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).