Hi Thomas,

Thomas Schwinge wrote:
> On 2024-01-23T10:55:16+0100, Tobias Burnus <tburnus@baylibre.com> wrote:
>> plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
>>
>> The following issue was found when running libgomp.c/target-52.c with
>> nvptx offloading when the dg-set-target-env-var was honored.
> Curious, I've never seen this failure mode in my several different
> configurations.  :-|

I think we recently fixed a surprisingly high number of issues that we 
didn't see before but were clearly preexisting for quite a while. 
(Mostly for AMDGPU but still.)

But I concur that this one is a more tricky one.

>> For some unknown reasons, while this does not have an effect on the
>> order of the called plugin functions for initialization, it changes the
>> order of function calls for shutting down. Namely, when the two environment
>> variables are set, GOMP_offload_unregister_ver is called now before
>> gomp_target_fini.
> Re "unknown reasons", isn't that indeed explained by the different
> 'atexit' function/'__attribute__((destructor))' sequencing, due to
> different order of 'atexit'/'__attribute__((constructor))' calls?

Maybe or not. First, it does not seem to occur elsewhere but maybe 
that's because remote setting of environment variables does not work 
with DejaGNU and most code was run such a way. And secondly, I have no 
idea how 'atexit' and destructors are implemented internally.

>> And it seems as if CUDA regards a call to cuModuleUnload
>> (or unloading the last module?) as indication that the device context should
>> be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
>> CUDA_ERROR_DEINITIALIZED.
> However, this I don't understand -- but would like to.  Are you saying
> that for:
>
>      --- libgomp/plugin/plugin-nvptx.c
>      +++ libgomp/plugin/plugin-nvptx.c
>      @@ -1556,8 +1556,16 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
>           if (image->target_data == target_data)
>             {
>       	*prev_p = image->next;
>      -	if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS)
>      +	CUresult r;
>      +	r = CUDA_CALL_NOCHECK (cuModuleUnload, image->module);
>      +	GOMP_PLUGIN_debug (0, "%s: cuModuleUnload: %s\n", __FUNCTION__, cuda_error (r));
>      +	if (r != CUDA_SUCCESS)
>       	  ret = false;
>      +	CUdevice dev_;
>      +	r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev_);
>      +	GOMP_PLUGIN_debug (0, "%s: cuCtxGetDevice: %s\n", __FUNCTION__, cuda_error (r));
>      +	GOMP_PLUGIN_debug (0, "%s: dev_=%d, dev->dev=%d\n", __FUNCTION__, dev_, dev->dev);
>      +	assert (dev_ == dev->dev);
>       	free (image->fns);
>       	free (image);
>       	break;
>
> ..., you're seeing an error for 'libgomp.c/target-52.c' with
> 'env OMP_TARGET_OFFLOAD=mandatory OMP_DISPLAY_ENV=true'?  I get:
>
>      GOMP_OFFLOAD_unload_image: cuModuleUnload: no error
>      GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error
>      GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0
>
> Or, is something else happening in between the 'cuModuleUnload' and your
> reportedly failing 'cuCtxGetDevice'?

I cluttered the plugin with "printf" debugging; hence, no other code
is calling *into* the run-time library as far as I can see.

But now I will try it with a vanilla code and your patch applied.

Result for the target-52.c with the env vars set:

DEBUG: GOMP_offload_unregister_ver dev=0; state=1
DEBUG: gomp_unload_image_from_device
DEBUG GOMP_OFFLOAD_unload_image, 0, 196609
GOMP_OFFLOAD_unload_image: cuModuleUnload: no error
GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error
GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0
DEBUG: gomp_target_fini; dev=0, state=1
DEBUG  0
DEBUG: nvptx_attach_host_thread_to_device - 0
DEBUG: ERROR nvptx_attach_host_thread_to_device - 0

libgomp: cuCtxGetDevice error: unknown cuda error

Hence: The immediately calling cuCtxGetDevice after
the device unloading does not fail.

But calling it soon late via gomp_target_fini
→ GOMP_OFFLOAD_fini_device → nvptx_attach_host_thread_to_device
does fail.

I have attached my printf patch for reference.

* * *

> Re your PR113513 details, I don't see how your failure mode could be
> related to (a) the PTX code ('--with-arch=sm_80'), or the GPU hardware
> ("NVIDIA RTX A1000 6GB") (..., unless the Nvidia Driver is doing "funny"
> things, of course...), so could this possibly be due to a recent change
> in the CUDA Driver/Nvidia Driver?  You say "CUDA Version: 12.3", but
> which which Nvidia Driver version?  The latest I've now tested are:
>
>      Driver Version: 525.147.05   CUDA Version: 12.0
>      Driver Version: 535.154.05   CUDA Version: 12.2

My laptop has:

NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3

> I'd like to please defer that one until we understand the actual origin
> of the misbehavior.
(I think that patch makes still sense, but first finding out what goes 
wrong is fine nonetheless.)
>> When reading the code, the following was observed in addition:
>> When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
>> to ensure that the queue is emptied.  It seems to make sense to do
>> likewise for GOMP_offload_unregister_ver, which this commit does in
>> addition.
> I don't understand why offload image unregistration (a) should trigger
> 'goacc_fini_asyncqueues', and (b) how that relates to PR113513?

While there no direct relation and none to the testcase, this is 
affected by the ordering of GOMP_offload_unregister_ver vs.before 
gomp_target_fini, which is what the main issue is above.

Assume that by some reason GOMP_offload_unregister_ver gets called 
before gomp_target_fini. In that case, the asynchronous queues can be 
still running when the variables are removed and only when later 
gomp_target_fini is called, it will invoke goacc_fini_asyncqueues.

Of course, when gomp_target_fini is called first, it will run 
goacc_fini_asyncqueues first – and a later GOMP_offload_unregister_ver 
is a no op as the device is already finalized.

Thus, this part of the patch adds a safeguard for something to be a 
known issue for a related issue.

If we guarantee that gomp_target_fini is always called first, I suggest 
to remove GOMP_offload_unregister_ver for good as that will then be 
always unreachable ... (Well, that function itself not but it will not 
do any actual work.)

If we don't think so and there might be an ordering issue, I very much 
would like to see this safeguard in, which is very inexpensive if no 
work remains to be completed.

Tobias