Hi,

I wrote a patch that called some function in the common libgomp code 
from GOMP_OFFLOAD_fini_device, and found that it hung due to the fact that:
- gomp_target_fini locks devices[*].lock while calling
   GOMP_OFFLOAD_fini_device, and
- the function call that I added also locked that same lock, and
- that lock is not recursive.

Given that gomp_target_fini is called at exit, I decided to move the 
function call to a separate function (let's call it pre_fini_device), 
and register it with atexit.

[ Other ways to handle this problem are:
- add a new plugin function GOMP_OFFLOAD_pre_fini_device, or
- to make the lock recursive
]

Then I ran into the problem that pre_fini_device was called after 
GOMP_OFFLOAD_fini_device, due to the fact that:
- atexit (gomp_target_fini) is called at the end of gomp_target_init,
   and
- the atexit (pre_fini_device) happens on the first plugin call, which
   is the current_device.get_num_devices_func () call earlier in
   gomp_target_init.

I fixed this by moving the atexit to the start of gomp_target_init.

I tested this on nvptx, and found that some cuda cleanup is no longer 
needed (or possible), presumably because the cuda runtime itself 
registers a cleanup at exit, which is now called before gomp_target_fini 
instead of after.

This patch contains:
- the move of atexit (gomp_target_fini) from end to start of
   gomp_target_init, and
- handling of the new situation in plugin-nvptx.c. I suspect the code
   can be simplified by assuming that cuda_alive is always false.

Tested on x86_64 with nvptx accelerator.

Is moving the atexit (gomp_target_fini) to the start of gomp_target_init 
a good idea? Any other comments?

Thanks,
- Tom