Hi, I wrote a patch that called some function in the common libgomp code from GOMP_OFFLOAD_fini_device, and found that it hung due to the fact that: - gomp_target_fini locks devices[*].lock while calling GOMP_OFFLOAD_fini_device, and - the function call that I added also locked that same lock, and - that lock is not recursive. Given that gomp_target_fini is called at exit, I decided to move the function call to a separate function (let's call it pre_fini_device), and register it with atexit. [ Other ways to handle this problem are: - add a new plugin function GOMP_OFFLOAD_pre_fini_device, or - to make the lock recursive ] Then I ran into the problem that pre_fini_device was called after GOMP_OFFLOAD_fini_device, due to the fact that: - atexit (gomp_target_fini) is called at the end of gomp_target_init, and - the atexit (pre_fini_device) happens on the first plugin call, which is the current_device.get_num_devices_func () call earlier in gomp_target_init. I fixed this by moving the atexit to the start of gomp_target_init. I tested this on nvptx, and found that some cuda cleanup is no longer needed (or possible), presumably because the cuda runtime itself registers a cleanup at exit, which is now called before gomp_target_fini instead of after. This patch contains: - the move of atexit (gomp_target_fini) from end to start of gomp_target_init, and - handling of the new situation in plugin-nvptx.c. I suspect the code can be simplified by assuming that cuda_alive is always false. Tested on x86_64 with nvptx accelerator. Is moving the atexit (gomp_target_fini) to the start of gomp_target_init a good idea? Any other comments? Thanks, - Tom