Hi Alexander, On 21.09.22 22:06, Alexander Monakov wrote: > It also goes > against the good practice of accelerator programming, which requires queueing > work on the accelerator and letting it run asynchronously with the CPU with high > occupancy. > (I know libgomp still waits for the GPU to finish in each GOMP_offload_run, > but maybe it's better to improve *that* instead of piling on new slowness) Doesn't OpenMP 'nowait' permit this? (+ 'depend' clause if needed). > On to the patch itself. > And for non-USM code path you're relying on cudaMemcpy observing device-side > atomics in the right order. > Atomics aside, CUDA pinned memory would be a natural choice for such a tiny > structure. Did you rule it out for some reason? I did use pinned memory (cuMemAllocHost) – but somehow it did escape me that: "All host memory allocated in all contexts using cuMemAllocHost() and cuMemHostAlloc() is always directly accessible from all contexts on all devices that support unified addressing." I have now updated (but using cuMemHostAlloc instead, using a flag in the hope that this choice is a tad faster). >> +++ b/libgomp/config/nvptx/target.c >> ... >> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var > Shouldn't this be in a header (needs to be in sync with the plugin). I have now created one. >> + >> +#if (__SIZEOF_SHORT__ != 2 \ >> + || __SIZEOF_SIZE_T__ != 8 \ >> + || __SIZEOF_POINTER__ != 8) >> +#error "Data-type conversion required for rev_offload" >> +#endif > Huh? This is not a requirement that is new for reverse offload, it has always > been like that for offloading (all ABI rules regarding type sizes, struct > layout, bitfield layout, endianness must match). In theory, compiling with "-m32 -foffload-options=-m64" or "-m32 -foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported. In practice, -m64 everywhere is required. I just want to make sure that for this code the sizes are fine because, here, I am sure it breaks. For other parts, I think the 64bit assumption is coded in but I am not completely sure that's really the case. >> + if (device != GOMP_DEVICE_HOST_FALLBACK >> + || fn == NULL >> + || GOMP_REV_OFFLOAD_VAR == NULL) >> + return; > Shouldn't this be an 'assert' instead? This tries to mimic what was there before – doing nothing. In any case, this code path is unspecified or implementation defined (I forgot which of the two), but a user might still be able to construct such a code. I leave it to Jakub whether he likes to have an assert, a error/warning message, or just the return here. >> + __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num, >> + GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST); > Looks like all these can be plain stores, you only need ... > >> + __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST); > ... this to be atomic with 'release' semantics in the usual producer-consumer > pattern. > >> + if (ptx_dev->rev_data->fn != 0) > Surely this needs to be an atomic load with 'acquire' semantics in has_usm case? >> + rev_data->fn = 0; >> >> Atomic store? Done so – updated patch attached. Thanks for the comments. Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955