On Wed, 12 Oct 2022, Tobias Burnus wrote: > On 11.10.22 13:12, Alexander Monakov wrote: > > My understanding is such trickery should not be necessary with > > the barrier-based approach, i.e. the sequence of PTX instructions > > > > st % plain store > > membar.sys > > st.volatile > > > > should be enough to guarantee that the former store is visible on the host > > before the latter, and work all the way back to sm_20. > > If I understand it correctly, you mean: > > GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; > > __sync_synchronize (); /* membar.sys */ > asm volatile ("st.volatile.global.u64 [%0], %1;" > : : "r"(addr_struct_fn), "r" (fn) : "memory"); > > > And then directly followed by the busy wait: > > while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0) > ; /* spin */ > > which GCC expands to: > > /* ld.global.u64 %r64,[__gomp_rev_offload_var]; > ld.u64 %r36,[%r64]; > membar.sys; */ > > The such updated patch is attached. I think the topic for which I was Cc'ed (memory space and access method for the synchronization variable) has been resolved nicely. I am not satisfied with some other points raised in the conversation, I hope they are noted. Alexander > (This is the only change + removing the mkoffload.cc part is the only > larger change. Otherwise, it only handles the minor comments by Jakub. > The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used > until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e. > it is a really old left over!) > > Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged), > sm_35 and sm_70. > > Tobias > > *With some added code; until GOMP_OFFLOAD_get_num_devices accepts > GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image > gets passed a non-NULL for rev_fn_table, the current patch is a no op. > > Planned next is the related GCN patch – and the actual change > in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)