On 11.10.22 13:12, Alexander Monakov wrote: > My understanding is such trickery should not be necessary with > the barrier-based approach, i.e. the sequence of PTX instructions > > st % plain store > membar.sys > st.volatile > > should be enough to guarantee that the former store is visible on the host > before the latter, and work all the way back to sm_20. If I understand it correctly, you mean: GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; __sync_synchronize (); /* membar.sys */ asm volatile ("st.volatile.global.u64 [%0], %1;" : : "r"(addr_struct_fn), "r" (fn) : "memory"); And then directly followed by the busy wait: while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0) ; /* spin */ which GCC expands to: /* ld.global.u64 %r64,[__gomp_rev_offload_var]; ld.u64 %r36,[%r64]; membar.sys; */ The such updated patch is attached. (This is the only change + removing the mkoffload.cc part is the only larger change. Otherwise, it only handles the minor comments by Jakub. The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e. it is a really old left over!) Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged), sm_35 and sm_70. Tobias *With some added code; until GOMP_OFFLOAD_get_num_devices accepts GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image gets passed a non-NULL for rev_fn_table, the current patch is a no op. Planned next is the related GCN patch – and the actual change in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices) ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955