From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252]) by sourceware.org (Postfix) with ESMTPS id 4B0423858C5E for ; Tue, 4 Apr 2023 14:40:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4B0423858C5E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.98,318,1673942400"; d="scan'208";a="1390939" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa4.mentor.iphmx.com with ESMTP; 04 Apr 2023 06:40:36 -0800 IronPort-SDR: 49MQhcq3EAWU6GOCvJr+Y7A9itXdO0yg1W8R7Xl3CMSvP39fK054xS+aMuPCc03t9OoW3s4Eps qOFksqe9r+j2nAVl2fqb+kZKNw17/UbXLuRfX1NPYloVe7jr1ebqAaEd5TGsZJSLr07AoNlbi6 m0IBS+aDVuEIuujL/7Tgc8E6FNU48ulH579lBAqtmV79ld2vOD4ESDUGP1Ukq0074Dyyo01obj tvuw4uxbIeU7bQbHbhQE9hODwMoSFOxVRObsiXnJDslwWhcAPvFMyhurX/SxZHfXM+TbP6BVqn CHw= From: Thomas Schwinge To: Tobias Burnus CC: Alexander Monakov , Jakub Jelinek , Tom de Vries , Subject: Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling In-Reply-To: <57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com> References: <57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2 (x86_64-pc-linux-gnu) Date: Tue, 4 Apr 2023 16:40:26 +0200 Message-ID: <871qkzpv6t.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi! During GCC/OpenMP/nvptx reverse offload investigations, about how to replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found something re: On 2022-08-26T11:07:28+0200, Tobias Burnus wrote: > Better suggestions are welcome for the busy loop in > libgomp/plugin/plugin-nvptx.c regarding the variable placement and checki= ng > its value. > On the host side, the last address is checked - if fn_addr !=3D NULL, > it passes all arguments on to the generic (target.c) gomp_target_rev > to do the actual offloading. > > CUDA does lockup when trying to copy data from the currently running > stream; hence, a new stream is generated to do the memory copying. > Future work for nvptx: > * Adjust 'sleep', possibly [...] > to do shorter sleeps than usleep(1)? ... this busy loop. Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run': [...] if (reverse_offload) CUDA_CALL_ASSERT (cuStreamCreate, ©_stream, CU_STREAM_NON_BLOCK= ING); r =3D CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1, 32, threads, 1, 0, NULL, NULL, config); if (r !=3D CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); if (reverse_offload) while (true) { r =3D CUDA_CALL_NOCHECK (cuStreamQuery, NULL); if (r =3D=3D CUDA_SUCCESS) break; if (r =3D=3D CUDA_ERROR_LAUNCH_FAILED) GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error= (r), maybe_abort_msg); else if (r !=3D CUDA_ERROR_NOT_READY) GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r))= ; if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) = !=3D 0) { struct rev_offload *rev_data =3D ptx_dev->rev_data; GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum, rev_data->addrs, rev_data->sizes, rev_data->kinds, rev_data->dev_num, rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy, copy_strea= m); CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream); __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE); } usleep (1); } else r =3D CUDA_CALL_NOCHECK (cuCtxSynchronize, ); if (reverse_offload) CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream); [...] Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able to use "Stream Memory Operations", , that allow to "Wait on a memory location", "until the given condition on the memory is satisfied"? For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext': [...] GOMP_REV_OFFLOAD_VAR->mapnum =3D mapnum; GOMP_REV_OFFLOAD_VAR->addrs =3D (uint64_t) hostaddrs; GOMP_REV_OFFLOAD_VAR->sizes =3D (uint64_t) sizes; GOMP_REV_OFFLOAD_VAR->kinds =3D (uint64_t) kinds; GOMP_REV_OFFLOAD_VAR->dev_num =3D GOMP_ADDITIONAL_ICVS.device_num; /* Set 'fn' to trigger processing on the host; wait for completion, which is flagged by setting 'fn' back to 0 on the host. */ uint64_t addr_struct_fn =3D (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn; #if __PTX_SM__ >=3D 700 asm volatile ("st.global.release.sys.u64 [%0], %1;" : : "r"(addr_struct_fn), "r" (fn) : "memory"); #else __sync_synchronize (); /* membar.sys */ asm volatile ("st.volatile.global.u64 [%0], %1;" : : "r"(addr_struct_fn), "r" (fn) : "memory"); #endif #if __PTX_SM__ >=3D 700 uint64_t fn2; do { asm volatile ("ld.acquire.sys.global.u64 %0, [%1];" : "=3Dr" (fn2) : "r" (addr_struct_fn) : "memory"); } while (fn2 !=3D 0); #else /* ld.global.u64 %r64,[__gomp_rev_offload_var]; ld.u64 %r36,[%r64]; membar.sys; */ while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) = !=3D 0) ; /* spin */ #endif [...] Gr=C3=BC=C3=9Fe Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955