From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180]) by sourceware.org (Postfix) with ESMTPS id DB0A63858CDA for ; Fri, 28 Apr 2023 09:23:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DB0A63858CDA Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.99,234,1677571200"; d="scan'208";a="3758872" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa3.mentor.iphmx.com with ESMTP; 28 Apr 2023 01:23:35 -0800 IronPort-SDR: 7on0qgYP6Jea7HhPo5x566rc0bFBFQrVBSaJgAHQABUjxtQ3sFR0bv8bT7Z2pHQLbR5VdwNL07 KeF6OH6MMojAFaastBtEgPws0b/B7OCljeh8XraAaJQTTJsLkCXHqQfMmUQAx/psuuHVR7RO9k QJWXGGd0rRX+XIoIqz5M1yJ+A85ic4uXmCKyaHrRb/DrIl2y2JJ7nY6nOxYJT+9ZH4LMGOgRIZ wT1CtLU9aonXTaJk8K23jrRvhoguGuMWUSWVPotb5XpnnNBm4xwyW6ZNf9U3T2wNcfHJMaPlmX 1Rk= From: Thomas Schwinge To: Tobias Burnus CC: Alexander Monakov , Jakub Jelinek , Tom de Vries , Subject: Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling In-Reply-To: <5915414b-cd34-f1b0-cb2d-aea0560b1f3c@codesourcery.com> References: <57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com> <871qkzpv6t.fsf@euler.schwinge.homeip.net> <5915414b-cd34-f1b0-cb2d-aea0560b1f3c@codesourcery.com> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2 (x86_64-pc-linux-gnu) Date: Fri, 28 Apr 2023 11:23:28 +0200 Message-ID: <875y9gcqcv.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Tobias! On 2023-04-28T10:28:22+0200, Tobias Burnus wrote: > maybe I misunderstood your suggestion, but Forst, note that those CUDA "Stream Memory Operations" are something that I found by chance, and don't have any actual experience with. I can't seem to find a lot of documentation/usage of this API? By the way, a similar thing also exists for AMD GPUs: 'hipStreamWaitValue32', etc. > "Wait on a memory location" > assumes that there will be a change =E2=80=93 but if a target region happ= ens to > have no reverse offload, the memory location will never change, but > still the target region should return to the host. Oh indeed. ;-) Details... > What we would need: Wait on memory location =E2=80=93 and return if eithe= r the > kernel stopped *or* the memory location changed. Or, have a way to "cancel", from the host, the 'cuStreamWaitValue32', 'cuStreamWaitValue64', after the actual 'target' kernel completed? > My impression is that "return if the kernel stopped" is not really > guaranteed. Of did I miss some fineprint? No, you're right. I suppose this is as designed: for example, generally, there may be additional kernel launches, and the "wait" will then eventually trigger. Could we, after the actual 'target' kernel completed, issue a host-side "write" ('cuStreamWriteValue32', 'cuStreamWriteValue64') to that memory location, to signal end of processing for reverse offloads? That is: - enqueue 'cuLaunchKernel' - enqueue 'cuStreamWriteValue' (to signal end of processing for reverse o= ffloads) - loop on 'cuStreamWaitValue' (until end of processing for reverse offloa= ds) Gr=C3=BC=C3=9Fe Thomas > On 04.04.23 16:40, Thomas Schwinge wrote: >> Hi! >> >> During GCC/OpenMP/nvptx reverse offload investigations, about how to >> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found >> something re: >> >> On 2022-08-26T11:07:28+0200, Tobias Burnus wro= te: >>> Better suggestions are welcome for the busy loop in >>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and chec= king >>> its value. >>> On the host side, the last address is checked - if fn_addr !=3D NULL, >>> it passes all arguments on to the generic (target.c) gomp_target_rev >>> to do the actual offloading. >>> >>> CUDA does lockup when trying to copy data from the currently running >>> stream; hence, a new stream is generated to do the memory copying. >>> Future work for nvptx: >>> * Adjust 'sleep', possibly [...] >>> to do shorter sleeps than usleep(1)? >> ... this busy loop. >> >> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run': >> >> [...] >> if (reverse_offload) >> CUDA_CALL_ASSERT (cuStreamCreate, ©_stream, CU_STREAM_NON_B= LOCKING); >> r =3D CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1, >> 32, threads, 1, 0, NULL, NULL, config); >> if (r !=3D CUDA_SUCCESS) >> GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); >> if (reverse_offload) >> while (true) >> { >> r =3D CUDA_CALL_NOCHECK (cuStreamQuery, NULL); >> if (r =3D=3D CUDA_SUCCESS) >> break; >> if (r =3D=3D CUDA_ERROR_LAUNCH_FAILED) >> GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_e= rror (r), >> maybe_abort_msg); >> else if (r !=3D CUDA_ERROR_NOT_READY) >> GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error = (r)); >> >> if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUI= RE) !=3D 0) >> { >> struct rev_offload *rev_data =3D ptx_dev->rev_data; >> GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum, >> rev_data->addrs, rev_data->size= s, >> rev_data->kinds, rev_data->dev_= num, >> rev_off_dev_to_host_cpy, >> rev_off_host_to_dev_cpy, copy_s= tream); >> CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream); >> __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE); >> } >> usleep (1); >> } >> else >> r =3D CUDA_CALL_NOCHECK (cuCtxSynchronize, ); >> if (reverse_offload) >> CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream); >> [...] >> >> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able >> to use "Stream Memory Operations", >> > that allow to "Wait on a memory location", "until the given condition on >> the memory is satisfied"? >> >> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext': >> >> [...] >> GOMP_REV_OFFLOAD_VAR->mapnum =3D mapnum; >> GOMP_REV_OFFLOAD_VAR->addrs =3D (uint64_t) hostaddrs; >> GOMP_REV_OFFLOAD_VAR->sizes =3D (uint64_t) sizes; >> GOMP_REV_OFFLOAD_VAR->kinds =3D (uint64_t) kinds; >> GOMP_REV_OFFLOAD_VAR->dev_num =3D GOMP_ADDITIONAL_ICVS.device_num= ; >> >> /* Set 'fn' to trigger processing on the host; wait for completio= n, >> which is flagged by setting 'fn' back to 0 on the host. */ >> uint64_t addr_struct_fn =3D (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn; >> #if __PTX_SM__ >=3D 700 >> asm volatile ("st.global.release.sys.u64 [%0], %1;" >> : : "r"(addr_struct_fn), "r" (fn) : "memory"); >> #else >> __sync_synchronize (); /* membar.sys */ >> asm volatile ("st.volatile.global.u64 [%0], %1;" >> : : "r"(addr_struct_fn), "r" (fn) : "memory"); >> #endif >> >> #if __PTX_SM__ >=3D 700 >> uint64_t fn2; >> do >> { >> asm volatile ("ld.acquire.sys.global.u64 %0, [%1];" >> : "=3Dr" (fn2) : "r" (addr_struct_fn) : "memory= "); >> } >> while (fn2 !=3D 0); >> #else >> /* ld.global.u64 %r64,[__gomp_rev_offload_var]; >> ld.u64 %r36,[%r64]; >> membar.sys; */ >> while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUI= RE) !=3D 0) >> ; /* spin */ >> #endif >> [...] >> >> >> Gr=C3=BC=C3=9Fe >> Thomas >> ----------------- >> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe = 201, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch= =C3=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellsc= haft: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955 > ----------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 2= 01, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch= =C3=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellsc= haft: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955 ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955