From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=g9/B=AT=mentor.com=Thomas_Schwinge@sourceware.org>
Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180])
	by sourceware.org (Postfix) with ESMTPS id DB0A63858CDA
	for <gcc-patches@gcc.gnu.org>; Fri, 28 Apr 2023 09:23:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DB0A63858CDA
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.99,234,1677571200"; 
   d="scan'208";a="3758872"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
  by esa3.mentor.iphmx.com with ESMTP; 28 Apr 2023 01:23:35 -0800
IronPort-SDR: 7on0qgYP6Jea7HhPo5x566rc0bFBFQrVBSaJgAHQABUjxtQ3sFR0bv8bT7Z2pHQLbR5VdwNL07
 KeF6OH6MMojAFaastBtEgPws0b/B7OCljeh8XraAaJQTTJsLkCXHqQfMmUQAx/psuuHVR7RO9k
 QJWXGGd0rRX+XIoIqz5M1yJ+A85ic4uXmCKyaHrRb/DrIl2y2JJ7nY6nOxYJT+9ZH4LMGOgRIZ
 wT1CtLU9aonXTaJk8K23jrRvhoguGuMWUSWVPotb5XpnnNBm4xwyW6ZNf9U3T2wNcfHJMaPlmX
 1Rk=
From: Thomas Schwinge <thomas@codesourcery.com>
To: Tobias Burnus <tobias@codesourcery.com>
CC: Alexander Monakov <amonakov@ispras.ru>, Jakub Jelinek <jakub@redhat.com>,
	Tom de Vries <tdevries@suse.de>, <gcc-patches@gcc.gnu.org>
Subject: Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback
 handling
In-Reply-To: <5915414b-cd34-f1b0-cb2d-aea0560b1f3c@codesourcery.com>
References: <57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com>
 <871qkzpv6t.fsf@euler.schwinge.homeip.net>
 <5915414b-cd34-f1b0-cb2d-aea0560b1f3c@codesourcery.com>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2
 (x86_64-pc-linux-gnu)
Date: Fri, 28 Apr 2023 11:23:28 +0200
Message-ID: <875y9gcqcv.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To
 svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10)
X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi Tobias!

On 2023-04-28T10:28:22+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
> maybe I misunderstood your suggestion, but

Forst, note that those CUDA "Stream Memory Operations" are something that
I found by chance, and don't have any actual experience with.  I can't
seem to find a lot of documentation/usage of this API?

By the way, a similar thing also exists for AMD GPUs:
'hipStreamWaitValue32', etc.


> "Wait on a memory location"
> assumes that there will be a change =E2=80=93 but if a target region happ=
ens to
> have no reverse offload, the memory location will never change, but
> still the target region should return to the host.

Oh indeed.  ;-) Details...

> What we would need: Wait on memory location =E2=80=93 and return if eithe=
r the
> kernel stopped *or* the memory location changed.

Or, have a way to "cancel", from the host, the 'cuStreamWaitValue32',
'cuStreamWaitValue64', after the actual 'target' kernel completed?

> My impression is that "return if the kernel stopped" is not really
> guaranteed. Of did I miss some fineprint?

No, you're right.  I suppose this is as designed: for example, generally,
there may be additional kernel launches, and the "wait" will then
eventually trigger.

Could we, after the actual 'target' kernel completed, issue a host-side
"write" ('cuStreamWriteValue32', 'cuStreamWriteValue64') to that memory
location, to signal end of processing for reverse offloads?

That is:

  - enqueue 'cuLaunchKernel'
  - enqueue 'cuStreamWriteValue' (to signal end of processing for reverse o=
ffloads)
  - loop on 'cuStreamWaitValue' (until end of processing for reverse offloa=
ds)


Gr=C3=BC=C3=9Fe
 Thomas


> On 04.04.23 16:40, Thomas Schwinge wrote:
>> Hi!
>>
>> During GCC/OpenMP/nvptx reverse offload investigations, about how to
>> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
>> something re:
>>
>> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wro=
te:
>>> Better suggestions are welcome for the busy loop in
>>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and chec=
king
>>> its value.
>>> On the host side, the last address is checked - if fn_addr !=3D NULL,
>>> it passes all arguments on to the generic (target.c) gomp_target_rev
>>> to do the actual offloading.
>>>
>>> CUDA does lockup when trying to copy data from the currently running
>>> stream; hence, a new stream is generated to do the memory copying.
>>> Future work for nvptx:
>>> * Adjust 'sleep', possibly [...]
>>>    to do shorter sleeps than usleep(1)?
>> ... this busy loop.
>>
>> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':
>>
>>      [...]
>>        if (reverse_offload)
>>          CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_B=
LOCKING);
>>        r =3D CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>>                               32, threads, 1, 0, NULL, NULL, config);
>>        if (r !=3D CUDA_SUCCESS)
>>          GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
>>        if (reverse_offload)
>>          while (true)
>>            {
>>              r =3D CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
>>              if (r =3D=3D CUDA_SUCCESS)
>>                break;
>>              if (r =3D=3D CUDA_ERROR_LAUNCH_FAILED)
>>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_e=
rror (r),
>>                                   maybe_abort_msg);
>>              else if (r !=3D CUDA_ERROR_NOT_READY)
>>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error =
(r));
>>
>>              if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUI=
RE) !=3D 0)
>>                {
>>                  struct rev_offload *rev_data =3D ptx_dev->rev_data;
>>                  GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>>                                          rev_data->addrs, rev_data->size=
s,
>>                                          rev_data->kinds, rev_data->dev_=
num,
>>                                          rev_off_dev_to_host_cpy,
>>                                          rev_off_host_to_dev_cpy, copy_s=
tream);
>>                  CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
>>                  __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>>                }
>>              usleep (1);
>>            }
>>        else
>>          r =3D CUDA_CALL_NOCHECK (cuCtxSynchronize, );
>>        if (reverse_offload)
>>          CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>>      [...]
>>
>> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
>> to use "Stream Memory Operations",
>> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
>> that allow to "Wait on a memory location", "until the given condition on
>> the memory is satisfied"?
>>
>> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':
>>
>>      [...]
>>        GOMP_REV_OFFLOAD_VAR->mapnum =3D mapnum;
>>        GOMP_REV_OFFLOAD_VAR->addrs =3D (uint64_t) hostaddrs;
>>        GOMP_REV_OFFLOAD_VAR->sizes =3D (uint64_t) sizes;
>>        GOMP_REV_OFFLOAD_VAR->kinds =3D (uint64_t) kinds;
>>        GOMP_REV_OFFLOAD_VAR->dev_num =3D GOMP_ADDITIONAL_ICVS.device_num=
;
>>
>>        /* Set 'fn' to trigger processing on the host; wait for completio=
n,
>>           which is flagged by setting 'fn' back to 0 on the host.  */
>>        uint64_t addr_struct_fn =3D (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
>>      #if __PTX_SM__ >=3D 700
>>        asm volatile ("st.global.release.sys.u64 [%0], %1;"
>>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>>      #else
>>        __sync_synchronize ();  /* membar.sys */
>>        asm volatile ("st.volatile.global.u64 [%0], %1;"
>>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>>      #endif
>>
>>      #if __PTX_SM__ >=3D 700
>>        uint64_t fn2;
>>        do
>>          {
>>            asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
>>                          : "=3Dr" (fn2) : "r" (addr_struct_fn) : "memory=
");
>>          }
>>        while (fn2 !=3D 0);
>>      #else
>>        /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>>           ld.u64 %r36,[%r64];
>>           membar.sys;  */
>>        while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUI=
RE) !=3D 0)
>>          ;  /* spin  */
>>      #endif
>>      [...]
>>
>>
>> Gr=C3=BC=C3=9Fe
>>   Thomas
>> -----------------
>> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe =
201, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=
=C3=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellsc=
haft: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 2=
01, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=
=C3=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellsc=
haft: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201=
, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3=
=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf=
t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955