From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=J2mt=DP=mentor.com=Tobias_Burnus@sourceware.org>
Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180])
	by sourceware.org (Postfix) with ESMTPS id 7C0A33858C41
	for <gcc-patches@gcc.gnu.org>; Sat, 29 Jul 2023 11:27:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7C0A33858C41
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="6.01,240,1684828800"; 
   d="scan'208";a="13033713"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
  by esa3.mentor.iphmx.com with ESMTP; 29 Jul 2023 03:27:34 -0800
IronPort-SDR: i6gh0z05V6klH/tCXpfYhrkJSGf53u/Fc8hOKoowDu6nxv94IHC+7IcFsq2GtYSsi1HnQLtdpY
 CkZ+VqJRWkS7r04evvyIOsQyDXmY6en6rHpKrMnbaRrtdyqPwSyRgK0udmPWdmWmfh+YejQzt8
 8TNZIMt2XLQNuuZOQEKoSJ+77Etp/zKRuhhwTSjarQgVWDTS+5Fq8iq0puI34MBP4tEauhzuoR
 3KLrVCgzpcbYDD/3evA+XSGAVzNUvKrh81OjKHawxjj618e2JTIT7r+Wiy2vknfQS4Scsee8h3
 JQA=
Message-ID: <77cb0e02-e02c-3e36-2681-2bb3b0f67eb2@codesourcery.com>
Date: Sat, 29 Jul 2023 13:27:28 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.13.1
Subject: Re: [patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was:
 [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for
 omp_target_memcpy_rect)
Content-Language: en-US
From: Tobias Burnus <tobias@codesourcery.com>
To: Thomas Schwinge <thomas@codesourcery.com>, <gcc-patches@gcc.gnu.org>
References: <3bd8a5c7-3256-0e63-b7c4-1e969f6bba86@codesourcery.com>
 <87cz0ddq8d.fsf@euler.schwinge.homeip.net>
 <12ebc63d-9f48-a292-ae17-4ac70254c500@codesourcery.com>
In-Reply-To: <12ebc63d-9f48-a292-ae17-4ac70254c500@codesourcery.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To
 svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12)
X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,KAM_SHORT,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Now committed as r14-2865-g8b9e559fe7ca5715c74115322af99dbf9137a399

Tobias

On 28.07.23 13:51, Tobias Burnus wrote:
> thanks for proof reading and the suggestions! =E2=80=93 Do have comments =
to the
> attached patch?
>
> * * *
>
> Crossref: For further optimizations, see also
>
> https://gcc.gnu.org/PR101581 =E2=80=94 [OpenMP] omp_target_memcpy =E2=80=
=93 support
> inter-device memcpy
> https://gcc.gnu.org/PR110813 =E2=80=94 [OpenMP] omp_target_memcpy_rect (+
> strided 'target update'): Improve GCN performance and contiguous
> subranges
>
> and just added based on Thomas' comment:
>
> https://gcc.gnu.org/PR107424 =E2=80=94 [OpenMP] Check whether device lock=
ing is
> really needed for bare memcopy to/from devices (omp_target_memcpy...)
>
> * * *
>
> On 27.07.23 23:00, Thomas Schwinge wrote:
>>> +++ b/include/cuda/cuda.h
>> I note that you're not actually using everything you're adding here.
>> (..., but I understand you're simply adding everying that relates to
>> these 'cuMemcpy[...]' routines -- OK as far as I'm concerned.)
>
> Yes. That was on purpose to make it easier to pick something when needed
> =E2=80=93 especially as we might want to use some of those later on.
>
> For symmetry, I now also added cuMemcpyPeer + ...Async, which also
> remain unused. (But could be used as part of the PRs linked above.)
>
>>> +  const void *dstHost;
>> That last one isn't 'const'.  ;-)
> Fixed - three times.
>> A 'cuda.h' that I looked at calls that last one 'reserved0', with
>> comment
>> "Must be NULL".
> Seems to be unused in real world code and in the documentation. But
> let's use this name as it might be exposed in the wild.
>>> --- a/libgomp/libgomp-plugin.h
>>> +++ b/libgomp/libgomp-plugin.h
>>> +extern int GOMP_OFFLOAD_memcpy2d (int, int, size_t, size_t,
>>> +                               void*, size_t, size_t, size_t,
>>> +                               const void*, size_t, size_t, size_t);
>>> +extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, size_t, size_t,
>>> void *,
>>> +                               size_t, size_t, size_t, size_t, size_t,
>>> +                               const void *, size_t, size_t,
>>> size_t, size_t,
>>> +                               size_t);
>> Oh, wow.  ;-)
>
> Maybe this is not the best ABI. We can consider to modify it before the
> GCC 14 release. (And in principle also afterwards, given that libgomp
> and its plugins should=E2=84=A2 be compiled and installed alongside.)
>
> I think once we know how to implement GCN, we will see whether it was
> done smartly or whether other arguments should be used or whether the
> two functions should be combined.
>
> [Regarding the reserve0/reserve1 values for cuMemcpy3D and whether it
> should be NULL or not; quoting the usage in plugin-nvptx.c:]
>
>> I note that this doesn't adhere to the two "Must be NULL" remarks from
>> above -- but I'm confused, because, for example, on
>> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#grou=
p__CUDA__MEM_1g4b5238975579f002c0199a3800ca44df
>>
>> "cuMemcpy3D", there also are no such remarks.  (... in contast to:
>> "srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set
>> to 0",
>> which you do set accordingly.)
>>
>> Maybe just 'memset' the whole thing to '0', for good measure?
> OK - but then we can remove the LOD init.
>>> +      else if (src_devicep !=3D NULL
>>> +            && (dst_devicep =3D=3D NULL
>>> +                || (dst_devicep->capabilities
>>> +                    & GOMP_OFFLOAD_CAP_SHARED_MEM)))
>> Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that
>> 'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears
>> out
>> the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'?
>
> I have now undone this change =E2=80=93 I did not dig deep enough into th=
e
> function calls.
>
>
>>> +      else if (dst_devicep =3D=3D NULL && src_devicep =3D=3D NULL)
>>> +     {
>>> +       memcpy ((char *) dst + dst_off, (const char *) src + src_off,
>>> +               length);
>>> +       ret =3D 1;
>>> +     }
>>>         else if (src_devicep =3D=3D dst_devicep)
>>>        ret =3D src_devicep->dev2dev_func (src_devicep->target_id,
>>>                                         (char *) dst + dst_off,
>>>                                         (const char *) src + src_off,
>>>                                         length);
>> ..., but also left the intra-device case here -- which should now be
>> dead
>> code here?
>
> Why? Unless I missed something, the old, the current, and the proposed
> (=3D old) code do still run this code.
>
> I have not added an assert to confirm, but in any case, it is tested for
> in my recently added testcase - thus, we could add a 'printf' to confirm.
>
>>> +       else if (*tmp_size < length)
>>> +         {
>>> +           *tmp_size =3D length;
>>> +           *tmp =3D realloc (*tmp, length);
>>> +           if (*tmp =3D=3D NULL)
>>> +             return ENOMEM;
>> If 'realloc' returns 'NULL', we should 'free' the original '*tmp'?
>>
>> Do we really need here the property here that if the re-allocation can't
>> be done in-place, 'realloc' copies the original content to the new?  In
>> other words, should we just unconditionally 'free' and re-'malloc' here,
>> instead of 'realloc'?
> I have now done so =E2=80=93 but I am not really sure what's faster on av=
erage.
> If it can be enlarged, 'realloc' is faster, if it cannot free+malloc is
> better.
>> I haven't looked whether the re-use of 'tmp' for multiple calls to this
>> is then actually useful, or whether we should just always 'malloc', use,
>> 'free' the buffer here?
>
> Well, it can run in a hot loop =E2=80=93 assume a C-array array[1024][102=
4][2]
> and copying array[:1024,:1024,0:1] (using OpenMP syntax) =E2=80=93 i.e. 1=
048576
> times every other element. And therefore I would like to avoid repeated
> malloc/free in such a case. (But in general, interdevice copying should
> be very rare.)
>
> Actually, I think the realloc case is unreachable: for rectangular
> copies, as implied both by 'target update' with strided access and by
> 'omp_target_memcpy_rect', the size should be constant.
>
>> (For later...)  Is that what
>> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#grou=
p__CUDA__MEM_1ge1f5c7771544fee150ada8853c7cbf4a
>>
>> "cuMemcpyPeer" would be used for?
>
> I concur that we could/should use cuMemcpyPeer and cuMemcpy3DPeer for
> omp_target and ..._rect. I added a comment to the respective PRs (linked
> at the top).
>
>>> +  lock_src =3D (src_devicep
>>> +           && (!dst_devicep
>>> +               || src_devicep =3D=3D dst_devicep
>>> +               || !(src_devicep->capabilities
>>> +                    & GOMP_OFFLOAD_CAP_SHARED_MEM)));
>> Similar doubt than above re "'GOMP_OFFLOAD_CAP_SHARED_MEM' actually
>> reachable"?
> Updated in the attach patch + filed a PR about the need of the lock (see
> link at the top).
>>> +  DLSYM (memcpy2d);
>>> +  DLSYM (memcpy3d);
>> With 'DLSYM' used here, won't that fail if these symbols don't actually
>> exist (like for 'libgomp/plugin/plugin-gcn.c')?
>
> Hmm, that should be: DLSYM_OPT, but somehow testing on GCN did not show
> any fails. (Neither here nor in the testsuite; odd.)
>
> I intent to commit the attached follow-up patch very soon.
>
> [It has been tested without offloading support compiled in, with AMD GCN
> offloading (single GPU, no memcpy[23]d available), and on a 2-GPU system
> with nvptx offloading.]
>
> Tobias
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe
> 201, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung;
> Gesch=C3=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der G=
esellschaft:
> M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201=
, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3=
=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf=
t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955