From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180]) by sourceware.org (Postfix) with ESMTPS id 7C0A33858C41 for ; Sat, 29 Jul 2023 11:27:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7C0A33858C41 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="6.01,240,1684828800"; d="scan'208";a="13033713" Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167]) by esa3.mentor.iphmx.com with ESMTP; 29 Jul 2023 03:27:34 -0800 IronPort-SDR: i6gh0z05V6klH/tCXpfYhrkJSGf53u/Fc8hOKoowDu6nxv94IHC+7IcFsq2GtYSsi1HnQLtdpY CkZ+VqJRWkS7r04evvyIOsQyDXmY6en6rHpKrMnbaRrtdyqPwSyRgK0udmPWdmWmfh+YejQzt8 8TNZIMt2XLQNuuZOQEKoSJ+77Etp/zKRuhhwTSjarQgVWDTS+5Fq8iq0puI34MBP4tEauhzuoR 3KLrVCgzpcbYDD/3evA+XSGAVzNUvKrh81OjKHawxjj618e2JTIT7r+Wiy2vknfQS4Scsee8h3 JQA= Message-ID: <77cb0e02-e02c-3e36-2681-2bb3b0f67eb2@codesourcery.com> Date: Sat, 29 Jul 2023 13:27:28 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.1 Subject: Re: [patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect) Content-Language: en-US From: Tobias Burnus To: Thomas Schwinge , References: <3bd8a5c7-3256-0e63-b7c4-1e969f6bba86@codesourcery.com> <87cz0ddq8d.fsf@euler.schwinge.homeip.net> <12ebc63d-9f48-a292-ae17-4ac70254c500@codesourcery.com> In-Reply-To: <12ebc63d-9f48-a292-ae17-4ac70254c500@codesourcery.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12) X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,KAM_SHORT,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Now committed as r14-2865-g8b9e559fe7ca5715c74115322af99dbf9137a399 Tobias On 28.07.23 13:51, Tobias Burnus wrote: > thanks for proof reading and the suggestions! =E2=80=93 Do have comments = to the > attached patch? > > * * * > > Crossref: For further optimizations, see also > > https://gcc.gnu.org/PR101581 =E2=80=94 [OpenMP] omp_target_memcpy =E2=80= =93 support > inter-device memcpy > https://gcc.gnu.org/PR110813 =E2=80=94 [OpenMP] omp_target_memcpy_rect (+ > strided 'target update'): Improve GCN performance and contiguous > subranges > > and just added based on Thomas' comment: > > https://gcc.gnu.org/PR107424 =E2=80=94 [OpenMP] Check whether device lock= ing is > really needed for bare memcopy to/from devices (omp_target_memcpy...) > > * * * > > On 27.07.23 23:00, Thomas Schwinge wrote: >>> +++ b/include/cuda/cuda.h >> I note that you're not actually using everything you're adding here. >> (..., but I understand you're simply adding everying that relates to >> these 'cuMemcpy[...]' routines -- OK as far as I'm concerned.) > > Yes. That was on purpose to make it easier to pick something when needed > =E2=80=93 especially as we might want to use some of those later on. > > For symmetry, I now also added cuMemcpyPeer + ...Async, which also > remain unused. (But could be used as part of the PRs linked above.) > >>> + const void *dstHost; >> That last one isn't 'const'. ;-) > Fixed - three times. >> A 'cuda.h' that I looked at calls that last one 'reserved0', with >> comment >> "Must be NULL". > Seems to be unused in real world code and in the documentation. But > let's use this name as it might be exposed in the wild. >>> --- a/libgomp/libgomp-plugin.h >>> +++ b/libgomp/libgomp-plugin.h >>> +extern int GOMP_OFFLOAD_memcpy2d (int, int, size_t, size_t, >>> + void*, size_t, size_t, size_t, >>> + const void*, size_t, size_t, size_t); >>> +extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, size_t, size_t, >>> void *, >>> + size_t, size_t, size_t, size_t, size_t, >>> + const void *, size_t, size_t, >>> size_t, size_t, >>> + size_t); >> Oh, wow. ;-) > > Maybe this is not the best ABI. We can consider to modify it before the > GCC 14 release. (And in principle also afterwards, given that libgomp > and its plugins should=E2=84=A2 be compiled and installed alongside.) > > I think once we know how to implement GCN, we will see whether it was > done smartly or whether other arguments should be used or whether the > two functions should be combined. > > [Regarding the reserve0/reserve1 values for cuMemcpy3D and whether it > should be NULL or not; quoting the usage in plugin-nvptx.c:] > >> I note that this doesn't adhere to the two "Must be NULL" remarks from >> above -- but I'm confused, because, for example, on >> > >> "cuMemcpy3D", there also are no such remarks. (... in contast to: >> "srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set >> to 0", >> which you do set accordingly.) >> >> Maybe just 'memset' the whole thing to '0', for good measure? > OK - but then we can remove the LOD init. >>> + else if (src_devicep !=3D NULL >>> + && (dst_devicep =3D=3D NULL >>> + || (dst_devicep->capabilities >>> + & GOMP_OFFLOAD_CAP_SHARED_MEM))) >> Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that >> 'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears >> out >> the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'? > > I have now undone this change =E2=80=93 I did not dig deep enough into th= e > function calls. > > >>> + else if (dst_devicep =3D=3D NULL && src_devicep =3D=3D NULL) >>> + { >>> + memcpy ((char *) dst + dst_off, (const char *) src + src_off, >>> + length); >>> + ret =3D 1; >>> + } >>> else if (src_devicep =3D=3D dst_devicep) >>> ret =3D src_devicep->dev2dev_func (src_devicep->target_id, >>> (char *) dst + dst_off, >>> (const char *) src + src_off, >>> length); >> ..., but also left the intra-device case here -- which should now be >> dead >> code here? > > Why? Unless I missed something, the old, the current, and the proposed > (=3D old) code do still run this code. > > I have not added an assert to confirm, but in any case, it is tested for > in my recently added testcase - thus, we could add a 'printf' to confirm. > >>> + else if (*tmp_size < length) >>> + { >>> + *tmp_size =3D length; >>> + *tmp =3D realloc (*tmp, length); >>> + if (*tmp =3D=3D NULL) >>> + return ENOMEM; >> If 'realloc' returns 'NULL', we should 'free' the original '*tmp'? >> >> Do we really need here the property here that if the re-allocation can't >> be done in-place, 'realloc' copies the original content to the new? In >> other words, should we just unconditionally 'free' and re-'malloc' here, >> instead of 'realloc'? > I have now done so =E2=80=93 but I am not really sure what's faster on av= erage. > If it can be enlarged, 'realloc' is faster, if it cannot free+malloc is > better. >> I haven't looked whether the re-use of 'tmp' for multiple calls to this >> is then actually useful, or whether we should just always 'malloc', use, >> 'free' the buffer here? > > Well, it can run in a hot loop =E2=80=93 assume a C-array array[1024][102= 4][2] > and copying array[:1024,:1024,0:1] (using OpenMP syntax) =E2=80=93 i.e. 1= 048576 > times every other element. And therefore I would like to avoid repeated > malloc/free in such a case. (But in general, interdevice copying should > be very rare.) > > Actually, I think the realloc case is unreachable: for rectangular > copies, as implied both by 'target update' with strided access and by > 'omp_target_memcpy_rect', the size should be constant. > >> (For later...) Is that what >> > >> "cuMemcpyPeer" would be used for? > > I concur that we could/should use cuMemcpyPeer and cuMemcpy3DPeer for > omp_target and ..._rect. I added a comment to the respective PRs (linked > at the top). > >>> + lock_src =3D (src_devicep >>> + && (!dst_devicep >>> + || src_devicep =3D=3D dst_devicep >>> + || !(src_devicep->capabilities >>> + & GOMP_OFFLOAD_CAP_SHARED_MEM))); >> Similar doubt than above re "'GOMP_OFFLOAD_CAP_SHARED_MEM' actually >> reachable"? > Updated in the attach patch + filed a PR about the need of the lock (see > link at the top). >>> + DLSYM (memcpy2d); >>> + DLSYM (memcpy3d); >> With 'DLSYM' used here, won't that fail if these symbols don't actually >> exist (like for 'libgomp/plugin/plugin-gcn.c')? > > Hmm, that should be: DLSYM_OPT, but somehow testing on GCN did not show > any fails. (Neither here nor in the testsuite; odd.) > > I intent to commit the attached follow-up patch very soon. > > [It has been tested without offloading support compiled in, with AMD GCN > offloading (single GPU, no memcpy[23]d available), and on a 2-GPU system > with nvptx offloading.] > > Tobias > ----------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe > 201, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; > Gesch=C3=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der G= esellschaft: > M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955 ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955