From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252]) by sourceware.org (Postfix) with ESMTPS id 18DC23948A73 for ; Thu, 8 Dec 2022 14:02:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 18DC23948A73 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.96,227,1665475200"; d="scan'208";a="88975156" Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167]) by esa4.mentor.iphmx.com with ESMTP; 08 Dec 2022 06:02:28 -0800 IronPort-SDR: Zxfw7/JvDzP3FHNjd3UipsiWQ1Du5M3nCkGQ+0JEf0HJpUcRAjjrwjRJjLM6MLX6D0FtFYa+Wf 4XM/xDfviEo139/A3FPUsdKMNFkqki1VWAg+p4F/B0deEvWHvZNNDN9yX/up6+clSJEehHQeJr LDk5A6wH3ike6i0l8+a8lKIjyaxUT6JYcAU+bvu33gb+NGNdy3Ej9mE0uypKhOeRvEfG0lmG68 lOCAJCGU9/mLyT1Qs5ZhWNekmwcLpAtg6TmtYhQIupb2M0cs3abDR+NZI50jZxYSIuOnkrbp7M F1w= Message-ID: Date: Thu, 8 Dec 2022 15:02:17 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 Subject: Re: [PATCH 02/17] libgomp: pinned memory Content-Language: en-US To: Andrew Stubbs , Jakub Jelinek , Thomas Schwinge CC: References: From: Tobias Burnus In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12) X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,KAM_SHORT,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 08.12.22 13:51, Andrew Stubbs wrote: > On 08/12/2022 12:11, Jakub Jelinek wrote: >> On Thu, Jul 07, 2022 at 11:34:33AM +0100, Andrew Stubbs wrote: >>> Implement the OpenMP pinned memory trait on Linux hosts using the mlock >>> syscall. Pinned allocations are performed using mmap, not malloc, >>> to ensure >>> that they can be unpinned safely when freed. >> As I said before, I think the pinned memory is too precious to waste >> it this >> way, we should handle the -> pinned case through memkind_create_fixed on >> mmap + mlock area, that way we can create even quite small pinned >> allocations. > > This has been delayed due to other priorities, but our current plan is > to switch to using cudaHostAlloc, when available, but we can certainly > use memkind_create_fixed for the fallback case (including amdgcn). With available, I assume that nvptx is an 'available device' (per OpenMP definition, finally added in TR11), i.e. there is an image for nvptx and - after omp_requires filtering - there remains at least one nvptx device. * * * For completeness, I want to note that OpenMP TR11 adds support for creating memory spaces that are accessible from multiple devices, e.g. host + one/all devices, and adds some convenience functions for the latter (all devices, host and a specific device etc.) =E2=86=92 https://openmp.org/specifications/ TR11 (see Appendix B.2 for the release notes, esp. for Section 6.2). I think it makes sense to keep those addition in mind when doing the actual implementation to avoid incompatibilities. Side note regarding ompx_ additions proposed in https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597979.html (adds ompx_pinned_mem_alloc), https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597983.html (ompx_unified_shared_mem_alloc and ompx_host_mem_alloc; ompx_unified_shared_mem_space and ompx_host_mem_space). While TR11 does not add any predefined allocators or new memory spaces, using e.g. omp_get_devices_all_allocator(memspace) returns a unified-shared-memory allocator. I note that LLVM does not seem to have any ompx_ in this regard (yet?). (It has some ompx_ =E2=80=93 but related to assumptions.) > Using Cuda might be trickier to implement because there's a layering > violation inherent in routing target independent allocations through > the nvptx plugin, but benchmarking shows that that's the only way to > get the faster path through the Cuda black box; being pinned is good > because it avoids page faults, but apparently if Cuda *knows* it is > pinned then you get a speed boost even when there would be *no* faults > (i.e. on a quiet machine). Additionally, Cuda somehow ignores the > OS-defining limits. I wonder whether for a NUMA machine (and non-offloading access), using memkind_create_fixed will have an advantage over cuHostAlloc or not. (BTW, I find cuHostAlloc vs. cuAllocHost confusing.) And if so, whether we should provide a means (GOMP_... env var?) to toggle the preference. My feeling is that, on most systems, it does not matter - except (a) possibly for large NUMA systems, where the memkind tuning will probably make a difference and (b) we know that CUDA's cu(HostAlloc/AllocHost) is faster with nvptx offloading. (cu(HostAlloc/AllocHost) also permits DMA from the device. (If unified-shared address is supported, but that's the case [cf. comment + assert in plugin-nvptx.c].) Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955