From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=jLgs=4G=mentor.com=Tobias_Burnus@sourceware.org>
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
	by sourceware.org (Postfix) with ESMTPS id 18DC23948A73
	for <gcc-patches@gcc.gnu.org>; Thu,  8 Dec 2022 14:02:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 18DC23948A73
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.96,227,1665475200"; 
   d="scan'208";a="88975156"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
  by esa4.mentor.iphmx.com with ESMTP; 08 Dec 2022 06:02:28 -0800
IronPort-SDR: Zxfw7/JvDzP3FHNjd3UipsiWQ1Du5M3nCkGQ+0JEf0HJpUcRAjjrwjRJjLM6MLX6D0FtFYa+Wf
 4XM/xDfviEo139/A3FPUsdKMNFkqki1VWAg+p4F/B0deEvWHvZNNDN9yX/up6+clSJEehHQeJr
 LDk5A6wH3ike6i0l8+a8lKIjyaxUT6JYcAU+bvu33gb+NGNdy3Ej9mE0uypKhOeRvEfG0lmG68
 lOCAJCGU9/mLyT1Qs5ZhWNekmwcLpAtg6TmtYhQIupb2M0cs3abDR+NZI50jZxYSIuOnkrbp7M
 F1w=
Message-ID: <ffd6c217-6c1c-a730-4760-65925a6db7a0@codesourcery.com>
Date: Thu, 8 Dec 2022 15:02:17 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.5.1
Subject: Re: [PATCH 02/17] libgomp: pinned memory
Content-Language: en-US
To: Andrew Stubbs <ams@codesourcery.com>, Jakub Jelinek <jakub@redhat.com>,
	Thomas Schwinge <thomas@codesourcery.com>
CC: <gcc-patches@gcc.gnu.org>
References: <cover.1657188329.git.ams@codesourcery.com>
 <fdd8ab97564dca31c8c1c1cc54b7b437981bea3c.1657188329.git.ams@codesourcery.com>
 <Y5HUYmiEsPEm1ENP@tucnak>
 <a249230d-a9ce-c93d-204d-dc59c9a5b8bc@codesourcery.com>
From: Tobias Burnus <tobias@codesourcery.com>
In-Reply-To: <a249230d-a9ce-c93d-204d-dc59c9a5b8bc@codesourcery.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To
 svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12)
X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,KAM_SHORT,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 08.12.22 13:51, Andrew Stubbs wrote:
> On 08/12/2022 12:11, Jakub Jelinek wrote:
>> On Thu, Jul 07, 2022 at 11:34:33AM +0100, Andrew Stubbs wrote:
>>> Implement the OpenMP pinned memory trait on Linux hosts using the mlock
>>> syscall.  Pinned allocations are performed using mmap, not malloc,
>>> to ensure
>>> that they can be unpinned safely when freed.
>> As I said before, I think the pinned memory is too precious to waste
>> it this
>> way, we should handle the -> pinned case through memkind_create_fixed on
>> mmap + mlock area, that way we can create even quite small pinned
>> allocations.
>
> This has been delayed due to other priorities, but our current plan is
> to switch to using cudaHostAlloc, when available, but we can certainly
> use memkind_create_fixed for the fallback case (including amdgcn).

With available, I assume that nvptx is an 'available device' (per OpenMP
definition, finally added in TR11), i.e. there is an image for nvptx and
- after omp_requires filtering - there remains at least one nvptx device.

* * *

For completeness, I want to note that OpenMP TR11 adds support for
creating memory spaces that are accessible from multiple devices, e.g.
host + one/all devices, and adds some convenience functions for the
latter (all devices, host and a specific device etc.) =E2=86=92
https://openmp.org/specifications/ TR11 (see Appendix B.2 for the
release notes, esp. for Section 6.2).

I think it makes sense to keep those addition in mind when doing the
actual implementation to avoid incompatibilities.

Side note regarding ompx_ additions proposed in
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597979.html (adds
ompx_pinned_mem_alloc),
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597983.html
(ompx_unified_shared_mem_alloc and ompx_host_mem_alloc;
ompx_unified_shared_mem_space and ompx_host_mem_space).

While TR11 does not add any predefined allocators or new memory spaces,
using e.g. omp_get_devices_all_allocator(memspace) returns a
unified-shared-memory allocator.

I note that LLVM does not seem to have any ompx_ in this regard (yet?).
(It has some ompx_ =E2=80=93 but related to assumptions.)


> Using Cuda might be trickier to implement because there's a layering
> violation inherent in routing target independent allocations through
> the nvptx plugin, but benchmarking shows that that's the only way to
> get the faster path through the Cuda black box; being pinned is good
> because it avoids page faults, but apparently if Cuda *knows* it is
> pinned then you get a speed boost even when there would be *no* faults
> (i.e. on a quiet machine). Additionally, Cuda somehow ignores the
> OS-defining limits.

I wonder whether for a NUMA machine (and non-offloading access), using
memkind_create_fixed will have an advantage over cuHostAlloc or not.
(BTW, I find cuHostAlloc vs. cuAllocHost confusing.) And if so, whether
we should provide a means (GOMP_... env var?) to toggle the preference.

My feeling is that, on most systems, it does not matter - except (a)
possibly for large NUMA systems, where the memkind tuning will probably
make a difference and (b) we know that CUDA's cu(HostAlloc/AllocHost) is
faster with nvptx offloading. (cu(HostAlloc/AllocHost) also permits DMA
from the device. (If unified-shared address is supported, but that's the
case [cf. comment + assert in plugin-nvptx.c].)

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201=
, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3=
=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf=
t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955