From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Thomas_Schwinge@mentor.com>
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
 by sourceware.org (Postfix) with ESMTPS id A2A033852741
 for <gcc-patches@gcc.gnu.org>; Thu,  9 Jun 2022 09:38:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A2A033852741
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.91,287,1647331200"; d="scan'208";a="77019100"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa4.mentor.iphmx.com with ESMTP; 09 Jun 2022 01:38:29 -0800
IronPort-SDR: PcHSZCXFXjyGOQO+e7Oh/BELLAGxTdBnWTGteJXbYIDM0T4EFPObiNjBJXlOpildw8XzZRm2W5
 uuTRFNEMGwD7lNNmfJeVGkZ/hY7lnhqZFP8B82UbcjgtRQvGI5LUt6nAwv4+zDHORtKNUi7hQO
 8NqDdsljF9PbAcO542rBFjS92FWgPaLEUPS7ftaj83tyhoeqxcSwev0rppPBpanE6W0uYkQIqA
 GLoYCB+dINTALfMJnrI4ILMJ8qFCvIKxctyIbZbb7JhrC5zYLeXYym0rVVac8XyhG+dAPMR2z+
 xU4=
From: Thomas Schwinge <thomas@codesourcery.com>
To: Andrew Stubbs <ams@codesourcery.com>, Jakub Jelinek <jakub@redhat.com>
CC: <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] libgomp, openmp: pinned memory
In-Reply-To: <e8fc4b30-768a-2a02-1fc9-208ab9bf8a5d@codesourcery.com>
References: <f5260c95-6c71-99a7-3bf2-774380444082@codesourcery.com>
 <20220104155558.GG2646553@tucnak>
 <48ee767a-0d90-53b4-ea54-9deba9edd805@codesourcery.com>
 <20220104182829.GK2646553@tucnak> <20220104184740.GL2646553@tucnak>
 <b59981ce-9e47-8b00-03b8-1a9a5d555bb7@codesourcery.com>
 <a79567df-f061-8248-4281-63c74e724cb7@codesourcery.com>
 <dadaaf64-360f-bffb-8616-1ab9493cb358@codesourcery.com>
 <Yp9AMrhxak8lOh4t@tucnak>
 <e8fc4b30-768a-2a02-1fc9-208ab9bf8a5d@codesourcery.com>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1
 (x86_64-pc-linux-gnu)
Date: Thu, 9 Jun 2022 11:38:22 +0200
Message-ID: <87edzy5g8h.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) To
 svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10)
X-Spam-Status: No, score=-6.0 required=5.0 tests=BAYES_00,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS,
 TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Jun 2022 09:38:32 -0000

Hi!

I'm not all too familiar with the "newish" CUDA Driver API, but maybe the
following is useful still:

On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> On 07/06/2022 13:10, Jakub Jelinek wrote:
>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>> Following some feedback from users of the OG11 branch I think I need to
>>> withdraw this patch, for now.
>>>
>>> The memory pinned via the mlock call does not give the expected perform=
ance
>>> boost. I had not expected that it would do much in my test setup, given=
 that
>>> the machine has a lot of RAM and my benchmarks are small, but others ha=
ve
>>> tried more and on varying machines and architectures.
>>
>> I don't understand why there should be any expected performance boost (a=
t
>> least not unless the machine starts swapping out pages),
>> { omp_atk_pinned, true } is solely about the requirement that the memory
>> can't be swapped out.
>
> It seems like it takes a faster path through the NVidia drivers. This is
> a black box, for me, but that seems like a plausible explanation. The
> results are different on x86_64 and powerpc hosts (such as the Summit
> supercomputer).

For example, it's documented that 'cuMemHostAlloc',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__=
CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9>,
"Allocates page-locked host memory".  The crucial thing, though, what
makes this different from 'malloc' plus 'mlock' is, that "The driver
tracks the virtual memory ranges allocated with this function and
automatically accelerates calls to functions such as cuMemcpyHtoD().
Since the memory can be accessed directly by the device, it can be read
or written with much higher bandwidth than pageable memory obtained with
functions such as malloc()".

Similar, for example, for 'cuMemAllocHost',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__=
CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0>.

This, to me, would explain why "the mlock call does not give the expected
performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
with 'mlock' you're missing the "tracks the virtual memory ranges"
aspect.

Also, by means of the Nvidia Driver allocating the memory, I suppose
using this interface likely circumvents any "annoying" 'ulimit'
limitations?  I get this impression, because documentation continues
stating that "Allocating excessive amounts of memory with
cuMemAllocHost() may degrade system performance, since it reduces the
amount of memory available to the system for paging.  As a result, this
function is best used sparingly to allocate staging areas for data
exchange between host and device".

>>> It seems that it isn't enough for the memory to be pinned, it has to be
>>> pinned using the Cuda API to get the performance boost.
>>
>> For performance boost of what kind of code?
>> I don't understand how Cuda API could be useful (or can be used at all) =
if
>> offloading to NVPTX isn't involved.  The fact that somebody asks for hos=
t
>> memory allocation with omp_atk_pinned set to true doesn't mean it will b=
e
>> in any way related to NVPTX offloading (unless it is in NVPTX target reg=
ion
>> obviously, but then mlock isn't available, so sure, if there is somethin=
g
>> CUDA can provide for that case, nice).
>
> This is specifically for NVPTX offload, of course, but then that's what
> our customer is paying for.
>
> The expectation, from users, is that memory pinning will give the
> benefits specific to the active device. We can certainly make that
> happen when there is only one (flavour of) offload device present. I had
> hoped it could be one way for all, but it looks like not.

Aren't there CUDA Driver interfaces for that?  That is:

>>> I had not done this
>>> this because it was difficult to resolve the code abstraction
>>> difficulties and anyway the implementation was supposed to be device
>>> independent, but it seems we need a specific pinning mechanism for each
>>> device.

If not directly *allocating and registering* such memory via
'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
*register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__=
CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223>:
"Page-locks the memory range specified [...] and maps it for the
device(s) [...].  This memory range also is added to the same tracking
mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
manual 'mlock'ing involved in that case, too; presumably again using this
interface likely circumvents any "annoying" 'ulimit' limitations?)

Such a *register* abstraction can then be implemented by all the libgomp
offloading plugins: they just call the respective
CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
memory.

..., but maybe I'm missing some crucial "detail" here?


Gr=C3=BC=C3=9Fe
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201=
, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3=
=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf=
t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955