public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Thomas Schwinge <thomas@codesourcery.com>
To: Andrew Stubbs <ams@codesourcery.com>, Jakub Jelinek <jakub@redhat.com>
Cc: <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] libgomp, openmp: pinned memory
Date: Thu, 9 Jun 2022 11:38:22 +0200	[thread overview]
Message-ID: <87edzy5g8h.fsf@euler.schwinge.homeip.net> (raw)
In-Reply-To: <e8fc4b30-768a-2a02-1fc9-208ab9bf8a5d@codesourcery.com>

Hi!

I'm not all too familiar with the "newish" CUDA Driver API, but maybe the
following is useful still:

On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> On 07/06/2022 13:10, Jakub Jelinek wrote:
>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>> Following some feedback from users of the OG11 branch I think I need to
>>> withdraw this patch, for now.
>>>
>>> The memory pinned via the mlock call does not give the expected performance
>>> boost. I had not expected that it would do much in my test setup, given that
>>> the machine has a lot of RAM and my benchmarks are small, but others have
>>> tried more and on varying machines and architectures.
>>
>> I don't understand why there should be any expected performance boost (at
>> least not unless the machine starts swapping out pages),
>> { omp_atk_pinned, true } is solely about the requirement that the memory
>> can't be swapped out.
>
> It seems like it takes a faster path through the NVidia drivers. This is
> a black box, for me, but that seems like a plausible explanation. The
> results are different on x86_64 and powerpc hosts (such as the Summit
> supercomputer).

For example, it's documented that 'cuMemHostAlloc',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9>,
"Allocates page-locked host memory".  The crucial thing, though, what
makes this different from 'malloc' plus 'mlock' is, that "The driver
tracks the virtual memory ranges allocated with this function and
automatically accelerates calls to functions such as cuMemcpyHtoD().
Since the memory can be accessed directly by the device, it can be read
or written with much higher bandwidth than pageable memory obtained with
functions such as malloc()".

Similar, for example, for 'cuMemAllocHost',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0>.

This, to me, would explain why "the mlock call does not give the expected
performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
with 'mlock' you're missing the "tracks the virtual memory ranges"
aspect.

Also, by means of the Nvidia Driver allocating the memory, I suppose
using this interface likely circumvents any "annoying" 'ulimit'
limitations?  I get this impression, because documentation continues
stating that "Allocating excessive amounts of memory with
cuMemAllocHost() may degrade system performance, since it reduces the
amount of memory available to the system for paging.  As a result, this
function is best used sparingly to allocate staging areas for data
exchange between host and device".

>>> It seems that it isn't enough for the memory to be pinned, it has to be
>>> pinned using the Cuda API to get the performance boost.
>>
>> For performance boost of what kind of code?
>> I don't understand how Cuda API could be useful (or can be used at all) if
>> offloading to NVPTX isn't involved.  The fact that somebody asks for host
>> memory allocation with omp_atk_pinned set to true doesn't mean it will be
>> in any way related to NVPTX offloading (unless it is in NVPTX target region
>> obviously, but then mlock isn't available, so sure, if there is something
>> CUDA can provide for that case, nice).
>
> This is specifically for NVPTX offload, of course, but then that's what
> our customer is paying for.
>
> The expectation, from users, is that memory pinning will give the
> benefits specific to the active device. We can certainly make that
> happen when there is only one (flavour of) offload device present. I had
> hoped it could be one way for all, but it looks like not.

Aren't there CUDA Driver interfaces for that?  That is:

>>> I had not done this
>>> this because it was difficult to resolve the code abstraction
>>> difficulties and anyway the implementation was supposed to be device
>>> independent, but it seems we need a specific pinning mechanism for each
>>> device.

If not directly *allocating and registering* such memory via
'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
*register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223>:
"Page-locks the memory range specified [...] and maps it for the
device(s) [...].  This memory range also is added to the same tracking
mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
manual 'mlock'ing involved in that case, too; presumably again using this
interface likely circumvents any "annoying" 'ulimit' limitations?)

Such a *register* abstraction can then be implemented by all the libgomp
offloading plugins: they just call the respective
CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
memory.

..., but maybe I'm missing some crucial "detail" here?


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

  parent reply	other threads:[~2022-06-09  9:38 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-04 15:32 Andrew Stubbs
2022-01-04 15:55 ` Jakub Jelinek
2022-01-04 16:58   ` Andrew Stubbs
2022-01-04 18:28     ` Jakub Jelinek
2022-01-04 18:47       ` Jakub Jelinek
2022-01-05 17:07         ` Andrew Stubbs
2022-01-13 13:53           ` Andrew Stubbs
2022-06-07 11:05             ` Andrew Stubbs
2022-06-07 12:10               ` Jakub Jelinek
2022-06-07 12:28                 ` Andrew Stubbs
2022-06-07 12:40                   ` Jakub Jelinek
2022-06-09  9:38                   ` Thomas Schwinge [this message]
2022-06-09 10:09                     ` Tobias Burnus
2022-06-09 10:22                       ` Stubbs, Andrew
2022-06-09 10:31                     ` Stubbs, Andrew
2023-02-16 15:32                     ` Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) Thomas Schwinge
2023-02-16 16:17                       ` Stubbs, Andrew
2023-02-16 22:06                         ` [og12] " Thomas Schwinge
2023-02-17  8:12                           ` Thomas Schwinge
2023-02-20  9:48                             ` Andrew Stubbs
2023-02-20 13:53                               ` [og12] Attempt to not just register but allocate OpenMP pinned memory using a device (was: [og12] Attempt to register OpenMP pinned memory using a device instead of 'mlock') Thomas Schwinge
2023-02-10 15:11             ` [PATCH] libgomp, openmp: pinned memory Thomas Schwinge
2023-02-10 15:55               ` Andrew Stubbs
2023-02-16 21:39             ` [og12] Clarify/verify OpenMP 'omp_calloc' zero-initialization for pinned memory (was: [PATCH] libgomp, openmp: pinned memory) Thomas Schwinge
2023-03-24 15:49 ` [og12] libgomp: Document OpenMP 'pinned' memory (was: [PATCH] libgomp, openmp: pinned memory Thomas Schwinge
2023-03-27  9:27   ` Stubbs, Andrew
2023-03-27 11:26     ` [og12] libgomp: Document OpenMP 'pinned' memory (was: [PATCH] libgomp, openmp: pinned memory) Thomas Schwinge
2023-03-27 12:01       ` Andrew Stubbs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87edzy5g8h.fsf@euler.schwinge.homeip.net \
    --to=thomas@codesourcery.com \
    --cc=ams@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).