Hi!

On 2022-06-09T11:38:22+0200, I wrote:
> On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 07/06/2022 13:10, Jakub Jelinek wrote:
>>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>>> Following some feedback from users of the OG11 branch I think I need to
>>>> withdraw this patch, for now.
>>>>
>>>> The memory pinned via the mlock call does not give the expected performance
>>>> boost. I had not expected that it would do much in my test setup, given that
>>>> the machine has a lot of RAM and my benchmarks are small, but others have
>>>> tried more and on varying machines and architectures.
>>>
>>> I don't understand why there should be any expected performance boost (at
>>> least not unless the machine starts swapping out pages),
>>> { omp_atk_pinned, true } is solely about the requirement that the memory
>>> can't be swapped out.
>>
>> It seems like it takes a faster path through the NVidia drivers. This is
>> a black box, for me, but that seems like a plausible explanation. The
>> results are different on x86_64 and powerpc hosts (such as the Summit
>> supercomputer).
>
> For example, it's documented that 'cuMemHostAlloc',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9>,
> "Allocates page-locked host memory".  The crucial thing, though, what
> makes this different from 'malloc' plus 'mlock' is, that "The driver
> tracks the virtual memory ranges allocated with this function and
> automatically accelerates calls to functions such as cuMemcpyHtoD().
> Since the memory can be accessed directly by the device, it can be read
> or written with much higher bandwidth than pageable memory obtained with
> functions such as malloc()".
>
> Similar, for example, for 'cuMemAllocHost',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0>.
>
> This, to me, would explain why "the mlock call does not give the expected
> performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
> with 'mlock' you're missing the "tracks the virtual memory ranges"
> aspect.
>
> Also, by means of the Nvidia Driver allocating the memory, I suppose
> using this interface likely circumvents any "annoying" 'ulimit'
> limitations?  I get this impression, because documentation continues
> stating that "Allocating excessive amounts of memory with
> cuMemAllocHost() may degrade system performance, since it reduces the
> amount of memory available to the system for paging.  As a result, this
> function is best used sparingly to allocate staging areas for data
> exchange between host and device".
>
>>>> It seems that it isn't enough for the memory to be pinned, it has to be
>>>> pinned using the Cuda API to get the performance boost.
>>>
>>> For performance boost of what kind of code?
>>> I don't understand how Cuda API could be useful (or can be used at all) if
>>> offloading to NVPTX isn't involved.  The fact that somebody asks for host
>>> memory allocation with omp_atk_pinned set to true doesn't mean it will be
>>> in any way related to NVPTX offloading (unless it is in NVPTX target region
>>> obviously, but then mlock isn't available, so sure, if there is something
>>> CUDA can provide for that case, nice).
>>
>> This is specifically for NVPTX offload, of course, but then that's what
>> our customer is paying for.
>>
>> The expectation, from users, is that memory pinning will give the
>> benefits specific to the active device. We can certainly make that
>> happen when there is only one (flavour of) offload device present. I had
>> hoped it could be one way for all, but it looks like not.
>
> Aren't there CUDA Driver interfaces for that?  That is:
>
>>>> I had not done this
>>>> this because it was difficult to resolve the code abstraction
>>>> difficulties and anyway the implementation was supposed to be device
>>>> independent, but it seems we need a specific pinning mechanism for each
>>>> device.
>
> If not directly *allocating and registering* such memory via
> 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223>:
> "Page-locks the memory range specified [...] and maps it for the
> device(s) [...].  This memory range also is added to the same tracking
> mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> manual 'mlock'ing involved in that case, too; presumably again using this
> interface likely circumvents any "annoying" 'ulimit' limitations?)
>
> Such a *register* abstraction can then be implemented by all the libgomp
> offloading plugins: they just call the respective
> CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> memory.
>
> ..., but maybe I'm missing some crucial "detail" here?

Indeed this does appear to work; see attached
"[WIP] Attempt to register OpenMP pinned memory using a device instead of 'mlock'".
Any comments (aside from the TODOs that I'm still working on)?


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955