Hi! On 2022-06-09T11:38:22+0200, I wrote: > On 2022-06-07T13:28:33+0100, Andrew Stubbs wrote: >> On 07/06/2022 13:10, Jakub Jelinek wrote: >>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote: >>>> Following some feedback from users of the OG11 branch I think I need to >>>> withdraw this patch, for now. >>>> >>>> The memory pinned via the mlock call does not give the expected performance >>>> boost. I had not expected that it would do much in my test setup, given that >>>> the machine has a lot of RAM and my benchmarks are small, but others have >>>> tried more and on varying machines and architectures. >>> >>> I don't understand why there should be any expected performance boost (at >>> least not unless the machine starts swapping out pages), >>> { omp_atk_pinned, true } is solely about the requirement that the memory >>> can't be swapped out. >> >> It seems like it takes a faster path through the NVidia drivers. This is >> a black box, for me, but that seems like a plausible explanation. The >> results are different on x86_64 and powerpc hosts (such as the Summit >> supercomputer). > > For example, it's documented that 'cuMemHostAlloc', > , > "Allocates page-locked host memory". The crucial thing, though, what > makes this different from 'malloc' plus 'mlock' is, that "The driver > tracks the virtual memory ranges allocated with this function and > automatically accelerates calls to functions such as cuMemcpyHtoD(). > Since the memory can be accessed directly by the device, it can be read > or written with much higher bandwidth than pageable memory obtained with > functions such as malloc()". > > Similar, for example, for 'cuMemAllocHost', > . > > This, to me, would explain why "the mlock call does not give the expected > performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc'; > with 'mlock' you're missing the "tracks the virtual memory ranges" > aspect. > > Also, by means of the Nvidia Driver allocating the memory, I suppose > using this interface likely circumvents any "annoying" 'ulimit' > limitations? I get this impression, because documentation continues > stating that "Allocating excessive amounts of memory with > cuMemAllocHost() may degrade system performance, since it reduces the > amount of memory available to the system for paging. As a result, this > function is best used sparingly to allocate staging areas for data > exchange between host and device". > >>>> It seems that it isn't enough for the memory to be pinned, it has to be >>>> pinned using the Cuda API to get the performance boost. >>> >>> For performance boost of what kind of code? >>> I don't understand how Cuda API could be useful (or can be used at all) if >>> offloading to NVPTX isn't involved. The fact that somebody asks for host >>> memory allocation with omp_atk_pinned set to true doesn't mean it will be >>> in any way related to NVPTX offloading (unless it is in NVPTX target region >>> obviously, but then mlock isn't available, so sure, if there is something >>> CUDA can provide for that case, nice). >> >> This is specifically for NVPTX offload, of course, but then that's what >> our customer is paying for. >> >> The expectation, from users, is that memory pinning will give the >> benefits specific to the active device. We can certainly make that >> happen when there is only one (flavour of) offload device present. I had >> hoped it could be one way for all, but it looks like not. > > Aren't there CUDA Driver interfaces for that? That is: > >>>> I had not done this >>>> this because it was difficult to resolve the code abstraction >>>> difficulties and anyway the implementation was supposed to be device >>>> independent, but it seems we need a specific pinning mechanism for each >>>> device. > > If not directly *allocating and registering* such memory via > 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only > *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister', > : > "Page-locks the memory range specified [...] and maps it for the > device(s) [...]. This memory range also is added to the same tracking > mechanism as cuMemHostAlloc to automatically accelerate [...]"? (No > manual 'mlock'ing involved in that case, too; presumably again using this > interface likely circumvents any "annoying" 'ulimit' limitations?) > > Such a *register* abstraction can then be implemented by all the libgomp > offloading plugins: they just call the respective > CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.) > memory. > > ..., but maybe I'm missing some crucial "detail" here? Indeed this does appear to work; see attached "[WIP] Attempt to register OpenMP pinned memory using a device instead of 'mlock'". Any comments (aside from the TODOs that I'm still working on)? Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955