public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: "Stubbs, Andrew" <andrew.stubbs@siemens.com>
To: Thomas Schwinge <thomas@codesourcery.com>,
	Andrew Stubbs <ams@codesourcery.com>,
	Jakub Jelinek <jakub@redhat.com>,
	Tobias Burnus <tobias@codesourcery.com>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: RE: Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)
Date: Thu, 16 Feb 2023 16:17:32 +0000	[thread overview]
Message-ID: <DB8PR10MB3676D898B971F9E6AB2C5AFFFBA09@DB8PR10MB3676.EURPRD10.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <87cz69tyla.fsf@dem-tschwing-1.ger.mentorg.com>

> -----Original Message-----
> From: Thomas Schwinge <thomas@codesourcery.com>
> Sent: 16 February 2023 15:33
> To: Andrew Stubbs <ams@codesourcery.com>; Jakub Jelinek <jakub@redhat.com>;
> Tobias Burnus <tobias@codesourcery.com>; gcc-patches@gcc.gnu.org
> Subject: Attempt to register OpenMP pinned memory using a device instead of
> 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)
> 
> Hi!
> 
> On 2022-06-09T11:38:22+0200, I wrote:
> > On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >> On 07/06/2022 13:10, Jakub Jelinek wrote:
> >>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
> >>>> Following some feedback from users of the OG11 branch I think I need to
> >>>> withdraw this patch, for now.
> >>>>
> >>>> The memory pinned via the mlock call does not give the expected
> performance
> >>>> boost. I had not expected that it would do much in my test setup, given
> that
> >>>> the machine has a lot of RAM and my benchmarks are small, but others
> have
> >>>> tried more and on varying machines and architectures.
> >>>
> >>> I don't understand why there should be any expected performance boost
> (at
> >>> least not unless the machine starts swapping out pages),
> >>> { omp_atk_pinned, true } is solely about the requirement that the memory
> >>> can't be swapped out.
> >>
> >> It seems like it takes a faster path through the NVidia drivers. This is
> >> a black box, for me, but that seems like a plausible explanation. The
> >> results are different on x86_64 and powerpc hosts (such as the Summit
> >> supercomputer).
> >
> > For example, it's documented that 'cuMemHostAlloc',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e
> 035b9&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7S8K2opKAV%2F5Ub2tyZtcgplptZ65dNc3b%2F2IYoh
> me%2Fw%3D&reserved=0>,
> > "Allocates page-locked host memory".  The crucial thing, though, what
> > makes this different from 'malloc' plus 'mlock' is, that "The driver
> > tracks the virtual memory ranges allocated with this function and
> > automatically accelerates calls to functions such as cuMemcpyHtoD().
> > Since the memory can be accessed directly by the device, it can be read
> > or written with much higher bandwidth than pageable memory obtained with
> > functions such as malloc()".
> >
> > Similar, for example, for 'cuMemAllocHost',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gdd8311286d2c2691605362c689b
> c64e0&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TAhX%2BFjPavhKZKICMDiO%2BuZuytxnkaDvfDArT0R
> KDV0%3D&reserved=0>.
> >
> > This, to me, would explain why "the mlock call does not give the expected
> > performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
> > with 'mlock' you're missing the "tracks the virtual memory ranges"
> > aspect.
> >
> > Also, by means of the Nvidia Driver allocating the memory, I suppose
> > using this interface likely circumvents any "annoying" 'ulimit'
> > limitations?  I get this impression, because documentation continues
> > stating that "Allocating excessive amounts of memory with
> > cuMemAllocHost() may degrade system performance, since it reduces the
> > amount of memory available to the system for paging.  As a result, this
> > function is best used sparingly to allocate staging areas for data
> > exchange between host and device".
> >
> >>>> It seems that it isn't enough for the memory to be pinned, it has to be
> >>>> pinned using the Cuda API to get the performance boost.
> >>>
> >>> For performance boost of what kind of code?
> >>> I don't understand how Cuda API could be useful (or can be used at all)
> if
> >>> offloading to NVPTX isn't involved.  The fact that somebody asks for
> host
> >>> memory allocation with omp_atk_pinned set to true doesn't mean it will
> be
> >>> in any way related to NVPTX offloading (unless it is in NVPTX target
> region
> >>> obviously, but then mlock isn't available, so sure, if there is
> something
> >>> CUDA can provide for that case, nice).
> >>
> >> This is specifically for NVPTX offload, of course, but then that's what
> >> our customer is paying for.
> >>
> >> The expectation, from users, is that memory pinning will give the
> >> benefits specific to the active device. We can certainly make that
> >> happen when there is only one (flavour of) offload device present. I had
> >> hoped it could be one way for all, but it looks like not.
> >
> > Aren't there CUDA Driver interfaces for that?  That is:
> >
> >>>> I had not done this
> >>>> this because it was difficult to resolve the code abstraction
> >>>> difficulties and anyway the implementation was supposed to be device
> >>>> independent, but it seems we need a specific pinning mechanism for each
> >>>> device.
> >
> > If not directly *allocating and registering* such memory via
> > 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> > *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b
> 54223&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wkwx9TipC8JJNn1QqULahoTfqn9w%2FOLyoCQ1MTt90
> 8M%3D&reserved=0>:
> > "Page-locks the memory range specified [...] and maps it for the
> > device(s) [...].  This memory range also is added to the same tracking
> > mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> > manual 'mlock'ing involved in that case, too; presumably again using this
> > interface likely circumvents any "annoying" 'ulimit' limitations?)
> >
> > Such a *register* abstraction can then be implemented by all the libgomp
> > offloading plugins: they just call the respective
> > CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> > memory.
> >
> > ..., but maybe I'm missing some crucial "detail" here?
> 
> Indeed this does appear to work; see attached
> "[WIP] Attempt to register OpenMP pinned memory using a device instead of
> 'mlock'".
> Any comments (aside from the TODOs that I'm still working on)?

The mmap implementation was not optimized for a lot of small allocations, and I can't see that issue changing here, so I don't know if this can be used for mlockall replacement.

I had assumed that using the Cuda allocator would fix that limitation.

Andrew

  reply	other threads:[~2023-02-16 16:17 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-04 15:32 [PATCH] libgomp, openmp: pinned memory Andrew Stubbs
2022-01-04 15:55 ` Jakub Jelinek
2022-01-04 16:58   ` Andrew Stubbs
2022-01-04 18:28     ` Jakub Jelinek
2022-01-04 18:47       ` Jakub Jelinek
2022-01-05 17:07         ` Andrew Stubbs
2022-01-13 13:53           ` Andrew Stubbs
2022-06-07 11:05             ` Andrew Stubbs
2022-06-07 12:10               ` Jakub Jelinek
2022-06-07 12:28                 ` Andrew Stubbs
2022-06-07 12:40                   ` Jakub Jelinek
2022-06-09  9:38                   ` Thomas Schwinge
2022-06-09 10:09                     ` Tobias Burnus
2022-06-09 10:22                       ` Stubbs, Andrew
2022-06-09 10:31                     ` Stubbs, Andrew
2023-02-16 15:32                     ` Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) Thomas Schwinge
2023-02-16 16:17                       ` Stubbs, Andrew [this message]
2023-02-16 22:06                         ` [og12] " Thomas Schwinge
2023-02-17  8:12                           ` Thomas Schwinge
2023-02-20  9:48                             ` Andrew Stubbs
2023-02-20 13:53                               ` [og12] Attempt to not just register but allocate OpenMP pinned memory using a device (was: [og12] Attempt to register OpenMP pinned memory using a device instead of 'mlock') Thomas Schwinge
2023-02-10 15:11             ` [PATCH] libgomp, openmp: pinned memory Thomas Schwinge
2023-02-10 15:55               ` Andrew Stubbs
2023-02-16 21:39             ` [og12] Clarify/verify OpenMP 'omp_calloc' zero-initialization for pinned memory (was: [PATCH] libgomp, openmp: pinned memory) Thomas Schwinge
2023-03-24 15:49 ` [og12] libgomp: Document OpenMP 'pinned' memory (was: [PATCH] libgomp, openmp: pinned memory Thomas Schwinge
2023-03-27  9:27   ` Stubbs, Andrew
2023-03-27 11:26     ` [og12] libgomp: Document OpenMP 'pinned' memory (was: [PATCH] libgomp, openmp: pinned memory) Thomas Schwinge
2023-03-27 12:01       ` Andrew Stubbs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DB8PR10MB3676D898B971F9E6AB2C5AFFFBA09@DB8PR10MB3676.EURPRD10.PROD.OUTLOOK.COM \
    --to=andrew.stubbs@siemens.com \
    --cc=ams@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=thomas@codesourcery.com \
    --cc=tobias@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).