From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252]) by sourceware.org (Postfix) with ESMTPS id A2A033852741 for ; Thu, 9 Jun 2022 09:38:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A2A033852741 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.91,287,1647331200"; d="scan'208";a="77019100" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa4.mentor.iphmx.com with ESMTP; 09 Jun 2022 01:38:29 -0800 IronPort-SDR: PcHSZCXFXjyGOQO+e7Oh/BELLAGxTdBnWTGteJXbYIDM0T4EFPObiNjBJXlOpildw8XzZRm2W5 uuTRFNEMGwD7lNNmfJeVGkZ/hY7lnhqZFP8B82UbcjgtRQvGI5LUt6nAwv4+zDHORtKNUi7hQO 8NqDdsljF9PbAcO542rBFjS92FWgPaLEUPS7ftaj83tyhoeqxcSwev0rppPBpanE6W0uYkQIqA GLoYCB+dINTALfMJnrI4ILMJ8qFCvIKxctyIbZbb7JhrC5zYLeXYym0rVVac8XyhG+dAPMR2z+ xU4= From: Thomas Schwinge To: Andrew Stubbs , Jakub Jelinek CC: Subject: Re: [PATCH] libgomp, openmp: pinned memory In-Reply-To: References: <20220104155558.GG2646553@tucnak> <48ee767a-0d90-53b4-ea54-9deba9edd805@codesourcery.com> <20220104182829.GK2646553@tucnak> <20220104184740.GL2646553@tucnak> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1 (x86_64-pc-linux-gnu) Date: Thu, 9 Jun 2022 11:38:22 +0200 Message-ID: <87edzy5g8h.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-6.0 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Jun 2022 09:38:32 -0000 Hi! I'm not all too familiar with the "newish" CUDA Driver API, but maybe the following is useful still: On 2022-06-07T13:28:33+0100, Andrew Stubbs wrote: > On 07/06/2022 13:10, Jakub Jelinek wrote: >> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote: >>> Following some feedback from users of the OG11 branch I think I need to >>> withdraw this patch, for now. >>> >>> The memory pinned via the mlock call does not give the expected perform= ance >>> boost. I had not expected that it would do much in my test setup, given= that >>> the machine has a lot of RAM and my benchmarks are small, but others ha= ve >>> tried more and on varying machines and architectures. >> >> I don't understand why there should be any expected performance boost (a= t >> least not unless the machine starts swapping out pages), >> { omp_atk_pinned, true } is solely about the requirement that the memory >> can't be swapped out. > > It seems like it takes a faster path through the NVidia drivers. This is > a black box, for me, but that seems like a plausible explanation. The > results are different on x86_64 and powerpc hosts (such as the Summit > supercomputer). For example, it's documented that 'cuMemHostAlloc', , "Allocates page-locked host memory". The crucial thing, though, what makes this different from 'malloc' plus 'mlock' is, that "The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cuMemcpyHtoD(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc()". Similar, for example, for 'cuMemAllocHost', . This, to me, would explain why "the mlock call does not give the expected performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc'; with 'mlock' you're missing the "tracks the virtual memory ranges" aspect. Also, by means of the Nvidia Driver allocating the memory, I suppose using this interface likely circumvents any "annoying" 'ulimit' limitations? I get this impression, because documentation continues stating that "Allocating excessive amounts of memory with cuMemAllocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device". >>> It seems that it isn't enough for the memory to be pinned, it has to be >>> pinned using the Cuda API to get the performance boost. >> >> For performance boost of what kind of code? >> I don't understand how Cuda API could be useful (or can be used at all) = if >> offloading to NVPTX isn't involved. The fact that somebody asks for hos= t >> memory allocation with omp_atk_pinned set to true doesn't mean it will b= e >> in any way related to NVPTX offloading (unless it is in NVPTX target reg= ion >> obviously, but then mlock isn't available, so sure, if there is somethin= g >> CUDA can provide for that case, nice). > > This is specifically for NVPTX offload, of course, but then that's what > our customer is paying for. > > The expectation, from users, is that memory pinning will give the > benefits specific to the active device. We can certainly make that > happen when there is only one (flavour of) offload device present. I had > hoped it could be one way for all, but it looks like not. Aren't there CUDA Driver interfaces for that? That is: >>> I had not done this >>> this because it was difficult to resolve the code abstraction >>> difficulties and anyway the implementation was supposed to be device >>> independent, but it seems we need a specific pinning mechanism for each >>> device. If not directly *allocating and registering* such memory via 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister', : "Page-locks the memory range specified [...] and maps it for the device(s) [...]. This memory range also is added to the same tracking mechanism as cuMemHostAlloc to automatically accelerate [...]"? (No manual 'mlock'ing involved in that case, too; presumably again using this interface likely circumvents any "annoying" 'ulimit' limitations?) Such a *register* abstraction can then be implemented by all the libgomp offloading plugins: they just call the respective CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.) memory. ..., but maybe I'm missing some crucial "detail" here? Gr=C3=BC=C3=9Fe Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955