From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR05-DB8-obe.outbound.protection.outlook.com (mail-db8eur05on2083.outbound.protection.outlook.com [40.107.20.83]) by sourceware.org (Postfix) with ESMTPS id 466593858D33 for ; Thu, 16 Feb 2023 16:17:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 466593858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=siemens.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=siemens.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gMJ2sUESfsTXZRTpjf/D8iLyMSvC8H9DYqPOewCNXC8wMSTenH9GEQdwy5GnzsuLAN2+DKWhmG1ADOjJhDSFLrtO52SjWXysGLNWbiaE1drN6F1RbRXpUBVLWWKXWSPZL9vEgSk577LcbgCMRaWqSrGSBAmnhjLY/A3U73ju7l0h+mI63Wvd8mnbbW0uOHgWGuJOl4bNljwFws9n1Q9VnNeYWSQuSrLymAtQqm28B3Foh3oaMh3KfMUGyu8QKExkNy1sl5qdPxkJVoqGXtWH26Ob8DddvJkPo/kCQfkSvxJgDbU91yy7aJqAyf6mSCGtsknsOlL1yRpR++4QRJ+UTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=RwmnibGgGOxdA/dMSALM72hrg7MAxX1Tcfsh2NYIv9g=; b=IWxWuiLR3gqC23PJZYw4TyEZMnCQ+Zst9Xvb9X3WY0e+HfWDUq0dEbmscRBYybDBc2OrCI3qfT3sYtlYHp7aNMmjyiz4r7fpQSI1oLld1JM5EqcGI7naLcrjBQixd+QevCTVyTvZLcFV2rSf3mVesjWbbY9kWrAKVa4JqMB22abwa+z+cestGfG0ozi7KB0/XREiQ9Gomydk18VUPK+NbKZmeZepQEJ1hfGRdbLxwsaUj8qetlFv1IP0p0yOH7K4U+pDkTtJ2U8thj4hXvQ6jvhATyBbAhp7WFkUmzxVYiHjwL05qZoBIVrZVpufRWxU3x5CQ0ci3CauNkjfBl4Bag== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=siemens.com; dmarc=pass action=none header.from=siemens.com; dkim=pass header.d=siemens.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=siemens.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=RwmnibGgGOxdA/dMSALM72hrg7MAxX1Tcfsh2NYIv9g=; b=Fh8OIbSwrSpP6uQUuQfjEt3BraBDACz3mX4+6j5enz8sFjqShVNBvkmQMycEwPOnMVzbf3nfHPXMovgWd4xZuh/ydWp77ytXM18Sjn/XBDFZAt/PcAooW1IPMGD9WhkQDYP4pGVHZTtDT94+mwHHXipjMHb5PDd6yJom2yfAH7xDTAR9wvIsmY7TgT3IKL0bIZkhG/H7zXOG1IhzCkl04vadaH11Ugv1bwJwKnEc26fOW5YC+mfKxj9ejceBCqrtGYzkUFH0i4WkwTR3o8soTqrQdz+hraNFesCVsylSiQrW4IIlreMhbmYZFnxX+x/1cLmue6gFST8buTJhKURIxg== Received: from DB8PR10MB3676.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:10:13c::18) by AM7PR10MB3238.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:20b:10e::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6111.13; Thu, 16 Feb 2023 16:17:33 +0000 Received: from DB8PR10MB3676.EURPRD10.PROD.OUTLOOK.COM ([fe80::e78e:a40b:3948:4d41]) by DB8PR10MB3676.EURPRD10.PROD.OUTLOOK.COM ([fe80::e78e:a40b:3948:4d41%9]) with mapi id 15.20.6111.013; Thu, 16 Feb 2023 16:17:32 +0000 From: "Stubbs, Andrew" To: Thomas Schwinge , Andrew Stubbs , Jakub Jelinek , Tobias Burnus , "gcc-patches@gcc.gnu.org" Subject: RE: Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) Thread-Topic: Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) Thread-Index: AQHZQhv79EJfgurMakCAHDiqUX7NVK7RvnTQ Date: Thu, 16 Feb 2023 16:17:32 +0000 Message-ID: References: <20220104155558.GG2646553@tucnak> <48ee767a-0d90-53b4-ea54-9deba9edd805@codesourcery.com> <20220104182829.GK2646553@tucnak> <20220104184740.GL2646553@tucnak> <87edzy5g8h.fsf@euler.schwinge.homeip.net> <87cz69tyla.fsf@dem-tschwing-1.ger.mentorg.com> In-Reply-To: <87cz69tyla.fsf@dem-tschwing-1.ger.mentorg.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_ActionId=a9822208-ef7b-4140-991f-455abb88ef3e;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_ContentBits=0;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Enabled=true;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Method=Standard;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Name=restricted;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_SetDate=2023-02-16T16:11:33Z;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_SiteId=38ae3bcd-9579-4fd4-adda-b42e1495d55a; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=siemens.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: DB8PR10MB3676:EE_|AM7PR10MB3238:EE_ x-ms-office365-filtering-correlation-id: 3323c3bc-a2cf-428b-a5fd-08db10394ea4 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: TgE3b41SAYnq2MyJkSGOL9FHCJbA/hQuEVHKwt01kkexliTlVbHc4bvg8C7I6aBgVJWDXxH/narSdjAZAQmrom6D1JaHDxZ9fqCgiAIG4QwpgmyS6zRhqlD4Kg8U7okAGQa3E+YOAGU3wn66AcOgCbVV+GBNsbzHyNwe3UNJPkAWYRC5q/2E1+vw19JH4N3RxgAJUJneJpzd/wq3qvjFOwT7ZiNXzjUqSAwLpnNDiBxZIhDH0nxD3j0ROTTFfu+Ae+KyeAVENwdxdNFWuLdc8+7c6LsZGYP/rOms3FsRBjST9qbuizj/AwfTAUFc62gmEmLDRoPPt76cjfhpFlDOReJxerjkhH4Dt9gWJBs2nISGDhj+iYLSVQTyzTzX6qVHoJLhV0nnegbrFUUElDJoZcv8+Av9qDnEZEfwEI6bfNtgsZLuibLcWffcBZTrsGzMND3ArYAXKMlLYAopSj/FIjvKzZdV66aHqKdihoY7DICd/8KOFZdSFTi72lIB4853C+mf9rZsf9qinqby2S4Bh4vE/zdzp4zkHA1XSnILnL+60V3cZkMPzkPQqIs26Lr40hBbxLF6SMl+0IoyXNme0uW90nHoLmRHFZp6/t0YWR9ShhiWU3D3ewGZodLEkh+s/7Ony4SRoWDSuMvT2nuz/e5qhrYibBtaJa2sVJ/i7GsiJiw3IZCprcWBRqvH+/cx x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DB8PR10MB3676.EURPRD10.PROD.OUTLOOK.COM;PTR:;CAT:NONE;SFS:(13230025)(4636009)(396003)(346002)(366004)(376002)(136003)(39860400002)(451199018)(71200400001)(478600001)(7696005)(55016003)(83380400001)(38070700005)(33656002)(122000001)(82960400001)(86362001)(38100700002)(9686003)(186003)(26005)(6506007)(53546011)(8936002)(41300700001)(52536014)(2906002)(5660300002)(76116006)(64756008)(66446008)(8676002)(66946007)(66476007)(316002)(66556008)(110136005);DIR:OUT;SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?Id3yrnj+hfZvPRXj5JqG92kzsU4d/GI8QNolyLNYDjq/pkiOl5BRQN68y0HY?= =?us-ascii?Q?Wiacv1ndxU1P1C7TQdsq6UMnp1CzmGKSupLJXO1ayo73jbYVK880uL8+ktSs?= =?us-ascii?Q?gbRyXGF8xn8GwxNm5DnoCDPC2LkGiFEBu52oMBl1MhxuWlu6KrQQqOH1RTSu?= =?us-ascii?Q?hq1p3EEGaD5XHAMeBBW6rNKT19jDti9V29BtN5gQLRuY/RAEPnf4FsLH5k1/?= =?us-ascii?Q?Qsplla+CSReAUobZoFv0IinkggQvWpDRgMDmJEWL5it2YsbN5P8DAkilyulc?= =?us-ascii?Q?zteO0DYTQuLZygDcwHoxy1QrzMutKHGo7iKZcfl2xz02UCYO2Je4vlatd158?= =?us-ascii?Q?ENtWYKz2O16sYutpTp3ifWYyxjCtE8mCR+n01tQ04VJOsfkOmtCUyheZFGAe?= =?us-ascii?Q?KckSMXVV2Nb9Qi4Cu+gB9V958EYP4H+ge9e2s1ZQ7wErISAPdLbF8TaQG+9U?= =?us-ascii?Q?uI7duT0gYKzd4xKly97SQLZFoXn1IVgfex4fY3JGyJATBQ+PhP/660mAfbYj?= =?us-ascii?Q?Yikc0trS6Dam0KnhCf1ETOWRI0PlA7mJBXaxhB/La16hF4yo4wY2UXa0PKhx?= =?us-ascii?Q?ld5wjOZG/ltRP3AdOzwRDF1ddkhwBGxQERdjhRZ9Mb5gF1l2bWOgkuuYPosO?= =?us-ascii?Q?tXC+/4b6459reTKBJzBn0Oqn8rMPTeKlNnqrfkDhgI7vHZK2OA+Fu5AFMtTP?= =?us-ascii?Q?xyPcXFJxUebt2Z93jKGeHU3xjXFEVagHIIFXEFbB8wHc9FzmJQAm2+Dygoeu?= =?us-ascii?Q?/kHUih+NfISAMO16fo9daZ0p1mXUgKXXkjLKxlw29VJ42WyJLvarPI7OCzfs?= =?us-ascii?Q?3jxRzK105u9muE97q76TJDYe9kr9TQZ9np0Fi+FJfspH5qqUqW88NehL8G0M?= =?us-ascii?Q?PuRfgqYnKuphPoE5ASTO+/I/njzBIumzILtvukO3JpUee5sTI+TP5q06BAzw?= =?us-ascii?Q?V5phqBlxoLaoABMMHXJi3khGg9t/+pazBIs/f+3bQhSqK3bI7XtBuM7xDm+G?= =?us-ascii?Q?qSnMREeq8UIOaJX0Fm/LZXJ0tzgqMkvmAP/yQSR0HpQv6jHuPtz0IAU1va8n?= =?us-ascii?Q?h92nhdMyju+mS18VmaeVnykr1bDio4AQ6RXgB0riIynPHLXuSB2Gf7ynKEPO?= =?us-ascii?Q?ePoXJaNp/MFxM0wMbWzptlNjDE8HTCEbI0yFFrbIsAHRe2IHMOsRLp+Rq1pW?= =?us-ascii?Q?ofs0HkMRKvq6Ua1jpKIquD9+xj6jvxmXyBkjU1pAMhh7CuQOrbIldE6els+x?= =?us-ascii?Q?R8lO7NvxUmrsAo++vczngagM4CY1xvGbAG6qPOBrt4cHZb0Wn5JQdffDp3um?= =?us-ascii?Q?leqRUooK5Tm8ZdNJXZSc+I1074YzwvjYKgTXPLnKq9ZhPLraKbZw8IWyZoqG?= =?us-ascii?Q?+Nj6HbFSEdcTGEiaR66+rxuyZB31P8F2g54uQ9OSDWgod3BiJSVxp463f2Li?= =?us-ascii?Q?keqSWmvyeK29ULI7Y2F9dy4LnQVs4haRfz46Hp9qjn1/IAgsILmoIh7WlmLp?= =?us-ascii?Q?DUxNa3PhYlwdkgiMfF/Q8a36yKjKP4L5WiarCTmIkyDswbJ6o1H4I3XgqMgn?= =?us-ascii?Q?JI51Gcx5i4iLStVJO0e/4Q8/qWTWdvr9YfCH7srS?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: siemens.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DB8PR10MB3676.EURPRD10.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: 3323c3bc-a2cf-428b-a5fd-08db10394ea4 X-MS-Exchange-CrossTenant-originalarrivaltime: 16 Feb 2023 16:17:32.1494 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 38ae3bcd-9579-4fd4-adda-b42e1495d55a X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: Oytk/TGiQkkewMgB48CR9wLfPCpoIqwvTxe8OxWvygU0tTzISpgwcRodVJN15Uwgz9NU1egJN+swBthbok+JQzy2QrXVEhPDjq46dVhRWvM= X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM7PR10MB3238 X-Spam-Status: No, score=-1.2 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO,KAM_LOTSOFHASH,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > -----Original Message----- > From: Thomas Schwinge > Sent: 16 February 2023 15:33 > To: Andrew Stubbs ; Jakub Jelinek ; > Tobias Burnus ; gcc-patches@gcc.gnu.org > Subject: Attempt to register OpenMP pinned memory using a device instead = of > 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) >=20 > Hi! >=20 > On 2022-06-09T11:38:22+0200, I wrote: > > On 2022-06-07T13:28:33+0100, Andrew Stubbs wrote= : > >> On 07/06/2022 13:10, Jakub Jelinek wrote: > >>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote: > >>>> Following some feedback from users of the OG11 branch I think I need= to > >>>> withdraw this patch, for now. > >>>> > >>>> The memory pinned via the mlock call does not give the expected > performance > >>>> boost. I had not expected that it would do much in my test setup, gi= ven > that > >>>> the machine has a lot of RAM and my benchmarks are small, but others > have > >>>> tried more and on varying machines and architectures. > >>> > >>> I don't understand why there should be any expected performance boost > (at > >>> least not unless the machine starts swapping out pages), > >>> { omp_atk_pinned, true } is solely about the requirement that the mem= ory > >>> can't be swapped out. > >> > >> It seems like it takes a faster path through the NVidia drivers. This = is > >> a black box, for me, but that seems like a plausible explanation. The > >> results are different on x86_64 and powerpc hosts (such as the Summit > >> supercomputer). > > > > For example, it's documented that 'cuMemHostAlloc', > > > ia.com%2Fcuda%2Fcuda-driver- > api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1g572ca4011bfcb25034888a14= d4e > 035b9&data=3D05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa= 08db1 > 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7= CUn > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiL= CJX > VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3D7S8K2opKAV%2F5Ub2tyZtcgplptZ65dNc3b%2F= 2IYoh > me%2Fw%3D&reserved=3D0>, > > "Allocates page-locked host memory". The crucial thing, though, what > > makes this different from 'malloc' plus 'mlock' is, that "The driver > > tracks the virtual memory ranges allocated with this function and > > automatically accelerates calls to functions such as cuMemcpyHtoD(). > > Since the memory can be accessed directly by the device, it can be read > > or written with much higher bandwidth than pageable memory obtained wit= h > > functions such as malloc()". > > > > Similar, for example, for 'cuMemAllocHost', > > > ia.com%2Fcuda%2Fcuda-driver- > api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gdd8311286d2c2691605362c6= 89b > c64e0&data=3D05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa= 08db1 > 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7= CUn > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiL= CJX > VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3DTAhX%2BFjPavhKZKICMDiO%2BuZuytxnkaDvfD= ArT0R > KDV0%3D&reserved=3D0>. > > > > This, to me, would explain why "the mlock call does not give the expect= ed > > performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc= '; > > with 'mlock' you're missing the "tracks the virtual memory ranges" > > aspect. > > > > Also, by means of the Nvidia Driver allocating the memory, I suppose > > using this interface likely circumvents any "annoying" 'ulimit' > > limitations? I get this impression, because documentation continues > > stating that "Allocating excessive amounts of memory with > > cuMemAllocHost() may degrade system performance, since it reduces the > > amount of memory available to the system for paging. As a result, this > > function is best used sparingly to allocate staging areas for data > > exchange between host and device". > > > >>>> It seems that it isn't enough for the memory to be pinned, it has to= be > >>>> pinned using the Cuda API to get the performance boost. > >>> > >>> For performance boost of what kind of code? > >>> I don't understand how Cuda API could be useful (or can be used at al= l) > if > >>> offloading to NVPTX isn't involved. The fact that somebody asks for > host > >>> memory allocation with omp_atk_pinned set to true doesn't mean it wil= l > be > >>> in any way related to NVPTX offloading (unless it is in NVPTX target > region > >>> obviously, but then mlock isn't available, so sure, if there is > something > >>> CUDA can provide for that case, nice). > >> > >> This is specifically for NVPTX offload, of course, but then that's wha= t > >> our customer is paying for. > >> > >> The expectation, from users, is that memory pinning will give the > >> benefits specific to the active device. We can certainly make that > >> happen when there is only one (flavour of) offload device present. I h= ad > >> hoped it could be one way for all, but it looks like not. > > > > Aren't there CUDA Driver interfaces for that? That is: > > > >>>> I had not done this > >>>> this because it was difficult to resolve the code abstraction > >>>> difficulties and anyway the implementation was supposed to be device > >>>> independent, but it seems we need a specific pinning mechanism for e= ach > >>>> device. > > > > If not directly *allocating and registering* such memory via > > 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only > > *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister'= , > > > ia.com%2Fcuda%2Fcuda-driver- > api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gf0a9fe11544326dabd743b7a= a6b > 54223&data=3D05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa= 08db1 > 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7= CUn > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiL= CJX > VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3DWkwx9TipC8JJNn1QqULahoTfqn9w%2FOLyoCQ1= MTt90 > 8M%3D&reserved=3D0>: > > "Page-locks the memory range specified [...] and maps it for the > > device(s) [...]. This memory range also is added to the same tracking > > mechanism as cuMemHostAlloc to automatically accelerate [...]"? (No > > manual 'mlock'ing involved in that case, too; presumably again using th= is > > interface likely circumvents any "annoying" 'ulimit' limitations?) > > > > Such a *register* abstraction can then be implemented by all the libgom= p > > offloading plugins: they just call the respective > > CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.) > > memory. > > > > ..., but maybe I'm missing some crucial "detail" here? >=20 > Indeed this does appear to work; see attached > "[WIP] Attempt to register OpenMP pinned memory using a device instead of > 'mlock'". > Any comments (aside from the TODOs that I'm still working on)? The mmap implementation was not optimized for a lot of small allocations, a= nd I can't see that issue changing here, so I don't know if this can be use= d for mlockall replacement. I had assumed that using the Cuda allocator would fix that limitation. Andrew