From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-x334.google.com (mail-wm1-x334.google.com [IPv6:2a00:1450:4864:20::334]) by sourceware.org (Postfix) with ESMTPS id 27A6E38313BC for ; Wed, 5 Jun 2024 09:41:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 27A6E38313BC Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=baylibre.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 27A6E38313BC Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::334 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717580479; cv=none; b=vJ7QaEyqcuc1+E90zEkFeoshTfLlH9syvF5CGUYkzPT8R96NzVtd7DPoWbi/y3QDcGgDpTDSAO6bZoq5T98XU9zygvw4VeOjB65yEFoFa7pNy1UnEGEmJba/TeKMn9ECo2+5fYZZSiCde2FiYm15jYZlwabpA0EBn2ngOXyEgBQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717580479; c=relaxed/simple; bh=f5LnTrU3bFza0kduEh0tSPZcQic0c5PP/+XrVgFLVo0=; h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:From:To; b=UQ3+M8MCdRQSzG0uZ5FWKPAw4yyDYRc0gQASBFs07D9bkspuQxR1RI1ck7r6ZMWyyxGeoHI3Q+C52CfnO+2Jkgxfnlltsd24bh/afjRVmbL7WBIn/mM+JbfcQFdCioF7fJH8TEUYm+FiGLztTZYofbOIx5Z0kEB5kPPlhelAT9U= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-wm1-x334.google.com with SMTP id 5b1f17b1804b1-42159283989so3065145e9.3 for ; Wed, 05 Jun 2024 02:41:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1717580475; x=1718185275; darn=gcc.gnu.org; h=in-reply-to:content-language:references:cc:to:from:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=i2PhLD+E2vBi2dIYKpGfVFr0ZvoFghFek0ZygMmLRMA=; b=VpEClDWsmuyC6LVqvpGJiTaHU1wC4mdIMkamFw3oy+SoUvxAhTKqJY0JIrA0xQObcK zVXO9o8w4zcPt6fZnb2clhWvkbnIS23PAmdrF9DX/Q418dnNR2oxS0dxSvHbG5c9zxz0 0zzYFGF4yZg6AjRGhjfJZH1MRkpqW0N1hk4VUGBmF3/NyzsHfQ2BLOS0ls0o/18xdOR5 BGeh5Je2zH4RyoC+xicNGa0toAlp+/DTEjFnr5ou/9DfMQ+1Nc7fYRICZt/JaIl4/aAj xgIAanmU/vIXlr00SQDLdhwS/if9d6wTRUZqqLssI3DbKDSkPPfgqwIoRuyDxv4QppV1 RClw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717580475; x=1718185275; h=in-reply-to:content-language:references:cc:to:from:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=i2PhLD+E2vBi2dIYKpGfVFr0ZvoFghFek0ZygMmLRMA=; b=NEcllOm9IhfuVCRPSVWuZl5ZBbucFSSqndjmOwEsPnaed7z557wOhbxMzN4U3xkLWN WWdqAANsVkdY7KsYlbvaZoSXoAk+Xb3mhoxqTpyrbYV0uFnG78b01dM+FeOFn2M8AiCQ y9eCDwNFamj89GNEJO7E59AHdMiD+6tYVMe6L5dCQ46RMADoKJnNyaDe952R4IO9GOJP 9R0B6fZYzRuplCTgFMtpv6ldfM/xOfeykwbU++F+0XwAn0x/Ai7vB3xUoWLLwoRscI1p d7uRZxBsid2Ehc9Np1mKG7HdFdbrq5yneBU3TS98pd/gTZiGNY4YmpRjA71elGy+ET8d iBbg== X-Forwarded-Encrypted: i=1; AJvYcCXz5Kv9BG2rNb8T7YfZaCREn21Ot/9y5mSFRVZ6AVWF2QKRASF4so3yb+MBet0idfpyI1AFafsQXD80KWRQeoteDcaq9Aydqg== X-Gm-Message-State: AOJu0Yz/Bi2aqVL2KepSOxjArIQKn573JryFxs7/fC8tXP4nDxGYqlfs I42wDrdag7gc62EnTzE6zaI+STZEJVRKUDeGYohUNFpq7RKrgKc8n9kuTJLDOew= X-Google-Smtp-Source: AGHT+IEQZVSt5ol9HBh/vU48W7JHNlRj2Aut0mU7ArnwWNB4Myw+CTsQ6LAvwZ0l/lAjVXkCeS+AHA== X-Received: by 2002:a05:600c:3c9f:b0:41c:2313:da8d with SMTP id 5b1f17b1804b1-42156260335mr20463275e9.0.1717580474360; Wed, 05 Jun 2024 02:41:14 -0700 (PDT) Received: from ?IPV6:2001:16b8:3fea:4700:e9f2:bf15:5803:472c? ([2001:16b8:3fea:4700:e9f2:bf15:5803:472c]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-42158101c07sm14634745e9.7.2024.06.05.02.41.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 05 Jun 2024 02:41:14 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------O0J5Mx0GXX3TLmhABWC8W12n" Message-ID: <1167cf2e-0945-452b-94ed-6a796b9399cf@baylibre.com> Date: Wed, 5 Jun 2024 11:41:12 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [patch] libgomp: Enable USM for some nvptx devices From: Tobias Burnus To: Andrew Stubbs , gcc-patches , Jakub Jelinek Cc: Thomas Schwinge References: <8b1aa301-dd3d-4ad1-bc90-e3b301d16c67@baylibre.com> <19c7375c-bdf3-48de-af66-729a14f64696@baylibre.com> <57fa8e9b-7e6f-42d4-85cc-c2af21a44b11@baylibre.com> <866924d4-6b52-4b1b-b707-33957ba80114@baylibre.com> <91c01d9e-2d7d-4027-baaa-0c2f7873f124@baylibre.com> Content-Language: en-US In-Reply-To: X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: This is a multi-part message in MIME format. --------------O0J5Mx0GXX3TLmhABWC8W12n Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hi Andrew, hello world, Now with AMD Instinct MI200 data - see below. And a better look at the numbers. In terms of USM, there does not seem to be any clear winner of both approaches. If we want to draw conclusions, definitely more runs are needed (statistics): The runs below show that the differences between runs can be larger than the effect of mapping vs. USM. And that OG13's USM was be 40% slower on MI210 (compared with mainline or OG13 'map') while mainline's USM is about as fast as 'map' (OG13 or mainline) is not consistent with the MI250X result, were both USM are slower with mainline's USM being much slower with ~30% than OG13 with 12%. Tobias Burnus wrote: > I have now tried it on my laptop with BabelStream,https://github.com/UoB-HPC/BabelStream > > Compiling with: > echo "#pragma omp requires unified_shared_memory" > omp-usm.h > cmake -DMODEL=omp -DCMAKE_CXX_COMPILER=$HOME/projects/gcc-trunk-offload/bin/g++ \ > -DCXX_EXTRA_FLAGS="-g -include ../omp-usm.h -foffload=nvptx-none -fopenmp" -DOFFLOAD=ON .. > > (and the variants: no -include (→ map) + -DOFFLOAD=OFF (= host), and with hostfallback, > via env var (or usm-14 by due to lacking support.) > > For mainline, I get (either with libgomp.so of mainline or GCC 14, i.e. w/o USM support): > host-14.log                     195.84user 0.94system 0 11.20elapsed 1755%CPU (0avgtext+0avgdata 1583268maxresident)k > host-mainline.log               200.16user 1.00system 0 11.89elapsed 1691%CPU (0avgtext+0avgdata 1583272maxresident)k > hostfallback-mainline.log       288.99user 4.57system 0 19.39elapsed 1513%CPU (0avgtext+0avgdata 1583972maxresident)k > usm-14.log                      279.91user 5.38system 0 19.57elapsed 1457%CPU (0avgtext+0avgdata 1590168maxresident)k > map-14.log                      4.17user 0.45system 0   03.58elapsed 129%CPU (0avgtext+0avgdata 1691152maxresident)k > map-mainline.log                4.15user 0.44system 0   03.58elapsed 128%CPU (0avgtext+0avgdata 1691260maxresident)k > usm-mainline.log                3.63user 1.96system 0   03.88elapsed 144%CPU (0avgtext+0avgdata 1692068maxresident)k > > Thus: GPU is faster than host, host fallback takes 40% longer than doing host compilation. > USM is 15% faster than mapping. Correction: I shouldn't look at user time but at elapsed time. For the latter, USM is 8% slower on mainline; hostfallback is ~70% slower than host execution. > With OG13, the pattern is similar, except that USM is only 3% faster. Here, USM (elapsed) is 2.5% faster. It is a bit difficult to compare the results as OG13 is faster for mapping and USM, which makes distinguishing OG13 vs mainline performance and the two different USM approaches difficult. > host-og13.log 191.51user 0.70system 0 09.80elapsed 1960%CPU (0avgtext+0avgdata 1583280maxresident)k > map-hostfallback-og13.log 205.12user 1.09system 0 10.82elapsed 1905%CPU (0avgtext+0avgdata 1585092maxresident)k > usm-hostfallback-og13.log 338.82user 4.60system 0 19.34elapsed 1775%CPU (0avgtext+0avgdata 1584580maxresident)k > map-og13.log 4.43user 0.42system 0 03.59elapsed 135%CPU (0avgtext+0avgdata 1692692maxresident)k > usm-og13.log 4.31user 1.18system 0 03.68elapsed 149%CPU (0avgtext+0avgdata 1686256maxresident)k > > * * * As IT issues are now solved: (A) On AMD Instinct MI210 (gfx90a) The host fallback is here very slow with elapsed time 24s vs. 1.6s for host execution. map and USM seem to be in the same ballpark. For two 'map' runs, I see a difference of 8%, the USM times are between those map results. I see similar results for OG13 than mainline, except for USM which is ~40% slower (elapse time) than map (OG13 or mainline - or mainline's USM). host-mainline-2.log 194.00user 7.21system 0 01.44elapsed 13954%CPU (0avgtext+0avgdata 1320960maxresident)k host-mainline.log 221.53user 5.58system 0 01.78elapsed 12716%CPU (0avgtext+0avgdata 1318912maxresident)k hostfallback-mainline-1.log 3073.35user 146.22system 0 24.25elapsed 13272%CPU (0avgtext+0avgdata 1644544maxresident)k hostfallback-mainline-2.log 2268.62user 146.13system 0 23.39elapsed 10320%CPU (0avgtext+0avgdata 1650544maxresident)k map-mainline-1.log 5.38user 16.16system 0 03.00elapsed 716%CPU (0avgtext+0avgdata 1714936maxresident)k map-mainline-2.log 5.12user 15.93system 0 02.74elapsed 768%CPU (0avgtext+0avgdata 1714932maxresident)k usm-mainline-1.log 7.61user 2.30system 0 02.89elapsed 342%CPU (0avgtext+0avgdata 1716984maxresident)k usm-mainline-2.log 7.75user 2.92system 0 02.89elapsed 369%CPU (0avgtext+0avgdata 1716980maxresident)k host-og13-1.log 213.69user 6.37system 0 01.56elapsed 14026%CPU (0avgtext+0avgdata 1316864maxresident)k hostfallback-map-og13-1.log 3026.68user 123.77system 0 23.69elapsed 13295%CPU (0avgtext+0avgdata 1642496maxresident)k hostfallback-map-og13-2.log 3118.71user 123.81system 0 24.49elapsed 13235%CPU (0avgtext+0avgdata 1628160maxresident)k hostfallback-usm-og13-1.log 3070.33user 116.23system 0 23.86elapsed 13354%CPU (0avgtext+0avgdata 1648632maxresident)k hostfallback-usm-og13-2.log 3112.34user 125.54system 0 24.39elapsed 13273%CPU (0avgtext+0avgdata 1622012maxresident)k map-og13-1.log 5.61user 7.13system 0 02.69elapsed 472%CPU (0avgtext+0avgdata 1716984maxresident)k map-og13-2.log 5.39user 16.25system 0 02.83elapsed 764%CPU (0avgtext+0avgdata 1716984maxresident)k usm-og13-1.log 7.23user 3.13system 0 04.37elapsed 237%CPU (0avgtext+0avgdata 1716964maxresident)k usm-og13-2.log 7.31user 3.15system 0 03.98elapsed 262%CPU (0avgtext+0avgdata 1716964maxresident)k * * * Running it on MI250X: USM is in the sam ballpark as MAP – but here USM is actually 30% or 12% slower than map. omp-stream-mainline-map 7.24user 0.71system 0:01.18elapsed 672%CPU (0avgtext+0avgdata 1728852maxresident)k omp-stream-mainline-usm 2.48user 1.07system 0:01.44elapsed 247%CPU (0avgtext+0avgdata 1728916maxresident)k omp-stream-og13-map 7.14user 0.72system 0:01.10elapsed 712%CPU (0avgtext+0avgdata 1728708maxresident)k omp-stream-og13-usm 2.32user 0.91system 0:01.23elapsed 262%CPU (0avgtext+0avgdata 1991180maxresident)k Tobias --------------O0J5Mx0GXX3TLmhABWC8W12n--