From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id 928543858001 for ; Thu, 15 Jul 2021 11:24:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 928543858001 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com IronPort-SDR: +/m5npSxQGJAkFudFnM9iswlJ5P60sRjk0lv07qrFVAV2DgCcTRXOkrxWk+Ws7IiRKpfZSRzIt O8lu5gLPoXp6/qgd8pMtY1VyzO4/Zhfu0i+7/Jf9QY3m/9dFatyX/botdNNKOiXP4pMw318DpG h6ookQYMyYT1EGI8PrZHU7UbDQO+exP0EH6t4t4BA/t+bbJCSh2rThLJpdYN9opS4/PyrTGTW2 v3+GqTqrk3Y+LkcveEGw4+J/1TqUaHJeYHbRcM3qhsKtLr6WrKETfhKpMi9Yulr7xdvVSxmpmH IvDy6yXqgngcj4ENeO8xXMLq X-IronPort-AV: E=Sophos;i="5.84,240,1620720000"; d="scan'208";a="63553282" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 15 Jul 2021 03:24:17 -0800 IronPort-SDR: 93RIVjj63JHwSmXjPOuUNvkfSTVAMDD0CPwJXMroZaev904iewurJzNDLz66BlJJIQSA7mo5WB Gk/O9yOfRtqFaV3AWBxWi+Qx25AReNYQp6ELWA09X9uVu43rQadDbLPvRjIbtPF0fC/0pzbxVc x71DWXC8N5YzFTgGd08i8aT5CYuGQjxiI0csjHBgEBrfrvk6Ml4j+Gv71asC0ShjMIYZrxMK+w HxuAtopU3ubMBLwfvzhadWONTEo58Pm91WkOrDubO4KwUPXNDq4IX/2xHwzNT3eUyI+4f4fE9t EV0= From: Thomas Schwinge To: Jakub Jelinek , Alexander Monakov , Tom de Vries CC: , Chung-Lin Tang Subject: Re: Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling In-Reply-To: <20210713155943.GN2380545@tucnak> References: <4d5c0b62-15ae-13d6-e6f5-bdf6693bac7e@mentor.com> <8735siw6xo.fsf@euler.schwinge.homeip.net> <20210713155943.GN2380545@tucnak> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1 (x86_64-pc-linux-gnu) Date: Thu, 15 Jul 2021 13:24:09 +0200 Message-ID: <87y2a7vmzq.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: SVR-IES-MBX-07.mgc.mentorg.com (139.181.222.7) To svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) X-Spam-Status: No, score=-6.3 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Jul 2021 11:24:21 -0000 Hi! On 2021-07-13T17:59:43+0200, Jakub Jelinek wrote: > On Tue, Jul 13, 2021 at 05:48:51PM +0200, Thomas Schwinge wrote: >> Starting with the Volta family (sm_70+), Nvidia GPUs introduced >> Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that >> constitute a warp, which means "execution state per thread, including a >> program counter", succeeding the previous "warp-synchronous" abstraction >> where "warps used a single program counter shared amongst all 32 threads >> in the warp together with an active mask specifying the active threads o= f >> the warp". See >> , >> , >> , >> etc. >> >> Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b >> "[nvptx] Add -mptx=3D3.1/6.3", Tom has already added implemented the >> necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes, >> hard-coding a 'membermask' of '0xffffffff' (all threads participate). >> This I understand to be the direct translation, avoiding the >> deprecated/removed non-'.sync' variants of these instructions, but >> otherwise maintaining the pre-Independent Thread Scheduling semantics (a= s >> well as performance level, supposedly). (Unless there are further >> changes relevant to GCC/nvptx that I'm not currently seeing?) this means >> that we now comply to the sm_70+ Independent Thread Scheduling >> requirements -- but don't actually use its capabilities. >> >> Now, I haven't spent much thought on it yet, but it would seem to me (gu= t >> feeling?) that continuing to maintain "warp-synchronicity" (that is, >> avoid using Independent Thread Scheduling) should still yield best >> performance? Or, given the GCC/nvptx offloading context via >> OpenACC/OpenMP, has anyone already made any thoughts about how actually >> using Independent Thread Scheduling would be beneficial? Can it be >> exploited via OpenACC/OpenMP directly? Can it somehow be used to >> increase performance? Can it be used to simplify parts of the GCC/nvptx >> back end implementation (without sacrifying performance -- is it a >> zero-cost abstraction, compared to "warp-synchronicity")? > > Is it something that is always enabled on sm_70 and later hw or does a PT= X > program ask for independent thread scheduling? As I understand it: always enabled; basically kind of a "hardware change". In quotes, because: for the time being, you might avoid it by not compiling for sm_7x (for example, compile for sm_6x, which does load on sm_7x hardware), but that will also prohibit you from other sm_7x features (not relevant right now, but eventually), and eventually support for sm_6x and earlier will be removed. So'll we have to get this addressed at some point. > If threads in the warp no longer execute in lockstep, then I think it is = not > compliant to use the model we have for OpenMP with warps being OpenMP > threads and threads in warp being SIMD lanes and we'd need to switch to > have each thread in a warp being an OpenMP thread (so have 32 times more > threads than before) and only a single SIMD lane in each thread (i.e. SIM= D > not used). Maybe I do understand your concern -- or maybe don't. Will you please provide an example? If there is direct PTX thread-level communication (for example, "shuffle" instructions, PTX old: 'shfl'/new: 'shfl.sync'), that (already and continues to) include corresponding synchronization, implicitly (old: all threads of a warp, new: 'membermask' to specify participating threads). So that continues to work as before, with unchanged user-visible semantics, and via 'membermask' of '0xffffffff' simply prohibits Independent Thread Scheduling (again, at least as far as user-visible, via communication instructions). The concern I'm having is whether there are bits in the nvptx back end where we use communication *without* the implicitly synchronizing PTX instructions, via device global or CTA shared memory, and currently rely on "warp-synchronicity" (that means: there may be divergent control flow, but it has been guaranteed that individual PTX threads don't advance their PC individually). This would then run into erroneous behavior with sm_70+, and we'd need to insert explicit PTX synchronization instructions (I suppose: 'bar.warp.sync': "Barrier synchronization for threads in a warp") (... which the PTX JIT would optimize out for pre-sm_70, due to implicit "warp-synchronicity"). For example, see section "Warp Synchronization" in , or code pattern 2. in . (CUDA '__syncwarp' maps to PTX 'bar.warp.sync'.) So this concern would mostly (only?) relate to avoiding "Implicit Warp-Synchronous Programming" (see above, Google, etc.) in the nvptx back end-synthesized PTX code, via RTL or PTX code templates. Gr=C3=BC=C3=9Fe Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955