From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Thomas_Schwinge@mentor.com>
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
 by sourceware.org (Postfix) with ESMTPS id 928543858001
 for <gcc@gcc.gnu.org>; Thu, 15 Jul 2021 11:24:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 928543858001
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
IronPort-SDR: +/m5npSxQGJAkFudFnM9iswlJ5P60sRjk0lv07qrFVAV2DgCcTRXOkrxWk+Ws7IiRKpfZSRzIt
 O8lu5gLPoXp6/qgd8pMtY1VyzO4/Zhfu0i+7/Jf9QY3m/9dFatyX/botdNNKOiXP4pMw318DpG
 h6ookQYMyYT1EGI8PrZHU7UbDQO+exP0EH6t4t4BA/t+bbJCSh2rThLJpdYN9opS4/PyrTGTW2
 v3+GqTqrk3Y+LkcveEGw4+J/1TqUaHJeYHbRcM3qhsKtLr6WrKETfhKpMi9Yulr7xdvVSxmpmH
 IvDy6yXqgngcj4ENeO8xXMLq
X-IronPort-AV: E=Sophos;i="5.84,240,1620720000"; d="scan'208";a="63553282"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa2.mentor.iphmx.com with ESMTP; 15 Jul 2021 03:24:17 -0800
IronPort-SDR: 93RIVjj63JHwSmXjPOuUNvkfSTVAMDD0CPwJXMroZaev904iewurJzNDLz66BlJJIQSA7mo5WB
 Gk/O9yOfRtqFaV3AWBxWi+Qx25AReNYQp6ELWA09X9uVu43rQadDbLPvRjIbtPF0fC/0pzbxVc
 x71DWXC8N5YzFTgGd08i8aT5CYuGQjxiI0csjHBgEBrfrvk6Ml4j+Gv71asC0ShjMIYZrxMK+w
 HxuAtopU3ubMBLwfvzhadWONTEo58Pm91WkOrDubO4KwUPXNDq4IX/2xHwzNT3eUyI+4f4fE9t
 EV0=
From: Thomas Schwinge <thomas@codesourcery.com>
To: Jakub Jelinek <jakub@redhat.com>, Alexander Monakov <amonakov@ispras.ru>, 
 Tom de Vries <tdevries@suse.de>
CC: <gcc@gcc.gnu.org>, Chung-Lin Tang <cltang@codesourcery.com>
Subject: Re: Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling
In-Reply-To: <20210713155943.GN2380545@tucnak>
References: <4d5c0b62-15ae-13d6-e6f5-bdf6693bac7e@mentor.com>
 <8735siw6xo.fsf@euler.schwinge.homeip.net>
 <20210713155943.GN2380545@tucnak>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1
 (x86_64-pc-linux-gnu)
Date: Thu, 15 Jul 2021 13:24:09 +0200
Message-ID: <87y2a7vmzq.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: SVR-IES-MBX-07.mgc.mentorg.com (139.181.222.7) To
 svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1)
X-Spam-Status: No, score=-6.3 required=5.0 tests=BAYES_00,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Jul 2021 11:24:21 -0000

Hi!

On 2021-07-13T17:59:43+0200, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jul 13, 2021 at 05:48:51PM +0200, Thomas Schwinge wrote:
>> Starting with the Volta family (sm_70+), Nvidia GPUs introduced
>> Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that
>> constitute a warp, which means "execution state per thread, including a
>> program counter", succeeding the previous "warp-synchronous" abstraction
>> where "warps used a single program counter shared amongst all 32 threads
>> in the warp together with an active mask specifying the active threads o=
f
>> the warp".  See
>> <https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thr=
ead-scheduling>,
>> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architectur=
e>,
>> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thre=
ad-scheduling-7-x>,
>> etc.
>>
>> Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b
>> "[nvptx] Add -mptx=3D3.1/6.3", Tom has already added implemented the
>> necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes,
>> hard-coding a 'membermask' of '0xffffffff' (all threads participate).
>> This I understand to be the direct translation, avoiding the
>> deprecated/removed non-'.sync' variants of these instructions, but
>> otherwise maintaining the pre-Independent Thread Scheduling semantics (a=
s
>> well as performance level, supposedly).  (Unless there are further
>> changes relevant to GCC/nvptx that I'm not currently seeing?) this means
>> that we now comply to the sm_70+ Independent Thread Scheduling
>> requirements -- but don't actually use its capabilities.
>>
>> Now, I haven't spent much thought on it yet, but it would seem to me (gu=
t
>> feeling?) that continuing to maintain "warp-synchronicity" (that is,
>> avoid using Independent Thread Scheduling) should still yield best
>> performance?  Or, given the GCC/nvptx offloading context via
>> OpenACC/OpenMP, has anyone already made any thoughts about how actually
>> using Independent Thread Scheduling would be beneficial?  Can it be
>> exploited via OpenACC/OpenMP directly?  Can it somehow be used to
>> increase performance?  Can it be used to simplify parts of the GCC/nvptx
>> back end implementation (without sacrifying performance -- is it a
>> zero-cost abstraction, compared to "warp-synchronicity")?
>
> Is it something that is always enabled on sm_70 and later hw or does a PT=
X
> program ask for independent thread scheduling?

As I understand it: always enabled; basically kind of a "hardware
change".  In quotes, because: for the time being, you might avoid it by
not compiling for sm_7x (for example, compile for sm_6x, which does load
on sm_7x hardware), but that will also prohibit you from other sm_7x
features (not relevant right now, but eventually), and eventually support
for sm_6x and earlier will be removed.  So'll we have to get this
addressed at some point.


> If threads in the warp no longer execute in lockstep, then I think it is =
not
> compliant to use the model we have for OpenMP with warps being OpenMP
> threads and threads in warp being SIMD lanes and we'd need to switch to
> have each thread in a warp being an OpenMP thread (so have 32 times more
> threads than before) and only a single SIMD lane in each thread (i.e. SIM=
D
> not used).

Maybe I do understand your concern -- or maybe don't.  Will you please
provide an example?

If there is direct PTX thread-level communication (for example, "shuffle"
instructions, PTX old: 'shfl'/new: 'shfl.sync'), that (already and
continues to) include corresponding synchronization, implicitly (old: all
threads of a warp, new: 'membermask' to specify participating threads).
So that continues to work as before, with unchanged user-visible
semantics, and via 'membermask' of '0xffffffff' simply prohibits
Independent Thread Scheduling (again, at least as far as user-visible,
via communication instructions).


The concern I'm having is whether there are bits in the nvptx back end
where we use communication *without* the implicitly synchronizing PTX
instructions, via device global or CTA shared memory, and currently rely
on "warp-synchronicity" (that means: there may be divergent control flow,
but it has been guaranteed that individual PTX threads don't advance
their PC individually).  This would then run into erroneous behavior with
sm_70+, and we'd need to insert explicit PTX synchronization instructions
(I suppose: 'bar.warp.sync': "Barrier synchronization for threads in a
warp") (... which the PTX JIT would optimize out for pre-sm_70, due to
implicit "warp-synchronicity").

For example, see section "Warp Synchronization" in
<https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/>, or
code pattern 2. in
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-=
scheduling-7-x>.
(CUDA '__syncwarp' maps to PTX 'bar.warp.sync'.)

So this concern would mostly (only?) relate to avoiding "Implicit
Warp-Synchronous Programming" (see above, Google, etc.) in the nvptx back
end-synthesized PTX code, via RTL or PTX code templates.


Gr=C3=BC=C3=9Fe
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201=
, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3=
=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf=
t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955