Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling
       [not found] <4d5c0b62-15ae-13d6-e6f5-bdf6693bac7e@mentor.com>
@ 2021-07-13 15:48 ` Thomas Schwinge
  2021-07-13 15:59   ` Jakub Jelinek
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Schwinge @ 2021-07-13 15:48 UTC (permalink / raw)
  To: gcc, Chung-Lin Tang, Alexander Monakov, Tom de Vries, Jakub Jelinek

Hi!

Starting with the Volta family (sm_70+), Nvidia GPUs introduced
Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that
constitute a warp, which means "execution state per thread, including a
program counter", succeeding the previous "warp-synchronous" abstraction
where "warps used a single program counter shared amongst all 32 threads
in the warp together with an active mask specifying the active threads of
the warp".  See
<https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>,
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>,
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>,
etc.

Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b
"[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the
necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes,
hard-coding a 'membermask' of '0xffffffff' (all threads participate).
This I understand to be the direct translation, avoiding the
deprecated/removed non-'.sync' variants of these instructions, but
otherwise maintaining the pre-Independent Thread Scheduling semantics (as
well as performance level, supposedly).  (Unless there are further
changes relevant to GCC/nvptx that I'm not currently seeing?) this means
that we now comply to the sm_70+ Independent Thread Scheduling
requirements -- but don't actually use its capabilities.

Now, I haven't spent much thought on it yet, but it would seem to me (gut
feeling?) that continuing to maintain "warp-synchronicity" (that is,
avoid using Independent Thread Scheduling) should still yield best
performance?  Or, given the GCC/nvptx offloading context via
OpenACC/OpenMP, has anyone already made any thoughts about how actually
using Independent Thread Scheduling would be beneficial?  Can it be
exploited via OpenACC/OpenMP directly?  Can it somehow be used to
increase performance?  Can it be used to simplify parts of the GCC/nvptx
back end implementation (without sacrifying performance -- is it a
zero-cost abstraction, compared to "warp-synchronicity")?

Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling
  2021-07-13 15:48 ` Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling Thomas Schwinge
@ 2021-07-13 15:59   ` Jakub Jelinek
  2021-07-15 11:24     ` Thomas Schwinge
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Jelinek @ 2021-07-13 15:59 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: gcc, Chung-Lin Tang, Alexander Monakov, Tom de Vries

On Tue, Jul 13, 2021 at 05:48:51PM +0200, Thomas Schwinge wrote:
> Starting with the Volta family (sm_70+), Nvidia GPUs introduced
> Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that
> constitute a warp, which means "execution state per thread, including a
> program counter", succeeding the previous "warp-synchronous" abstraction
> where "warps used a single program counter shared amongst all 32 threads
> in the warp together with an active mask specifying the active threads of
> the warp".  See
> <https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>,
> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>,
> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>,
> etc.
> 
> Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b
> "[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the
> necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes,
> hard-coding a 'membermask' of '0xffffffff' (all threads participate).
> This I understand to be the direct translation, avoiding the
> deprecated/removed non-'.sync' variants of these instructions, but
> otherwise maintaining the pre-Independent Thread Scheduling semantics (as
> well as performance level, supposedly).  (Unless there are further
> changes relevant to GCC/nvptx that I'm not currently seeing?) this means
> that we now comply to the sm_70+ Independent Thread Scheduling
> requirements -- but don't actually use its capabilities.
> 
> Now, I haven't spent much thought on it yet, but it would seem to me (gut
> feeling?) that continuing to maintain "warp-synchronicity" (that is,
> avoid using Independent Thread Scheduling) should still yield best
> performance?  Or, given the GCC/nvptx offloading context via
> OpenACC/OpenMP, has anyone already made any thoughts about how actually
> using Independent Thread Scheduling would be beneficial?  Can it be
> exploited via OpenACC/OpenMP directly?  Can it somehow be used to
> increase performance?  Can it be used to simplify parts of the GCC/nvptx
> back end implementation (without sacrifying performance -- is it a
> zero-cost abstraction, compared to "warp-synchronicity")?

Is it something that is always enabled on sm_70 and later hw or does a PTX
program ask for independent thread scheduling?
If threads in the warp no longer execute in lockstep, then I think it is not
compliant to use the model we have for OpenMP with warps being OpenMP
threads and threads in warp being SIMD lanes and we'd need to switch to
have each thread in a warp being an OpenMP thread (so have 32 times more
threads than before) and only a single SIMD lane in each thread (i.e. SIMD
not used).

	Jakub


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling
  2021-07-13 15:59   ` Jakub Jelinek
@ 2021-07-15 11:24     ` Thomas Schwinge
  0 siblings, 0 replies; 3+ messages in thread
From: Thomas Schwinge @ 2021-07-15 11:24 UTC (permalink / raw)
  To: Jakub Jelinek, Alexander Monakov, Tom de Vries; +Cc: gcc, Chung-Lin Tang

Hi!

On 2021-07-13T17:59:43+0200, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jul 13, 2021 at 05:48:51PM +0200, Thomas Schwinge wrote:
>> Starting with the Volta family (sm_70+), Nvidia GPUs introduced
>> Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that
>> constitute a warp, which means "execution state per thread, including a
>> program counter", succeeding the previous "warp-synchronous" abstraction
>> where "warps used a single program counter shared amongst all 32 threads
>> in the warp together with an active mask specifying the active threads of
>> the warp".  See
>> <https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>,
>> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>,
>> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>,
>> etc.
>>
>> Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b
>> "[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the
>> necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes,
>> hard-coding a 'membermask' of '0xffffffff' (all threads participate).
>> This I understand to be the direct translation, avoiding the
>> deprecated/removed non-'.sync' variants of these instructions, but
>> otherwise maintaining the pre-Independent Thread Scheduling semantics (as
>> well as performance level, supposedly).  (Unless there are further
>> changes relevant to GCC/nvptx that I'm not currently seeing?) this means
>> that we now comply to the sm_70+ Independent Thread Scheduling
>> requirements -- but don't actually use its capabilities.
>>
>> Now, I haven't spent much thought on it yet, but it would seem to me (gut
>> feeling?) that continuing to maintain "warp-synchronicity" (that is,
>> avoid using Independent Thread Scheduling) should still yield best
>> performance?  Or, given the GCC/nvptx offloading context via
>> OpenACC/OpenMP, has anyone already made any thoughts about how actually
>> using Independent Thread Scheduling would be beneficial?  Can it be
>> exploited via OpenACC/OpenMP directly?  Can it somehow be used to
>> increase performance?  Can it be used to simplify parts of the GCC/nvptx
>> back end implementation (without sacrifying performance -- is it a
>> zero-cost abstraction, compared to "warp-synchronicity")?
>
> Is it something that is always enabled on sm_70 and later hw or does a PTX
> program ask for independent thread scheduling?

As I understand it: always enabled; basically kind of a "hardware
change".  In quotes, because: for the time being, you might avoid it by
not compiling for sm_7x (for example, compile for sm_6x, which does load
on sm_7x hardware), but that will also prohibit you from other sm_7x
features (not relevant right now, but eventually), and eventually support
for sm_6x and earlier will be removed.  So'll we have to get this
addressed at some point.

> If threads in the warp no longer execute in lockstep, then I think it is not
> compliant to use the model we have for OpenMP with warps being OpenMP
> threads and threads in warp being SIMD lanes and we'd need to switch to
> have each thread in a warp being an OpenMP thread (so have 32 times more
> threads than before) and only a single SIMD lane in each thread (i.e. SIMD
> not used).

Maybe I do understand your concern -- or maybe don't.  Will you please
provide an example?

If there is direct PTX thread-level communication (for example, "shuffle"
instructions, PTX old: 'shfl'/new: 'shfl.sync'), that (already and
continues to) include corresponding synchronization, implicitly (old: all
threads of a warp, new: 'membermask' to specify participating threads).
So that continues to work as before, with unchanged user-visible
semantics, and via 'membermask' of '0xffffffff' simply prohibits
Independent Thread Scheduling (again, at least as far as user-visible,
via communication instructions).

The concern I'm having is whether there are bits in the nvptx back end
where we use communication *without* the implicitly synchronizing PTX
instructions, via device global or CTA shared memory, and currently rely
on "warp-synchronicity" (that means: there may be divergent control flow,
but it has been guaranteed that individual PTX threads don't advance
their PC individually).  This would then run into erroneous behavior with
sm_70+, and we'd need to insert explicit PTX synchronization instructions
(I suppose: 'bar.warp.sync': "Barrier synchronization for threads in a
warp") (... which the PTX JIT would optimize out for pre-sm_70, due to
implicit "warp-synchronicity").

For example, see section "Warp Synchronization" in
<https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/>, or
code pattern 2. in
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>.
(CUDA '__syncwarp' maps to PTX 'bar.warp.sync'.)

So this concern would mostly (only?) relate to avoiding "Implicit
Warp-Synchronous Programming" (see above, Google, etc.) in the nvptx back
end-synthesized PTX code, via RTL or PTX code templates.

Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-07-15 11:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4d5c0b62-15ae-13d6-e6f5-bdf6693bac7e@mentor.com>
2021-07-13 15:48 ` Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling Thomas Schwinge
2021-07-13 15:59   ` Jakub Jelinek
2021-07-15 11:24     ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).