public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt
@ 2022-03-14 13:00 vries at gcc dot gnu.org
  2022-03-14 13:08 ` [Bug target/104916] " vries at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-14 13:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

            Bug ID: 104916
           Summary: [nvptx] Handle Independent Thread Scheduling for
                    sm_70+ with -muniform-simt
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vries at gcc dot gnu.org
  Target Milestone: ---

The problem -muniform-simt is trying to address is to make sure that a register
produced outside an openmp simd region is available when used in a lane inside
an simd region.

The solution is to, outside an simd region, execute in all warp lanes, thus
producing consistent values in result registers in each warp thread.

[ Note that this solution is: as-produced, asap.  Openacc has the same problem,
but deals with it: as-needed, alap. ]

This approach doesn't work when executing in all warp lanes multiplies the side
effects from 1 to 32 separate side effects, which is the case for for instance
atomic insns.  So atomic insns are rewritten to execute only in the master
lane, and if there are any results, propagate those to the other threads in the
warp.
[ And likewise for system calls malloc, free, vprintf. ]

[ The corresponding reorg pass nvptx_reorg_uniform_simt potentially rewrites
all statements, be those inside or outside an simd region.  But care is taken
that the rewrite only has effect outside the simd region. ]

Now, take a non-atomic update: ld, add, store.  The store has side effects, are
those multiplied as well?

Now, pre-sm_70 we have the guarantee that warps execute in lock step.  So:
- the load will load the same value into the result register across the warp,
- the add will write the same value into the result register across the warp,
- the store will write the same value to the same memory location, 32 times,
  at once, having the result of a single store.
So, no side-effect multiplication (well, at least that's the observation).

Starting sm_70, the threads in a warp are no longer guaranteed to execute in
lockstep.  Consequently, we can have the following execution trace:
- some threads load a value into the result register
- those threads do an add and write the result into the result register
- that result is stored
- the other threads arrive, and now load the now updated, thus different value
  into the result register
- the other threads do an add and write a different result into their
  result register
- the updated result is stored
So, we both have now the side effect multiplied, and the registers are no
longer in sync.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/104916] [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt
  2022-03-14 13:00 [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt vries at gcc dot gnu.org
@ 2022-03-14 13:08 ` vries at gcc dot gnu.org
  2022-03-15 10:14 ` vries at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-14 13:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> ---
We could try the same solution as for atomic: predicate ld/st to only execute
in lane 0, and propagate ld result.

Another solution might be to wrap each ld/st in two bar.warp.sync.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/104916] [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt
  2022-03-14 13:00 [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt vries at gcc dot gnu.org
  2022-03-14 13:08 ` [Bug target/104916] " vries at gcc dot gnu.org
@ 2022-03-15 10:14 ` vries at gcc dot gnu.org
  2022-03-15 11:17 ` vries at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-15 10:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

--- Comment #2 from Tom de Vries <vries at gcc dot gnu.org> ---
Created attachment 52629
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52629&action=edit
Attempt, runs into driver internal error

FTR, this is an attempt at a fix.

It does the "predicate ld/st to only execute in lane 0, and propagate ld
result", expect for modes where we run into a problem with producing a shuffle
(HFmode, V2SImode), where we use the "wrap each ld/st in two bar.warp.sync"
approach instead.

For libgomp/testsuite/libgomp.c++/target_teams_distribute__parallel_for__simd.C
(copied from OvO) I end up with:
...
Linking
Link complete: 0.000000ms
Link log warning : Stack size for entry function
'_Z48test_target_teams_distribute__parallel_for__simdv$_omp_fn$0' cannot be
statically determined

libgomp: Link error log fatal   : Internal error: reference to deleted section
...

This may be due to a problem with the patch or the driver, I'm not sure.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/104916] [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt
  2022-03-14 13:00 [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt vries at gcc dot gnu.org
  2022-03-14 13:08 ` [Bug target/104916] " vries at gcc dot gnu.org
  2022-03-15 10:14 ` vries at gcc dot gnu.org
@ 2022-03-15 11:17 ` vries at gcc dot gnu.org
  2022-03-22 13:40 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-15 11:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

--- Comment #3 from Tom de Vries <vries at gcc dot gnu.org> ---
Anyway, having reread the volta architecture whitepaper again, I think it's ok
to use the solution I already found that does work (see PR104783): add a warp
sync at simt exit.

The tricky bit is that we rely on the warps to stay uniform after that warp
sync until the next simt region entry.  Which AFAICT is not something enforced
by specification, but rather by a assuming that the 'Convergence optimizer' has
a sane implementation.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/104916] [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt
  2022-03-14 13:00 [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt vries at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-03-15 11:17 ` vries at gcc dot gnu.org
@ 2022-03-22 13:40 ` cvs-commit at gcc dot gnu.org
  2022-03-22 13:47 ` vries at gcc dot gnu.org
  2022-03-22 13:47 ` vries at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-22 13:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

--- Comment #4 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:a624388b9546b066250be8baa118b7d50c403c25

commit r12-7765-ga624388b9546b066250be8baa118b7d50c403c25
Author: Tom de Vries <tdevries@suse.de>
Date:   Wed Mar 9 10:35:14 2022 +0100

    [nvptx] Add warp sync at simt exit

    Consider this code (with N defined to 1024):
    ...
      float v = 0.0;
      #pragma omp target map(tofrom: v)
      #pragma omp parallel for simd
      for (int i = 0 ; i < N; i++)
        {
          #pragma omp atomic update
          v = v + 1.0;
        }
    ...

    It hangs when executing on target board unix/-foffload=-misa=sm_75, using
    drivers 470.103.01 and 510.54 on a T400 board (sm_75).

    I'm tentatively identifying the problem as a bug in -muniform-simt for
    architectures that support Independent Thread Scheduling (sm_70 and later).

    The problem -muniform-simt is trying to address is to make sure that a
    register produced outside an openmp simd region is available when used in
any
    lane inside an simd region.

    The solution is to, outside an simd region, execute in all warp lanes, thus
    producing consistent values in result registers in each warp thread.

    This approach doesn't work when executing in all warp lanes multiplies the
    side effects from 1 to 32 separate side effects, which is the case for
atomic
    insns.  So atomic insns are rewritten to execute only in lane 0, and if
    there are any results, those are propagated to the other threads in the
warp.
    [ And likewise for system calls malloc, free, vprintf. ]

    Now, consider a non-atomic update: ld, add, store.  The store has side
    effects, are those multiplied or not?

    Pre-sm_70 we can assume that at the end of an SIMT region, any divergent
    control flow has reconverged, and we have a uniform warp, executing in lock
    step.  So:
    - the load will load the same value into the result register across the
warp,
    - the add will write the same value into the result register across the
warp,
    - the store will write the same value to the same memory location, 32
times,
      at once, having the result of a single store.
    So, no side-effect multiplication (well, at least that's the observation).

    Starting sm_70, the threads in a warp are no longer guaranteed to
reconverge
    after divergence.  There's a "Convergence Optimizer" that can can identify
    that it is safe for a warp to reconverge, but that works only as long as
the
    code does not contain "synchronizing operations".

    Consequently, the ld, add, store sequence can be executed by a non-uniform
    warp, which means the side effects can have multiplied, and the registers
are
    no longer guarantueed to be in sync.

    The atomic update in the example above is translated using an atom.cas
loop,
    which means that we have divergence (because only one thread is allowed to
    succeed at a time) and the "Convergence Optimizer" doesn't reconverge
probably
    because the atom.cas counts as a "synchronizing operation".  So, it seems
    plausible that the root cause for the mentioned hang is the problem
described
    above.

    Fix this by adding an explicit warp sync at simt exit.

    Note that we're assuming here that the warp will stay uniform until the
next
    SIMT region entry.

    Tested on x86_64 with nvptx accelerator.

    gcc/ChangeLog:

    2022-03-09  Tom de Vries  <tdevries@suse.de>

            PR target/104916
            PR target/104783
            * config/nvptx/nvptx.md (define_expand "omp_simt_exit"): Emit warp
            sync (or uniform warp check for mptx < 6.0).

    libgomp/ChangeLog:

    2022-03-15  Tom de Vries  <tdevries@suse.de>

            PR target/104916
            PR target/104783
            * testsuite/libgomp.c/pr104783-2.c: New test.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/104916] [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt
  2022-03-14 13:00 [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt vries at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-03-22 13:40 ` cvs-commit at gcc dot gnu.org
@ 2022-03-22 13:47 ` vries at gcc dot gnu.org
  2022-03-22 13:47 ` vries at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-22 13:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

Tom de Vries <vries at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from Tom de Vries <vries at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/104916] [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt
  2022-03-14 13:00 [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt vries at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-03-22 13:47 ` vries at gcc dot gnu.org
@ 2022-03-22 13:47 ` vries at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-22 13:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

Tom de Vries <vries at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |12.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-22 13:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-14 13:00 [Bug target/104916] New: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt vries at gcc dot gnu.org
2022-03-14 13:08 ` [Bug target/104916] " vries at gcc dot gnu.org
2022-03-15 10:14 ` vries at gcc dot gnu.org
2022-03-15 11:17 ` vries at gcc dot gnu.org
2022-03-22 13:40 ` cvs-commit at gcc dot gnu.org
2022-03-22 13:47 ` vries at gcc dot gnu.org
2022-03-22 13:47 ` vries at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).