From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 1FC5E3899427; Tue, 22 Mar 2022 13:40:41 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1FC5E3899427
From: "cvs-commit at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/104916] [nvptx] Handle Independent Thread Scheduling for
 sm_70+ with -muniform-simt
Date: Tue, 22 Mar 2022 13:40:40 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: cvs-commit at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-104916-4-fgllSSlfkj@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-104916-4@http.gcc.gnu.org/bugzilla/>
References: <bug-104916-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 22 Mar 2022 13:40:41 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104916
--- Comment #4 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:a624388b9546b066250be8baa118b7d50c403c25

commit r12-7765-ga624388b9546b066250be8baa118b7d50c403c25
Author: Tom de Vries <tdevries@suse.de>
Date:   Wed Mar 9 10:35:14 2022 +0100

    [nvptx] Add warp sync at simt exit

    Consider this code (with N defined to 1024):
    ...
      float v =3D 0.0;
      #pragma omp target map(tofrom: v)
      #pragma omp parallel for simd
      for (int i =3D 0 ; i < N; i++)
        {
          #pragma omp atomic update
          v =3D v + 1.0;
        }
    ...

    It hangs when executing on target board unix/-foffload=3D-misa=3Dsm_75,=
 using
    drivers 470.103.01 and 510.54 on a T400 board (sm_75).

    I'm tentatively identifying the problem as a bug in -muniform-simt for
    architectures that support Independent Thread Scheduling (sm_70 and lat=
er).

    The problem -muniform-simt is trying to address is to make sure that a
    register produced outside an openmp simd region is available when used =
in
any
    lane inside an simd region.

    The solution is to, outside an simd region, execute in all warp lanes, =
thus
    producing consistent values in result registers in each warp thread.

    This approach doesn't work when executing in all warp lanes multiplies =
the
    side effects from 1 to 32 separate side effects, which is the case for
atomic
    insns.  So atomic insns are rewritten to execute only in lane 0, and if
    there are any results, those are propagated to the other threads in the
warp.
    [ And likewise for system calls malloc, free, vprintf. ]

    Now, consider a non-atomic update: ld, add, store.  The store has side
    effects, are those multiplied or not?

    Pre-sm_70 we can assume that at the end of an SIMT region, any divergent
    control flow has reconverged, and we have a uniform warp, executing in =
lock
    step.  So:
    - the load will load the same value into the result register across the
warp,
    - the add will write the same value into the result register across the
warp,
    - the store will write the same value to the same memory location, 32
times,
      at once, having the result of a single store.
    So, no side-effect multiplication (well, at least that's the observatio=
n).

    Starting sm_70, the threads in a warp are no longer guaranteed to
reconverge
    after divergence.  There's a "Convergence Optimizer" that can can ident=
ify
    that it is safe for a warp to reconverge, but that works only as long as
the
    code does not contain "synchronizing operations".

    Consequently, the ld, add, store sequence can be executed by a non-unif=
orm
    warp, which means the side effects can have multiplied, and the registe=
rs
are
    no longer guarantueed to be in sync.

    The atomic update in the example above is translated using an atom.cas
loop,
    which means that we have divergence (because only one thread is allowed=
 to
    succeed at a time) and the "Convergence Optimizer" doesn't reconverge
probably
    because the atom.cas counts as a "synchronizing operation".  So, it see=
ms
    plausible that the root cause for the mentioned hang is the problem
described
    above.

    Fix this by adding an explicit warp sync at simt exit.

    Note that we're assuming here that the warp will stay uniform until the
next
    SIMT region entry.

    Tested on x86_64 with nvptx accelerator.

    gcc/ChangeLog:

    2022-03-09  Tom de Vries  <tdevries@suse.de>

            PR target/104916
            PR target/104783
            * config/nvptx/nvptx.md (define_expand "omp_simt_exit"): Emit w=
arp
            sync (or uniform warp check for mptx < 6.0).

    libgomp/ChangeLog:

    2022-03-15  Tom de Vries  <tdevries@suse.de>

            PR target/104916
            PR target/104783
            * testsuite/libgomp.c/pr104783-2.c: New test.=