public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
@ 2021-03-11 16:36 tschwinge at gcc dot gnu.org
  2021-03-12 15:53 ` [Bug target/99555] " vries at gcc dot gnu.org
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2021-03-11 16:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

            Bug ID: 99555
           Summary: [OpenMP/nvptx] Execution-time hang for simple nested
                    OpenMP 'target'/'parallel'/'task' constructs
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: openmp
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: jakub at gcc dot gnu.org, kcy at codesourcery dot com,
                    vries at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

Discovered during OpenMP 'task' 'detach' development.  See PR98738,
<http://mid.mail-archive.com/e7796b0a-c8ee-e695-3775-9edfa254c552@codesourcery.com>;
when offloaded to nvptx, '-O0', the following hangs consistently:

    #pragma omp target
    #pragma omp parallel
    #pragma omp task
      ;

This doesn't hang when offloaded to GCN or the host device, or if
'num_threads(1)' is specified on the 'parallel'.

---

Not yet determined if this is a regression, when this started.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
@ 2021-03-12 15:53 ` vries at gcc dot gnu.org
  2021-03-25 12:00 ` cvs-commit at gcc dot gnu.org
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2021-03-12 15:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> ---
I see this as well:
...
PASS: libgomp.c/../libgomp.c-c++-common/task-detach-6.c (test for excess
errors)
WARNING: program timed out.
...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
  2021-03-12 15:53 ` [Bug target/99555] " vries at gcc dot gnu.org
@ 2021-03-25 12:00 ` cvs-commit at gcc dot gnu.org
  2021-03-29  8:41 ` cvs-commit at gcc dot gnu.org
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-03-25 12:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #2 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Thomas Schwinge <tschwinge@gcc.gnu.org>:

https://gcc.gnu.org/g:d99111fd8e12deffdd9a965ce17e8a760d531ec3

commit r11-7824-gd99111fd8e12deffdd9a965ce17e8a760d531ec3
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Thu Mar 11 17:01:22 2021 +0100

    Avoid OpenMP/nvptx execution-time hangs for simple nested OpenMP
'target'/'parallel'/'task' constructs [PR99555]

    ... awaiting proper resolution, of course.

            libgomp/
            PR target/99555
            * testsuite/lib/on_device_arch.c: New file.
            * testsuite/libgomp.c/pr99555-1.c: Likewise.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: Until resolved,
            skip for nvptx offloading, with error status.
            * testsuite/libgomp.fortran/task-detach-6.f90: Likewise.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
  2021-03-12 15:53 ` [Bug target/99555] " vries at gcc dot gnu.org
  2021-03-25 12:00 ` cvs-commit at gcc dot gnu.org
@ 2021-03-29  8:41 ` cvs-commit at gcc dot gnu.org
  2021-04-15  8:02 ` vries at gcc dot gnu.org
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-03-29  8:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #3 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tobias Burnus <burnus@gcc.gnu.org>:

https://gcc.gnu.org/g:d579e2e76f9469e1b386d693af57c5c4f0ede410

commit r11-7886-gd579e2e76f9469e1b386d693af57c5c4f0ede410
Author: Tobias Burnus <tobias@codesourcery.com>
Date:   Mon Mar 29 10:38:39 2021 +0200

    libgomp: Fix on_device_arch.c aux-file handling [PR99555]

    libgomp/ChangeLog:

            PR target/99555
            * testsuite/lib/on_device_arch.c: Move to ...
            * testsuite/libgomp.c-c++-common/on_device_arch.h: ... here.
            * testsuite/libgomp.fortran/on_device_arch.c: New file;
            #include on_device_arch.h.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: #include
            on_device_arch.h instead of using dg-additional-source.
            * testsuite/libgomp.c/pr99555-1.c: Likewise.
            * testsuite/libgomp.fortran/task-detach-6.f90: Update to use
            on_device_arch.c without relative paths.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-03-29  8:41 ` cvs-commit at gcc dot gnu.org
@ 2021-04-15  8:02 ` vries at gcc dot gnu.org
  2021-04-15  9:14 ` cvs-commit at gcc dot gnu.org
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2021-04-15  8:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #4 from Tom de Vries <vries at gcc dot gnu.org> ---
Investigated using cuda-gdb.

After typing ^c, we investigate the state:
...
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status   SMs Mask GridDim BlockDim Invocation 
*      0      -   0    1 Active 0x00000010 (1,1,1) (32,8,1) main$_omp_fn() 
...

So, we have 256 threads in the CTA, or 8 warps.

The threads have the following state:
...
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx ThreadIdx Count         Virtual PC Filename 
Line 
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)   (0,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (0,1,0)     (0,0,0)   (0,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (1,0,0)     (0,0,0)   (1,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (1,1,0)     (0,0,0)   (1,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (2,0,0)     (0,0,0)   (2,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (2,1,0)     (0,0,0)   (2,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (3,0,0)     (0,0,0)   (3,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (3,1,0)     (0,0,0)   (3,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (4,0,0)     (0,0,0)   (4,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (4,1,0)     (0,0,0)   (4,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (5,0,0)     (0,0,0)   (5,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (5,1,0)     (0,0,0)   (5,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (6,0,0)     (0,0,0)   (6,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (6,1,0)     (0,0,0)   (6,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (7,0,0)     (0,0,0)   (7,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (7,1,0)     (0,0,0)   (7,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (8,0,0)     (0,0,0)   (8,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (8,1,0)     (0,0,0)   (8,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)   (9,0,0)     (0,0,0)   (9,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)   (9,1,0)     (0,0,0)   (9,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (10,0,0)     (0,0,0)  (10,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (10,1,0)     (0,0,0)  (10,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (11,0,0)     (0,0,0)  (11,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (11,1,0)     (0,0,0)  (11,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (12,0,0)     (0,0,0)  (12,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (12,1,0)     (0,0,0)  (12,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (13,0,0)     (0,0,0)  (13,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (13,1,0)     (0,0,0)  (13,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (14,0,0)     (0,0,0)  (14,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (14,1,0)     (0,0,0)  (14,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (15,0,0)     (0,0,0)  (15,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (15,1,0)     (0,0,0)  (15,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (16,0,0)     (0,0,0)  (16,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (16,1,0)     (0,0,0)  (16,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (17,0,0)     (0,0,0)  (17,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (17,1,0)     (0,0,0)  (17,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (18,0,0)     (0,0,0)  (18,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (18,1,0)     (0,0,0)  (18,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (19,0,0)     (0,0,0)  (19,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (19,1,0)     (0,0,0)  (19,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (20,0,0)     (0,0,0)  (20,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (20,1,0)     (0,0,0)  (20,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (21,0,0)     (0,0,0)  (21,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (21,1,0)     (0,0,0)  (21,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (22,0,0)     (0,0,0)  (22,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (22,1,0)     (0,0,0)  (22,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (23,0,0)     (0,0,0)  (23,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (23,1,0)     (0,0,0)  (23,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (24,0,0)     (0,0,0)  (24,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (24,1,0)     (0,0,0)  (24,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (25,0,0)     (0,0,0)  (25,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (25,1,0)     (0,0,0)  (25,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (26,0,0)     (0,0,0)  (26,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (26,1,0)     (0,0,0)  (26,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (27,0,0)     (0,0,0)  (27,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (27,1,0)     (0,0,0)  (27,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (28,0,0)     (0,0,0)  (28,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (28,1,0)     (0,0,0)  (28,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (29,0,0)     (0,0,0)  (29,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (29,1,0)     (0,0,0)  (29,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (30,0,0)     (0,0,0)  (30,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (30,1,0)     (0,0,0)  (30,7,0)     7 0x0000000000b2f350      n/a   
 0 
   (0,0,0)  (31,0,0)     (0,0,0)  (31,0,0)     1 0x0000000000b5f638      n/a   
 0 
   (0,0,0)  (31,1,0)     (0,0,0)  (31,7,0)     7 0x0000000000b2f350      n/a   
 0 
...

I seems that we're stuck at two locations, one warp in one and 7 warps in
another.  Here (in thread 0,0,0):
...
(cuda-gdb) bt
#0  0x0000000000b5f638 in gomp_team_barrier_wait_end ()
#1  0x0000000000a9e638 in gomp_team_barrier_wait_final ()
#2  0x0000000000b31ad8 in gomp_team_end ()
#3  0x0000000000b394d8 in GOMP_parallel_end ()
#4  0x0000000000a7e620 in GOMP_parallel ()
#5  0x0000000000b48cc0 in main$_omp_fn$0$impl ()
#6  0x0000000000b2f020 in gomp_nvptx_main ()
#7  0x0000000000b4a2c0 in main$_omp_fn<<<(1,1,1),(32,8,1)>>> ()
...
and here:
...
(cuda-gdb) cuda thread (0,1,0)
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,1,0),
device 0, sm 4, warp 1, lane 0]
0x0000000000b2f350 in gomp_nvptx_main ()
(cuda-gdb) bt
#0  0x0000000000b2f350 in gomp_nvptx_main ()
#1  0x0000000000b4a2c0 in main$_omp_fn<<<(1,1,1),(32,8,1)>>> ()
...

Looking at the specific addresses, we have two bar.sync insns:
...
   0x0000000000b5f630 <+1648>:  BAR.SYNC 0x1, R0
=> 0x0000000000b5f638 <+1656>:  MEMBAR.CTA
...
and:
...
   0x0000000000b2f340 <+4000>:
   0x0000000000b2f348 <+4008>:  BAR.SYNC 0x0, R4
=> 0x0000000000b2f350 <+4016>:  MEMBAR.CTA
...

Printing the registers for the thread amount operand gives us:
...
(cuda-gdb) p $R0
$1 = 256
...
and:
...
(cuda-gdb) cuda thread (0,1,0)
  ...
(cuda-gdb) p $R4
$2 = 256
...

So we seem to be stuck at two barrier instructions both requiring 256 threads,
but a different logical barrier.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-04-15  8:02 ` vries at gcc dot gnu.org
@ 2021-04-15  9:14 ` cvs-commit at gcc dot gnu.org
  2021-04-17  8:07 ` vries at gcc dot gnu.org
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-04-15  9:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Thomas Schwinge <tschwinge@gcc.gnu.org>:

https://gcc.gnu.org/g:4dd9e1c541e0eb921d62c8652c854b1259e56aac

commit r11-8189-g4dd9e1c541e0eb921d62c8652c854b1259e56aac
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Wed Apr 7 10:36:36 2021 +0200

    XFAIL OpenMP/nvptx execution-time hangs for simple nested OpenMP
'target'/'parallel'/'task' constructs [PR99555]

    ... still awaiting proper resolution, of course.

            libgomp/
            PR target/99555
            * testsuite/lib/libgomp.exp
            (check_effective_target_offload_device_nvptx): New.
            * testsuite/libgomp.c/pr99555-1.c <nvptx offload device>: Until
            resolved, make sure that we exit quickly, with error status,
            XFAILed.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: Likewise.
            * testsuite/libgomp.fortran/task-detach-6.f90: Likewise.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-04-15  9:14 ` cvs-commit at gcc dot gnu.org
@ 2021-04-17  8:07 ` vries at gcc dot gnu.org
  2021-04-19 10:44 ` vries at gcc dot gnu.org
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2021-04-17  8:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

Tom de Vries <vries at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #6 from Tom de Vries <vries at gcc dot gnu.org> ---
Current theory ...

All omp-threads are supposed to participate in a team barrier, and then all
together move on.  The master omp-thread participates from gomp_team_end, the
other omp-threads from the worker loop in gomp_thread_start.

Instead, it seems the master omp-thread gets stuck at the team barrier, while
all other omp-threads move on, to the thread pool barrier, and that state
corresponds to the observed hang.

AFAICT, the problem starts when gomp_team_barrier_wake is called with count ==
1:
...
void
gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
{
  if (bar->total > 1)
    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
}
...
The count argument is ignored, and instead all omp-threads are woken up, which
causes omp-threads to escape the team barrier.

This all is a result of the gomp_barrier_handle_tasks path being taken in
gomp_team_barrier_wait_end, and I haven't figured out why that is triggered, so
it still may be that the root cause lies elsewhere.

Anyway, the nvptx bar.{c,h} is copied from linux/bar.{c,h}, which is
implemented using futex, and with futex uses replaced with bar.sync uses.

FWIW, replacing libgomp/config/nvptx/bar.{c,h} with libgomp/config/posix.{c,h}
fixes the problem.  Did a full libgomp test run, all problems fixed.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-04-17  8:07 ` vries at gcc dot gnu.org
@ 2021-04-19 10:44 ` vries at gcc dot gnu.org
  2021-04-19 11:15 ` vries at gcc dot gnu.org
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2021-04-19 10:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #7 from Tom de Vries <vries at gcc dot gnu.org> ---
Created attachment 50627
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50627&action=edit
debug patch

A bit more analysis.

I'm working with this example, with an actual task to be able to perform a
check afterwards:
...
#include <assert.h>

int i = 1;

int
main (void)
{

#pragma omp target map(tofrom:i)
#pragma omp parallel num_threads(2)
#pragma omp task
  {
    __atomic_add_fetch (&i, 1, __ATOMIC_SEQ_CST);
  }

  assert (i == 3);

  return 0;
}
...

And I've forced the plugin to launch with two omp-threads to limit the
dimensions to the minimium:
...
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status   SMs Mask GridDim BlockDim Invocation 
*      0      -   0    1 Active 0x00000010 (1,1,1) (32,2,1) main$_omp_fn() 
...

Furthermore I've made specific instances for the bar.sync team barrier, to get
more meaningful backtraces.  So the lifetimes of the two omp-threads look like
this.

THREAD 0:
...
#0  0x0000000000b73aa8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b72598 in bar_sync_1 ()
#3  0x0000000000b760b8 in gomp_team_barrier_wake ()
#4  0x0000000000b5bc38 in GOMP_task ()
#5  0x0000000000b36a58 in main$_omp_fn () # $1
#6  0x0000000000a7e618 in GOMP_parallel ()
#7  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#8  0x0000000000b3c700 in gomp_nvptx_main ()
#9  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b380e8 in main$_omp_fn () # $2
#1  0x0000000000b95178 in gomp_barrier_handle_tasks ()
#2  0x0000000000b76e38 in gomp_team_barrier_wait_end ()
#3  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#4  0x0000000000b2a1b8 in gomp_team_end ()
#5  0x0000000000b318d8 in GOMP_parallel_end ()
#6  0x0000000000a7e620 in GOMP_parallel ()
#7  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#8  0x0000000000b3c700 in gomp_nvptx_main ()
#9  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b380e8 in main$_omp_fn () # $2
#1  0x0000000000b95178 in gomp_barrier_handle_tasks ()
#2  0x0000000000b76e38 in gomp_team_barrier_wait_end ()
#3  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#4  0x0000000000b2a1b8 in gomp_team_end ()
#5  0x0000000000b318d8 in GOMP_parallel_end ()
#6  0x0000000000a7e620 in GOMP_parallel ()
#7  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#8  0x0000000000b3c700 in gomp_nvptx_main ()
#9  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b73aa8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b72598 in bar_sync_1 ()
#3  0x0000000000b760b8 in gomp_team_barrier_wake ()
#4  0x0000000000b94c98 in gomp_barrier_handle_tasks ()
#5  0x0000000000b76e38 in gomp_team_barrier_wait_end ()
#6  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#7  0x0000000000b2a1b8 in gomp_team_end ()
#8  0x0000000000b318d8 in GOMP_parallel_end ()
#9  0x0000000000a7e620 in GOMP_parallel ()
#10 0x0000000000b377a0 in main$_omp_fn$0$impl ()
#11 0x0000000000b3c700 in gomp_nvptx_main ()
#12 0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b73aa8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b719b8 in bar_sync_3 ()
#3  0x0000000000b76f50 in gomp_team_barrier_wait_end ()
#4  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#5  0x0000000000b2a1b8 in gomp_team_end ()
#6  0x0000000000b318d8 in GOMP_parallel_end ()
#7  0x0000000000a7e620 in GOMP_parallel ()
#8  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#9  0x0000000000b3c700 in gomp_nvptx_main ()
#10 0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

^C

#0  0x0000000000b73da8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b719b8 in bar_sync_3 ()
#3  0x0000000000b76f50 in gomp_team_barrier_wait_end ()
#4  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#5  0x0000000000b2a1b8 in gomp_team_end ()
#6  0x0000000000b318d8 in GOMP_parallel_end ()
#7  0x0000000000a7e620 in GOMP_parallel ()
#8  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#9  0x0000000000b3c700 in gomp_nvptx_main ()
#10 0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()
...

THREAD 1:
...
#0  0x0000000000b70ae8 in bar_sync_thread_1 ()
#1  0x0000000000b74b80 in bar_sync_n ()
#2  0x0000000000b72598 in bar_sync_1 ()
#3  0x0000000000b760b8 in gomp_team_barrier_wake ()
#4  0x0000000000b5bc38 in GOMP_task ()
#5  0x0000000000b36a58 in main$_omp_fn () # $1
#6  0x0000000000b3cbb8 in gomp_nvptx_main ()
#7  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b70ae8 in bar_sync_thread_1 ()
#1  0x0000000000b74b80 in bar_sync_n ()
#2  0x0000000000b719b8 in bar_sync_3 ()
#3  0x0000000000b76f50 in gomp_team_barrier_wait_end ()
#4  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#5  0x0000000000b3cd50 in gomp_nvptx_main ()
#6  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

^C

#0  0x0000000000b3ca30 in gomp_nvptx_main ()
#1  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()
...


Weaving together this information, I get the following scenario:
- both threads execute GOMP_task and deposit a task and execute
  gomp_team_barrier_wake
- thread 1 proceeds to wait at the team barrier
- thread 0 proceeds to execute both tasks
- thread 0 then executes a gomp_team_barrier_wake from
  gomp_barrier_handle_tasks, which makes thread 1 exit the team barrier
- thread 0 then goes to wait at the team barrier, which results in a hang
  given that thread 1 already has exited.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-04-19 10:44 ` vries at gcc dot gnu.org
@ 2021-04-19 11:15 ` vries at gcc dot gnu.org
  2021-04-19 15:39 ` vries at gcc dot gnu.org
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2021-04-19 11:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #8 from Tom de Vries <vries at gcc dot gnu.org> ---
This fixes the hang:
...
@@ -91,14 +129,16 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
gomp_barrier_state_t state)
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
+         gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+         if (gen == state + BAR_INCR)
+           return;
        }
       else
        {
...

I'm not yet sure about the implementation, but the idea is to detect that
gomp_team_barrier_done was called during gomp_barrier_handle_tasks, and then
bail out.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-04-19 11:15 ` vries at gcc dot gnu.org
@ 2021-04-19 15:39 ` vries at gcc dot gnu.org
  2021-04-20 11:24 ` vries at gcc dot gnu.org
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2021-04-19 15:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #9 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #8)
> This fixes the hang:

This is a less intrusive solution, and is easier to transplant into
gomp_team_barrier_wait_cancel_end:
...
diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index c5c2fa8829b..cb7b299c6a8 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -91,6 +91,9 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
gomp_barrier_state_t state)
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
+         if (team->task_count != 0)
+           __builtin_abort ();
+         bar->total = 1;
        }
       else
        {
@@ -157,6 +160,9 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
+         if (team->task_count != 0)
+           __builtin_abort ();
+         bar->total = 1;
        }
       else
        {
...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-04-19 15:39 ` vries at gcc dot gnu.org
@ 2021-04-20 11:24 ` vries at gcc dot gnu.org
  2022-02-22 14:53 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2021-04-20 11:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #10 from Tom de Vries <vries at gcc dot gnu.org> ---
Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568295.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-04-20 11:24 ` vries at gcc dot gnu.org
@ 2022-02-22 14:53 ` cvs-commit at gcc dot gnu.org
  2022-02-22 14:54 ` vries at gcc dot gnu.org
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-02-22 14:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1

commit r12-7332-g5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1
Author: Tom de Vries <tdevries@suse.de>
Date:   Tue Apr 20 08:47:03 2021 +0200

    [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

    Consider the following omp fragment.
    ...
      #pragma omp target
      #pragma omp parallel num_threads (2)
      #pragma omp task
        ;
    ...

    This hangs at -O0 for nvptx.

    Investigating the behaviour gives us the following trace of events:
    - both threads execute GOMP_task, where they:
      - deposit a task, and
      - execute gomp_team_barrier_wake
    - thread 1 executes gomp_team_barrier_wait_end and, not being the last
thread,
      proceeds to wait at the team barrier
    - thread 0 executes gomp_team_barrier_wait_end and, being the last thread,
it
      calls gomp_barrier_handle_tasks, where it:
      - executes both tasks and marks the team barrier done
      - executes a gomp_team_barrier_wake which wakes up thread 1
    - thread 1 exits the team barrier
    - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
      the team barrier.
    - thread 0 hangs.

    To understand why there is a hang here, it's good to understand how things
    are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
    a copy of the libgomp/config/linux/bar.c implementation, with uses of both
    futex_wake and do_wait replaced with uses of ptx insn bar.sync:
    ...
      if (bar->total > 1)
        asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
    ...

    The point where thread 0 goes to wait at the team barrier, corresponds in
    the linux implementation with a do_wait.  In the linux case, the call to
    do_wait doesn't hang, because it's waiting for bar->generation to become
    a certain value, and if bar->generation already has that value, it just
    proceeds, without any need for coordination with other threads.

    In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
    logical barrier, which never happens: thread 1 is lingering in the
    thread pool at the thread pool barrier (using a different logical barrier),
    waiting to join a new team.

    The easiest way to fix this is to revert to the posix implementation for
    bar.{c,h}.  That however falls back on a busy-waiting approach, and
    does not take advantage of the ptx bar.sync insn.

    Instead, we revert to the linux implementation for bar.c,
    and implement bar.c local functions futex_wait and futex_wake using the
    bar.sync insn.

    The bar.sync insn takes an argument specifying how many threads are
    participating, and that doesn't play well with the futex syntax where it's
    not clear in advance how many threads will be woken up.

    This is solved by waking up all waiting threads each time a futex_wait or
    futex_wake happens, and possibly going back to sleep with an updated thread
    count.

    Tested libgomp on x86_64 with nvptx accelerator.

    libgomp/ChangeLog:

    2021-04-20  Tom de Vries  <tdevries@suse.de>

            PR target/99555
            * config/nvptx/bar.c (generation_to_barrier): New function, copied
            from config/rtems/bar.c.
            (futex_wait, futex_wake): New function.
            (do_spin, do_wait): New function, copied from config/linux/wait.h.
            (gomp_barrier_wait_end, gomp_barrier_wait_last)
            (gomp_team_barrier_wake, gomp_team_barrier_wait_end):
            (gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel):
Remove
            and replace with include of config/linux/bar.c.
            * config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock.
            (gomp_barrier_init): Init new fields.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: Remove
nvptx-specific
            workarounds.
            * testsuite/libgomp.c/pr99555-1.c: Same.
            * testsuite/libgomp.fortran/task-detach-6.f90: Same.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2022-02-22 14:53 ` cvs-commit at gcc dot gnu.org
@ 2022-02-22 14:54 ` vries at gcc dot gnu.org
  2022-03-17 12:16 ` tschwinge at gcc dot gnu.org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2022-02-22 14:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

Tom de Vries <vries at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |12.0

--- Comment #12 from Tom de Vries <vries at gcc dot gnu.org> ---
Fixed in "[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end".

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2022-02-22 14:54 ` vries at gcc dot gnu.org
@ 2022-03-17 12:16 ` tschwinge at gcc dot gnu.org
  2022-05-13 13:16 ` tschwinge at gcc dot gnu.org
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-03-17 12:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #13 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Thanks -- I'm confirming:

    PASS: libgomp.c/../libgomp.c-c++-common/task-detach-6.c (test for excess
errors)
    [-XFAIL:-]{+PASS:+} libgomp.c/../libgomp.c-c++-common/task-detach-6.c
execution test

    PASS: libgomp.c/pr99555-1.c (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.c/pr99555-1.c execution test

    PASS: libgomp.c++/../libgomp.c-c++-common/task-detach-6.c (test for excess
errors)
    [-XFAIL:-]{+PASS:+} libgomp.c++/../libgomp.c-c++-common/task-detach-6.c
execution test

    PASS: libgomp.fortran/task-detach-6.f90   -O0  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O0  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O1  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O1  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O2  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O2  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess
errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions 
execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O3 -g  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O3 -g  execution
test
    PASS: libgomp.fortran/task-detach-6.f90   -Os  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -Os  execution test

..., but on one system (only!), I'm also seeing regressions as follows:

    PASS: libgomp.c/../libgomp.c-c++-common/task-detach-10.c (test for excess
errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/task-detach-10.c
execution test

    PASS: libgomp.c/../libgomp.c-c++-common/task-detach-8.c (test for excess
errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/task-detach-8.c
execution test

    PASS: libgomp.c++/../libgomp.c-c++-common/task-detach-10.c (test for excess
errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c++/../libgomp.c-c++-common/task-detach-10.c
execution test

    PASS: libgomp.c++/../libgomp.c-c++-common/task-detach-8.c (test for excess
errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c++/../libgomp.c-c++-common/task-detach-8.c
execution test

    PASS: libgomp.fortran/task-detach-10.f90   -O0  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O0  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O1  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O1  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O2  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O2  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess
errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions 
execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O3 -g  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O3 -g  execution
test
    PASS: libgomp.fortran/task-detach-10.f90   -Os  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -Os  execution test

    PASS: libgomp.fortran/task-detach-8.f90   -O0  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O0  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O1  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O1  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O2  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O2  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess
errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions 
execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O3 -g  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O3 -g  execution
test
    PASS: libgomp.fortran/task-detach-8.f90   -Os  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -Os  execution test

(Accumulated over a few runs; not always seeing all of those.)

That's with a Nvidia Tesla K20c GPU, Driver Version: 346.46.
As that version is "a bit old", I shall first update this, before we spend any
further time on analyzing this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2022-03-17 12:16 ` tschwinge at gcc dot gnu.org
@ 2022-05-13 13:16 ` tschwinge at gcc dot gnu.org
  2022-09-06 13:32 ` vries at gcc dot gnu.org
  2022-12-21 13:59 ` cvs-commit at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-05-13 13:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #14 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Regarding my previous report that after
commit r12-7332-g5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1
"[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end"...

(In reply to Thomas Schwinge from comment #13)
> [...] on one system (only!), I'm [...] seeing regressions as follows:
> 
>     PASS: libgomp.c/../libgomp.c-c++-common/task-detach-10.c (test for excess errors)
>     {+WARNING: program timed out.+}
>     [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/task-detach-10.c execution test

..., and similar for all 'libgomp.c-c++-common/task-detach-10.c',
'libgomp.c-c++-common/task-detach-8.c', 'libgomp.fortran/task-detach-10.f90',
'libgomp.fortran/task-detach-8.f90' test cases:

> (Accumulated over a few runs; not always seeing all of those.)
> 
> That's with a Nvidia Tesla K20c GPU, Driver Version: 346.46.
> As that version is "a bit old", I shall first update this, before we spend
> any further time on analyzing this.

Cross-checking on another system with Nvidia Tesla K20c GPU but more recent
Driver Version I'm not seeing such an issue.

On the "old" system, gradually upgrading Driver Version: 346.46 to 352.99,
361.93.02, 375.88 (always the latest (?) version of the respective series),
these all did not resolve the problem.

Only starting with 384.59 (that is, early version of the 384.X series), that
then did resolve the issue.  That's still using the GCC/nvptx '-mptx=3.1'
multilib.

(We couldn't with earlier series, but given this is 384.X, we may now also
cross-check with the default multilib, and that also was fine.)

Now, I don't know if at all we would like to spend any more effort on this
issue, given that it only appears with rather old pre-384.X versions -- but on
the other hand, the GCC/nvptx '-mptx=3.1' multilib is meant to keep these
supported?  (... which is why I'm running such testing; and certainly the
timeouts are annoying there.)

It might be another issue with pre-384.X versions of the Nvidia PTX JIT, or is
there the slight possibility that GCC is generating/libgomp contains some
"weird" code that post-384.X version happen to "fix up" -- probably the former
rather than the latter?  (Or, the chance of GPU hardware/firmware or some other
system weirdness -- unlikely, otherwise behaves totally fine?)

I don't know where to find complete Nvidia Driver/JIT release notes, where the
375.X -> 384.X notes might provide an idea of what got fixed, and we might then
add another 'WORKAROUND_PTXJIT_BUG' for that -- maybe simple, maybe not.

Any thoughts, Tom?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2022-05-13 13:16 ` tschwinge at gcc dot gnu.org
@ 2022-09-06 13:32 ` vries at gcc dot gnu.org
  2022-12-21 13:59 ` cvs-commit at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: vries at gcc dot gnu.org @ 2022-09-06 13:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #17 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Thomas Schwinge from comment #14)
> > That's with a Nvidia Tesla K20c GPU, Driver Version: 346.46.
> > As that version is "a bit old", I shall first update this, before we spend
> > any further time on analyzing this.
> 
> Cross-checking on another system with Nvidia Tesla K20c GPU but more recent
> Driver Version I'm not seeing such an issue.
> 
> On the "old" system, gradually upgrading Driver Version: 346.46 to 352.99,
> 361.93.02, 375.88 (always the latest (?) version of the respective series),
> these all did not resolve the problem.
> 
> Only starting with 384.59 (that is, early version of the 384.X series), that
> then did resolve the issue.  That's still using the GCC/nvptx '-mptx=3.1'
> multilib.
> 
> (We couldn't with earlier series, but given this is 384.X, we may now also
> cross-check with the default multilib, and that also was fine.)
> 
> Now, I don't know if at all we would like to spend any more effort on this
> issue, given that it only appears with rather old pre-384.X versions -- but
> on the other hand, the GCC/nvptx '-mptx=3.1' multilib is meant to keep these
> supported?  (... which is why I'm running such testing; and certainly the
> timeouts are annoying there.)
> 
> It might be another issue with pre-384.X versions of the Nvidia PTX JIT, or
> is there the slight possibility that GCC is generating/libgomp contains some
> "weird" code that post-384.X version happen to "fix up" -- probably the
> former rather than the latter?  (Or, the chance of GPU hardware/firmware or
> some other system weirdness -- unlikely, otherwise behaves totally fine?)
> 
> I don't know where to find complete Nvidia Driver/JIT release notes, where
> the 375.X -> 384.X notes might provide an idea of what got fixed, and we
> might then add another 'WORKAROUND_PTXJIT_BUG' for that -- maybe simple,
> maybe not.
> 
> Any thoughts, Tom?

I care about old cards, not about old drivers.  The oldest card we support is
an sm_30, and last driver series that supports that one is 470.x (and AFAIU, is
therefore supported by nvidia for that arch).

There's the legacy series, 390.x, which is the last to support fermi, but we
don't support any fermi cards or earlier.  I did do some testing with this one
for later cards, but reported issues are acknowledged but not fixed by nvidia,
so ... this is already out of scope for me.

So yeah, IWBN to come up with workarounds for various older drivers, but I'm
not investing time in that.  Is there a problem for you to move to 470.x or
later (515.x) ?  Is there a card for which that causes problems ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/99555] [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
  2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2022-09-06 13:32 ` vries at gcc dot gnu.org
@ 2022-12-21 13:59 ` cvs-commit at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-12-21 13:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555

--- Comment #18 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Chung-Lin Tang <cltang@gcc.gnu.org>:

https://gcc.gnu.org/g:fdc7469cf597ec11229ddfc3e9c7a06f3d0fba9d

commit r13-4832-gfdc7469cf597ec11229ddfc3e9c7a06f3d0fba9d
Author: Chung-Lin Tang <cltang@codesourcery.com>
Date:   Wed Dec 21 05:57:45 2022 -0800

    nvptx: reimplement libgomp barriers [PR99555]

    Instead of trying to have the GPU do CPU-with-OS-like things, this new
barriers
    implementation for NVPTX uses simplistic bar.* synchronization
instructions.
    Tasks are processed after threads have joined, and only if team->task_count
!= 0

    It is noted that: there might be a little bit of performance forfeited for
    cases where earlier arriving threads could've been used to process tasks
ahead
    of other threads, but that has the requirement of implementing complex
    futex-wait/wake like behavior, which is what we're try to avoid with this
patch.
    It is deemed that task processing is not what GPU target offloading is
usually
    used for.

    Implementation highlight notes:
    1. gomp_team_barrier_wake() is now an empty function (threads never "wake"
in
       the usual manner)
    2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
    3. gomp_barrier_wait_last() now is implemented using "bar.arrive"

    4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
       The main synchronization is done using a 'bar.red' instruction. This
reduces
       across all threads the condition (team->task_count != 0), to enable the
task
       processing down below if any thread created a task.
       (this bar.red usage means that this patch is dependent on the prior
NVPTX
       bar.red GCC patch)

            PR target/99555

    libgomp/ChangeLog:

            * config/nvptx/bar.c (generation_to_barrier): Remove.
            (futex_wait,futex_wake,do_spin,do_wait): Remove.
            (GOMP_WAIT_H): Remove.
            (#include "../linux/bar.c"): Remove.
            (gomp_barrier_wait_end): New function.
            (gomp_barrier_wait): Likewise.
            (gomp_barrier_wait_last): Likewise.
            (gomp_team_barrier_wait_end): Likewise.
            (gomp_team_barrier_wait): Likewise.
            (gomp_team_barrier_wait_final): Likewise.
            (gomp_team_barrier_wait_cancel_end): Likewise.
            (gomp_team_barrier_wait_cancel): Likewise.
            (gomp_team_barrier_cancel): Likewise.
            * config/nvptx/bar.h (gomp_barrier_t): Remove waiters, lock fields.
            (gomp_barrier_init): Remove init of waiters, lock fields.
            (gomp_team_barrier_wake): Remove prototype, add new static inline
            function.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-12-21 13:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 16:36 [Bug target/99555] New: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs tschwinge at gcc dot gnu.org
2021-03-12 15:53 ` [Bug target/99555] " vries at gcc dot gnu.org
2021-03-25 12:00 ` cvs-commit at gcc dot gnu.org
2021-03-29  8:41 ` cvs-commit at gcc dot gnu.org
2021-04-15  8:02 ` vries at gcc dot gnu.org
2021-04-15  9:14 ` cvs-commit at gcc dot gnu.org
2021-04-17  8:07 ` vries at gcc dot gnu.org
2021-04-19 10:44 ` vries at gcc dot gnu.org
2021-04-19 11:15 ` vries at gcc dot gnu.org
2021-04-19 15:39 ` vries at gcc dot gnu.org
2021-04-20 11:24 ` vries at gcc dot gnu.org
2022-02-22 14:53 ` cvs-commit at gcc dot gnu.org
2022-02-22 14:54 ` vries at gcc dot gnu.org
2022-03-17 12:16 ` tschwinge at gcc dot gnu.org
2022-05-13 13:16 ` tschwinge at gcc dot gnu.org
2022-09-06 13:32 ` vries at gcc dot gnu.org
2022-12-21 13:59 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).