public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/9] [COVER-LETTER, nvptx] Add support for warp-multiple openacc vector length
@ 2019-01-12 22:21 Tom de Vries
  2019-01-12 22:21 ` [PATCH 3/9] [nvptx] Enable large vectors -- test-cases Tom de Vries
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Tom de Vries @ 2019-01-12 22:21 UTC (permalink / raw)
  To: gcc-patches; +Cc: Thomas Schwinge

I. Current state

The current openacc implementation sets vector length to warp-size.

There are two aspects that need to be implemented for an openacc implementation
to work: communication and synchronization.  Synchronization is needed at the
end of worker and vector loops.  Communication is needed at the start of worker
and vector loops, to propagate state that not has been calculated redundantly in
vector-single and worker-single mode to vector-partition and worker-partitioned
mode.

For worker loops, synchronization at the end of the loop is done using the
inter-warp synchronization instruction 'bar.sync 0'.  Communication is done
using a buffer in shared memory (and synchronization is used to ensure that the
buffer is used properly).

For vector loops with warp-sized vector length, synchronization at the end of
the loop is not needed, since warps are synchronized by definition.
Communication is done using the intra-warp communication instruction shfl.

These vector and worker schemes do not change if we nest a vector loop in a
worker loop.  OTOH, a vector-and-worker loop uses the worker scheme.

II. Patch series

This patch series adds the possibility to use warp-multiple openacc vector
length.

This means we can no longer rely on the same mechanisms for communication and
synchronization of vector loops, and need to apply the same ones as we do for
worker loops.

II.a Vector loop

A vector loop with warp-sized vector length looks as before.  A vector loop with
warp-multiple vector length looks like a simple worker loop.

II.b Vector-and-worker loop

A vector-and-worker loop with is handled as worker loop, as before.

II.c Vector loop in worker loop

A vector loop in worker loop with warp-sized vector length looks as before.

A vector loop in a worker loop with warp-multiple vector length is handled as
follows.

We use the 'bar.sync 0' instruction (which synchronizes all threads in a CTA)
for worker synchronization, but to synchronize only the warps that form a
vector together, we use 'bar.sync <id>, <vector-length>', where <id> uniquely
identifies the vector (we use the worker id, offset by one not to clash with
logical barrier resource '0' used by worker synchronization, so: %tid.y + 1).

Furthermore, the fact that vectors synchronize independently means that vector
state needs to be propagated independently.  We handle this by allocating a
state propagation buffer for each vector.  So, the shared memory buffer is
partitioned into a part for worker propagation, and num_worker parts for vector
propagation.

We'll name the first part worker-generic and the other parts worker-specific
(but we've got one vector per worker, so confusingly you might also call it
vector-specific).

In a vector loop in worker loop, we first transition from worker-single to
worker-partitioned, and then from vector-single to vector-partitioned, which
means state propagation from W0V0 to WAV0, and then state propagation from WAV0
to WAVA (using W for worker, V for vector, and A for all).
For branch condition propagation however, a condition calculated in
worker-single-vector-single mode is propagated from W0V0 to WAVA directly (so we use
the worker-generic buffer for that).

II.d Routines

There's a question on how to handle vector-partitionable routines in such a
scheme, given these can now be called from a context with a warp-multiple vector
length, while the current implementation of routines assumes warp-sized vector
length.  This patch series takes a conservative approach: keep routine
generation as is, and detect if we're calling a vector-partitionable routine
from an offloading region, and if so we fall back to warp-sized vector length
in that region.

III. Testing

Build and reg-tested on x86_64 with nvptx accelerator.

Build and reg-tested on x86_64 with nvptx accelerator with
PTX_DEFAULT_VECTOR_LENGTH set to various sizes.

IV. Patches

     1  [nvptx] Enable large vectors
     2  [nvptx] Update insufficient launch message for variable vector_length
     3  [nvptx] Enable large vectors -- test-cases
     4  [nvptx] Enable large vectors -- reduction testcases
     5  [nvptx] Don't emit barriers for empty loops -- test-cases
     6  [nvptx] Force vl32 if calling vector-partitionable routines -- test-cases
     7  [nvptx] Add vector_length 64 test-cases
     8  [nvptx] Enable setting vector length using -fopenacc-dim
     9  [nvptx] Enable setting vector length using -fopenacc-dim -- testcases


Tom de Vries (9):
  [nvptx] Enable large vectors
  [nvptx] Update insufficient launch message for variable vector_length
  [nvptx] Enable large vectors -- test-cases
  [nvptx] Enable large vectors -- reduction testcases
  [nvptx] Don't emit barriers for empty loops -- test-cases
  [nvptx] Force vl32 if calling vector-partitionable routines --
    test-cases
  [nvptx] Add vector_length 64 test-cases
  [nvptx] Enable setting vector length using -fopenacc-dim
  [nvptx] Enable setting vector length using -fopenacc-dim -- testcases

 gcc/config/nvptx/nvptx.c                           |  5 +-
 libgomp/plugin/plugin-nvptx.c                      | 20 +++---
 .../libgomp.oacc-c-c++-common/parallel-dims.c      |  4 +-
 .../libgomp.oacc-c-c++-common/pr85381-5.c          | 24 +++++++
 .../testsuite/libgomp.oacc-c-c++-common/pr85381.c  | 18 +++++
 .../libgomp.oacc-c-c++-common/pr85486-2.c          | 52 ++++++++++++++
 .../libgomp.oacc-c-c++-common/pr85486-3.c          | 54 +++++++++++++++
 .../testsuite/libgomp.oacc-c-c++-common/pr85486.c  | 51 ++++++++++++++
 .../vector-length-128-1.c                          |  5 +-
 .../vector-length-128-10.c                         | 39 +++++++++++
 .../vector-length-128-2.c                          | 39 +++++++++++
 .../vector-length-128-4.c                          | 40 +++++++++++
 .../vector-length-128-5.c                          | 41 +++++++++++
 .../vector-length-128-6.c                          | 41 +++++++++++
 .../vector-length-128-7.c                          | 40 +++++++++++
 .../libgomp.oacc-c-c++-common/vector-length-64-1.c | 17 +++++
 .../libgomp.oacc-c-c++-common/vector-length-64-2.c | 21 ++++++
 .../libgomp.oacc-c-c++-common/vector-length-64-3.c | 17 +++++
 .../libgomp.oacc-c-c++-common/vred2d-128.c         | 55 +++++++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm-2.f90  | 80 ++++++++++++++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    | 79 +++++++++++++++++++++
 21 files changed, 726 insertions(+), 16 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-5.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm-2.f90
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90

-- 
2.16.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-06-08  9:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-12 22:21 [PATCH 0/9] [COVER-LETTER, nvptx] Add support for warp-multiple openacc vector length Tom de Vries
2019-01-12 22:21 ` [PATCH 3/9] [nvptx] Enable large vectors -- test-cases Tom de Vries
2019-01-12 22:21 ` [PATCH 9/9] [nvptx] Enable setting vector length using -fopenacc-dim -- testcases Tom de Vries
2019-01-12 22:21 ` [PATCH 6/9] [nvptx] Force vl32 if calling vector-partitionable routines -- test-cases Tom de Vries
2020-10-30 16:16   ` Thomas Schwinge
2020-10-30 16:32     ` Tom de Vries
2020-11-02 13:47       ` Thomas Schwinge
2019-01-12 22:21 ` [PATCH 5/9] [nvptx] Don't emit barriers for empty loops " Tom de Vries
2019-01-12 22:21 ` [PATCH 7/9] [nvptx] Add vector_length 64 test-cases Tom de Vries
2019-01-12 22:21 ` [PATCH 4/9] [nvptx] Enable large vectors -- reduction testcases Tom de Vries
2019-01-12 22:21 ` [PATCH 2/9] [nvptx] Update insufficient launch message for variable vector_length Tom de Vries
2019-01-12 22:21 ` [PATCH 1/9] [nvptx] Enable large vectors Tom de Vries
2021-06-08  9:10   ` Thomas Schwinge
2019-01-12 22:21 ` [PATCH 8/9] [nvptx] Enable setting vector length using -fopenacc-dim Tom de Vries

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).