From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 104155 invoked by alias); 12 Jan 2019 22:21:14 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 104090 invoked by uid 89); 12 Jan 2019 22:21:14 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-11.9 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.2 spammy=iii, Communication, confusingly, iic X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 12 Jan 2019 22:21:11 +0000 Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 0501FABE4; Sat, 12 Jan 2019 22:21:08 +0000 (UTC) From: Tom de Vries To: gcc-patches@gcc.gnu.org Cc: Thomas Schwinge Subject: [PATCH 0/9] [COVER-LETTER, nvptx] Add support for warp-multiple openacc vector length Date: Sat, 12 Jan 2019 22:21:00 -0000 Message-Id: <20190112222131.29519-1-tdevries@suse.de> X-IsSubscribed: yes X-SW-Source: 2019-01/txt/msg00721.txt.bz2 I. Current state The current openacc implementation sets vector length to warp-size. There are two aspects that need to be implemented for an openacc implementation to work: communication and synchronization. Synchronization is needed at the end of worker and vector loops. Communication is needed at the start of worker and vector loops, to propagate state that not has been calculated redundantly in vector-single and worker-single mode to vector-partition and worker-partitioned mode. For worker loops, synchronization at the end of the loop is done using the inter-warp synchronization instruction 'bar.sync 0'. Communication is done using a buffer in shared memory (and synchronization is used to ensure that the buffer is used properly). For vector loops with warp-sized vector length, synchronization at the end of the loop is not needed, since warps are synchronized by definition. Communication is done using the intra-warp communication instruction shfl. These vector and worker schemes do not change if we nest a vector loop in a worker loop. OTOH, a vector-and-worker loop uses the worker scheme. II. Patch series This patch series adds the possibility to use warp-multiple openacc vector length. This means we can no longer rely on the same mechanisms for communication and synchronization of vector loops, and need to apply the same ones as we do for worker loops. II.a Vector loop A vector loop with warp-sized vector length looks as before. A vector loop with warp-multiple vector length looks like a simple worker loop. II.b Vector-and-worker loop A vector-and-worker loop with is handled as worker loop, as before. II.c Vector loop in worker loop A vector loop in worker loop with warp-sized vector length looks as before. A vector loop in a worker loop with warp-multiple vector length is handled as follows. We use the 'bar.sync 0' instruction (which synchronizes all threads in a CTA) for worker synchronization, but to synchronize only the warps that form a vector together, we use 'bar.sync , ', where uniquely identifies the vector (we use the worker id, offset by one not to clash with logical barrier resource '0' used by worker synchronization, so: %tid.y + 1). Furthermore, the fact that vectors synchronize independently means that vector state needs to be propagated independently. We handle this by allocating a state propagation buffer for each vector. So, the shared memory buffer is partitioned into a part for worker propagation, and num_worker parts for vector propagation. We'll name the first part worker-generic and the other parts worker-specific (but we've got one vector per worker, so confusingly you might also call it vector-specific). In a vector loop in worker loop, we first transition from worker-single to worker-partitioned, and then from vector-single to vector-partitioned, which means state propagation from W0V0 to WAV0, and then state propagation from WAV0 to WAVA (using W for worker, V for vector, and A for all). For branch condition propagation however, a condition calculated in worker-single-vector-single mode is propagated from W0V0 to WAVA directly (so we use the worker-generic buffer for that). II.d Routines There's a question on how to handle vector-partitionable routines in such a scheme, given these can now be called from a context with a warp-multiple vector length, while the current implementation of routines assumes warp-sized vector length. This patch series takes a conservative approach: keep routine generation as is, and detect if we're calling a vector-partitionable routine from an offloading region, and if so we fall back to warp-sized vector length in that region. III. Testing Build and reg-tested on x86_64 with nvptx accelerator. Build and reg-tested on x86_64 with nvptx accelerator with PTX_DEFAULT_VECTOR_LENGTH set to various sizes. IV. Patches 1 [nvptx] Enable large vectors 2 [nvptx] Update insufficient launch message for variable vector_length 3 [nvptx] Enable large vectors -- test-cases 4 [nvptx] Enable large vectors -- reduction testcases 5 [nvptx] Don't emit barriers for empty loops -- test-cases 6 [nvptx] Force vl32 if calling vector-partitionable routines -- test-cases 7 [nvptx] Add vector_length 64 test-cases 8 [nvptx] Enable setting vector length using -fopenacc-dim 9 [nvptx] Enable setting vector length using -fopenacc-dim -- testcases Tom de Vries (9): [nvptx] Enable large vectors [nvptx] Update insufficient launch message for variable vector_length [nvptx] Enable large vectors -- test-cases [nvptx] Enable large vectors -- reduction testcases [nvptx] Don't emit barriers for empty loops -- test-cases [nvptx] Force vl32 if calling vector-partitionable routines -- test-cases [nvptx] Add vector_length 64 test-cases [nvptx] Enable setting vector length using -fopenacc-dim [nvptx] Enable setting vector length using -fopenacc-dim -- testcases gcc/config/nvptx/nvptx.c | 5 +- libgomp/plugin/plugin-nvptx.c | 20 +++--- .../libgomp.oacc-c-c++-common/parallel-dims.c | 4 +- .../libgomp.oacc-c-c++-common/pr85381-5.c | 24 +++++++ .../testsuite/libgomp.oacc-c-c++-common/pr85381.c | 18 +++++ .../libgomp.oacc-c-c++-common/pr85486-2.c | 52 ++++++++++++++ .../libgomp.oacc-c-c++-common/pr85486-3.c | 54 +++++++++++++++ .../testsuite/libgomp.oacc-c-c++-common/pr85486.c | 51 ++++++++++++++ .../vector-length-128-1.c | 5 +- .../vector-length-128-10.c | 39 +++++++++++ .../vector-length-128-2.c | 39 +++++++++++ .../vector-length-128-4.c | 40 +++++++++++ .../vector-length-128-5.c | 41 +++++++++++ .../vector-length-128-6.c | 41 +++++++++++ .../vector-length-128-7.c | 40 +++++++++++ .../libgomp.oacc-c-c++-common/vector-length-64-1.c | 17 +++++ .../libgomp.oacc-c-c++-common/vector-length-64-2.c | 21 ++++++ .../libgomp.oacc-c-c++-common/vector-length-64-3.c | 17 +++++ .../libgomp.oacc-c-c++-common/vred2d-128.c | 55 +++++++++++++++ libgomp/testsuite/libgomp.oacc-fortran/gemm-2.f90 | 80 ++++++++++++++++++++++ libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 | 79 +++++++++++++++++++++ 21 files changed, 726 insertions(+), 16 deletions(-) create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-5.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-1.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-2.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-3.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm-2.f90 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90 -- 2.16.4