public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp-nvptx 06/13] libgomp: add nvptx time.c
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (3 preceding siblings ...)
  2016-01-20 17:27 ` [gomp-nvptx 01/13] nvptx backend: new patterns for OpenMP SIMD-via-SIMT Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 09/13] libgomp: use generic fortran.c on nvptx Alexander Monakov
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

This patch implements time.c on NVPTX with the %clock64 register.  The PTX
documentation describes %globaltimer as explicitely off-limits for us.

	* config/nvptx/time.c: New.
---
 libgomp/ChangeLog.gomp-nvptx |  4 ++++
 libgomp/config/nvptx/time.c  | 49 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/libgomp/config/nvptx/time.c b/libgomp/config/nvptx/time.c
index e69de29..08feafe 100644
--- a/libgomp/config/nvptx/time.c
+++ b/libgomp/config/nvptx/time.c
@@ -0,0 +1,49 @@
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+   Contributed by Dmitry Melnik <dm@ispras.ru>
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file implements timer routines for NVPTX.  It uses the %clock64 cycle
+   counter.  */
+
+#include "libgomp.h"
+
+/* This is set from host in plugin-nvptx.c.  */
+double __nvptx_clocktick = 0;
+
+double
+omp_get_wtime (void)
+{
+  uint64_t clock;
+  asm ("mov.u64 %0, %%clock64;" : "=r" (clock));
+  return clock * __nvptx_clocktick;
+}
+
+double
+omp_get_wtick (void)
+{
+  return __nvptx_clocktick;
+}
+
+ialias (omp_get_wtime)
+ialias (omp_get_wtick)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 10/13] libgomp testsuite: add -foffload=-lgfortran
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (6 preceding siblings ...)
  2016-01-20 17:27 ` [gomp-nvptx 03/13] nvptx backend: silence warning Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 05/13] libgomp: remove sections.c, splay-tree.c Alexander Monakov
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

Link libgfortran for offloaded code as well.

	* testsuite/libgomp.fortran/fortran.exp (lang_link_flags): Pass
	-foffload=-lgfortran in addition to -lgfortran.
	* testsuite/libgomp.oacc-fortran/fortran.exp (lang_link_flags): Ditto.
---
 libgomp/ChangeLog.gomp-nvptx                       | 6 ++++++
 libgomp/testsuite/libgomp.fortran/fortran.exp      | 2 +-
 libgomp/testsuite/libgomp.oacc-fortran/fortran.exp | 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/libgomp/testsuite/libgomp.fortran/fortran.exp b/libgomp/testsuite/libgomp.fortran/fortran.exp
index 9e6b643..d848ed4 100644
--- a/libgomp/testsuite/libgomp.fortran/fortran.exp
+++ b/libgomp/testsuite/libgomp.fortran/fortran.exp
@@ -7,7 +7,7 @@ global ALWAYS_CFLAGS
 
 set shlib_ext [get_shlib_extension]
 set lang_library_path	"../libgfortran/.libs"
-set lang_link_flags	"-lgfortran"
+set lang_link_flags	"-lgfortran -foffload=-lgfortran"
 if [info exists lang_include_flags] then {
     unset lang_include_flags
 }
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp b/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp
index 2d6b647..663c932 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp
+++ b/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp
@@ -9,7 +9,7 @@ global ALWAYS_CFLAGS
 
 set shlib_ext [get_shlib_extension]
 set lang_library_path	"../libgfortran/.libs"
-set lang_link_flags	"-lgfortran"
+set lang_link_flags	"-lgfortran -foffload=-lgfortran"
 if [info exists lang_include_flags] then {
     unset lang_include_flags
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 00/13] SIMD, teams, Fortran
@ 2016-01-20 17:27 Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 02/13] omp-low: extend SIMD lowering for SIMT execution Alexander Monakov
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

Hello,

I'm pushing this patch series to the gomp-nvptx branch.  It adds the
following:

  - backend and omp-low.c changes for SIMT-style SIMD region handling
  - libgomp changes for running the fortran testsuite
  - libgomp changes for spawning multiple OpenMP teams

I'll perform a trunk merge and copyright years update on the branch shortly.
There are 4 tests that still fail in libgomp testsuite with NVPTX offloading:

  - 2 due to missing 'usleep'
  - 2 due to unimplemented 'target nowait'/GOMP_OFFLOAD_run_async.

The most interesting part of the series is omp-low.c additions for lowering of
SIMD regions for SIMT execution.  I've taken care to insert new code only when
the region could be offloaded to NVPTX, and make sure that added code can be
easily cleaned up on the host compiler side.

However, there's one infrastructure piece that I didn't manage to nail down
yet.  We are running in non-default mode outside of SIMD regions, with
per-warp soft-stacks and atomics instrumented to have an effect once per warp.
We need to transition to the opposite on SIMD region boundaries. While
switching atomics is easy, I see no way to model stack switching in GCC IL,
except for doing it at function boundaries (which is then also easy from the
backend point of view).  As a result, we need to outline SIMD regions for
NVPTX into separate functions, if they are not already outlined by virtue of
being combined into an 'omp parallel' or 'omp task'.

To achieve that, I think there are two general possibilities:

1) post lto-streamin, in omp_device_lower, in accel compiler only.  I'm not
sure how hard it would be, it's not something that GCC does normally, although
tree-parloops performs that.  I think this isn't preferable.

2) Up front during omp-lowering, properly outline it together with parallel
and task regions, and tweak inlining so inlining back happens on host side
only.  It looks like I'd need to invent a new ephemeral GIMPLE statement, say
OMP__SIMTREG_ that is handled like other 'taskreg' kinds (OMP_PARALLEL and
OMP_TASK) and artificially inject it in IL.  Or maybe to avoid excessive
surgery, it may be better to reuse existing taskreg kind (OMP_PARALLEL) and
attach and artificial clause instead that signals that this "parallel" is for
outlining a SIMD region.

Thoughts, comments?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 07/13] libgomp plugin: set __nvptx_clocktick
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 02/13] omp-low: extend SIMD lowering for SIMT execution Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 04/13] nvptx backend: add support for placing variables in shared memory Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 01/13] nvptx backend: new patterns for OpenMP SIMD-via-SIMT Alexander Monakov
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

This is the libgomp plugin side of omp_clock_wtime support on NVPTX.  Query
GPU frequency and copy the value into the device image.

At the moment CUDA driver sets GPU to a fixed frequency when a CUDA context is
created (the default is to use the highest non-boost frequency, but it can be
altered with the nvidia-smi utility), so as long as dynamic boost is not
implemented, and thermal throttling does not happen, what was queried should
correspond to the actual frequency of %clock64 updates.  However, on GTX Titan
we observed that the driver returns GPU frequency that is midway between
actual frequency and boost frequency -- we consider that a driver bug.  Thus,
the implementation comes with a caveat that device-side measurements are less
reliable (than host-side).

	* plugin/plugin-nvptx.c (struct ptx_device): New field (clock_khz).
	(nvptx_open_device): Set it.
	(nvptx_set_clocktick): New.  Use it...
	(GOMP_OFFLOAD_load_image): ...here.
---
 libgomp/ChangeLog.gomp-nvptx  |  7 +++++++
 libgomp/plugin/plugin-nvptx.c | 28 +++++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index e687586..87e0494 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -287,8 +287,9 @@ struct ptx_device
   bool overlap;
   bool map;
   bool concur;
-  int  mode;
   bool mkern;
+  int  mode;
+  int clock_khz;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -641,6 +642,12 @@ nvptx_open_device (int n)
 
   ptx_dev->mkern = pi;
 
+  r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_CLOCK_RATE, dev);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r));
+
+  ptx_dev->clock_khz = pi;
+
   r = cuDeviceGetAttribute (&async_engines,
 			    CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
@@ -1505,6 +1512,23 @@ GOMP_OFFLOAD_version (void)
   return GOMP_VERSION;
 }
 
+/* Initialize __nvptx_clocktick, if present in MODULE.  */
+
+static void
+nvptx_set_clocktick (CUmodule module, struct ptx_device *dev)
+{
+  CUdeviceptr dptr;
+  CUresult r = cuModuleGetGlobal (&dptr, NULL, module, "__nvptx_clocktick");
+  if (r == CUDA_ERROR_NOT_FOUND)
+    return;
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+  double __nvptx_clocktick = 1e-3 / dev->clock_khz;
+  r = cuMemcpyHtoD (dptr, &__nvptx_clocktick, sizeof (__nvptx_clocktick));
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+}
+
 /* Load the (partial) program described by TARGET_DATA to device
    number ORD.  Allocate and return TARGET_TABLE.  */
 
@@ -1590,6 +1614,8 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       targ_tbl->end = targ_tbl->start + bytes;
     }
 
+  nvptx_set_clocktick (module, dev);
+
   return fn_entries + var_entries;
 }
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 04/13] nvptx backend: add support for placing variables in shared memory
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 02/13] omp-low: extend SIMD lowering for SIMT execution Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 07/13] libgomp plugin: set __nvptx_clocktick Alexander Monakov
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

This patch allows to use __attribute__((shared)) to place non-automatic
variables in shared memory.

	* config/nvptx/nvptx.c (nvptx_encode_section_info): Handle "shared"
	attribute.
	(nvptx_handle_shared_attribute): New.  Use it...
	(nvptx_attribute_table): ... here (new entry).
---
 gcc/ChangeLog.gomp-nvptx |  7 +++++++
 gcc/config/nvptx/nvptx.c | 33 ++++++++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index f63f840..5c8c28b 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -228,9 +228,12 @@ nvptx_encode_section_info (tree decl, rtx rtl, int first)
       if (TREE_CONSTANT (decl))
 	area = DATA_AREA_CONST;
       else if (TREE_CODE (decl) == VAR_DECL)
-	/* TODO: This would be a good place to check for a .shared or
-	   other section name.  */
-	area = TREE_READONLY (decl) ? DATA_AREA_CONST : DATA_AREA_GLOBAL;
+	{
+	  if (lookup_attribute ("shared", DECL_ATTRIBUTES (decl)))
+	    area = DATA_AREA_SHARED;
+	  else
+	    area = TREE_READONLY (decl) ? DATA_AREA_CONST : DATA_AREA_GLOBAL;
+	}
 
       SET_SYMBOL_DATA_AREA (XEXP (rtl, 0), area);
     }
@@ -4047,12 +4050,36 @@ nvptx_handle_kernel_attribute (tree *node, tree name, tree ARG_UNUSED (args),
   return NULL_TREE;
 }
 
+/* Handle a "shared" attribute; arguments as in
+   struct attribute_spec.handler.  */
+
+static tree
+nvptx_handle_shared_attribute (tree *node, tree name, tree ARG_UNUSED (args),
+			       int ARG_UNUSED (flags), bool *no_add_attrs)
+{
+  tree decl = *node;
+
+  if (TREE_CODE (decl) != VAR_DECL)
+    {
+      error ("%qE attribute only applies to variables", name);
+      *no_add_attrs = true;
+    }
+  else if (current_function_decl && !TREE_STATIC (decl))
+    {
+      error ("%qE attribute only applies to non-stack variables", name);
+      *no_add_attrs = true;
+    }
+
+  return NULL_TREE;
+}
+
 /* Table of valid machine attributes.  */
 static const struct attribute_spec nvptx_attribute_table[] =
 {
   /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
        affects_type_identity } */
   { "kernel", 0, 0, true, false,  false, nvptx_handle_kernel_attribute, false },
+  { "shared", 0, 0, true, false,  false, nvptx_handle_shared_attribute, false },
   { NULL, 0, 0, false, false, false, NULL, false }
 };
 \f

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 09/13] libgomp: use generic fortran.c on nvptx
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (4 preceding siblings ...)
  2016-01-20 17:27 ` [gomp-nvptx 06/13] libgomp: add nvptx time.c Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 03/13] nvptx backend: silence warning Alexander Monakov
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

This patch removes the nvptx fortran.c stub that provides only
_gfortran_abort.  It is possible to link libgfortran on NVPTX with
-foffload=-lgfortran.

	* config/nvptx/fortran.c: Delete.
---
 libgomp/ChangeLog.gomp-nvptx   |  4 ++++
 libgomp/config/nvptx/fortran.c | 40 ----------------------------------------
 2 files changed, 4 insertions(+), 40 deletions(-)
 delete mode 100644 libgomp/config/nvptx/fortran.c

diff --git a/libgomp/config/nvptx/fortran.c b/libgomp/config/nvptx/fortran.c
deleted file mode 100644
index 58ca790..0000000
--- a/libgomp/config/nvptx/fortran.c
+++ /dev/null
@@ -1,40 +0,0 @@
-/* OpenACC Runtime Fortran wrapper routines
-
-   Copyright (C) 2014-2015 Free Software Foundation, Inc.
-
-   Contributed by Mentor Embedded.
-
-   This file is part of the GNU Offloading and Multi Processing Library
-   (libgomp).
-
-   Libgomp is free software; you can redistribute it and/or modify it
-   under the terms of the GNU General Public License as published by
-   the Free Software Foundation; either version 3, or (at your option)
-   any later version.
-
-   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
-   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
-   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
-   more details.
-
-   Under Section 7 of GPL version 3, you are granted additional
-   permissions described in the GCC Runtime Library Exception, version
-   3.1, as published by the Free Software Foundation.
-
-   You should have received a copy of the GNU General Public License and
-   a copy of the GCC Runtime Library Exception along with this program;
-   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-   <http://www.gnu.org/licenses/>.  */
-
-/* Temporary hack; this will be provided by libgfortran.  */
-
-extern void _gfortran_abort (void);
-
-__asm__ ("// BEGIN GLOBAL FUNCTION DECL: _gfortran_abort\n"
-	 ".visible .func _gfortran_abort;\n"
-	 "// BEGIN GLOBAL FUNCTION DEF: _gfortran_abort\n"
-	 ".visible .func _gfortran_abort\n"
-	 "{\n"
-	 "trap;\n"
-	 "ret;\n"
-	 "}\n");

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 01/13] nvptx backend: new patterns for OpenMP SIMD-via-SIMT
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (2 preceding siblings ...)
  2016-01-20 17:27 ` [gomp-nvptx 07/13] libgomp plugin: set __nvptx_clocktick Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 06/13] libgomp: add nvptx time.c Alexander Monakov
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

This patch adds a few insn patterns used for OpenMP SIMD
reduction/lastprivate/ordered lowering for SIMT execution.  OpenMP lowering
produces GOMP_SIMT_... internal functions when lowering SIMD constructs that
can be offloaded to a SIMT device.  After lto stream-in, those internal
functions are trivially folded when compiling for non-SIMT execution;
otherwise they are kept, and expanded into these insns.

	* config/nvptx/nvptx-protos.h (nvptx_shuffle_kind): Move enum
	declaration from nvptx.c.
	(nvptx_gen_shuffle): Declare.
	* config/nvptx/nvptx.c (nvptx_shuffle_kind): Moved to nvptx-protos.h.
	(nvptx_gen_shuffle): No longer static.
	* config/nvptx/nvptx.md (UNSPEC_VOTE_BALLOT): New unspec.
	(UNSPEC_LANEID): Ditto.
	(UNSPECV_NOUNROLL): Ditto.
	(nvptx_vote_ballot): New pattern.
	(omp_simt_lane): Ditto.
	(nvptx_nounroll): Ditto.
	(omp_simt_last_lane): Ditto.
	(omp_simt_ordered): Ditto.
	(omp_simt_vote_any): Ditto.
	(omp_simt_xchg_bfly): Ditto.
	(omp_simt_xchg_idx): Ditto.
	* target-insns.def (omp_simt_lane): New.
	(omp_simt_last_lane): New.
	(omp_simt_ordered): New.
	(omp_simt_vote_any): New.
	(omp_simt_xchg_bfly): New.
	(omp_simt_xchg_idx): New.
---
 gcc/ChangeLog.gomp-nvptx        | 25 +++++++++++++
 gcc/config/nvptx/nvptx-protos.h | 11 ++++++
 gcc/config/nvptx/nvptx.c        | 12 +-----
 gcc/config/nvptx/nvptx.md       | 81 +++++++++++++++++++++++++++++++++++++++++
 gcc/target-insns.def            |  6 +++
 5 files changed, 124 insertions(+), 11 deletions(-)

diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h
index 7e0c296..e38c6ad 100644
--- a/gcc/config/nvptx/nvptx-protos.h
+++ b/gcc/config/nvptx/nvptx-protos.h
@@ -21,6 +21,16 @@
 #ifndef GCC_NVPTX_PROTOS_H
 #define GCC_NVPTX_PROTOS_H
 
+/* The kind of shuffe instruction.  */
+enum nvptx_shuffle_kind
+{
+  SHUFFLE_UP,
+  SHUFFLE_DOWN,
+  SHUFFLE_BFLY,
+  SHUFFLE_IDX,
+  SHUFFLE_MAX
+};
+
 extern void nvptx_declare_function_name (FILE *, const char *, const_tree decl);
 extern void nvptx_declare_object_name (FILE *file, const char *name,
 				       const_tree decl);
@@ -36,6 +46,7 @@ extern void nvptx_register_pragmas (void);
 extern void nvptx_expand_oacc_fork (unsigned);
 extern void nvptx_expand_oacc_join (unsigned);
 extern void nvptx_expand_call (rtx, rtx);
+extern rtx nvptx_gen_shuffle (rtx, rtx, rtx, nvptx_shuffle_kind);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
 extern const char *nvptx_output_mov_insn (rtx, rtx);
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index d557646..45aebdd 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -70,16 +70,6 @@
 /* This file should be included last.  */
 #include "target-def.h"
 
-/* The kind of shuffe instruction.  */
-enum nvptx_shuffle_kind
-{
-  SHUFFLE_UP,
-  SHUFFLE_DOWN,
-  SHUFFLE_BFLY,
-  SHUFFLE_IDX,
-  SHUFFLE_MAX
-};
-
 /* The various PTX memory areas an object might reside in.  */
 enum nvptx_data_area
 {
@@ -1400,7 +1390,7 @@ nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
 /* Generate an instruction or sequence to broadcast register REG
    across the vectors of a single warp.  */
 
-static rtx
+rtx
 nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, nvptx_shuffle_kind kind)
 {
   rtx res;
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 130c809..1522aa3 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -43,6 +43,10 @@ (define_c_enum "unspec" [
 
    UNSPEC_BIT_CONV
 
+   UNSPEC_VOTE_BALLOT
+
+   UNSPEC_LANEID
+
    UNSPEC_SHUFFLE
    UNSPEC_BR_UNIFIED
 ])
@@ -58,6 +62,8 @@ (define_c_enum "unspecv" [
    UNSPECV_FORKED
    UNSPECV_JOINING
    UNSPECV_JOIN
+
+   UNSPECV_NOUNROLL
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -1239,6 +1245,81 @@ (define_insn "nvptx_shuffle<mode>"
   ""
   "%.\\tshfl%S3.b32\\t%0, %1, %2, 31;")
 
+(define_insn "nvptx_vote_ballot"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
+	(unspec:SI [(match_operand:BI 1 "nvptx_register_operand" "R")]
+		   UNSPEC_VOTE_BALLOT))]
+  ""
+  "%.\\tvote.ballot.b32\\t%0, %1;")
+
+(define_insn "omp_simt_lane"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+	(unspec:SI [(const_int 0)] UNSPEC_LANEID))]
+  ""
+  "%.\\tmov.u32\\t%0, %%laneid;")
+
+(define_insn "nvptx_nounroll"
+  [(unspec_volatile [(const_int 0)] UNSPECV_NOUNROLL)]
+  ""
+  "\\t.pragma \\\"nounroll\\\";"
+  [(set_attr "predicable" "false")])
+
+(define_expand "omp_simt_last_lane"
+  [(match_operand:SI 0 "nvptx_register_operand" "=R")
+   (match_operand:SI 1 "nvptx_register_operand" "R")]
+  ""
+{
+  rtx pred = gen_reg_rtx (BImode);
+  rtx tmp = gen_reg_rtx (SImode);
+  emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx));
+  emit_insn (gen_nvptx_vote_ballot (tmp, pred));
+  emit_insn (gen_ctzsi2 (operands[0], tmp));
+  DONE;
+})
+
+(define_expand "omp_simt_ordered"
+  [(match_operand:SI 0 "nvptx_register_operand" "=R")
+   (match_operand:SI 1 "nvptx_register_operand" "R")]
+  ""
+{
+  emit_move_insn (operands[0], operands[1]);
+  emit_insn (gen_nvptx_nounroll ());
+  DONE;
+})
+
+(define_expand "omp_simt_vote_any"
+  [(match_operand:SI 0 "nvptx_register_operand" "=R")
+   (match_operand:SI 1 "nvptx_register_operand" "R")]
+  ""
+{
+  rtx pred = gen_reg_rtx (BImode);
+  emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx));
+  emit_insn (gen_nvptx_vote_ballot (operands[0], pred));
+  DONE;
+})
+
+(define_expand "omp_simt_xchg_bfly"
+  [(match_operand 0 "nvptx_register_operand" "=R")
+   (match_operand 1 "nvptx_register_operand" "R")
+   (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")]
+  ""
+{
+  emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2],
+				SHUFFLE_BFLY));
+  DONE;
+})
+
+(define_expand "omp_simt_xchg_idx"
+  [(match_operand 0 "nvptx_register_operand" "=R")
+   (match_operand 1 "nvptx_register_operand" "R")
+   (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")]
+  ""
+{
+  emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2],
+				SHUFFLE_IDX));
+  DONE;
+})
+
 ;; extract parts of a 64 bit object into 2 32-bit ints
 (define_insn "unpack<mode>si2"
   [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index 0353bb5..9f998f0 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -68,6 +68,12 @@ DEF_TARGET_INSN (oacc_dim_pos, (rtx x0, rtx x1))
 DEF_TARGET_INSN (oacc_dim_size, (rtx x0, rtx x1))
 DEF_TARGET_INSN (oacc_fork, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (oacc_join, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (omp_simt_lane, (rtx x0))
+DEF_TARGET_INSN (omp_simt_last_lane, (rtx x0, rtx x1))
+DEF_TARGET_INSN (omp_simt_ordered, (rtx x0, rtx x1))
+DEF_TARGET_INSN (omp_simt_vote_any, (rtx x0, rtx x1))
+DEF_TARGET_INSN (omp_simt_xchg_bfly, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (omp_simt_xchg_idx, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (prefetch, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (probe_stack, (rtx x0))
 DEF_TARGET_INSN (probe_stack_address, (rtx x0))

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 02/13] omp-low: extend SIMD lowering for SIMT execution
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 04/13] nvptx backend: add support for placing variables in shared memory Alexander Monakov
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

This patch extends SIMD-via-SIMT lowering in omp-low.c to handle all loops,
lowering reduction/lastprivate/ordered appropriately (but it still chickens
out on collapsed loops, handling them as if safelen=1).  New SIMT lowering
snippets use new internal functions that are folded for non-SIMT targets in
omp_device_lower, allowing subsequent optimizations to clean up.

	* internal-fn.c (expand_GOMP_SIMT_LANE): Update.
	(expand_GOMP_SIMT_LAST_LANE): New.
	(expand_GOMP_SIMT_ORDERED_PRED): New.
	(expand_GOMP_SIMT_VOTE_ANY): New.
	(expand_GOMP_SIMT_XCHG_BFLY): New.
	(expand_GOMP_SIMT_XCHG_IDX): New.
	* internal-fn.def (GOMP_SIMT_LAST_LANE): New.
	(GOMP_SIMT_ORDERED_PRED): New.
	(GOMP_SIMT_VOTE_ANY): New.
	(GOMP_SIMT_XCHG_BFLY): New.
	(GOMP_SIMT_XCHG_IDX): New.
	* omp-low.c (omp_maybe_offloaded_ctx): New, outlined from...
	(create_omp_child_function): ...here.  Simplify.
	(omp_max_simt_vf): New.  Use it...
	(omp_max_vf): ...here.
	(lower_rec_input_clauses): Add reduction lowering for SIMT execution.
	(lower_lastprivate_clauses): Likewise, for lastprivate lowering.
	(lower_omp_ordered): Likewise, for "ordered" lowering.
	(expand_omp_simd): Update SIMT transforms.
	(execute_omp_device_lower): Update.  Fold SIMD ifns on SIMT targets.
---
 gcc/ChangeLog.gomp-nvptx |  23 +++
 gcc/internal-fn.c        | 110 ++++++++++++-
 gcc/internal-fn.def      |   5 +
 gcc/omp-low.c            | 399 ++++++++++++++++++++++++++++++++++-------------
 4 files changed, 427 insertions(+), 110 deletions(-)

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index f730548..6eba12f 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -161,11 +161,12 @@ static void
 expand_GOMP_SIMT_LANE (internal_fn, gcall *stmt)
 {
   tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
 
   rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
-  /* FIXME: use a separate pattern for OpenMP?  */
-  gcc_assert (targetm.have_oacc_dim_pos ());
-  emit_insn (targetm.gen_oacc_dim_pos (target, const2_rtx));
+  gcc_assert (targetm.have_omp_simt_lane ());
+  emit_insn (targetm.gen_omp_simt_lane (target));
 }
 
 /* This should get expanded in omp_device_lower pass.  */
@@ -176,6 +177,109 @@ expand_GOMP_SIMT_VF (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+/* Lane index of the first SIMT lane that supplies a non-zero argument.
+   This is a SIMT counterpart to GOMP_SIMD_LAST_LANE, used to represent the
+   lane that executed the last iteration for handling OpenMP lastprivate.  */
+
+static void
+expand_GOMP_SIMT_LAST_LANE (internal_fn, gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rtx cond = expand_normal (gimple_call_arg (stmt, 0));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+  struct expand_operand ops[2];
+  create_output_operand (&ops[0], target, mode);
+  create_input_operand (&ops[1], cond, mode);
+  gcc_assert (targetm.have_omp_simt_last_lane ());
+  expand_insn (targetm.code_for_omp_simt_last_lane, 2, ops);
+}
+
+/* Non-transparent predicate used in SIMT lowering of OpenMP "ordered".  */
+
+static void
+expand_GOMP_SIMT_ORDERED_PRED (internal_fn, gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rtx ctr = expand_normal (gimple_call_arg (stmt, 0));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+  struct expand_operand ops[2];
+  create_output_operand (&ops[0], target, mode);
+  create_input_operand (&ops[1], ctr, mode);
+  gcc_assert (targetm.have_omp_simt_ordered ());
+  expand_insn (targetm.code_for_omp_simt_ordered, 2, ops);
+}
+
+/* "Or" boolean reduction across SIMT lanes: return non-zero in all lanes if
+   any lane supplies a non-zero argument.  */
+
+static void
+expand_GOMP_SIMT_VOTE_ANY (internal_fn, gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rtx cond = expand_normal (gimple_call_arg (stmt, 0));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+  struct expand_operand ops[2];
+  create_output_operand (&ops[0], target, mode);
+  create_input_operand (&ops[1], cond, mode);
+  gcc_assert (targetm.have_omp_simt_vote_any ());
+  expand_insn (targetm.code_for_omp_simt_vote_any, 2, ops);
+}
+
+/* Exchange between SIMT lanes with a "butterfly" pattern: source lane index
+   is destination lane index XOR given offset.  */
+
+static void
+expand_GOMP_SIMT_XCHG_BFLY (internal_fn, gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rtx src = expand_normal (gimple_call_arg (stmt, 0));
+  rtx idx = expand_normal (gimple_call_arg (stmt, 1));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+  struct expand_operand ops[3];
+  create_output_operand (&ops[0], target, mode);
+  create_input_operand (&ops[1], src, mode);
+  create_input_operand (&ops[2], idx, SImode);
+  gcc_assert (targetm.have_omp_simt_xchg_bfly ());
+  expand_insn (targetm.code_for_omp_simt_xchg_bfly, 3, ops);
+}
+
+/* Exchange between SIMT lanes according to given source lane index.  */
+
+static void
+expand_GOMP_SIMT_XCHG_IDX (internal_fn, gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rtx src = expand_normal (gimple_call_arg (stmt, 0));
+  rtx idx = expand_normal (gimple_call_arg (stmt, 1));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+  struct expand_operand ops[3];
+  create_output_operand (&ops[0], target, mode);
+  create_input_operand (&ops[1], src, mode);
+  create_input_operand (&ops[2], idx, SImode);
+  gcc_assert (targetm.have_omp_simt_xchg_idx ());
+  expand_insn (targetm.code_for_omp_simt_xchg_idx, 3, ops);
+}
+
 /* This should get expanded in adjust_simduid_builtins.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 998ec80..110bb4e 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -143,6 +143,11 @@ DEF_INTERNAL_INT_FN (POPCOUNT, ECF_CONST, popcount, unary)
 
 DEF_INTERNAL_FN (GOMP_SIMT_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMT_VF, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_LAST_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_ORDERED_PRED, ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_VOTE_ANY, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_XCHG_BFLY, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_XCHG_IDX, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index cfdc3bc..8996b8d 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -2391,6 +2391,20 @@ cilk_for_check_loop_diff_type (tree type)
     }
 }
 
+/* Return true if CTX may belong to offloaded code: either if current function
+   is offloaded, or any enclosing context corresponds to a target region.  */
+
+static bool
+omp_maybe_offloaded_ctx (omp_context *ctx)
+{
+  if (cgraph_node::get (current_function_decl)->offloadable)
+    return true;
+  for (; ctx; ctx = ctx->outer)
+    if (is_gimple_omp_offloaded (ctx->stmt))
+      return true;
+  return false;
+}
+
 /* Build a decl for the omp child function.  It'll not contain a body
    yet, just the bare decl.  */
 
@@ -2438,20 +2452,11 @@ create_omp_child_function (omp_context *ctx, bool task_copy)
   DECL_EXTERNAL (decl) = 0;
   DECL_CONTEXT (decl) = NULL_TREE;
   DECL_INITIAL (decl) = make_node (BLOCK);
-  if (cgraph_node::get (current_function_decl)->offloadable)
-    cgraph_node::get_create (decl)->offloadable = 1;
-  else
+  if (omp_maybe_offloaded_ctx (ctx))
     {
-      omp_context *octx;
-      for (octx = ctx; octx; octx = octx->outer)
-	if (is_gimple_omp_offloaded (octx->stmt))
-	  {
-	    cgraph_node::get_create (decl)->offloadable = 1;
-	    if (ENABLE_OFFLOADING)
-	      g->have_offload = true;
-
-	    break;
-	  }
+      cgraph_node::get_create (decl)->offloadable = 1;
+      if (ENABLE_OFFLOADING)
+	g->have_offload = true;
     }
 
   if (cgraph_node::get_create (decl)->offloadable
@@ -4205,6 +4210,23 @@ omp_clause_aligned_alignment (tree clause)
   return build_int_cst (integer_type_node, al);
 }
 
+
+/* Return maximum SIMT width if offloading may target SIMT hardware.  */
+
+static int
+omp_max_simt_vf (void)
+{
+  if (!optimize)
+    return 0;
+  if (ENABLE_OFFLOADING)
+    for (const char *c = getenv ("OFFLOAD_TARGET_NAMES"); c; )
+      if (!strncmp (c, "nvptx", strlen ("nvptx")))
+	return 32;
+      else if ((c = strchr (c, ',')))
+	c++;
+  return 0;
+}
+
 /* Return maximum possible vectorization factor for the target.  */
 
 static int
@@ -4218,16 +4240,18 @@ omp_max_vf (void)
               || global_options_set.x_flag_tree_vectorize)))
     return 1;
 
+  int vf = 1;
   int vs = targetm.vectorize.autovectorize_vector_sizes ();
   if (vs)
+    vf = 1 << floor_log2 (vs);
+  else
     {
-      vs = 1 << floor_log2 (vs);
-      return vs;
+      machine_mode vqimode = targetm.vectorize.preferred_simd_mode (QImode);
+      if (GET_MODE_CLASS (vqimode) == MODE_VECTOR_INT)
+	vf = GET_MODE_NUNITS (vqimode);
     }
-  machine_mode vqimode = targetm.vectorize.preferred_simd_mode (QImode);
-  if (GET_MODE_CLASS (vqimode) == MODE_VECTOR_INT)
-    return GET_MODE_NUNITS (vqimode);
-  return 1;
+  int svf = omp_max_simt_vf ();
+  return MAX (vf, svf);
 }
 
 /* Helper function of lower_rec_input_clauses, used for #pragma omp simd
@@ -4313,10 +4337,13 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist,
   int pass;
   bool is_simd = (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
 		  && gimple_omp_for_kind (ctx->stmt) & GF_OMP_FOR_SIMD);
+  bool maybe_simt
+    = is_simd && omp_maybe_offloaded_ctx (ctx) && omp_max_simt_vf () > 1;
   int max_vf = 0;
   tree lane = NULL_TREE, idx = NULL_TREE;
+  tree simt_lane = NULL_TREE;
   tree ivar = NULL_TREE, lvar = NULL_TREE;
-  gimple_seq llist[2] = { NULL, NULL };
+  gimple_seq llist[3] = { };
 
   copyin_seq = NULL;
 
@@ -5188,6 +5215,16 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist,
 
 		      gimplify_assign (unshare_expr (ivar), x, &llist[0]);
 
+		      if (maybe_simt)
+			{
+			  if (!simt_lane)
+			    simt_lane = create_tmp_var (unsigned_type_node);
+			  x = build_call_expr_internal_loc
+			    (UNKNOWN_LOCATION, IFN_GOMP_SIMT_XCHG_BFLY,
+			     TREE_TYPE (ivar), 2, ivar, simt_lane);
+			  x = build2 (code, TREE_TYPE (ivar), ivar, x);
+			  gimplify_assign (ivar, x, &llist[2]);
+			}
 		      x = build2 (code, TREE_TYPE (ref), ref, ivar);
 		      ref = build_outer_var_ref (var, ctx);
 		      gimplify_assign (ref, x, &llist[1]);
@@ -5240,6 +5277,39 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist,
       g = gimple_build_assign (lane, INTEGER_CST,
 			       build_int_cst (unsigned_type_node, 0));
       gimple_seq_add_stmt (ilist, g);
+      /* Emit reductions across SIMT lanes in log_2(simt_vf) steps.  */
+      if (llist[2])
+	{
+	  tree simt_vf = create_tmp_var (unsigned_type_node);
+	  g = gimple_build_call_internal (IFN_GOMP_SIMT_VF, 0);
+	  gimple_call_set_lhs (g, simt_vf);
+	  gimple_seq_add_stmt (dlist, g);
+
+	  tree t = build_int_cst (unsigned_type_node, 1);
+	  g = gimple_build_assign (simt_lane, INTEGER_CST, t);
+	  gimple_seq_add_stmt (dlist, g);
+
+	  t = build_int_cst (unsigned_type_node, 0);
+	  g = gimple_build_assign (idx, INTEGER_CST, t);
+	  gimple_seq_add_stmt (dlist, g);
+
+	  tree body = create_artificial_label (UNKNOWN_LOCATION);
+	  tree header = create_artificial_label (UNKNOWN_LOCATION);
+	  tree end = create_artificial_label (UNKNOWN_LOCATION);
+	  gimple_seq_add_stmt (dlist, gimple_build_goto (header));
+	  gimple_seq_add_stmt (dlist, gimple_build_label (body));
+
+	  gimple_seq_add_seq (dlist, llist[2]);
+
+	  g = gimple_build_assign (simt_lane, LSHIFT_EXPR, simt_lane, integer_one_node);
+	  gimple_seq_add_stmt (dlist, g);
+
+	  gimple_seq_add_stmt (dlist, gimple_build_label (header));
+	  g = gimple_build_cond (LT_EXPR, simt_lane, simt_vf, body, end);
+	  gimple_seq_add_stmt (dlist, g);
+
+	  gimple_seq_add_stmt (dlist, gimple_build_label (end));
+	}
       for (int i = 0; i < 2; i++)
 	if (llist[i])
 	  {
@@ -5326,7 +5396,7 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *stmt_list,
 {
   tree x, c, label = NULL, orig_clauses = clauses;
   bool par_clauses = false;
-  tree simduid = NULL, lastlane = NULL;
+  tree simduid = NULL, lastlane = NULL, simtcond = NULL, simtlast = NULL;
 
   /* Early exit if there are no lastprivate or linear clauses.  */
   for (; clauses ; clauses = OMP_CLAUSE_CHAIN (clauses))
@@ -5353,6 +5423,16 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *stmt_list,
       par_clauses = true;
     }
 
+  bool maybe_simt = false;
+  if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
+      && gimple_omp_for_kind (ctx->stmt) & GF_OMP_FOR_SIMD)
+    {
+      maybe_simt = omp_maybe_offloaded_ctx (ctx) && omp_max_simt_vf () > 1;
+      simduid = find_omp_clause (orig_clauses, OMP_CLAUSE__SIMDUID_);
+      if (simduid)
+	simduid = OMP_CLAUSE__SIMDUID__DECL (simduid);
+    }
+
   if (predicate)
     {
       gcond *stmt;
@@ -5364,20 +5444,27 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *stmt_list,
       arm2 = TREE_OPERAND (predicate, 1);
       gimplify_expr (&arm1, stmt_list, NULL, is_gimple_val, fb_rvalue);
       gimplify_expr (&arm2, stmt_list, NULL, is_gimple_val, fb_rvalue);
-      stmt = gimple_build_cond (TREE_CODE (predicate), arm1, arm2,
-				label_true, label);
+      if (maybe_simt)
+	{
+	  c = build2 (TREE_CODE (predicate), boolean_type_node, arm1, arm2);
+	  c = fold_convert (integer_type_node, c);
+	  simtcond = create_tmp_var (integer_type_node);
+	  gimplify_assign (simtcond, c, stmt_list);
+	  gcall *g = gimple_build_call_internal (IFN_GOMP_SIMT_VOTE_ANY,
+						 1, simtcond);
+	  c = create_tmp_var (integer_type_node);
+	  gimple_call_set_lhs (g, c);
+	  gimple_seq_add_stmt (stmt_list, g);
+	  stmt = gimple_build_cond (NE_EXPR, c, integer_zero_node,
+				    label_true, label);
+	}
+      else
+	stmt = gimple_build_cond (TREE_CODE (predicate), arm1, arm2,
+				  label_true, label);
       gimple_seq_add_stmt (stmt_list, stmt);
       gimple_seq_add_stmt (stmt_list, gimple_build_label (label_true));
     }
 
-  if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
-      && gimple_omp_for_kind (ctx->stmt) & GF_OMP_FOR_SIMD)
-    {
-      simduid = find_omp_clause (orig_clauses, OMP_CLAUSE__SIMDUID_);
-      if (simduid)
-	simduid = OMP_CLAUSE__SIMDUID__DECL (simduid);
-    }
-
   for (c = clauses; c ;)
     {
       tree var, new_var;
@@ -5420,6 +5507,23 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *stmt_list,
 		  new_var = build4 (ARRAY_REF, TREE_TYPE (val),
 				    TREE_OPERAND (val, 0), lastlane,
 				    NULL_TREE, NULL_TREE);
+		  if (maybe_simt)
+		    {
+		      gcall *g;
+		      if (simtlast == NULL)
+			{
+			  simtlast = create_tmp_var (unsigned_type_node);
+			  g = gimple_build_call_internal
+			    (IFN_GOMP_SIMT_LAST_LANE, 1, simtcond);
+			  gimple_call_set_lhs (g, simtlast);
+			  gimple_seq_add_stmt (stmt_list, g);
+			}
+		      x = build_call_expr_internal_loc
+			(UNKNOWN_LOCATION, IFN_GOMP_SIMT_XCHG_IDX,
+			 TREE_TYPE (new_var), 2, new_var, simtlast);
+		      gimplify_assign (new_var, x, stmt_list);
+		      new_var = unshare_expr (new_var);
+		    }
 		}
 	    }
 
@@ -10332,12 +10436,23 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
   edge e, ne;
   tree *counts = NULL;
   int i;
+  int safelen_int = INT_MAX;
   tree safelen = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 				  OMP_CLAUSE_SAFELEN);
   tree simduid = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 				  OMP_CLAUSE__SIMDUID_);
-  tree n1, n2, step, simt_lane;
+  tree n1, n2;
 
+  if (safelen)
+    {
+      safelen = OMP_CLAUSE_SAFELEN_EXPR (safelen);
+      if (TREE_CODE (safelen) != INTEGER_CST)
+	safelen_int = 0;
+      else if (tree_fits_uhwi_p (safelen) && tree_to_uhwi (safelen) < INT_MAX)
+	safelen_int = tree_to_uhwi (safelen);
+      if (safelen_int == 1)
+	safelen_int = 0;
+    }
   type = TREE_TYPE (fd->loop.v);
   entry_bb = region->entry;
   cont_bb = region->cont;
@@ -10381,61 +10496,63 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
 
   n1 = fd->loop.n1;
   n2 = fd->loop.n2;
-  step = fd->loop.step;
-  bool offloaded = cgraph_node::get (current_function_decl)->offloadable;
-  for (struct omp_region *reg = region; !offloaded && reg; reg = reg->outer)
-    offloaded = reg->type == GIMPLE_OMP_TARGET;
-  bool do_simt_transform
-    = offloaded && !broken_loop && !safelen && !simduid && !(fd->collapse > 1);
-  if (do_simt_transform)
-    {
-      simt_lane
-	= build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_LANE,
-					integer_type_node, 0);
-      simt_lane = fold_convert (TREE_TYPE (step), simt_lane);
-      simt_lane = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_lane);
-      cfun->curr_properties &= ~PROP_gimple_lomp_dev;
-    }
-
   if (gimple_omp_for_combined_into_p (fd->for_stmt))
     {
       tree innerc = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 				     OMP_CLAUSE__LOOPTEMP_);
       gcc_assert (innerc);
       n1 = OMP_CLAUSE_DECL (innerc);
-      if (do_simt_transform)
-	{
-	  n1 = fold_convert (type, n1);
-	  if (POINTER_TYPE_P (type))
-	    n1 = fold_build_pointer_plus (n1, simt_lane);
-	  else
-	    n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, simt_lane));
-	}
       innerc = find_omp_clause (OMP_CLAUSE_CHAIN (innerc),
 				OMP_CLAUSE__LOOPTEMP_);
       gcc_assert (innerc);
       n2 = OMP_CLAUSE_DECL (innerc);
-      expand_omp_build_assign (&gsi, fd->loop.v,
-			       fold_convert (type, n1));
+    }
+  tree step = fd->loop.step;
+
+  bool offloaded = cgraph_node::get (current_function_decl)->offloadable;
+  for (struct omp_region *rgn = region; !offloaded && rgn; rgn = rgn->outer)
+    offloaded = rgn->type == GIMPLE_OMP_TARGET;
+  bool is_simt = offloaded && omp_max_simt_vf () > 1 && safelen_int > 1;
+  tree simt_lane = NULL_TREE, simt_maxlane = NULL_TREE;
+  if (is_simt)
+    {
+      cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+      simt_lane = create_tmp_var (unsigned_type_node);
+      gimple *g = gimple_build_call_internal (IFN_GOMP_SIMT_LANE, 0);
+      gimple_call_set_lhs (g, simt_lane);
+      gsi_insert_before (&gsi, g, GSI_SAME_STMT);
+      tree offset = fold_build2 (MULT_EXPR, TREE_TYPE (step), step,
+				 fold_convert (TREE_TYPE (step), simt_lane));
+      n1 = fold_convert (type, n1);
+      if (POINTER_TYPE_P (type))
+	n1 = fold_build_pointer_plus (n1, offset);
+      else
+	n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, offset));
+
+      /* Collapsed loops not handled for SIMT yet: limit to one lane only.  */
       if (fd->collapse > 1)
+	simt_maxlane = build_one_cst (unsigned_type_node);
+      else if (safelen_int < omp_max_simt_vf ())
+	simt_maxlane = build_int_cst (unsigned_type_node, safelen_int);
+      tree vf
+	= build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_VF,
+					unsigned_type_node, 0);
+      if (simt_maxlane)
+	vf = fold_build2 (MIN_EXPR, unsigned_type_node, vf, simt_maxlane);
+      vf = fold_convert (TREE_TYPE (step), vf);
+      step = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, vf);
+    }
+
+  expand_omp_build_assign (&gsi, fd->loop.v, fold_convert (type, n1));
+  if (fd->collapse > 1)
+    {
+      if (gimple_omp_for_combined_into_p (fd->for_stmt))
 	{
 	  gsi_prev (&gsi);
 	  expand_omp_for_init_vars (fd, &gsi, counts, NULL, n1);
 	  gsi_next (&gsi);
 	}
-    }
-  else
-    {
-      if (do_simt_transform)
-	{
-	  n1 = fold_convert (type, n1);
-	  if (POINTER_TYPE_P (type))
-	    n1 = fold_build_pointer_plus (n1, simt_lane);
-	  else
-	    n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, simt_lane));
-	}
-      expand_omp_build_assign (&gsi, fd->loop.v, fold_convert (type, n1));
-      if (fd->collapse > 1)
+      else
 	for (i = 0; i < fd->collapse; i++)
 	  {
 	    tree itype = TREE_TYPE (fd->loops[i].v);
@@ -10444,7 +10561,7 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
 	    t = fold_convert (TREE_TYPE (fd->loops[i].v), fd->loops[i].n1);
 	    expand_omp_build_assign (&gsi, fd->loops[i].v, t);
 	  }
-      }
+    }
 
   /* Remove the GIMPLE_OMP_FOR statement.  */
   gsi_remove (&gsi, true);
@@ -10456,14 +10573,6 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
       stmt = gsi_stmt (gsi);
       gcc_assert (gimple_code (stmt) == GIMPLE_OMP_CONTINUE);
 
-      if (do_simt_transform)
-	{
-	  tree simt_vf
-	    = build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_VF,
-					    integer_type_node, 0);
-	  simt_vf = fold_convert (TREE_TYPE (step), simt_vf);
-	  step = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_vf);
-	}
       if (POINTER_TYPE_P (type))
 	t = fold_build_pointer_plus (fd->loop.v, step);
       else
@@ -10541,6 +10650,18 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
       gimple_regimplify_operands (cond_stmt, &gsi);
     }
 
+  /* Add 'V -= STEP * (SIMT_VF - 1)' after the loop.  */
+  if (is_simt)
+    {
+      gsi = gsi_start_bb (l2_bb);
+      step = fold_build2 (MINUS_EXPR, TREE_TYPE (step), fd->loop.step, step);
+      if (POINTER_TYPE_P (type))
+	t = fold_build_pointer_plus (fd->loop.v, step);
+      else
+	t = fold_build2 (PLUS_EXPR, type, fd->loop.v, step);
+      expand_omp_build_assign (&gsi, fd->loop.v, t);
+    }
+
   /* Remove GIMPLE_OMP_RETURN.  */
   gsi = gsi_last_bb (exit_bb);
   gsi_remove (&gsi, true);
@@ -10570,30 +10691,29 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
   ne->probability = REG_BR_PROB_BASE / 8;
 
   set_immediate_dominator (CDI_DOMINATORS, l1_bb, entry_bb);
-  set_immediate_dominator (CDI_DOMINATORS, l2_bb, l2_dom_bb);
   set_immediate_dominator (CDI_DOMINATORS, l0_bb, l1_bb);
 
+  if (simt_maxlane)
+    {
+      cond_stmt = gimple_build_cond (LT_EXPR, simt_lane, simt_maxlane,
+				     NULL_TREE, NULL_TREE);
+      gsi = gsi_last_bb (entry_bb);
+      gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
+      make_edge (entry_bb, l2_bb, EDGE_FALSE_VALUE);
+      FALLTHRU_EDGE (entry_bb)->flags = EDGE_TRUE_VALUE;
+      FALLTHRU_EDGE (entry_bb)->probability = REG_BR_PROB_BASE * 7 / 8;
+      BRANCH_EDGE (entry_bb)->probability = REG_BR_PROB_BASE / 8;
+      l2_dom_bb = entry_bb;
+    }
+  set_immediate_dominator (CDI_DOMINATORS, l2_bb, l2_dom_bb);
+
   if (!broken_loop)
     {
       struct loop *loop = alloc_loop ();
       loop->header = l1_bb;
       loop->latch = cont_bb;
       add_loop (loop, l1_bb->loop_father);
-      if (safelen == NULL_TREE)
-	loop->safelen = INT_MAX;
-      else
-	{
-	  safelen = OMP_CLAUSE_SAFELEN_EXPR (safelen);
-	  if (TREE_CODE (safelen) != INTEGER_CST)
-	    loop->safelen = 0;
-	  else if (!tree_fits_uhwi_p (safelen)
-		   || tree_to_uhwi (safelen) > INT_MAX)
-	    loop->safelen = INT_MAX;
-	  else
-	    loop->safelen = tree_to_uhwi (safelen);
-	  if (loop->safelen == 1)
-	    loop->safelen = 0;
-	}
+      loop->safelen = safelen_int;
       if (simduid)
 	{
 	  loop->simduid = OMP_CLAUSE__SIMDUID__DECL (simduid);
@@ -14139,12 +14259,14 @@ static void
 lower_omp_ordered (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 {
   tree block;
-  gimple *stmt = gsi_stmt (*gsi_p);
+  gimple *stmt = gsi_stmt (*gsi_p), *g;
   gomp_ordered *ord_stmt = as_a <gomp_ordered *> (stmt);
   gcall *x;
   gbind *bind;
   bool simd = find_omp_clause (gimple_omp_ordered_clauses (ord_stmt),
 			       OMP_CLAUSE_SIMD);
+  bool maybe_simt
+    = simd && omp_maybe_offloaded_ctx (ctx) && omp_max_simt_vf () > 1;
   bool threads = find_omp_clause (gimple_omp_ordered_clauses (ord_stmt),
 				  OMP_CLAUSE_THREADS);
 
@@ -14178,11 +14300,56 @@ lower_omp_ordered (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 			   0);
   gimple_bind_add_stmt (bind, x);
 
+  tree counter = NULL_TREE, test = NULL_TREE, body = NULL_TREE;
+  if (maybe_simt)
+    {
+      counter = create_tmp_var (integer_type_node);
+      g = gimple_build_call_internal (IFN_GOMP_SIMT_LANE, 0);
+      gimple_call_set_lhs (g, counter);
+      gimple_bind_add_stmt (bind, g);
+
+      body = create_artificial_label (UNKNOWN_LOCATION);
+      test = create_artificial_label (UNKNOWN_LOCATION);
+      gimple_bind_add_stmt (bind, gimple_build_label (body));
+
+      tree simt_pred = create_tmp_var (integer_type_node);
+      g = gimple_build_call_internal (IFN_GOMP_SIMT_ORDERED_PRED, 1, counter);
+      gimple_call_set_lhs (g, simt_pred);
+      gimple_bind_add_stmt (bind, g);
+
+      tree t = create_artificial_label (UNKNOWN_LOCATION);
+      g = gimple_build_cond (EQ_EXPR, simt_pred, integer_zero_node, t, test);
+      gimple_bind_add_stmt (bind, g);
+
+      gimple_bind_add_stmt (bind, gimple_build_label (t));
+    }
   lower_omp (gimple_omp_body_ptr (stmt), ctx);
   gimple_omp_set_body (stmt, maybe_catch_exception (gimple_omp_body (stmt)));
   gimple_bind_add_seq (bind, gimple_omp_body (stmt));
   gimple_omp_set_body (stmt, NULL);
 
+  if (maybe_simt)
+    {
+      gimple_bind_add_stmt (bind, gimple_build_label (test));
+      g = gimple_build_assign (counter, MINUS_EXPR, counter, integer_one_node);
+      gimple_bind_add_stmt (bind, g);
+
+      tree c = build2 (GE_EXPR, boolean_type_node, counter, integer_zero_node);
+      tree nonneg = create_tmp_var (integer_type_node);
+      gimple_seq tseq = NULL;
+      gimplify_assign (nonneg, fold_convert (integer_type_node, c), &tseq);
+      gimple_bind_add_seq (bind, tseq);
+
+      g = gimple_build_call_internal (IFN_GOMP_SIMT_VOTE_ANY, 1, nonneg);
+      gimple_call_set_lhs (g, nonneg);
+      gimple_bind_add_stmt (bind, g);
+
+      tree end = create_artificial_label (UNKNOWN_LOCATION);
+      g = gimple_build_cond (NE_EXPR, nonneg, integer_zero_node, body, end);
+      gimple_bind_add_stmt (bind, g);
+
+      gimple_bind_add_stmt (bind, gimple_build_label (end));
+    }
   if (simd)
     x = gimple_build_call_internal (IFN_GOMP_SIMD_ORDERED_END, 1,
 				    build_int_cst (NULL_TREE, threads));
@@ -19879,15 +20046,14 @@ make_pass_oacc_device_lower (gcc::context *ctxt)
 
 /* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets,
    VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and
-   LANE is kept to be expanded to RTL later on.  */
+   LANE is kept to be expanded to RTL later on.  Also cleanup all other SIMT
+   internal functions on non-SIMT targets, and likewise some SIMD internal
+   functions on SIMT targets.  */
 
 static unsigned int
 execute_omp_device_lower ()
 {
-  int vf = 1;
-  if (targetm.simt.vf)
-    vf = targetm.simt.vf ();
-  tree vf_tree = build_int_cst (integer_type_node, vf);
+  int vf = targetm.simt.vf ? targetm.simt.vf () : 1;
   basic_block bb;
   gimple_stmt_iterator gsi;
   FOR_EACH_BB_FN (bb, cfun)
@@ -19897,20 +20063,39 @@ execute_omp_device_lower ()
 	if (!is_gimple_call (stmt) || !gimple_call_internal_p (stmt))
 	  continue;
 	tree lhs = gimple_call_lhs (stmt), rhs = NULL_TREE;
+	tree type = lhs ? TREE_TYPE (lhs) : integer_type_node;
 	switch (gimple_call_internal_fn (stmt))
 	  {
 	  case IFN_GOMP_SIMT_LANE:
-	    rhs = vf == 1 ? integer_zero_node : NULL_TREE;
+	  case IFN_GOMP_SIMT_LAST_LANE:
+	    rhs = vf == 1 ? build_zero_cst (type) : NULL_TREE;
 	    break;
 	  case IFN_GOMP_SIMT_VF:
-	    rhs = vf_tree;
+	    rhs = build_int_cst (type, vf);
 	    break;
-	  default:
+	  case IFN_GOMP_SIMT_ORDERED_PRED:
+	    rhs = vf == 1 ? integer_zero_node : NULL_TREE;
+	    if (rhs || !lhs)
+	      unlink_stmt_vdef (stmt);
+	    break;
+	  case IFN_GOMP_SIMT_VOTE_ANY:
+	  case IFN_GOMP_SIMT_XCHG_BFLY:
+	  case IFN_GOMP_SIMT_XCHG_IDX:
+	    rhs = vf == 1 ? gimple_call_arg (stmt, 0) : NULL_TREE;
 	    break;
+	  case IFN_GOMP_SIMD_LANE:
+	  case IFN_GOMP_SIMD_LAST_LANE:
+	    rhs = vf != 1 ? build_zero_cst (type) : NULL_TREE;
+	    break;
+	  case IFN_GOMP_SIMD_VF:
+	    rhs = vf != 1 ? build_one_cst (type) : NULL_TREE;
+	    break;
+	  default:
+	    continue;
 	  }
-	if (!rhs)
+	if (lhs && !rhs)
 	  continue;
-	stmt = gimple_build_assign (lhs, rhs);
+	stmt = lhs ? gimple_build_assign (lhs, rhs) : gimple_build_nop ();
 	gsi_replace (&gsi, stmt, false);
       }
   if (vf != 1)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 05/13] libgomp: remove sections.c, splay-tree.c
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (7 preceding siblings ...)
  2016-01-20 17:27 ` [gomp-nvptx 10/13] libgomp testsuite: add -foffload=-lgfortran Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:45 ` [gomp-nvptx 08/13] libgomp: add nvptx lock.c Alexander Monakov
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

This patch removes two zero-size stubs, there's no need for these overrides.

	* config/nvptx/section.c: Delete.
	* config/nvptx/splay-tree.c: Delete.
---
 libgomp/ChangeLog.gomp-nvptx      | 5 +++++
 libgomp/config/nvptx/sections.c   | 0
 libgomp/config/nvptx/splay-tree.c | 0
 3 files changed, 5 insertions(+)
 delete mode 100644 libgomp/config/nvptx/sections.c
 delete mode 100644 libgomp/config/nvptx/splay-tree.c

diff --git a/libgomp/config/nvptx/sections.c b/libgomp/config/nvptx/sections.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/splay-tree.c b/libgomp/config/nvptx/splay-tree.c
deleted file mode 100644
index e69de29..0000000

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 03/13] nvptx backend: silence warning
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (5 preceding siblings ...)
  2016-01-20 17:27 ` [gomp-nvptx 09/13] libgomp: use generic fortran.c on nvptx Alexander Monakov
@ 2016-01-20 17:27 ` Alexander Monakov
  2016-01-20 17:27 ` [gomp-nvptx 10/13] libgomp testsuite: add -foffload=-lgfortran Alexander Monakov
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:27 UTC (permalink / raw)
  To: gcc-patches

	* config/nvptx/nvptx.c (nvptx_declare_function_name): Fix warning.
---
 gcc/ChangeLog.gomp-nvptx | 4 ++++
 gcc/config/nvptx/nvptx.c | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 45aebdd..f63f840 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -890,7 +890,7 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (sz == 0 && cfun->machine->has_call_with_sc)
     sz = 1;
   bool need_sp = cfun->calls_alloca || cfun->machine->has_call_with_varargs;
-  if (sz > 0 || TARGET_SOFT_STACK && need_sp)
+  if (sz > 0 || (TARGET_SOFT_STACK && need_sp))
     {
       int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 13/13] libgomp plugin: handle multiple teams
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (9 preceding siblings ...)
  2016-01-20 17:45 ` [gomp-nvptx 08/13] libgomp: add nvptx lock.c Alexander Monakov
@ 2016-01-20 17:45 ` Alexander Monakov
  2016-01-20 17:45 ` [gomp-nvptx 12/13] libgomp: handle multiple teams on NVPTX Alexander Monakov
  2016-01-20 17:45 ` [gomp-nvptx 11/13] pick GOMP_target_ext changes from the hsa branch Alexander Monakov
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:45 UTC (permalink / raw)
  To: gcc-patches

This complements multiple teams support on the libgomp plugin side.

	* plugin/plugin-nvptx.c (struct targ_fn_descriptor): Add new fields.
	(struct ptx_device): Ditto.  Set them...
	(nvptx_open_device): ...here.
	(GOMP_OFFLOAD_load_image): Set new targ_fn_descriptor fields.
	(nvptx_adjust_launch_bounds): New.  Use it...
	(GOMP_OFFLOAD_run): ...here.
---
 libgomp/ChangeLog.gomp-nvptx  |   9 ++++
 libgomp/plugin/plugin-nvptx.c | 106 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 109 insertions(+), 6 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 87e0494..b7bf59b 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -254,6 +254,8 @@ struct targ_fn_descriptor
 {
   CUfunction fn;
   const struct targ_fn_launch *launch;
+  int regs_per_thread;
+  int max_threads_per_block;
 };
 
 /* A loaded PTX image.  */
@@ -290,6 +292,9 @@ struct ptx_device
   bool mkern;
   int  mode;
   int clock_khz;
+  int num_sms;
+  int regs_per_block;
+  int regs_per_sm;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -648,6 +653,36 @@ nvptx_open_device (int n)
 
   ptx_dev->clock_khz = pi;
 
+  r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r));
+
+  ptx_dev->num_sms = pi;
+
+  r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK,
+			    dev);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r));
+
+  ptx_dev->regs_per_block = pi;
+
+  /* CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82 is defined only
+     in CUDA 6.0 and newer.  */
+  r = cuDeviceGetAttribute (&pi, 82, dev);
+  /* Fallback: use limit of registers per block, which is usually equal.  */
+  if (r == CUDA_ERROR_INVALID_VALUE)
+    pi = ptx_dev->regs_per_block;
+  else if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r));
+
+  ptx_dev->regs_per_sm = pi;
+
+  r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev);
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r));
+  if (pi != 32)
+    GOMP_PLUGIN_fatal ("Only warp size 32 is supported");
+
   r = cuDeviceGetAttribute (&async_engines,
 			    CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
@@ -1589,13 +1624,23 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++)
     {
       CUfunction function;
+      int nregs, mthrs;
 
       r = cuModuleGetFunction (&function, module, fn_descs[i].fn);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r));
+      r = cuFuncGetAttribute (&nregs, CU_FUNC_ATTRIBUTE_NUM_REGS, function);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuFuncGetAttribute error: %s", cuda_error (r));
+      r = cuFuncGetAttribute (&mthrs, CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
+			      function);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuFuncGetAttribute error: %s", cuda_error (r));
 
       targ_fns->fn = function;
       targ_fns->launch = &fn_descs[i];
+      targ_fns->regs_per_thread = nregs;
+      targ_fns->max_threads_per_block = mthrs;
 
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
@@ -1822,19 +1867,67 @@ GOMP_OFFLOAD_openacc_set_cuda_stream (int async, void *stream)
   return nvptx_set_cuda_stream (async, stream);
 }
 
+/* Adjust launch dimensions: pick good values for number of blocks and warps
+   and ensure that number of warps does not exceed CUDA limits as well as GCC's
+   own limits.  */
+
+static void
+nvptx_adjust_launch_bounds (struct targ_fn_descriptor *fn,
+			    struct ptx_device *ptx_dev,
+			    long *teams_p, long *threads_p)
+{
+  int max_warps_block = fn->max_threads_per_block / 32;
+  /* Maximum 32 warps per block is an implementation limit in NVPTX backend
+     and libgcc, which matches documented limit of all GPUs as of 2015.  */
+  if (max_warps_block > 32)
+    max_warps_block = 32;
+  if (*threads_p <= 0)
+    *threads_p = 8;
+  if (*threads_p > max_warps_block)
+    *threads_p = max_warps_block;
+
+  int regs_per_block = fn->regs_per_thread * 32 * *threads_p;
+  /* This is an estimate of how many blocks the device can host simultaneously.
+     Actual limit, which may be lower, can be queried with "occupancy control"
+     driver interface (since CUDA 6.0).  */
+  int max_blocks = ptx_dev->regs_per_sm / regs_per_block * ptx_dev->num_sms;
+  if (*teams_p <= 0 || *teams_p > max_blocks)
+    *teams_p = max_blocks;
+}
+
 void
-GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars)
+GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
   CUfunction function = ((struct targ_fn_descriptor *) tgt_fn)->fn;
   CUresult r;
   struct ptx_device *ptx_dev = ptx_devices[ord];
   const char *maybe_abort_msg = "(perhaps abort was called)";
-  void *args = &tgt_vars;
+  void *fn_args = &tgt_vars;
+  long teams = 0, threads = 0;
+
+  if (!args)
+    GOMP_PLUGIN_fatal ("No target arguments provided");
+  while (*args)
+    {
+      long id = (long) *args++, val;
+      if (id & GOMP_TARGET_ARG_SUBSEQUENT_PARAM)
+	val = (long) *args++;
+      else
+        val = id >> GOMP_TARGET_ARG_VALUE_SHIFT;
+      if ((id & GOMP_TARGET_ARG_DEVICE_MASK) != GOMP_TARGET_ARG_DEVICE_ALL)
+	continue;
+      id &= GOMP_TARGET_ARG_ID_MASK;
+      if (id == GOMP_TARGET_ARG_NUM_TEAMS)
+	teams = val;
+      else if (id == GOMP_TARGET_ARG_THREAD_LIMIT)
+	threads = val;
+    }
+  nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   r = cuLaunchKernel (function,
-		      1, 1, 1,
-		      32, 8, 1,
-		      0, ptx_dev->null_stream->stream, &args, 0);
+		      teams, 1, 1,
+		      32, threads, 1,
+		      0, ptx_dev->null_stream->stream, &fn_args, 0);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 
@@ -1847,7 +1940,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars)
 }
 
 void
-GOMP_OFFLOAD_async_run (int ord, void *tgt_fn, void *tgt_vars, void *async_data)
+GOMP_OFFLOAD_async_run (int ord, void *tgt_fn, void *tgt_vars, void **args,
+			void *async_data)
 {
   GOMP_PLUGIN_fatal ("GOMP_OFFLOAD_async_run unimplemented");
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 11/13] pick GOMP_target_ext changes from the hsa branch
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (11 preceding siblings ...)
  2016-01-20 17:45 ` [gomp-nvptx 12/13] libgomp: handle multiple teams on NVPTX Alexander Monakov
@ 2016-01-20 17:45 ` Alexander Monakov
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:45 UTC (permalink / raw)
  To: gcc-patches

This adds necessary plumbing to spawn multiple teams.

To be reverted on this branch prior to merge.
---
 gcc/builtin-types.def                            |   7 +-
 gcc/fortran/types.def                            |   5 +-
 gcc/omp-builtins.def                             |   2 +-
 gcc/omp-low.c                                    | 149 ++++++++++++++++---
 include/gomp-constants.h                         |  21 +++
 libgomp/libgomp.h                                |  12 +-
 libgomp/libgomp_g.h                              |   3 +-
 libgomp/oacc-host.c                              |   3 +-
 libgomp/target.c                                 | 179 +++++++++++++++++------
 libgomp/task.c                                   |   3 +-
 liboffloadmic/plugin/libgomp-plugin-intelmic.cpp |   4 +-
 11 files changed, 299 insertions(+), 89 deletions(-)

diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index c68fb19..33bee1d 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -555,10 +555,9 @@ DEF_FUNCTION_TYPE_9 (BT_FN_VOID_OMPFN_PTR_OMPCPYFN_LONG_LONG_BOOL_UINT_PTR_INT,
 		     BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR,
 		     BT_PTR_FN_VOID_PTR_PTR, BT_LONG, BT_LONG,
 		     BT_BOOL, BT_UINT, BT_PTR, BT_INT)
-
-DEF_FUNCTION_TYPE_10 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_INT_INT,
-		      BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
-		      BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_INT, BT_INT)
+DEF_FUNCTION_TYPE_9 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
+		     BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
+		     BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_PTR)
 
 DEF_FUNCTION_TYPE_11 (BT_FN_VOID_OMPFN_PTR_OMPCPYFN_LONG_LONG_UINT_LONG_INT_LONG_LONG_LONG,
 		      BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR,
diff --git a/gcc/fortran/types.def b/gcc/fortran/types.def
index a37e856..5838f04 100644
--- a/gcc/fortran/types.def
+++ b/gcc/fortran/types.def
@@ -220,10 +220,9 @@ DEF_FUNCTION_TYPE_9 (BT_FN_VOID_OMPFN_PTR_OMPCPYFN_LONG_LONG_BOOL_UINT_PTR_INT,
 		     BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR,
 		     BT_PTR_FN_VOID_PTR_PTR, BT_LONG, BT_LONG,
 		     BT_BOOL, BT_UINT, BT_PTR, BT_INT)
-
-DEF_FUNCTION_TYPE_10 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_INT_INT,
+DEF_FUNCTION_TYPE_9 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
 		      BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
-		      BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_INT, BT_INT)
+		      BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_PTR)
 
 DEF_FUNCTION_TYPE_11 (BT_FN_VOID_OMPFN_PTR_OMPCPYFN_LONG_LONG_UINT_LONG_INT_LONG_LONG_LONG,
 		      BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR,
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index 35f5014..35c2724 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -341,7 +341,7 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_START, "GOMP_single_copy_start",
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_END, "GOMP_single_copy_end",
 		  BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TARGET, "GOMP_target_ext",
-		  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_INT_INT,
+		  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
 		  ATTR_NOTHROW_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TARGET_DATA, "GOMP_target_data_ext",
 		  BT_FN_VOID_INT_SIZE_PTR_PTR_PTR, ATTR_NOTHROW_LIST)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 8996b8d..2e02c6f 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -12731,6 +12731,130 @@ mark_loops_in_oacc_kernels_region (basic_block region_entry,
     loop->in_oacc_kernels_region = true;
 }
 
+/* Build target argument identifier from the DEVICE identifier, value
+   identifier ID and whether the element also has a SUBSEQUENT_PARAM.  */
+
+static tree
+get_target_argument_identifier_1 (int device, bool subseqent_param, int id)
+{
+  tree t = build_int_cst (integer_type_node, device);
+  if (subseqent_param)
+    t = fold_build2 (BIT_IOR_EXPR, integer_type_node, t,
+		     build_int_cst (integer_type_node,
+				    GOMP_TARGET_ARG_SUBSEQUENT_PARAM));
+  t = fold_build2 (BIT_IOR_EXPR, integer_type_node, t,
+		   build_int_cst (integer_type_node, id));
+  return t;
+}
+
+/* Like above but return it in type that can be directly stored as an element
+   of the argument array.  */
+
+static tree
+get_target_argument_identifier (int device, bool subseqent_param, int id)
+{
+  tree t = get_target_argument_identifier_1 (device, subseqent_param, id);
+  return fold_convert (ptr_type_node, t);
+}
+
+/* Return a target argument consisiting of DEVICE identifier, value identifier
+   ID, and the actual VALUE.  */
+
+static tree
+get_target_argument_value (gimple_stmt_iterator *gsi, int device, int id,
+			   tree value)
+{
+  tree t = fold_build2 (LSHIFT_EXPR, integer_type_node,
+			fold_convert (integer_type_node, value),
+			build_int_cst (unsigned_type_node,
+				       GOMP_TARGET_ARG_VALUE_SHIFT));
+  t = fold_build2 (BIT_IOR_EXPR, integer_type_node, t,
+		   get_target_argument_identifier_1 (device, false, id));
+  t = fold_convert (ptr_type_node, t);
+  return force_gimple_operand_gsi (gsi, t, true, NULL, true, GSI_SAME_STMT);
+}
+
+/* If VALUE is an integer constant greater than -2^15 and smaller than 2^15,
+   push one argument to ARGS with bot the DEVICE, ID and VALUE embeded in it,
+   otherwise push an iedntifier (with DEVICE and ID) and the VALUE in two
+   arguments.  */
+
+static void
+push_target_argument_according_to_value (gimple_stmt_iterator *gsi, int device,
+					 int id, tree value, vec <tree> *args)
+{
+  if (tree_fits_shwi_p (value)
+      && tree_to_shwi (value) > -(1 << 15)
+      && tree_to_shwi (value) < (1 << 15))
+    args->quick_push (get_target_argument_value (gsi, device, id, value));
+  else
+    {
+      args->quick_push (get_target_argument_identifier (device, true, id));
+      value = fold_convert (ptr_type_node, value);
+      value = force_gimple_operand_gsi (gsi, value, true, NULL, true,
+					GSI_SAME_STMT);
+      args->quick_push (value);
+    }
+}
+
+/* Create an array of arguments that is then passed to GOMP_target.   */
+
+static tree
+get_target_arguments (gimple_stmt_iterator *gsi, gomp_target *tgt_stmt)
+{
+  auto_vec <tree, 6> args;
+  tree clauses = gimple_omp_target_clauses (tgt_stmt);
+  tree t, c = find_omp_clause (clauses, OMP_CLAUSE_NUM_TEAMS);
+  if (c)
+    t = OMP_CLAUSE_NUM_TEAMS_EXPR (c);
+  else
+    t = integer_minus_one_node;
+  push_target_argument_according_to_value (gsi, GOMP_TARGET_ARG_DEVICE_ALL,
+					   GOMP_TARGET_ARG_NUM_TEAMS, t, &args);
+
+  c = find_omp_clause (clauses, OMP_CLAUSE_THREAD_LIMIT);
+  if (c)
+    t = OMP_CLAUSE_THREAD_LIMIT_EXPR (c);
+  else
+    t = integer_minus_one_node;
+  push_target_argument_according_to_value (gsi, GOMP_TARGET_ARG_DEVICE_ALL,
+					   GOMP_TARGET_ARG_THREAD_LIMIT, t,
+					   &args);
+
+#if 0
+  /* Add HSA-specific grid sizes, if available.  */
+  if (find_omp_clause (gimple_omp_target_clauses (tgt_stmt),
+		       OMP_CLAUSE__GRIDDIM_))
+    {
+      t = get_target_argument_identifier (GOMP_DEVICE_HSA, true,
+					  GOMP_TARGET_ARG_HSA_KERNEL_ATTRIBUTES);
+      args.quick_push (t);
+      args.quick_push (grid_get_kernel_launch_attributes (gsi, tgt_stmt));
+    }
+#endif
+
+  /* Produce more, perhaps device specific, arguments here.  */
+
+  tree argarray = create_tmp_var (build_array_type_nelts (ptr_type_node,
+							  args.length () + 1),
+				  ".omp_target_args");
+  for (unsigned i = 0; i < args.length (); i++)
+    {
+      tree ref = build4 (ARRAY_REF, ptr_type_node, argarray,
+			 build_int_cst (integer_type_node, i),
+			 NULL_TREE, NULL_TREE);
+      gsi_insert_before (gsi, gimple_build_assign (ref, args[i]),
+			 GSI_SAME_STMT);
+    }
+  tree ref = build4 (ARRAY_REF, ptr_type_node, argarray,
+		     build_int_cst (integer_type_node, args.length ()),
+		     NULL_TREE, NULL_TREE);
+  gsi_insert_before (gsi, gimple_build_assign (ref, null_pointer_node),
+		     GSI_SAME_STMT);
+  TREE_ADDRESSABLE (argarray) = 1;
+  return build_fold_addr_expr (argarray);
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -13148,30 +13272,7 @@ expand_omp_target (struct omp_region *region)
 	depend = build_int_cst (ptr_type_node, 0);
       args.quick_push (depend);
       if (start_ix == BUILT_IN_GOMP_TARGET)
-	{
-	  c = find_omp_clause (clauses, OMP_CLAUSE_NUM_TEAMS);
-	  if (c)
-	    {
-	      t = fold_convert (integer_type_node,
-				OMP_CLAUSE_NUM_TEAMS_EXPR (c));
-	      t = force_gimple_operand_gsi (&gsi, t, true, NULL,
-					    true, GSI_SAME_STMT);
-	    }
-	  else
-	    t = integer_minus_one_node;
-	  args.quick_push (t);
-	  c = find_omp_clause (clauses, OMP_CLAUSE_THREAD_LIMIT);
-	  if (c)
-	    {
-	      t = fold_convert (integer_type_node,
-				OMP_CLAUSE_THREAD_LIMIT_EXPR (c));
-	      t = force_gimple_operand_gsi (&gsi, t, true, NULL,
-					    true, GSI_SAME_STMT);
-	    }
-	  else
-	    t = integer_minus_one_node;
-	  args.quick_push (t);
-	}
+	args.quick_push (get_target_arguments (&gsi, entry_stmt));
       break;
     case BUILT_IN_GOACC_PARALLEL:
       {
diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index dffd631..fef27e4 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -228,4 +228,25 @@ enum gomp_map_kind
 #define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
 #define GOMP_LAUNCH_OP_MAX 0xffff
 
+/* Bitmask to apply in order to find out the intended device of a target
+   argument.  */
+#define GOMP_TARGET_ARG_DEVICE_MASK		((1 << 7) - 1)
+/* The target argument is significant for all devices.  */
+#define GOMP_TARGET_ARG_DEVICE_ALL		0
+
+/* Flag set when the subsequent element in the device-specific argument
+   values.  */
+#define GOMP_TARGET_ARG_SUBSEQUENT_PARAM	(1 << 7)
+
+/* Bitmask to apply to a target argument to find out the value identifier.  */
+#define GOMP_TARGET_ARG_ID_MASK			(((1 << 8) - 1) << 8)
+/* Target argument index of NUM_TEAMS.  */
+#define GOMP_TARGET_ARG_NUM_TEAMS		(1 << 8)
+/* Target argument index of THREAD_LIMIT.  */
+#define GOMP_TARGET_ARG_THREAD_LIMIT		(2 << 8)
+
+/* If the value is directly embeded in target argument, it should be a 16-bit
+   at most and shifted by this many bits.  */
+#define GOMP_TARGET_ARG_VALUE_SHIFT		16
+
 #endif
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 31ffba0..1d137f1 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -499,6 +499,10 @@ struct gomp_target_task
   struct target_mem_desc *tgt;
   struct gomp_task *task;
   struct gomp_team *team;
+  /* Copies of firstprivate mapped data for shared memory accelerators.  */
+  void *firstprivate_copies;
+  /* Device-specific target arguments.  */
+  void **args;
   void *hostaddrs[];
 };
 
@@ -765,7 +769,8 @@ extern void gomp_task_maybe_wait_for_dependencies (void **);
 extern bool gomp_create_target_task (struct gomp_device_descr *,
 				     void (*) (void *), size_t, void **,
 				     size_t *, unsigned short *, unsigned int,
-				     void **, enum gomp_target_task_state);
+				     void **, void **,
+				     enum gomp_target_task_state);
 
 static void inline
 gomp_finish_task (struct gomp_task *task)
@@ -939,8 +944,9 @@ struct gomp_device_descr
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void *(*dev2dev_func) (int, void *, const void *, size_t);
-  void (*run_func) (int, void *, void *);
-  void (*async_run_func) (int, void *, void *, void *);
+  bool (*can_run_func) (void *);
+  void (*run_func) (int, void *, void *, void **);
+  void (*async_run_func) (int, void *, void *, void **, void *);
 
   /* Splay tree containing information about mapped memory regions.  */
   struct splay_tree_s mem_map;
diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h
index c238e6a..9c90d59 100644
--- a/libgomp/libgomp_g.h
+++ b/libgomp/libgomp_g.h
@@ -278,8 +278,7 @@ extern void GOMP_single_copy_end (void *);
 extern void GOMP_target (int, void (*) (void *), const void *,
 			 size_t, void **, size_t *, unsigned char *);
 extern void GOMP_target_ext (int, void (*) (void *), size_t, void **, size_t *,
-			     unsigned short *, unsigned int, void **,
-			     int, int);
+			     unsigned short *, unsigned int, void **, void **);
 extern void GOMP_target_data (int, const void *,
 			      size_t, void **, size_t *, unsigned char *);
 extern void GOMP_target_data_ext (int, size_t, void **, size_t *,
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 9874804..a769211 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -123,7 +123,8 @@ host_host2dev (int n __attribute__ ((unused)),
 }
 
 static void
-host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars)
+host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars,
+	  void **args __attribute__((unused)))
 {
   void (*fn)(void *) = (void (*)(void *)) fn_ptr;
 
diff --git a/libgomp/target.c b/libgomp/target.c
index cf9d0e6..f990a9e 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1261,15 +1261,38 @@ gomp_target_fallback (void (*fn) (void *), void **hostaddrs)
   *thr = old_thr;
 }
 
-/* Host fallback with firstprivate map-type handling.  */
+/* Calculate alignment and size requirements of a private copy of data shared
+   as GOMP_MAP_FIRSTPRIVATE and store them to TGT_ALIGN and TGT_SIZE.  */
 
-static void
-gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
-				   void **hostaddrs, size_t *sizes,
-				   unsigned short *kinds)
+static inline void
+calculate_firstprivate_requirements (size_t mapnum, size_t *sizes,
+				     unsigned short *kinds, size_t *tgt_align,
+				     size_t *tgt_size)
 {
-  size_t i, tgt_align = 0, tgt_size = 0;
-  char *tgt = NULL;
+  size_t i;
+  for (i = 0; i < mapnum; i++)
+    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
+      {
+	size_t align = (size_t) 1 << (kinds[i] >> 8);
+	if (*tgt_align < align)
+	  *tgt_align = align;
+	*tgt_size = (*tgt_size + align - 1) & ~(align - 1);
+	*tgt_size += sizes[i];
+      }
+}
+
+/* Copy data shared as GOMP_MAP_FIRSTPRIVATE to DST.  */
+
+static inline void
+copy_firstprivate_data (char *tgt, size_t mapnum, void **hostaddrs,
+			size_t *sizes, unsigned short *kinds, size_t tgt_align,
+			size_t tgt_size)
+{
+  uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
+  if (al)
+    tgt += tgt_align - al;
+  tgt_size = 0;
+  size_t i;
   for (i = 0; i < mapnum; i++)
     if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
       {
@@ -1277,28 +1300,53 @@ gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
 	if (tgt_align < align)
 	  tgt_align = align;
 	tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	tgt_size += sizes[i];
+	memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
+	hostaddrs[i] = tgt + tgt_size;
+	tgt_size = tgt_size + sizes[i];
       }
+}
+
+/* Host fallback with firstprivate map-type handling.  */
+
+static void
+gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
+				   void **hostaddrs, size_t *sizes,
+				   unsigned short *kinds)
+{
+  size_t tgt_align = 0, tgt_size = 0;
+  calculate_firstprivate_requirements (mapnum, sizes, kinds, &tgt_align,
+				       &tgt_size);
   if (tgt_align)
     {
-      tgt = gomp_alloca (tgt_size + tgt_align - 1);
-      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
-      if (al)
-	tgt += tgt_align - al;
-      tgt_size = 0;
-      for (i = 0; i < mapnum; i++)
-	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
-	  {
-	    size_t align = (size_t) 1 << (kinds[i] >> 8);
-	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	    memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
-	    hostaddrs[i] = tgt + tgt_size;
-	    tgt_size = tgt_size + sizes[i];
-	  }
+      char *tgt = gomp_alloca (tgt_size + tgt_align - 1);
+      copy_firstprivate_data (tgt, mapnum, hostaddrs, sizes, kinds, tgt_align,
+			      tgt_size);
     }
   gomp_target_fallback (fn, hostaddrs);
 }
 
+/* Handle firstprivate map-type for shared memory devices and the host
+   fallback.  Return the pointer of firstprivate copies which has to be freed
+   after use.  */
+
+static void *
+gomp_target_unshare_firstprivate (size_t mapnum, void **hostaddrs,
+				  size_t *sizes, unsigned short *kinds)
+{
+  size_t tgt_align = 0, tgt_size = 0;
+  char *tgt = NULL;
+
+  calculate_firstprivate_requirements (mapnum, sizes, kinds, &tgt_align,
+				       &tgt_size);
+  if (tgt_align)
+    {
+      tgt = gomp_malloc (tgt_size + tgt_align - 1);
+      copy_firstprivate_data (tgt, mapnum, hostaddrs, sizes, kinds, tgt_align,
+			      tgt_size);
+    }
+  return tgt;
+}
+
 /* Helper function of GOMP_target{,_ext} routines.  */
 
 static void *
@@ -1348,7 +1396,8 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     GOMP_MAP_VARS_TARGET);
-  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
+  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start,
+		     NULL);
   gomp_unmap_vars (tgt_vars, true);
 }
 
@@ -1356,6 +1405,15 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
    and several arguments have been added:
    FLAGS is a bitmask, see GOMP_TARGET_FLAG_* in gomp-constants.h.
    DEPEND is array of dependencies, see GOMP_task for details.
+
+   ARGS is a pointer to an array consisting of a variable number of both
+   device-independent and device-specific arguments, which can take one two
+   elements where the first specifies for which device it is intended, the type
+   and optionally also the value.  If the value is not present in the first
+   one, the whole second element the actual value.  The last element of the
+   array is a single NULL.  Among the device independent can be for example
+   NUM_TEAMS and THREAD_LIMIT.
+
    NUM_TEAMS is positive if GOMP_teams will be called in the body with
    that value, or 1 if teams construct is not present, or 0, if
    teams construct does not have num_teams clause and so the choice is
@@ -1369,14 +1427,10 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
 void
 GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
-		 unsigned int flags, void **depend, int num_teams,
-		 int thread_limit)
+		 unsigned int flags, void **depend, void **args)
 {
   struct gomp_device_descr *devicep = resolve_device (device);
 
-  (void) num_teams;
-  (void) thread_limit;
-
   if (flags & GOMP_TARGET_FLAG_NOWAIT)
     {
       struct gomp_thread *thr = gomp_thread ();
@@ -1413,7 +1467,7 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 	  && !thr->task->final_task)
 	{
 	  gomp_create_target_task (devicep, fn, mapnum, hostaddrs,
-				   sizes, kinds, flags, depend,
+				   sizes, kinds, flags, depend, args,
 				   GOMP_TARGET_TASK_BEFORE_MAP);
 	  return;
 	}
@@ -1430,20 +1484,33 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 	gomp_task_maybe_wait_for_dependencies (depend);
     }
 
+  void *fn_addr;
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || !(fn_addr = gomp_get_target_fn_addr (devicep, fn))
+      || (devicep->can_run_func && !devicep->can_run_func (fn_addr)))
     {
       gomp_target_fallback_firstprivate (fn, mapnum, hostaddrs, sizes, kinds);
       return;
     }
 
-  void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
-
-  struct target_mem_desc *tgt_vars
-    = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		     GOMP_MAP_VARS_TARGET);
-  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
-  gomp_unmap_vars (tgt_vars, true);
+  struct target_mem_desc *tgt_vars;
+  void *fpc = NULL;
+  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+    {
+      fpc = gomp_target_unshare_firstprivate (mapnum, hostaddrs, sizes, kinds);
+      tgt_vars = NULL;
+    }
+  else
+    tgt_vars = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds,
+			      true, GOMP_MAP_VARS_TARGET);
+  devicep->run_func (devicep->target_id, fn_addr,
+		     tgt_vars ? (void *) tgt_vars->tgt_start : hostaddrs,
+		     args);
+  if (tgt_vars)
+    gomp_unmap_vars (tgt_vars, true);
+  else
+    free (fpc);
 }
 
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
@@ -1552,7 +1619,7 @@ GOMP_target_update_ext (int device, size_t mapnum, void **hostaddrs,
 	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
 					   mapnum, hostaddrs, sizes, kinds,
 					   flags | GOMP_TARGET_FLAG_UPDATE,
-					   depend, GOMP_TARGET_TASK_DATA))
+					   depend, NULL, GOMP_TARGET_TASK_DATA))
 		return;
 	    }
 	  else
@@ -1673,7 +1740,7 @@ GOMP_target_enter_exit_data (int device, size_t mapnum, void **hostaddrs,
 	    {
 	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
 					   mapnum, hostaddrs, sizes, kinds,
-					   flags, depend,
+					   flags, depend, NULL,
 					   GOMP_TARGET_TASK_DATA))
 		return;
 	    }
@@ -1729,8 +1796,11 @@ gomp_target_task_fn (void *data)
 
   if (ttask->fn != NULL)
     {
+      void *fn_addr;
       if (devicep == NULL
-	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	  || !(fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn))
+	  || (devicep->can_run_func && !devicep->can_run_func (fn_addr)))
 	{
 	  ttask->state = GOMP_TARGET_TASK_FALLBACK;
 	  gomp_target_fallback_firstprivate (ttask->fn, ttask->mapnum,
@@ -1745,19 +1815,31 @@ gomp_target_task_fn (void *data)
 	  return false;
 	}
 
-      void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
-      ttask->tgt
-	= gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
-			 ttask->sizes, ttask->kinds, true,
-			 GOMP_MAP_VARS_TARGET);
+      void *actual_arguments;
+      if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+	{
+	  ttask->tgt = NULL;
+	  ttask->firstprivate_copies
+	    = gomp_target_unshare_firstprivate (ttask->mapnum, ttask->hostaddrs,
+						ttask->sizes, ttask->kinds);
+	  actual_arguments = ttask->hostaddrs;
+	}
+      else
+	{
+	  ttask->tgt = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs,
+				      NULL, ttask->sizes, ttask->kinds, true,
+				      GOMP_MAP_VARS_TARGET);
+	  actual_arguments = (void *) ttask->tgt->tgt_start;
+	}
       ttask->state = GOMP_TARGET_TASK_READY_TO_RUN;
 
-      devicep->async_run_func (devicep->target_id, fn_addr,
-			       (void *) ttask->tgt->tgt_start, (void *) ttask);
+      devicep->async_run_func (devicep->target_id, fn_addr, actual_arguments,
+			       ttask->args, (void *) ttask);
       return true;
     }
   else if (devicep == NULL
-	   || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+	   || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	   || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return false;
 
   size_t i;
@@ -2225,6 +2307,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
     {
       DLSYM (run);
       DLSYM (async_run);
+      DLSYM_OPT (can_run, can_run);
       DLSYM (dev2dev);
     }
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
diff --git a/libgomp/task.c b/libgomp/task.c
index 620facd..f3b05e5 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -593,7 +593,7 @@ bool
 gomp_create_target_task (struct gomp_device_descr *devicep,
 			 void (*fn) (void *), size_t mapnum, void **hostaddrs,
 			 size_t *sizes, unsigned short *kinds,
-			 unsigned int flags, void **depend,
+			 unsigned int flags, void **depend, void **args,
 			 enum gomp_target_task_state state)
 {
   struct gomp_thread *thr = gomp_thread ();
@@ -653,6 +653,7 @@ gomp_create_target_task (struct gomp_device_descr *devicep,
   ttask->devicep = devicep;
   ttask->fn = fn;
   ttask->mapnum = mapnum;
+  ttask->args = args;
   memcpy (ttask->hostaddrs, hostaddrs, mapnum * sizeof (void *));
   ttask->sizes = (size_t *) &ttask->hostaddrs[mapnum];
   memcpy (ttask->sizes, sizes, mapnum * sizeof (size_t));
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index f8c1725..48599dd 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -539,7 +539,7 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst_ptr, const void *src_ptr,
 
 extern "C" void
 GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
-			void *async_data)
+			void **, void *async_data)
 {
   TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p, async_data = %p)", device,
 	 tgt_fn, tgt_vars, async_data);
@@ -555,7 +555,7 @@ GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
 }
 
 extern "C" void
-GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars)
+GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars, void **)
 {
   TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p)", device, tgt_fn, tgt_vars);
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 12/13] libgomp: handle multiple teams on NVPTX
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (10 preceding siblings ...)
  2016-01-20 17:45 ` [gomp-nvptx 13/13] libgomp plugin: handle multiple teams Alexander Monakov
@ 2016-01-20 17:45 ` Alexander Monakov
  2016-01-20 17:45 ` [gomp-nvptx 11/13] pick GOMP_target_ext changes from the hsa branch Alexander Monakov
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:45 UTC (permalink / raw)
  To: gcc-patches

	* config/nvptx/icv-device.c (omp_get_num_teams): Update.
	(omp_get_team_num): Ditto.
	* config/nvptx/target.c (GOMP_teams): Update.
	* config/nvptx/team.c (nvptx_thrs): Place in shared memory.
	* icv.c (gomp_num_teams_var): Define.
	* libgomp.h (gomp_num_teams_var): Declare.
	(nvptx_thrs): Place in shared memory.
---
 libgomp/ChangeLog.gomp-nvptx      | 10 ++++++++++
 libgomp/config/nvptx/icv-device.c |  8 ++++----
 libgomp/config/nvptx/target.c     | 13 ++++++++++++-
 libgomp/config/nvptx/team.c       |  2 +-
 libgomp/icv.c                     |  1 +
 libgomp/libgomp.h                 |  3 ++-
 6 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 0e5fef0..bd11002 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -47,15 +47,15 @@ omp_get_num_devices (void)
 int
 omp_get_num_teams (void)
 {
-  /* FORNOW.  */
-  return 1;
+  return gomp_num_teams_var + 1;
 }
 
 int
 omp_get_team_num (void)
 {
-  /* FORNOW.  */
-  return 0;
+  int ctaid;
+  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (ctaid));
+  return ctaid;
 }
 
 int
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index ad36013..9f34ae8 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -35,5 +35,16 @@ GOMP_teams (unsigned int num_teams, unsigned int thread_limit)
       icv->thread_limit_var
 	= thread_limit > INT_MAX ? UINT_MAX : thread_limit;
     }
-  (void) num_teams;
+  unsigned int num_blocks, block_id;
+  asm ("mov.u32 %0, %%nctaid.x;" : "=r" (num_blocks));
+  asm ("mov.u32 %0, %%ctaid.x;" : "=r" (block_id));
+  if (!num_teams || num_teams >= num_blocks)
+    num_teams = num_blocks;
+  else if (block_id >= num_teams)
+    {
+      gomp_free_thread (nvptx_thrs);
+      free (nvptx_thrs);
+      asm ("exit;");
+    }
+  gomp_num_teams_var = num_teams - 1;
 }
diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index c18517a..909f296 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -29,7 +29,7 @@
 #include "libgomp.h"
 #include <stdlib.h>
 
-struct gomp_thread *nvptx_thrs;
+struct gomp_thread *nvptx_thrs __attribute__((shared));
 
 static void gomp_thread_start (struct gomp_thread_pool *);
 
diff --git a/libgomp/icv.c b/libgomp/icv.c
index aa79423..18e35e5 100644
--- a/libgomp/icv.c
+++ b/libgomp/icv.c
@@ -56,6 +56,7 @@ unsigned long gomp_bind_var_list_len;
 void **gomp_places_list;
 unsigned long gomp_places_list_len;
 int gomp_debug_var;
+unsigned int gomp_num_teams_var;
 char *goacc_device_type;
 int goacc_device_num;
 
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 1d137f1..0ef2a05 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -363,6 +363,7 @@ extern char *gomp_bind_var_list;
 extern unsigned long gomp_bind_var_list_len;
 extern void **gomp_places_list;
 extern unsigned long gomp_places_list_len;
+extern unsigned int gomp_num_teams_var;
 extern int gomp_debug_var;
 extern int goacc_device_num;
 extern char *goacc_device_type;
@@ -648,7 +649,7 @@ enum gomp_cancel_kind
 /* ... and here is that TLS data.  */
 
 #if defined __nvptx__
-extern struct gomp_thread *nvptx_thrs;
+extern struct gomp_thread *nvptx_thrs __attribute__((shared));
 static inline struct gomp_thread *gomp_thread (void)
 {
   int tid;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [gomp-nvptx 08/13] libgomp: add nvptx lock.c
  2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
                   ` (8 preceding siblings ...)
  2016-01-20 17:27 ` [gomp-nvptx 05/13] libgomp: remove sections.c, splay-tree.c Alexander Monakov
@ 2016-01-20 17:45 ` Alexander Monakov
  2016-01-20 17:45 ` [gomp-nvptx 13/13] libgomp plugin: handle multiple teams Alexander Monakov
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Alexander Monakov @ 2016-01-20 17:45 UTC (permalink / raw)
  To: gcc-patches

This patch implements lock.c on NVPTX by moving a bunch of generic
implementations (in terms of gomp_mutex_t) from config/linux/lock.c to lock.c
and reusing them on NVPTX.

	* config/linux/lock.c (gomp_init_lock_30): Move to generic lock.c.
	(gomp_destroy_lock_30): Ditto.
	(gomp_set_lock_30): Ditto.
	(gomp_unset_lock_30): Ditto.
	(gomp_test_lock_30): Ditto.
	(gomp_init_nest_lock_30): Ditto.
	(gomp_destroy_nest_lock_30): Ditto.
	(gomp_set_nest_lock_30): Ditto.
	(gomp_unset_nest_lock_30): Ditto.
	(gomp_test_nest_lock_30): Ditto.
	* lock.c: New.
	* config/nvptx/lock.c: New.
---
 libgomp/ChangeLog.gomp-nvptx |  15 ++++++
 libgomp/config/linux/lock.c  |  94 +--------------------------------
 libgomp/config/nvptx/lock.c  |  41 +++++++++++++++
 libgomp/lock.c               | 123 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 181 insertions(+), 92 deletions(-)
 create mode 100644 libgomp/lock.c

diff --git a/libgomp/config/linux/lock.c b/libgomp/config/linux/lock.c
index 32cd21d..a80d7c5 100644
--- a/libgomp/config/linux/lock.c
+++ b/libgomp/config/linux/lock.c
@@ -32,98 +32,8 @@
 #include <sys/syscall.h>
 #include "wait.h"
 
-
-/* The internal gomp_mutex_t and the external non-recursive omp_lock_t
-   have the same form.  Re-use it.  */
-
-void
-gomp_init_lock_30 (omp_lock_t *lock)
-{
-  gomp_mutex_init (lock);
-}
-
-void
-gomp_destroy_lock_30 (omp_lock_t *lock)
-{
-  gomp_mutex_destroy (lock);
-}
-
-void
-gomp_set_lock_30 (omp_lock_t *lock)
-{
-  gomp_mutex_lock (lock);
-}
-
-void
-gomp_unset_lock_30 (omp_lock_t *lock)
-{
-  gomp_mutex_unlock (lock);
-}
-
-int
-gomp_test_lock_30 (omp_lock_t *lock)
-{
-  int oldval = 0;
-
-  return __atomic_compare_exchange_n (lock, &oldval, 1, false,
-				      MEMMODEL_ACQUIRE, MEMMODEL_RELAXED);
-}
-
-void
-gomp_init_nest_lock_30 (omp_nest_lock_t *lock)
-{
-  memset (lock, '\0', sizeof (*lock));
-}
-
-void
-gomp_destroy_nest_lock_30 (omp_nest_lock_t *lock)
-{
-}
-
-void
-gomp_set_nest_lock_30 (omp_nest_lock_t *lock)
-{
-  void *me = gomp_icv (true);
-
-  if (lock->owner != me)
-    {
-      gomp_mutex_lock (&lock->lock);
-      lock->owner = me;
-    }
-
-  lock->count++;
-}
-
-void
-gomp_unset_nest_lock_30 (omp_nest_lock_t *lock)
-{
-  if (--lock->count == 0)
-    {
-      lock->owner = NULL;
-      gomp_mutex_unlock (&lock->lock);
-    }
-}
-
-int
-gomp_test_nest_lock_30 (omp_nest_lock_t *lock)
-{
-  void *me = gomp_icv (true);
-  int oldval;
-
-  if (lock->owner == me)
-    return ++lock->count;
-
-  oldval = 0;
-  if (__atomic_compare_exchange_n (&lock->lock, &oldval, 1, false,
-				   MEMMODEL_ACQUIRE, MEMMODEL_RELAXED))
-    {
-      lock->owner = me;
-      lock->count = 1;
-      return 1;
-    }
-
-  return 0;
-}
+/* Reuse the generic implementation in terms of gomp_mutex_t.  */
+#include "../../lock.c"
 
 #ifdef LIBGOMP_GNU_SYMBOL_VERSIONING
 /* gomp_mutex_* can be safely locked in one thread and
diff --git a/libgomp/config/nvptx/lock.c b/libgomp/config/nvptx/lock.c
index e69de29..7731704 100644
--- a/libgomp/config/nvptx/lock.c
+++ b/libgomp/config/nvptx/lock.c
@@ -0,0 +1,41 @@
+/* Copyright (C) 2016 Free Software Foundation, Inc.
+   Contributed by Alexander Monakov <amonakov@ispras.ru>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a NVPTX specific implementation of the public OpenMP locking
+   primitives.  */
+
+/* Reuse the generic implementation in terms of gomp_mutex_t.  */
+#include "../../lock.c"
+
+ialias (omp_init_lock)
+ialias (omp_init_nest_lock)
+ialias (omp_destroy_lock)
+ialias (omp_destroy_nest_lock)
+ialias (omp_set_lock)
+ialias (omp_set_nest_lock)
+ialias (omp_unset_lock)
+ialias (omp_unset_nest_lock)
+ialias (omp_test_lock)
+ialias (omp_test_nest_lock)
diff --git a/libgomp/lock.c b/libgomp/lock.c
new file mode 100644
index 0000000..783bd77
--- /dev/null
+++ b/libgomp/lock.c
@@ -0,0 +1,123 @@
+/* Copyright (C) 2005-2016 Free Software Foundation, Inc.
+   Contributed by Richard Henderson <rth@redhat.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a generic implementation of the public OpenMP locking primitives in
+   terms of internal gomp_mutex_t.  It is not meant to be compiled on its own.
+   It is #include'd from config/{linux,nvptx}/lock.c.  */
+
+#include <string.h>
+#include "libgomp.h"
+
+/* The internal gomp_mutex_t and the external non-recursive omp_lock_t
+   have the same form.  Re-use it.  */
+
+void
+gomp_init_lock_30 (omp_lock_t *lock)
+{
+  gomp_mutex_init (lock);
+}
+
+void
+gomp_destroy_lock_30 (omp_lock_t *lock)
+{
+  gomp_mutex_destroy (lock);
+}
+
+void
+gomp_set_lock_30 (omp_lock_t *lock)
+{
+  gomp_mutex_lock (lock);
+}
+
+void
+gomp_unset_lock_30 (omp_lock_t *lock)
+{
+  gomp_mutex_unlock (lock);
+}
+
+int
+gomp_test_lock_30 (omp_lock_t *lock)
+{
+  int oldval = 0;
+
+  return __atomic_compare_exchange_n (lock, &oldval, 1, false,
+				      MEMMODEL_ACQUIRE, MEMMODEL_RELAXED);
+}
+
+void
+gomp_init_nest_lock_30 (omp_nest_lock_t *lock)
+{
+  memset (lock, '\0', sizeof (*lock));
+}
+
+void
+gomp_destroy_nest_lock_30 (omp_nest_lock_t *lock)
+{
+}
+
+void
+gomp_set_nest_lock_30 (omp_nest_lock_t *lock)
+{
+  void *me = gomp_icv (true);
+
+  if (lock->owner != me)
+    {
+      gomp_mutex_lock (&lock->lock);
+      lock->owner = me;
+    }
+
+  lock->count++;
+}
+
+void
+gomp_unset_nest_lock_30 (omp_nest_lock_t *lock)
+{
+  if (--lock->count == 0)
+    {
+      lock->owner = NULL;
+      gomp_mutex_unlock (&lock->lock);
+    }
+}
+
+int
+gomp_test_nest_lock_30 (omp_nest_lock_t *lock)
+{
+  void *me = gomp_icv (true);
+  int oldval;
+
+  if (lock->owner == me)
+    return ++lock->count;
+
+  oldval = 0;
+  if (__atomic_compare_exchange_n (&lock->lock, &oldval, 1, false,
+				   MEMMODEL_ACQUIRE, MEMMODEL_RELAXED))
+    {
+      lock->owner = me;
+      lock->count = 1;
+      return 1;
+    }
+
+  return 0;
+}

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-01-20 17:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-20 17:27 [gomp-nvptx 00/13] SIMD, teams, Fortran Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 02/13] omp-low: extend SIMD lowering for SIMT execution Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 04/13] nvptx backend: add support for placing variables in shared memory Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 07/13] libgomp plugin: set __nvptx_clocktick Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 01/13] nvptx backend: new patterns for OpenMP SIMD-via-SIMT Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 06/13] libgomp: add nvptx time.c Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 09/13] libgomp: use generic fortran.c on nvptx Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 03/13] nvptx backend: silence warning Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 10/13] libgomp testsuite: add -foffload=-lgfortran Alexander Monakov
2016-01-20 17:27 ` [gomp-nvptx 05/13] libgomp: remove sections.c, splay-tree.c Alexander Monakov
2016-01-20 17:45 ` [gomp-nvptx 08/13] libgomp: add nvptx lock.c Alexander Monakov
2016-01-20 17:45 ` [gomp-nvptx 13/13] libgomp plugin: handle multiple teams Alexander Monakov
2016-01-20 17:45 ` [gomp-nvptx 12/13] libgomp: handle multiple teams on NVPTX Alexander Monakov
2016-01-20 17:45 ` [gomp-nvptx 11/13] pick GOMP_target_ext changes from the hsa branch Alexander Monakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).