[gomp4 12/14] libgomp: fixup error.c on nvptx

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [gomp4 12/14] libgomp: fixup error.c on nvptx
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-21 10:03   ` Jakub Jelinek
  2015-10-20 18:34 ` [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main Alexander Monakov
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

NVPTX provides vprintf, but there's no stream separation: everything is
printed as if into stdout.  This is the minimal change to get error.c working.

	* error.c [__nvptx__]: Replace vfprintf, fputs, fputc with [v]printf.
---
 libgomp/error.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/libgomp/error.c b/libgomp/error.c
index 094c24a..009efdc 100644
--- a/libgomp/error.c
+++ b/libgomp/error.c
@@ -35,6 +35,11 @@
 #include <stdio.h>
 #include <stdlib.h>
 
+#ifdef __nvptx__
+#define vfprintf(stream, fmt, list) vprintf(fmt, list)
+#define fputs(s, stream) printf("%s", s)
+#define fputc(c, stream) printf("%c", c)
+#endif
 
 #undef gomp_vdebug
 void

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (4 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-20 20:48   ` Bernd Schmidt
  2015-10-21  9:59   ` Jakub Jelinek
  2015-10-20 18:34 ` [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints Alexander Monakov
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

NVPTX does not support alloca or variable-length stack allocations, thus
heap allocation needs to be used instead.  I've opted to make this a generic
change instead of guarding it with an #ifdef: libgomp usually leaves thread
stack size up to libc, so avoiding unbounded stack allocation makes sense.

	* task.c (GOMP_task): Use a fixed-size on-stack buffer or a heap
        allocation instead of a variable-size on-stack allocation.
---
 libgomp/task.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/libgomp/task.c b/libgomp/task.c
index 74920d5..ffb7ed2 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -162,11 +162,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
       thr->task = &task;
       if (__builtin_expect (cpyfn != NULL, 0))
 	{
-	  char buf[arg_size + arg_align - 1];
+	  long buf_size = arg_size + arg_align - 1;
+	  char buf_fixed[2048], *buf = buf_fixed;
+	  if (sizeof(buf_fixed) < buf_size)
+	    buf = gomp_malloc (buf_size);
 	  char *arg = (char *) (((uintptr_t) buf + arg_align - 1)
 				& ~(uintptr_t) (arg_align - 1));
 	  cpyfn (arg, data);
 	  fn (arg);
+	  if (buf != buf_fixed)
+	    free (buf);
 	}
       else
 	fn (data);

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 14/14] libgomp: use more generic implementations on nvptx
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
  2015-10-20 18:34 ` [gomp4 12/14] libgomp: fixup error.c on nvptx Alexander Monakov
  2015-10-20 18:34 ` [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-21 10:17   ` Jakub Jelinek
  2015-10-20 18:34 ` [gomp4 08/14] libgomp nvptx: populate proc.c Alexander Monakov
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

This patch removes 0-size libgomp stubs where generic implementations can be
compiled for the NVPTX target.

It also removes non-stub critical.c, which contains assembly implementations
for GOMP_atomic_{start,end}, but does not contain implementations for
GOMP_critical_*.  My understanding is that OpenACC offloading uses
GOMP_atomic_* routines (by virtue of OpenMP lowering using them).  Linking in
GOMP_critical_* and dependencies would be pointless for OpenACC.

If OpenACC indeed uses GOMP_atomic_*, then it makes sense to split them out
into a separate file (atomic.c?).

After this patch, a few 0-size stubs remain in libgomp/config/nvptx.  They
fall roughly into these categories:

  - Files which must remain a 0-size stub, like env.c, because they implement
    something that doesn't make sense on an accelerator-only target.

  - Files that are 0-size because all functionality is implemented in the
    corresponding header for now, like mutex.c.

  - Fiels that are 0-size, but probably should be changed to use generic
    implementations soon, like sections.c

  - Files that are 0-size and will need to have a custom implementation for
    nvptx, like time.c
---
 libgomp/config/nvptx/alloc.c    |  0
 libgomp/config/nvptx/barrier.c  |  0
 libgomp/config/nvptx/critical.c | 57 -----------------------------------------
 libgomp/config/nvptx/error.c    |  0
 libgomp/config/nvptx/iter.c     |  0
 libgomp/config/nvptx/iter_ull.c |  0
 libgomp/config/nvptx/loop.c     |  0
 libgomp/config/nvptx/loop_ull.c |  0
 libgomp/config/nvptx/ordered.c  |  0
 libgomp/config/nvptx/parallel.c |  0
 libgomp/config/nvptx/single.c   |  0
 libgomp/config/nvptx/task.c     |  0
 libgomp/config/nvptx/work.c     |  0
 13 files changed, 57 deletions(-)
 delete mode 100644 libgomp/config/nvptx/alloc.c
 delete mode 100644 libgomp/config/nvptx/barrier.c
 delete mode 100644 libgomp/config/nvptx/critical.c
 delete mode 100644 libgomp/config/nvptx/error.c
 delete mode 100644 libgomp/config/nvptx/iter.c
 delete mode 100644 libgomp/config/nvptx/iter_ull.c
 delete mode 100644 libgomp/config/nvptx/loop.c
 delete mode 100644 libgomp/config/nvptx/loop_ull.c
 delete mode 100644 libgomp/config/nvptx/ordered.c
 delete mode 100644 libgomp/config/nvptx/parallel.c
 delete mode 100644 libgomp/config/nvptx/single.c
 delete mode 100644 libgomp/config/nvptx/task.c
 delete mode 100644 libgomp/config/nvptx/work.c

diff --git a/libgomp/config/nvptx/alloc.c b/libgomp/config/nvptx/alloc.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/barrier.c b/libgomp/config/nvptx/barrier.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/critical.c b/libgomp/config/nvptx/critical.c
deleted file mode 100644
index 1f55aad..0000000
--- a/libgomp/config/nvptx/critical.c
+++ /dev/null
@@ -1,57 +0,0 @@
-/* GOMP atomic routines
-
-   Copyright (C) 2014-2015 Free Software Foundation, Inc.
-
-   Contributed by Mentor Embedded.
-
-   This file is part of the GNU Offloading and Multi Processing Library
-   (libgomp).
-
-   Libgomp is free software; you can redistribute it and/or modify it
-   under the terms of the GNU General Public License as published by
-   the Free Software Foundation; either version 3, or (at your option)
-   any later version.
-
-   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
-   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
-   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
-   more details.
-
-   Under Section 7 of GPL version 3, you are granted additional
-   permissions described in the GCC Runtime Library Exception, version
-   3.1, as published by the Free Software Foundation.
-
-   You should have received a copy of the GNU General Public License and
-   a copy of the GCC Runtime Library Exception along with this program;
-   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-   <http://www.gnu.org/licenses/>.  */
-
-__asm__ ("// BEGIN VAR DEF: libgomp_ptx_lock\n"
-	 ".global .align 4 .u32 libgomp_ptx_lock;\n"
-	 "\n"
-	 "// BEGIN GLOBAL FUNCTION DECL: GOMP_atomic_start\n"
-	 ".visible .func GOMP_atomic_start;\n"
-	 "// BEGIN GLOBAL FUNCTION DEF: GOMP_atomic_start\n"
-	 ".visible .func GOMP_atomic_start\n"
-	 "{\n"
-	 "	.reg .pred 	%p<2>;\n"
-	 "	.reg .s32 	%r<2>;\n"
-	 "	.reg .s64 	%rd<2>;\n"
-	 "BB5_1:\n"
-	 "	mov.u64 	%rd1, libgomp_ptx_lock;\n"
-	 "	atom.global.cas.b32 	%r1, [%rd1], 0, 1;\n"
-	 "	setp.ne.s32	%p1, %r1, 0;\n"
-	 "	@%p1 bra 	BB5_1;\n"
-	 "	ret;\n"
-	 "	}\n"
-	 "// BEGIN GLOBAL FUNCTION DECL: GOMP_atomic_end\n"
-	 ".visible .func GOMP_atomic_end;\n"
-	 "// BEGIN GLOBAL FUNCTION DEF: GOMP_atomic_end\n"
-	 ".visible .func GOMP_atomic_end\n"
-	 "{\n"
-	 "	.reg .s32 	%r<2>;\n"
-	 "	.reg .s64 	%rd<2>;\n"
-	 "	mov.u64 	%rd1, libgomp_ptx_lock;\n"
-	 "	atom.global.exch.b32 	%r1, [%rd1], 0;\n"
-	 "	ret;\n"
-	 "	}");
diff --git a/libgomp/config/nvptx/error.c b/libgomp/config/nvptx/error.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/iter.c b/libgomp/config/nvptx/iter.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/iter_ull.c b/libgomp/config/nvptx/iter_ull.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/loop.c b/libgomp/config/nvptx/loop.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/loop_ull.c b/libgomp/config/nvptx/loop_ull.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/ordered.c b/libgomp/config/nvptx/ordered.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/parallel.c b/libgomp/config/nvptx/parallel.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/single.c b/libgomp/config/nvptx/single.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/task.c b/libgomp/config/nvptx/task.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/config/nvptx/work.c b/libgomp/config/nvptx/work.c
deleted file mode 100644
index e69de29..0000000

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 03/14] nvptx: expand support for address spaces
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (6 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-20 20:56   ` Bernd Schmidt
  2015-10-20 18:34 ` [gomp4 04/14] nvptx: fix output of _Bool global variables Alexander Monakov
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

This allows to emit decls in 'shared' memory from the middle-end.

	* config/nvptx/nvptx.c (nvptx_legitimate_address_p): Adjust prototype.
        (nvptx_section_for_decl): If type of decl has a specific address
        space, return it.
        (nvptx_addr_space_from_address): Ditto.
        (TARGET_ADDR_SPACE_POINTER_MODE): Define.
        (TARGET_ADDR_SPACE_ADDRESS_MODE): Ditto.
        (TARGET_ADDR_SPACE_SUBSET_P): Ditto.
        (TARGET_ADDR_SPACE_CONVERT): Ditto.
        (TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P): Ditto.
---
 gcc/config/nvptx/nvptx.c | 49 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index a619e4c..779b018 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1571,7 +1571,7 @@ nvptx_maybe_convert_symbolic_operand (rtx orig_op)
 /* Returns true if X is a valid address for use in a memory reference.  */
 
 static bool
-nvptx_legitimate_address_p (machine_mode, rtx x, bool)
+nvptx_legitimate_address_p (machine_mode, rtx x, bool, addr_space_t)
 {
   enum rtx_code code = GET_CODE (x);
 
@@ -1642,7 +1642,10 @@ nvptx_section_for_decl (const_tree decl)
   if (is_const)
     return ".const";
 
-  return ".global";
+  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (decl));
+  if (as == ADDR_SPACE_GENERIC)
+    as = ADDR_SPACE_GLOBAL;
+  return nvptx_section_from_addr_space (as);
 }
 
 /* Look for a SYMBOL_REF in ADDR and return the address space to be used
@@ -1666,6 +1669,9 @@ nvptx_addr_space_from_address (rtx addr)
   if (is_const)
     return ADDR_SPACE_CONST;
 
+  if (TYPE_ADDR_SPACE (TREE_TYPE (decl)) != ADDR_SPACE_GENERIC)
+    return TYPE_ADDR_SPACE (TREE_TYPE (decl));
+
   return ADDR_SPACE_GLOBAL;
 }
 \f
@@ -4916,14 +4922,49 @@ nvptx_use_anchors_for_symbol (const_rtx ARG_UNUSED (symbol))
   return false;
 }
 \f
+#undef TARGET_ADDR_SPACE_POINTER_MODE
+#define TARGET_ADDR_SPACE_POINTER_MODE nvptx_addr_space_pointer_mode
+static enum machine_mode
+nvptx_addr_space_pointer_mode (addr_space_t)
+{
+  return Pmode;
+}
+
+#undef TARGET_ADDR_SPACE_ADDRESS_MODE
+#define TARGET_ADDR_SPACE_ADDRESS_MODE nvptx_addr_space_address_mode
+static enum machine_mode
+nvptx_addr_space_address_mode (addr_space_t)
+{
+  return Pmode;
+}
+
+#undef TARGET_ADDR_SPACE_SUBSET_P
+#define TARGET_ADDR_SPACE_SUBSET_P nvptx_addr_space_subset_p
+static bool
+nvptx_addr_space_subset_p (addr_space_t /*subset*/,
+                           addr_space_t superset)
+{
+  return superset == ADDR_SPACE_GENERIC;
+}
+
+#undef  TARGET_ADDR_SPACE_CONVERT
+#define TARGET_ADDR_SPACE_CONVERT nvptx_addr_space_convert
+
+static rtx
+nvptx_addr_space_convert (rtx op, tree /*from_type*/, tree to_type)
+{
+  gcc_checking_assert (TYPE_ADDR_SPACE (to_type) == ADDR_SPACE_GENERIC);
+  return nvptx_maybe_convert_symbolic_operand (op);
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
 #undef TARGET_ATTRIBUTE_TABLE
 #define TARGET_ATTRIBUTE_TABLE nvptx_attribute_table
 
-#undef TARGET_LEGITIMATE_ADDRESS_P
-#define TARGET_LEGITIMATE_ADDRESS_P nvptx_legitimate_address_p
+#undef TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P
+#define TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P nvptx_legitimate_address_p
 
 #undef  TARGET_PROMOTE_FUNCTION_MODE
 #define TARGET_PROMOTE_FUNCTION_MODE nvptx_promote_function_mode

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (5 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-20 23:57   ` Bernd Schmidt
  2015-10-21  8:20   ` Jakub Jelinek
  2015-10-20 18:34 ` [gomp4 03/14] nvptx: expand support for address spaces Alexander Monakov
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

(note to reviewers: I'm not sure what we're after here, on the high level;
will be happy to rework the patch in a saner manner based on feedback, or even
drop it for now)

At the moment the attribute setting logic in omp-low.c is such that if a
function that should be present in target code does not already have 'omp
declare target' attribute, it receives 'omp target entrypoint'.  That is
wasteful: clearly not all user-declared target functions will be target region
entry points in OpenMP.

The motivating example for this change is OpenMP parallel target regions.  The
'parallel' part is outlined into its own function.  We don't want that
function be an 'entrypoint' on PTX (but only as a matter of optimality rather
than correctness).

	* omp-low.c (create_omp_child_function): Set "omp target entrypoint"
        or "omp declare target" attribute based on is_gimple_omp_offloaded.
---
 gcc/omp-low.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 06b4a5e..6481163 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -2242,11 +2242,14 @@ create_omp_child_function (omp_context *ctx, bool task_copy)
 	  }
     }
 
+  const char *target_attr = (is_gimple_omp_offloaded (ctx->stmt)
+			     ? "omp target entrypoint"
+			     : "omp declare target");
   if (cgraph_node::get_create (decl)->offloadable
       && !lookup_attribute ("omp declare target",
                            DECL_ATTRIBUTES (current_function_decl)))
     DECL_ATTRIBUTES (decl)
-      = tree_cons (get_identifier ("omp target entrypoint"),
+      = tree_cons (get_identifier (target_attr),
                    NULL_TREE, DECL_ATTRIBUTES (decl));
 
   t = build_decl (DECL_SOURCE_LOCATION (decl),

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (8 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 04/14] nvptx: fix output of _Bool global variables Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-20 23:48   ` Bernd Schmidt
  2015-10-21  8:11   ` Jakub Jelinek
  2015-10-20 18:52 ` [gomp4 13/14] libgomp: provide minimal GOMP_teams Alexander Monakov
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

The NVPTX backend emits each functions either as .func (callable only from the
device code) or as .kernel (entry point for a parallel region).  OpenMP
lowering adds "omp target entrypoint" attribute to functions outlined from
target regions.  Unlike OpenACC offloading, OpenMP offloading does not invoke
such outlined functions directly, but instead passes their address to
'gomp_nvptx_main'.  Restrict the special attribute treatment to OpenACC only.

	* config/nvptx/nvptx.c (write_as_kernel): Additionally test
	flag_openacc for "omp_target_entrypoint".
---
 gcc/config/nvptx/nvptx.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 21c59ef..df7b61f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -401,8 +401,10 @@ write_one_arg (std::stringstream &s, tree type, int i, machine_mode mode,
 static bool
 write_as_kernel (tree attrs)
 {
-  return (lookup_attribute ("kernel", attrs) != NULL_TREE
-	  || lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE);
+  if (flag_openacc
+      && lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE)
+    return true;
+  return lookup_attribute ("kernel", attrs) != NULL_TREE;
 }
 
 /* Write a function decl for DECL to S, where NAME is the name to be used.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
  2015-10-20 18:34 ` [gomp4 12/14] libgomp: fixup error.c on nvptx Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-20 21:12   ` Bernd Schmidt
  2015-10-20 18:34 ` [gomp4 14/14] libgomp: use more generic implementations on nvptx Alexander Monakov
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

The approach I've taken in libgomp/nvptx is to have a single entry point,
gomp_nvptx_main, that can take care of initial allocation, transferring
control to target region function, and finalization.

At the moment it has the prototype:
void gomp_nvptx_main(void (*fn)(void*), void *fndata);

but it's plausible that down the road we'll need other arguments for passing
data allocated by the plugin.

I see two possible ways to arrange that.

1.  Make gomp_nvptx_main a .kernel function.  This is what this patch assumes.
This requires emitting pointers-to-target-region-functions from the compiler,
and looking them up via cuModuleLoadGlobal/cuMemcpyDtoH in the plugin.

2.  Make gomp_nvptx_main a device (.func) function.  To have that work, we'd
need to additionally emit a "trampoline" of sorts in the NVPTX backend.  For
each OpenMP target entrypoint foo$_omp_fn$0, we'd have to additionally emit

__global__ void foo$_omp_fn$0$entry(void *args)
{
   gomp_nvptx_main(foo$_omp_fn$0, args);
}

(or perhaps better, rename the original function, and emit the trampoline
under the original name)

In approach 1, the prototype of gomp_nvptx_main is the internal business of
libgomp.  We are free to add arguments to it as needed.  The ABI between
libgomp and the backend is the name 'gomp_nvptx_main' and '__ptr_' prefix for
exported function pointers.

In approach 2, the prototype of gomp_nvptx_main becomes an ABI detail between
libgomp and nvptx backend.  Adding more arguments to gomp_nvptx_main gets a
bit harder.  On the positive side, we won't need to export
function pointers anymore, so this plugin change won't be needed.

In both cases the ABI of gomp_nvptx_main matters to libgomp-nvptx-plugin.
Perhaps we should freeze it right from the beginning, like this:
void gomp_nvptx_main(void (*fn)(void*), void *fndata, void *auxdata, size_t size);

I think I like approach 2 more (it'll need time to materialize due to required
legwork in the nvptx backend).  Thoughts?

(admittedly this patch is rather crude, storing CUdeviceptr in 'function' is
an aliasing violation, and repeated lookups of gomp_nvptx_main in
GOMP_OFFLOAD_run is pointless; it all will go away if ultimately we go with
approach 2)

The plugin launches 1 team of 8 warps.  The number 8 is not an ABI matter:
gomp_nvptx_main can lookup team size it is launched with.  The static choice
of 8 warps, while unfortunate, shouldn't be a show-stopper: the implementation
can spawn as many threads as it wishes, and if fewer threads are requested via
num_threads clause, we'll have idle warps.

	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_load_image): Try loading
        OpenMP-specific function pointer __ptr_NAME first.
        (GOMP_OFFLOAD_run): Launch gomp_offload_main.
---
 libgomp/plugin/plugin-nvptx.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 47ed074..4e9c054 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1566,8 +1566,15 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version,
   for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++)
     {
       CUfunction function;
+      CUdeviceptr dptr;
+      char buf[sizeof("__ptr_") + strlen(fn_descs[i].fn)];

-      r = cuModuleGetFunction (&function, module, fn_descs[i].fn);
+      strcat(strcpy(buf, "__ptr_"), fn_descs[i].fn);
+      r = cuModuleGetGlobal (&dptr, NULL, module, buf);
+      if (r == CUDA_SUCCESS)
+	cuMemcpyDtoH (&function, dptr, sizeof (void*));
+      else
+	r = cuModuleGetFunction (&function, module, fn_descs[i].fn);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r));

@@ -1793,12 +1800,18 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars)
   CUresult r;
   struct ptx_device *ptx_dev = ptx_devices[ord];
   const char *maybe_abort_msg = "(perhaps abort was called)";
-  void *args = &tgt_vars;
+  void *args[] = {&function, &tgt_vars};

-  r = cuLaunchKernel (function,
-		      1, 1, 1,
+  CUfunction mainfunc;
+
+  r = cuModuleGetFunction (&mainfunc, ptx_dev->images->module, "gomp_nvptx_main");
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r));
+
+  r = cuLaunchKernel (mainfunc,
 		      1, 1, 1,
-		      0, ptx_dev->null_stream->stream, &args, 0);
+		      32, 8, 1,
+		      0, ptx_dev->null_stream->stream, args, 0);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 00/14] NVPTX: further porting
@ 2015-10-20 18:34 Alexander Monakov
  2015-10-20 18:34 ` [gomp4 12/14] libgomp: fixup error.c on nvptx Alexander Monakov
                   ` (16 more replies)
  0 siblings, 17 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

Hello,

This patch series moves libgomp/nvptx porting further along to get initial
bits of parallel execution working, mostly unbreaking the testsuite.  Please
have a look!  I'm interested in feedback, and would like to know if it's
suitable to become a part of a branch.

This patch series ports enough of libgomp.c to get warp-level parallelism
working for OpenMP offloading.  The overall approach is as follows.

I've opted not to use dynamic parallelism.  It increases the hardware
requirement from sm_30 to sm_35, needs a library from CUDA Toolkit at link
time (libcudadevrt.a), and imposes overhead at run time.  The last point might
be moot if we don't manage to make libgomp's own overhead low, but still my
judgement is that a hard dependency on dynamic parallelism is problematic.

The plugin launches one (for now) thread block with 8 warps, which begin
executing a new function in libgomp, gomp_nvptx_main.  The warps for a
(pre-allocated) pool.  Warp 0 is responsible for initialization and final
cleanup, and proceeds to execute target region functions.  Other warps proceed
to gomp_thread_start.

With these patches, it's possible to have libgomp testsuite mostly passing.
The failures are as follows:

libgomp.c/target-{1,7,critical-1}.c: segfault in accelerator code

libgomp.c/thread-limit-2.c: fails to link due to 'usleep' unavailable on
NVPTX.  Note, the test does not run anything on the device because the target
region has 'if (0)' clause.

libgomp.c++/examples-4/declare_target-2.C: libgomp: Can't map target variables
(size mismatch).  Will investigate later.

libgomp.c++/target-1.C: same as libgomp.c/target-1.c, segfault on device.

I didn't run the libgomp/gfortran testsuite yet.  I'd like your input on
dealing with testsuite breaks (XFAIL?).

I have not rebased my private branch in a while, so context in
gcc/config/nvptx is probably out-of-date in places.

Yours,
Alexander

  nvptx: emit kernels for 'omp target entrypoint' only for OpenACC
  nvptx: emit pointers to OpenMP target region entry points
  nvptx: expand support for address spaces
  nvptx: fix output of _Bool global variables
  omp-low: set 'omp target entrypoint' only on entypoints
  omp-low: copy omp_data_o to shared memory on NVPTX
  libgomp nvptx plugin: launch target functions via gomp_nvptx_main
  libgomp nvptx: populate proc.c
  libgomp: provide barriers on NVPTX
  libgomp: arrange a team of pre-started threads via gomp_nvptx_main
  libgomp: avoid variable-length stack allocation in team.c
  libgomp: fixup error.c on nvptx
  libgomp: provide minimal GOMP_teams
  libgomp: use more generic implementations on nvptx

 gcc/config/nvptx/nvptx.c        |  78 +++++++++++++--
 gcc/omp-low.c                   |  58 +++++++++--
 libgomp/config/nvptx/alloc.c    |   0
 libgomp/config/nvptx/bar.c      | 210 ++++++++++++++++++++++++++++++++++++++++
 libgomp/config/nvptx/bar.h      | 129 +++++++++++++++++++++++-
 libgomp/config/nvptx/barrier.c  |   0
 libgomp/config/nvptx/critical.c |  57 -----------
 libgomp/config/nvptx/error.c    |   0
 libgomp/config/nvptx/iter.c     |   0
 libgomp/config/nvptx/iter_ull.c |   0
 libgomp/config/nvptx/loop.c     |   0
 libgomp/config/nvptx/loop_ull.c |   0
 libgomp/config/nvptx/ordered.c  |   0
 libgomp/config/nvptx/parallel.c |   0
 libgomp/config/nvptx/proc.c     |  40 ++++++++
 libgomp/config/nvptx/single.c   |   0
 libgomp/config/nvptx/target.c   |  39 ++++++++
 libgomp/config/nvptx/task.c     |   0
 libgomp/config/nvptx/team.c     |   0
 libgomp/config/nvptx/work.c     |   0
 libgomp/error.c                 |   5 +
 libgomp/libgomp.h               |  10 +-
 libgomp/plugin/plugin-nvptx.c   |  23 ++++-
 libgomp/task.c                  |   7 +-
 libgomp/team.c                  |  92 +++++++++++++++++-
 25 files changed, 664 insertions(+), 84 deletions(-)
 delete mode 100644 libgomp/config/nvptx/alloc.c
 delete mode 100644 libgomp/config/nvptx/barrier.c
 delete mode 100644 libgomp/config/nvptx/critical.c
 delete mode 100644 libgomp/config/nvptx/error.c
 delete mode 100644 libgomp/config/nvptx/iter.c
 delete mode 100644 libgomp/config/nvptx/iter_ull.c
 delete mode 100644 libgomp/config/nvptx/loop.c
 delete mode 100644 libgomp/config/nvptx/loop_ull.c
 delete mode 100644 libgomp/config/nvptx/ordered.c
 delete mode 100644 libgomp/config/nvptx/parallel.c
 delete mode 100644 libgomp/config/nvptx/single.c
 delete mode 100644 libgomp/config/nvptx/task.c
 delete mode 100644 libgomp/config/nvptx/team.c
 delete mode 100644 libgomp/config/nvptx/work.c

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (3 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 08/14] libgomp nvptx: populate proc.c Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-21  0:07   ` Bernd Schmidt
                     ` (2 more replies)
  2015-10-20 18:34 ` [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c Alexander Monakov
                   ` (11 subsequent siblings)
  16 siblings, 3 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

(This patch serves as a straw man proposal to have something concrete for
discussion and further patches)

On PTX, stack memory is private to each thread.  When master thread constructs
'omp_data_o' on its own stack and passes it to other threads via
GOMP_parallel by reference, other threads cannot use the resulting pointer.
We need to arrange structures passed between threads be in global, or better,
in PTX __shared__ memory (private to each CUDA thread block).

We cannot easily adjust expansion of 'omp parallel' because it is done before
LTO streamout.  I've opted to adjust calls to GOMP_parallel in
pass_late_lower_omp instead.

As I see, there are two possible approaches.  Either arrange the structure be
in shared memory from the compiler, or have GOMP_parallel perform the copies.
The latter requires passing sizeof(omp_data_o) to GOMP_parallel, and also to
GOMP_OFFLOAD_run (to reserve shared memory), so doing it from the compiler
seems simpler.

Using static storage may preclude nested parallelism.  Not sure we want to
support it for offloading anyway (but there needs to be a clear decision).

Using separate variables is wasteful: they should go into a union to reduce
shared memory consumption.

	* omp-low.c (expand_parallel_call): Mark function for
        pass_late_lower_omp transforms.
        (pass_late_lower_omp::execute): Copy omp_data_o to/from
        'shared' memory on NVPTX.
---
 gcc/omp-low.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 46 insertions(+), 7 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 6481163..5b75bf6 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -5384,7 +5384,10 @@ expand_parallel_call (struct omp_region *region, basic_block bb,
   if (t == NULL)
     t1 = null_pointer_node;
   else
-    t1 = build_fold_addr_expr (t);
+    {
+      t1 = build_fold_addr_expr (t);
+      cfun->curr_properties &= ~PROP_gimple_lompifn;
+    }
   t2 = build_fold_addr_expr (gimple_omp_parallel_child_fn (entry_stmt));
 
   vec_alloc (args, 4 + vec_safe_length (ws_args));
@@ -14703,15 +14706,51 @@ pass_late_lower_omp::execute (function *fun)
     for (i = gsi_start_bb (bb); !gsi_end_p (i); gsi_next (&i))
       {
 	gimple stmt = gsi_stmt (i);
-	if (!(is_gimple_call (stmt)
-	      && gimple_call_internal_p (stmt)
-	      && gimple_call_internal_fn (stmt) == IFN_GOACC_DATA_END_WITH_ARG))
+
+	if (!is_gimple_call (stmt))
 	  continue;
 
-	tree fn = builtin_decl_explicit (BUILT_IN_GOACC_DATA_END);
-	gimple g = gimple_build_call (fn, 0);
+#ifdef ADDR_SPACE_SHARED
+	/* Transform "GOMP_parallel (fn, &omp_data_o, ...)" call to
+
+	   static __shared__ typeof(omp_data_o) omp_data_shared;
+	   omp_data_shared = omp_data_o;
+	   GOMP_parallel(fn, &omp_data_shared, ...);
+	   omp_data_o = omp_data_shared; */
+	if (gimple_call_builtin_p (stmt, BUILT_IN_GOMP_PARALLEL))
+	  {
+	    tree omp_data_ptr = gimple_call_arg (stmt, 1);
+	    if (TREE_CODE (omp_data_ptr) == ADDR_EXPR)
+	      {
+		tree omp_data = TREE_OPERAND (omp_data_ptr, 0);
+		tree type = TREE_TYPE (omp_data);
+		int quals = ENCODE_QUAL_ADDR_SPACE (ADDR_SPACE_SHARED);
+		type = build_qualified_type (type, quals);
+		tree decl = create_tmp_var (type, "omp_data_shared");
+		TREE_STATIC (decl) = 1;
+		TREE_ADDRESSABLE (decl) = 1;
+		varpool_node::finalize_decl (decl);
+
+		gimple g = gimple_build_assign (decl, omp_data);
+		gsi_insert_before (&i, g, GSI_SAME_STMT);
+
+		g = gimple_build_assign (omp_data, decl);
+		gsi_insert_after (&i, g, GSI_NEW_STMT);
+
+		gimple_call_set_arg (stmt, 1, build_fold_addr_expr (decl));
+	      }
+	    continue;
+	  }
+#endif
+
+	if (gimple_call_internal_p (stmt)
+	    && gimple_call_internal_fn (stmt) == IFN_GOACC_DATA_END_WITH_ARG)
+	  {
+	    tree fn = builtin_decl_explicit (BUILT_IN_GOACC_DATA_END);
+	    gimple g = gimple_build_call (fn, 0);
 
-	gsi_replace (&i, g, false);
+	    gsi_replace (&i, g, false);
+	  }
       }
 
   return TODO_update_ssa;

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 08/14] libgomp nvptx: populate proc.c
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (2 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 14/14] libgomp: use more generic implementations on nvptx Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-21  9:15   ` Jakub Jelinek
  2015-10-20 18:34 ` [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX Alexander Monakov
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

This provides minimal implementations of gomp_dynamic_max_threads and
omp_get_num_procs.

	* config/nvptx/proc.c: New.
---
 libgomp/config/nvptx/proc.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/libgomp/config/nvptx/proc.c b/libgomp/config/nvptx/proc.c
index e69de29..6331b8c 100644
--- a/libgomp/config/nvptx/proc.c
+++ b/libgomp/config/nvptx/proc.c
@@ -0,0 +1,40 @@
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains system specific routines related to counting
+   online processors and dynamic load balancing.  */
+
+#include "libgomp.h"
+
+unsigned
+gomp_dynamic_max_threads (void)
+{
+  return gomp_icv (false)->nthreads_var;
+}
+
+int
+omp_get_num_procs (void)
+{
+  return gomp_icv (false)->nthreads_var;
+}

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (7 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 03/14] nvptx: expand support for address spaces Alexander Monakov
@ 2015-10-20 18:34 ` Alexander Monakov
  2015-10-20 20:51   ` Bernd Schmidt
  2015-10-20 18:34 ` [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC Alexander Monakov
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

Due to special treatment of types, emitting variables of type _Bool in
global scope is impossible: extern references are emitted with .u8, but
definitions use .u64.  This patch fixes the issue by treating boolean type as
integer types.

	* config/nvptx/nvptx.c (init_output_initializer): Also accept
        BOOLEAN_TYPE.
---
 gcc/config/nvptx/nvptx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 779b018..cfb5c4f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1863,6 +1863,7 @@ init_output_initializer (FILE *file, const char *name, const_tree type,
   int sz = int_size_in_bytes (type);
   if ((TREE_CODE (type) != INTEGER_TYPE
        && TREE_CODE (type) != ENUMERAL_TYPE
+       && TREE_CODE (type) != BOOLEAN_TYPE
        && TREE_CODE (type) != REAL_TYPE)
       || sz < 0
       || sz > HOST_BITS_PER_WIDE_INT)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 13/14] libgomp: provide minimal GOMP_teams
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (9 preceding siblings ...)
  2015-10-20 18:34 ` [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC Alexander Monakov
@ 2015-10-20 18:52 ` Alexander Monakov
  2015-10-21 10:12   ` Jakub Jelinek
  2015-10-20 18:52 ` [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main Alexander Monakov
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On NVPTX, we don't need most of target.c functionality, except for GOMP_teams.
Provide it as a copy of the generic implementation for now (it most likely
will need to change down the line: on NVPTX we do need to spawn several
thread blocks for #pragma omp teams).

Alternatively, it might make sense to split GOMP_teams out of target.c into
its own file (teams.c?), leaving target.c a 0-size stub in config/nvptx.

	* config/nvptx/target.c: New.
---
 libgomp/config/nvptx/target.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index e69de29..ad36013 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -0,0 +1,39 @@
+/* Copyright (C) 2013-2015 Free Software Foundation, Inc.
+   Contributed by Jakub Jelinek <jakub@redhat.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "libgomp.h"
+#include <limits.h>
+
+void
+GOMP_teams (unsigned int num_teams, unsigned int thread_limit)
+{
+  if (thread_limit)
+    {
+      struct gomp_task_icv *icv = gomp_icv (true);
+      icv->thread_limit_var
+	= thread_limit > INT_MAX ? UINT_MAX : thread_limit;
+    }
+  (void) num_teams;
+}

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (10 preceding siblings ...)
  2015-10-20 18:52 ` [gomp4 13/14] libgomp: provide minimal GOMP_teams Alexander Monakov
@ 2015-10-20 18:52 ` Alexander Monakov
  2015-10-21  9:49   ` Jakub Jelinek
  2015-10-20 18:53 ` [gomp4 09/14] libgomp: provide barriers on NVPTX Alexander Monakov
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

This patch ports team.c to nvptx by arranging an initialization/cleanup
routine, gomp_nvptx_main, that all (pre-started) threads can run.  It
initializes a thread pool and proceeds to run gomp_thread_start in all threads
except thread zero, which runs original target region function.

Thread-private data is arranged via a linear array, nvptx_thrs, that is
allocated in gomp_nvptx_main.

Like in previous patch, are naked asm() statement OK?

	* libgomp.h [__nvptx__] (gomp_thread): New implementation.
        * config/nvptx/team.c: Delete.
        * team.c: Guard uses of PThreads-specific interfaces by
        LIBGOMP_USE_PTHREADS.
        (gomp_nvptx_main): New.
        (gomp_thread_start) [__nvptx__]: Handle calls from gomp_nvptx_main.
---
 libgomp/config/nvptx/team.c |  0
 libgomp/libgomp.h           | 10 ++++-
 libgomp/team.c              | 92 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 96 insertions(+), 6 deletions(-)
 delete mode 100644 libgomp/config/nvptx/team.c

diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
deleted file mode 100644
index e69de29..0000000
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 1454adf..f25b265 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -483,7 +483,15 @@ enum gomp_cancel_kind
 
 /* ... and here is that TLS data.  */
 
-#if defined HAVE_TLS || defined USE_EMUTLS
+#if defined __nvptx__
+extern struct gomp_thread *nvptx_thrs;
+static inline struct gomp_thread *gomp_thread (void)
+{
+  int tid;
+  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
+  return nvptx_thrs + tid;
+}
+#elif defined HAVE_TLS || defined USE_EMUTLS
 extern __thread struct gomp_thread gomp_tls_data;
 static inline struct gomp_thread *gomp_thread (void)
 {
diff --git a/libgomp/team.c b/libgomp/team.c
index 7671b05..5b74532 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -30,6 +30,7 @@
 #include <stdlib.h>
 #include <string.h>
 
+#ifdef LIBGOMP_USE_PTHREADS
 /* This attribute contains PTHREAD_CREATE_DETACHED.  */
 pthread_attr_t gomp_thread_attr;
 
@@ -43,6 +44,7 @@ __thread struct gomp_thread gomp_tls_data;
 #else
 pthread_key_t gomp_tls_key;
 #endif
+#endif
 
 
 /* This structure is used to communicate across pthread_create.  */
@@ -58,6 +60,52 @@ struct gomp_thread_start_data
   bool nested;
 };
 
+#ifdef __nvptx__
+struct gomp_thread *nvptx_thrs;
+
+static struct gomp_thread_pool *gomp_new_thread_pool (void);
+static void *gomp_thread_start (void *);
+
+void __attribute__((kernel))
+gomp_nvptx_main (void (*fn) (void *), void *fn_data)
+{
+  int ntids, tid, laneid;
+  asm ("mov.u32 %0, %%laneid;" : "=r" (laneid));
+  if (laneid)
+    return;
+  static struct gomp_thread_pool *pool;
+  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
+  asm ("mov.u32 %0, %%ntid.y;" : "=r"(ntids));
+  if (tid == 0)
+    {
+      gomp_global_icv.nthreads_var = ntids;
+
+      nvptx_thrs = gomp_malloc_cleared (ntids * sizeof (*nvptx_thrs));
+
+      pool = gomp_new_thread_pool ();
+      pool->threads = gomp_malloc (ntids * sizeof (*pool->threads));
+      pool->threads[0] = nvptx_thrs;
+      pool->threads_size = ntids;
+      pool->threads_used = ntids;
+      gomp_barrier_init (&pool->threads_dock, ntids);
+
+      nvptx_thrs[0].thread_pool = pool;
+      asm ("bar.sync 0;");
+      fn (fn_data);
+
+      gomp_free_thread (nvptx_thrs);
+      free (nvptx_thrs);
+    }
+  else
+    {
+      struct gomp_thread_start_data tsdata = {0};
+      tsdata.ts.team_id = tid;
+      asm ("bar.sync 0;");
+      tsdata.thread_pool = pool;
+      gomp_thread_start (&tsdata);
+    }
+}
+#endif
 
 /* This function is a pthread_create entry point.  This contains the idle
    loop in which a thread waits to be called up to become part of a team.  */
@@ -71,7 +119,9 @@ gomp_thread_start (void *xdata)
   void (*local_fn) (void *);
   void *local_data;
 
-#if defined HAVE_TLS || defined USE_EMUTLS
+#ifdef __nvptx__
+  thr = gomp_thread ();
+#elif defined HAVE_TLS || defined USE_EMUTLS
   thr = &gomp_tls_data;
 #else
   struct gomp_thread local_thr;
@@ -88,7 +138,8 @@ gomp_thread_start (void *xdata)
   thr->task = data->task;
   thr->place = data->place;
 
-  thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;
+  if (thr->ts.team)
+    thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;
 
   /* Make thread pool local. */
   pool = thr->thread_pool;
@@ -110,6 +161,10 @@ gomp_thread_start (void *xdata)
       pool->threads[thr->ts.team_id] = thr;
 
       gomp_barrier_wait (&pool->threads_dock);
+#ifdef __nvptx__
+      local_fn = thr->fn;
+      local_data = thr->data;
+#endif
       do
 	{
 	  struct gomp_team *team = thr->ts.team;
@@ -242,7 +297,13 @@ gomp_free_pool_helper (void *thread_pool)
   gomp_sem_destroy (&thr->release);
   thr->thread_pool = NULL;
   thr->task = NULL;
+#ifdef LIBGOMP_USE_PTHREADS
   pthread_exit (NULL);
+#elif defined(__nvptx__)
+  asm ("exit;");
+#else
+#error gomp_free_pool_helper must terminate the thread
+#endif
 }
 
 /* Free a thread pool and release its threads. */
@@ -300,33 +361,40 @@ void
 gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
 		 unsigned flags, struct gomp_team *team)
 {
-  struct gomp_thread_start_data *start_data;
   struct gomp_thread *thr, *nthr;
   struct gomp_task *task;
   struct gomp_task_icv *icv;
   bool nested;
   struct gomp_thread_pool *pool;
   unsigned i, n, old_threads_used = 0;
-  pthread_attr_t thread_attr, *attr;
   unsigned long nthreads_var;
-  char bind, bind_var;
+  char bind_var;
+#ifdef LIBGOMP_USE_PTHREADS
+  char bind;
+  struct gomp_thread_start_data *start_data;
+  pthread_attr_t thread_attr, *attr;
   unsigned int s = 0, rest = 0, p = 0, k = 0;
+#endif
   unsigned int affinity_count = 0;
   struct gomp_thread **affinity_thr = NULL;
 
   thr = gomp_thread ();
   nested = thr->ts.team != NULL;
+#ifdef LIBGOMP_USE_PTHREADS
   if (__builtin_expect (thr->thread_pool == NULL, 0))
     {
       thr->thread_pool = gomp_new_thread_pool ();
       thr->thread_pool->threads_busy = nthreads;
       pthread_setspecific (gomp_thread_destructor, thr);
     }
+#endif
   pool = thr->thread_pool;
   task = thr->task;
   icv = task ? &task->icv : &gomp_global_icv;
+#ifdef LIBGOMP_USE_PTHREADS
   if (__builtin_expect (gomp_places_list != NULL, 0) && thr->place == 0)
     gomp_init_affinity ();
+#endif
 
   /* Always save the previous state, even if this isn't a nested team.
      In particular, we should save any work share state from an outer
@@ -352,10 +420,12 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
   bind_var = icv->bind_var;
   if (bind_var != omp_proc_bind_false && (flags & 7) != omp_proc_bind_false)
     bind_var = flags & 7;
+#ifdef LIBGOMP_USE_PTHREADS
   bind = bind_var;
   if (__builtin_expect (gomp_bind_var_list != NULL, 0)
       && thr->ts.level < gomp_bind_var_list_len)
     bind_var = gomp_bind_var_list[thr->ts.level];
+#endif
   gomp_init_task (thr->task, task, icv);
   team->implicit_task[0].icv.nthreads_var = nthreads_var;
   team->implicit_task[0].icv.bind_var = bind_var;
@@ -365,6 +435,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
 
   i = 1;
 
+#ifdef LIBGOMP_USE_PTHREADS
   if (__builtin_expect (gomp_places_list != NULL, 0))
     {
       /* Depending on chosen proc_bind model, set subpartition
@@ -432,6 +503,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
     }
   else
     bind = omp_proc_bind_false;
+#endif
 
   /* We only allow the reuse of idle threads for non-nested PARALLEL
      regions.  This appears to be implied by the semantics of
@@ -481,6 +553,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
 	  unsigned int place_partition_off = thr->ts.place_partition_off;
 	  unsigned int place_partition_len = thr->ts.place_partition_len;
 	  unsigned int place = 0;
+#ifdef LIBGOMP_USE_PTHREADS
 	  if (__builtin_expect (gomp_places_list != NULL, 0))
 	    {
 	      switch (bind)
@@ -612,6 +685,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
 	      place = p + 1;
 	    }
 	  else
+#endif
 	    nthr = pool->threads[i];
 	  nthr->ts.team = team;
 	  nthr->ts.work_share = &team->work_shares[0];
@@ -635,6 +709,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
 	  team->ordered_release[i] = &nthr->release;
 	}
 
+#ifdef LIBGOMP_USE_PTHREADS
       if (__builtin_expect (affinity_thr != NULL, 0))
 	{
 	  /* If AFFINITY_THR is non-NULL just because we had to
@@ -695,9 +770,11 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
 
       if (i == nthreads)
 	goto do_release;
+#endif
 
     }
 
+#ifdef LIBGOMP_USE_PTHREADS
   if (__builtin_expect (nthreads + affinity_count > old_threads_used, 0))
     {
       long diff = (long) (nthreads + affinity_count) - (long) old_threads_used;
@@ -829,6 +906,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
     pthread_attr_destroy (&thread_attr);
 
  do_release:
+#endif
   gomp_barrier_wait (nested ? &team->barrier : &pool->threads_dock);
 
   /* Decrease the barrier threshold to match the number of threads
@@ -935,6 +1013,7 @@ gomp_team_end (void)
     }
 }
 
+#ifdef LIBGOMP_USE_PTHREADS
 
 /* Constructors for this file.  */
 
@@ -959,6 +1038,7 @@ team_destructor (void)
      crashes.  */
   pthread_key_delete (gomp_thread_destructor);
 }
+#endif
 
 struct gomp_task_icv *
 gomp_new_icv (void)
@@ -967,6 +1047,8 @@ gomp_new_icv (void)
   struct gomp_task *task = gomp_malloc (sizeof (struct gomp_task));
   gomp_init_task (task, NULL, &gomp_global_icv);
   thr->task = task;
+#ifdef LIBGOMP_USE_PTHREADS
   pthread_setspecific (gomp_thread_destructor, thr);
+#endif
   return &task->icv;
 }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 09/14] libgomp: provide barriers on NVPTX
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (11 preceding siblings ...)
  2015-10-20 18:52 ` [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main Alexander Monakov
@ 2015-10-20 18:53 ` Alexander Monakov
  2015-10-20 20:56   ` Bernd Schmidt
  2015-10-21  9:39   ` Jakub Jelinek
  2015-10-20 19:01 ` [gomp4 02/14] nvptx: emit pointers to OpenMP target region entry points Alexander Monakov
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 18:53 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On NVPTX, there's 16 hardware barriers for each thread team, each barrier has
a variable waiter count.  The instruction 'bar.sync N, M;' allows to wait on
barrier number N until M threads have arrived.  M should be pre-multiplied by
warp width.  It's also possible to 'post' the barrier without suspending with
'bar.arrive'.

We should be able to provide gomp barrier via a combination of ptx barriers
and atomics.  This patch is a first step in that direction.

It's mostly a copy of Linux implementation, and it's very likely that
functions more complex than gomp_barrier_wait_end are implemented incorrectly.
I will have to review all of that (and optimize, hopefully).

I'm not sure if naked asm()'s are OK.  It's possible to implement a builtin
instead for a minor beautification.  Thoughts?

---
 libgomp/config/nvptx/bar.c | 210 +++++++++++++++++++++++++++++++++++++++++++++
 libgomp/config/nvptx/bar.h | 129 +++++++++++++++++++++++++++-
 2 files changed, 338 insertions(+), 1 deletion(-)

diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index e69de29..5631cb4 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -0,0 +1,210 @@
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is an NVPTX specific implementation of a barrier synchronization
+   mechanism for libgomp.  This type is private to the library.  This
+   implementation uses atomic instructions and bar.sync instruction.  */
+
+#include <limits.h>
+#include "libgomp.h"
+
+
+void
+gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      bar->awaited = bar->total;
+      __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
+			MEMMODEL_RELEASE);
+    }
+  asm ("bar.sync 0, %0;" : : "r"(32*bar->total));
+}
+
+void
+gomp_barrier_wait (gomp_barrier_t *bar)
+{
+  gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
+}
+
+/* Like gomp_barrier_wait, except that if the encountering thread
+   is not the last one to hit the barrier, it returns immediately.
+   The intended usage is that a thread which intends to gomp_barrier_destroy
+   this barrier calls gomp_barrier_wait, while all other threads
+   call gomp_barrier_wait_last.  When gomp_barrier_wait returns,
+   the barrier can be safely destroyed.  */
+
+void
+gomp_barrier_wait_last (gomp_barrier_t *bar)
+{
+#if 0
+  gomp_barrier_state_t state = gomp_barrier_wait_start (bar);
+  if (state & BAR_WAS_LAST)
+    gomp_barrier_wait_end (bar, state);
+#else
+  gomp_barrier_wait (bar);
+#endif
+}
+
+void
+gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
+{
+  asm ("bar.sync 0, %0;" : : "r"(32*bar->total));
+}
+
+void
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  unsigned int generation, gen;
+
+  gomp_barrier_wait_end (bar, state);
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      struct gomp_thread *thr = gomp_thread ();
+      struct gomp_team *team = thr->ts.team;
+
+      bar->awaited = bar->total;
+      team->work_share_cancelled = 0;
+      if (__builtin_expect (team->task_count, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  state &= ~BAR_WAS_LAST;
+	}
+      else
+	{
+	  state &= ~BAR_CANCELLED;
+	  state += BAR_INCR - BAR_WAS_LAST;
+	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
+	  asm ("bar.sync 0, %0;" : : "r"(32*bar->total));
+	  return;
+	}
+    }
+
+  generation = state;
+  state &= ~BAR_CANCELLED;
+  do
+    {
+      asm ("bar.sync 0, %0;" : : "r"(32*bar->total));
+      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+	}
+      generation |= gen & BAR_WAITING_FOR_TASK;
+    }
+  while (gen != state + BAR_INCR);
+}
+
+void
+gomp_team_barrier_wait (gomp_barrier_t *bar)
+{
+  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
+}
+
+void
+gomp_team_barrier_wait_final (gomp_barrier_t *bar)
+{
+  gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    bar->awaited_final = bar->total;
+  gomp_team_barrier_wait_end (bar, state);
+}
+
+bool
+gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
+				   gomp_barrier_state_t state)
+{
+  unsigned int generation, gen;
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      /* BAR_CANCELLED should never be set in state here, because
+	 cancellation means that at least one of the threads has been
+	 cancelled, thus on a cancellable barrier we should never see
+	 all threads to arrive.  */
+      struct gomp_thread *thr = gomp_thread ();
+      struct gomp_team *team = thr->ts.team;
+
+      bar->awaited = bar->total;
+      team->work_share_cancelled = 0;
+      if (__builtin_expect (team->task_count, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  state &= ~BAR_WAS_LAST;
+	}
+      else
+	{
+	  state += BAR_INCR - BAR_WAS_LAST;
+	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
+	  asm ("bar.sync 0, %0;" : : "r"(32*bar->total));
+	  return false;
+	}
+    }
+
+  if (__builtin_expect (state & BAR_CANCELLED, 0))
+    return true;
+
+  generation = state;
+  do
+    {
+      asm ("bar.sync 0, %0;" : : "r"(32*bar->total));
+      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (gen & BAR_CANCELLED, 0))
+	return true;
+      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+	}
+      generation |= gen & BAR_WAITING_FOR_TASK;
+    }
+  while (gen != state + BAR_INCR);
+
+  return false;
+}
+
+bool
+gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
+{
+  return gomp_team_barrier_wait_cancel_end (bar, gomp_barrier_wait_start (bar));
+}
+
+void
+gomp_team_barrier_cancel (struct gomp_team *team)
+{
+  gomp_mutex_lock (&team->task_lock);
+  if (team->barrier.generation & BAR_CANCELLED)
+    {
+      gomp_mutex_unlock (&team->task_lock);
+      return;
+    }
+  team->barrier.generation |= BAR_CANCELLED;
+  gomp_mutex_unlock (&team->task_lock);
+  gomp_team_barrier_wake (&team->barrier, INT_MAX);
+}
diff --git a/libgomp/config/nvptx/bar.h b/libgomp/config/nvptx/bar.h
index 009d85f..bbdc466 100644
--- a/libgomp/config/nvptx/bar.h
+++ b/libgomp/config/nvptx/bar.h
@@ -24,15 +24,142 @@
 
 /* This is an NVPTX specific implementation of a barrier synchronization
    mechanism for libgomp.  This type is private to the library.  This
-   implementation is a stub, for now.  */
+   implementation uses atomic instructions and bar.sync instruction.  */
 
 #ifndef GOMP_BARRIER_H
 #define GOMP_BARRIER_H 1
 
+#include "mutex.h"
+
 typedef struct
 {
+  unsigned total;
+  unsigned generation;
+  unsigned awaited;
+  unsigned awaited_final;
 } gomp_barrier_t;
 
 typedef unsigned int gomp_barrier_state_t;
 
+/* The generation field contains a counter in the high bits, with a few
+   low bits dedicated to flags.  Note that TASK_PENDING and WAS_LAST can
+   share space because WAS_LAST is never stored back to generation.  */
+#define BAR_TASK_PENDING	1
+#define BAR_WAS_LAST		1
+#define BAR_WAITING_FOR_TASK	2
+#define BAR_CANCELLED		4
+#define BAR_INCR		8
+
+static inline void gomp_barrier_init (gomp_barrier_t *bar, unsigned count)
+{
+  bar->total = count;
+  bar->awaited = count;
+  bar->awaited_final = count;
+  bar->generation = 0;
+}
+
+static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count)
+{
+  __atomic_add_fetch (&bar->awaited, count - bar->total, MEMMODEL_ACQ_REL);
+  bar->total = count;
+}
+
+static inline void gomp_barrier_destroy (gomp_barrier_t *bar)
+{
+}
+
+extern void gomp_barrier_wait (gomp_barrier_t *);
+extern void gomp_barrier_wait_last (gomp_barrier_t *);
+extern void gomp_barrier_wait_end (gomp_barrier_t *, gomp_barrier_state_t);
+extern void gomp_team_barrier_wait (gomp_barrier_t *);
+extern void gomp_team_barrier_wait_final (gomp_barrier_t *);
+extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
+					gomp_barrier_state_t);
+extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
+extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
+					       gomp_barrier_state_t);
+extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
+struct gomp_team;
+extern void gomp_team_barrier_cancel (struct gomp_team *);
+
+static inline gomp_barrier_state_t
+gomp_barrier_wait_start (gomp_barrier_t *bar)
+{
+  unsigned int ret = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+  ret &= -BAR_INCR | BAR_CANCELLED;
+  /* A memory barrier is needed before exiting from the various forms
+     of gomp_barrier_wait, to satisfy OpenMP API version 3.1 section
+     2.8.6 flush Construct, which says there is an implicit flush during
+     a barrier region.  This is a convenient place to add the barrier,
+     so we use MEMMODEL_ACQ_REL here rather than MEMMODEL_ACQUIRE.  */
+  if (__atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0)
+    ret |= BAR_WAS_LAST;
+  return ret;
+}
+
+static inline gomp_barrier_state_t
+gomp_barrier_wait_cancel_start (gomp_barrier_t *bar)
+{
+  return gomp_barrier_wait_start (bar);
+}
+
+/* This is like gomp_barrier_wait_start, except it decrements
+   bar->awaited_final rather than bar->awaited and should be used
+   for the gomp_team_end barrier only.  */
+static inline gomp_barrier_state_t
+gomp_barrier_wait_final_start (gomp_barrier_t *bar)
+{
+  unsigned int ret = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+  ret &= -BAR_INCR | BAR_CANCELLED;
+  /* See above gomp_barrier_wait_start comment.  */
+  if (__atomic_add_fetch (&bar->awaited_final, -1, MEMMODEL_ACQ_REL) == 0)
+    ret |= BAR_WAS_LAST;
+  return ret;
+}
+
+static inline bool
+gomp_barrier_last_thread (gomp_barrier_state_t state)
+{
+  return state & BAR_WAS_LAST;
+}
+
+/* All the inlines below must be called with team->task_lock
+   held.  */
+
+static inline void
+gomp_team_barrier_set_task_pending (gomp_barrier_t *bar)
+{
+  bar->generation |= BAR_TASK_PENDING;
+}
+
+static inline void
+gomp_team_barrier_clear_task_pending (gomp_barrier_t *bar)
+{
+  bar->generation &= ~BAR_TASK_PENDING;
+}
+
+static inline void
+gomp_team_barrier_set_waiting_for_tasks (gomp_barrier_t *bar)
+{
+  bar->generation |= BAR_WAITING_FOR_TASK;
+}
+
+static inline bool
+gomp_team_barrier_waiting_for_tasks (gomp_barrier_t *bar)
+{
+  return (bar->generation & BAR_WAITING_FOR_TASK) != 0;
+}
+
+static inline bool
+gomp_team_barrier_cancelled (gomp_barrier_t *bar)
+{
+  return __builtin_expect ((bar->generation & BAR_CANCELLED) != 0, 0);
+}
+
+static inline void
+gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  bar->generation = (state & -BAR_INCR) + BAR_INCR;
+}
+
 #endif /* GOMP_BARRIER_H */

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [gomp4 02/14] nvptx: emit pointers to OpenMP target region entry points
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (12 preceding siblings ...)
  2015-10-20 18:53 ` [gomp4 09/14] libgomp: provide barriers on NVPTX Alexander Monakov
@ 2015-10-20 19:01 ` Alexander Monakov
  2015-10-21  7:55 ` [gomp4 00/14] NVPTX: further porting Martin Jambor
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 19:01 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

Note: this patch will have to be more complex if we go with 'approach 2'
described in a later patch, 07/14 "launch target functions via gomp_nvptx_main".

For OpenMP offloading, libgomp invokes 'gomp_nvptx_main' as the accelerator
kernel, passing it a pointer to outlined target region function.  That
function needs to be a device function (.func rather than .kernel), unless we
want to bump the GPU requirement from sm_30 to sm_35.  To retrieve the device
function pointer from the host side, we need to emit a global-visibility
pointer to that function in PTX.  The naming scheme for deriving the pointer
name based on function name is an ABI item between the backend and libgomp
plugin.

	* config/nvptx/nvptx.c: (write_libgomp_anchor): New.  Use it...
        (nvptx_declare_function_name): ...here to emit pointers for libgomp.
---
 gcc/config/nvptx/nvptx.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index df7b61f..a619e4c 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -657,6 +657,26 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* For function DECL outlined for an OpenMP 'target' region, emit a global
+   pointer: void *__ptr_NAME = NAME; to be used in libgomp nvptx plugin.  */
+
+static void
+write_libgomp_anchor (std::stringstream &s, const char *name, const_tree decl)
+{
+  if (!lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)))
+    return;
+
+  /* OpenMP target regions are entered via gomp_nvptx_main.  */
+  static bool gomp_nvptx_main_declared;
+  if (!gomp_nvptx_main_declared)
+    {
+      gomp_nvptx_main_declared = true;
+      s << "// BEGIN GLOBAL FUNCTION DECL: gomp_nvptx_main\n";
+    }
+  s << ".visible .global " << nvptx_ptx_type_from_mode (Pmode, false);
+  s << " __ptr_" << name << " = " << name << ";\n";
+}
+
 /* Implement ASM_DECLARE_FUNCTION_NAME.  Writes the start of a ptx
    function, including local var decls and copies from the arguments to
    local regs.  */
@@ -671,6 +691,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 
   std::stringstream s;
   write_function_decl_and_comment (s, name, decl);
+  if (flag_openmp)
+    write_libgomp_anchor (s, name, decl);
   s << "// BEGIN";
   if (TREE_PUBLIC (decl))
     s << " GLOBAL";

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c
  2015-10-20 18:34 ` [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c Alexander Monakov
@ 2015-10-20 20:48   ` Bernd Schmidt
  2015-10-20 21:41     ` Alexander Monakov
  2015-10-21  9:59   ` Jakub Jelinek
  1 sibling, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 20:48 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> NVPTX does not support alloca or variable-length stack allocations, thus
> heap allocation needs to be used instead.  I've opted to make this a generic
> change instead of guarding it with an #ifdef: libgomp usually leaves thread
> stack size up to libc, so avoiding unbounded stack allocation makes sense.
>
> 	* task.c (GOMP_task): Use a fixed-size on-stack buffer or a heap
>          allocation instead of a variable-size on-stack allocation.

> +	  char buf_fixed[2048], *buf = buf_fixed;

This might also not be the best of ideas on a GPU - the stack size isn't 
all that unlimited, what with there being lots of threads. If I do

   size_t stack, heap;
   cuCtxGetLimit (&stack, CU_LIMIT_STACK_SIZE);

in the nvptx-run program we've used for testing, it shows a default 
stack size of just 1kB.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-20 18:34 ` [gomp4 04/14] nvptx: fix output of _Bool global variables Alexander Monakov
@ 2015-10-20 20:51   ` Bernd Schmidt
  2015-10-20 21:04     ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 20:51 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> Due to special treatment of types, emitting variables of type _Bool in
> global scope is impossible: extern references are emitted with .u8, but
> definitions use .u64.  This patch fixes the issue by treating boolean type as
> integer types.
>
> 	* config/nvptx/nvptx.c (init_output_initializer): Also accept
>          BOOLEAN_TYPE.

Interesting, what was the testcase? I didn't stumble over this one. In 
any case, I think this patch is ok for trunk.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 09/14] libgomp: provide barriers on NVPTX
  2015-10-20 18:53 ` [gomp4 09/14] libgomp: provide barriers on NVPTX Alexander Monakov
@ 2015-10-20 20:56   ` Bernd Schmidt
  2015-10-20 22:00     ` Alexander Monakov
  2015-10-21  9:39   ` Jakub Jelinek
  1 sibling, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 20:56 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> On NVPTX, there's 16 hardware barriers for each thread team, each barrier has
> a variable waiter count.  The instruction 'bar.sync N, M;' allows to wait on
> barrier number N until M threads have arrived.  M should be pre-multiplied by
> warp width.  It's also possible to 'post' the barrier without suspending with
> 'bar.arrive'.
>
> We should be able to provide gomp barrier via a combination of ptx barriers
> and atomics.  This patch is a first step in that direction.
>
> It's mostly a copy of Linux implementation, and it's very likely that
> functions more complex than gomp_barrier_wait_end are implemented incorrectly.
> I will have to review all of that (and optimize, hopefully).
>
> I'm not sure if naked asm()'s are OK.  It's possible to implement a builtin
> instead for a minor beautification.  Thoughts?

I have no concerns about naked asms. I'm more concerned about whether 
this actually works - how much testing has this had? My experience has 
been that there is practically no way of using bar.sync reliably, since 
we can't control warp divergence and reconvergence at the ptx level but 
the hardware bar.sync instruction only works when executed by all 
threads in a warp at the same time.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 03/14] nvptx: expand support for address spaces
  2015-10-20 18:34 ` [gomp4 03/14] nvptx: expand support for address spaces Alexander Monakov
@ 2015-10-20 20:56   ` Bernd Schmidt
  2015-10-20 21:06     ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 20:56 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> This allows to emit decls in 'shared' memory from the middle-end.
>
> 	* config/nvptx/nvptx.c (nvptx_legitimate_address_p): Adjust prototype.
>          (nvptx_section_for_decl): If type of decl has a specific address
>          space, return it.
>          (nvptx_addr_space_from_address): Ditto.
>          (TARGET_ADDR_SPACE_POINTER_MODE): Define.
>          (TARGET_ADDR_SPACE_ADDRESS_MODE): Ditto.
>          (TARGET_ADDR_SPACE_SUBSET_P): Ditto.
>          (TARGET_ADDR_SPACE_CONVERT): Ditto.
>          (TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P): Ditto.

Not a fan of this I'm afraid. I used to have address space support in 
the nvptx backend, but the middle-end was too broken for it to work, so 
I made nvptx deal with all the address space complications internally. 
Is there a reason why this approach can't work for what you want to do? 
(Also, where are you using this?)


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-20 20:51   ` Bernd Schmidt
@ 2015-10-20 21:04     ` Alexander Monakov
  2015-10-28 16:56       ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 21:04 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik



On Tue, 20 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > Due to special treatment of types, emitting variables of type _Bool in
> > global scope is impossible: extern references are emitted with .u8, but
> > definitions use .u64.  This patch fixes the issue by treating boolean type
> > as
> > integer types.
> >
> >  * config/nvptx/nvptx.c (init_output_initializer): Also accept
> >          BOOLEAN_TYPE.
> 
> Interesting, what was the testcase? I didn't stumble over this one. In any
> case, I think this patch is ok for trunk.

libgomp has 'bool gomp_cancel_var' in global scope, and since it is not
compiled with -ffunction-sections, GOMP_parallel pulls in
GOMP_cancel (same TU, parallel.c), which references the variable.  Anything
with "#pragma omp parallel" would fail to link.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 03/14] nvptx: expand support for address spaces
  2015-10-20 20:56   ` Bernd Schmidt
@ 2015-10-20 21:06     ` Alexander Monakov
  2015-10-20 21:13       ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 21:06 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik



On Tue, 20 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > This allows to emit decls in 'shared' memory from the middle-end.
> >
> >  * config/nvptx/nvptx.c (nvptx_legitimate_address_p): Adjust prototype.
> >          (nvptx_section_for_decl): If type of decl has a specific address
> >          space, return it.
> >          (nvptx_addr_space_from_address): Ditto.
> >          (TARGET_ADDR_SPACE_POINTER_MODE): Define.
> >          (TARGET_ADDR_SPACE_ADDRESS_MODE): Ditto.
> >          (TARGET_ADDR_SPACE_SUBSET_P): Ditto.
> >          (TARGET_ADDR_SPACE_CONVERT): Ditto.
> >          (TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P): Ditto.
> 
> Not a fan of this I'm afraid. I used to have address space support in the
> nvptx backend, but the middle-end was too broken for it to work, so I made
> nvptx deal with all the address space complications internally. Is there a
> reason why this approach can't work for what you want to do? (Also, where are
> you using this?)

It is used in patch 06/14, to copy omp_data_o to shared memory.  I don't see
any other sane approach.  Note that patch 06/14 itself will need to be cleaned
up, to use a target hook in pass_late_omp_lower as I imagine.  I expect there
will be other instances where we'll need to place something into ptx shared
memory from the middle end, so having something readily usable without much
churn would be nice.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main
  2015-10-20 18:34 ` [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main Alexander Monakov
@ 2015-10-20 21:12   ` Bernd Schmidt
  2015-10-20 21:19     ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 21:12 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> The approach I've taken in libgomp/nvptx is to have a single entry point,
> gomp_nvptx_main, that can take care of initial allocation, transferring
> control to target region function, and finalization.
>
> At the moment it has the prototype:
> void gomp_nvptx_main(void (*fn)(void*), void *fndata);
>
> but it's plausible that down the road we'll need other arguments for passing
> data allocated by the plugin.
>
> I see two possible ways to arrange that.
>
> 1.  Make gomp_nvptx_main a .kernel function.  This is what this patch assumes.
> This requires emitting pointers-to-target-region-functions from the compiler,
> and looking them up via cuModuleLoadGlobal/cuMemcpyDtoH in the plugin.
>
> 2.  Make gomp_nvptx_main a device (.func) function.  To have that work, we'd
> need to additionally emit a "trampoline" of sorts in the NVPTX backend.  For
> each OpenMP target entrypoint foo$_omp_fn$0, we'd have to additionally emit
>
> __global__ void foo$_omp_fn$0$entry(void *args)
> {
>     gomp_nvptx_main(foo$_omp_fn$0, args);
> }

Wouldn't it be simpler to generate a .kernel for every target region 
function (as OpenACC does)? That could be a small stub in each case 
which just calls gomp_nvptx_main with the right function pointer. We 
already have the machinery to look up the right kernel corresponding to 
a host address and invoke it, so I think we should just reuse that 
functionality.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 03/14] nvptx: expand support for address spaces
  2015-10-20 21:06     ` Alexander Monakov
@ 2015-10-20 21:13       ` Bernd Schmidt
  2015-10-20 21:41         ` Cesar Philippidis
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 21:13 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik, Cesar Philippidis

On 10/20/2015 11:04 PM, Alexander Monakov wrote:
> On Tue, 20 Oct 2015, Bernd Schmidt wrote:
>
>> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
>>> This allows to emit decls in 'shared' memory from the middle-end.
>>>
>>>   * config/nvptx/nvptx.c (nvptx_legitimate_address_p): Adjust prototype.
>>>           (nvptx_section_for_decl): If type of decl has a specific address
>>>           space, return it.
>>>           (nvptx_addr_space_from_address): Ditto.
>>>           (TARGET_ADDR_SPACE_POINTER_MODE): Define.
>>>           (TARGET_ADDR_SPACE_ADDRESS_MODE): Ditto.
>>>           (TARGET_ADDR_SPACE_SUBSET_P): Ditto.
>>>           (TARGET_ADDR_SPACE_CONVERT): Ditto.
>>>           (TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P): Ditto.
>>
>> Not a fan of this I'm afraid. I used to have address space support in the
>> nvptx backend, but the middle-end was too broken for it to work, so I made
>> nvptx deal with all the address space complications internally. Is there a
>> reason why this approach can't work for what you want to do? (Also, where are
>> you using this?)
>
> It is used in patch 06/14, to copy omp_data_o to shared memory.  I don't see
> any other sane approach.

There is an alternative - decorate anything you'd like to go to shared 
memory with a special attribute, then handled that attribute in 
nvptx_addr_space_from_address and nvptx_section_for_decl. I actually 
made such a patch for Cesar a while ago, maybe he still has it?

This would avoid the pitfalls with gcc's middle-end address space 
handling, and the #ifdef ADDR_SPACE_SHARED in patch 6 which is a bit ugly.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main
  2015-10-20 21:12   ` Bernd Schmidt
@ 2015-10-20 21:19     ` Alexander Monakov
  2015-10-20 21:27       ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 21:19 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Tue, 20 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > 2.  Make gomp_nvptx_main a device (.func) function.  To have that work, we'd
> > need to additionally emit a "trampoline" of sorts in the NVPTX backend.  For
> > each OpenMP target entrypoint foo$_omp_fn$0, we'd have to additionally emit
> >
> > __global__ void foo$_omp_fn$0$entry(void *args)
> > {
> >     gomp_nvptx_main(foo$_omp_fn$0, args);
> > }
> 
> Wouldn't it be simpler to generate a .kernel for every target region function
> (as OpenACC does)? That could be a small stub in each case which just calls
> gomp_nvptx_main with the right function pointer. We already have the machinery
> to look up the right kernel corresponding to a host address and invoke it, so
> I think we should just reuse that functionality.

As I see we are describing the same thing in different words.

In what you describe, and in my quoted paragraph, both gomp_nvptx_main and the
function originally outlined for a target region are device-only (.func)
functions.  The .kernel function that the plugin looks up and launches is a
small piece of code that calls gomp_nvptx_main, passing it a pointer to the
target region function.

Unless I didn't fully catch what you say?  Like I said in the email, I do like
this approach more.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main
  2015-10-20 21:19     ` Alexander Monakov
@ 2015-10-20 21:27       ` Bernd Schmidt
  2015-10-21  9:07         ` Jakub Jelinek
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 21:27 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/20/2015 11:13 PM, Alexander Monakov wrote:
> On Tue, 20 Oct 2015, Bernd Schmidt wrote:
>
>> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
>>> 2.  Make gomp_nvptx_main a device (.func) function.  To have that work, we'd
>>> need to additionally emit a "trampoline" of sorts in the NVPTX backend.  For
>>> each OpenMP target entrypoint foo$_omp_fn$0, we'd have to additionally emit
>>>
>>> __global__ void foo$_omp_fn$0$entry(void *args)
>>> {
>>>      gomp_nvptx_main(foo$_omp_fn$0, args);
>>> }
>>
>> Wouldn't it be simpler to generate a .kernel for every target region function
>> (as OpenACC does)? That could be a small stub in each case which just calls
>> gomp_nvptx_main with the right function pointer. We already have the machinery
>> to look up the right kernel corresponding to a host address and invoke it, so
>> I think we should just reuse that functionality.
>
> As I see we are describing the same thing in different words.
>
> In what you describe, and in my quoted paragraph, both gomp_nvptx_main and the
> function originally outlined for a target region are device-only (.func)
> functions.  The .kernel function that the plugin looks up and launches is a
> small piece of code that calls gomp_nvptx_main, passing it a pointer to the
> target region function.
>
> Unless I didn't fully catch what you say?  Like I said in the email, I do like
> this approach more.

Could be that we're talking about the same thing. I think I was confused 
by a reference to .func vs .kernel and sm_30 vs sm_35 in patch 2/14. So 
let's go for this approach.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 03/14] nvptx: expand support for address spaces
  2015-10-20 21:13       ` Bernd Schmidt
@ 2015-10-20 21:41         ` Cesar Philippidis
  2015-10-20 21:51           ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Cesar Philippidis @ 2015-10-20 21:41 UTC (permalink / raw)
  To: Bernd Schmidt, Alexander Monakov
  Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]

On 10/20/2015 02:13 PM, Bernd Schmidt wrote:
> On 10/20/2015 11:04 PM, Alexander Monakov wrote:
>> On Tue, 20 Oct 2015, Bernd Schmidt wrote:
>>
>>> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
>>>> This allows to emit decls in 'shared' memory from the middle-end.
>>>>
>>>>   * config/nvptx/nvptx.c (nvptx_legitimate_address_p): Adjust
>>>> prototype.
>>>>           (nvptx_section_for_decl): If type of decl has a specific
>>>> address
>>>>           space, return it.
>>>>           (nvptx_addr_space_from_address): Ditto.
>>>>           (TARGET_ADDR_SPACE_POINTER_MODE): Define.
>>>>           (TARGET_ADDR_SPACE_ADDRESS_MODE): Ditto.
>>>>           (TARGET_ADDR_SPACE_SUBSET_P): Ditto.
>>>>           (TARGET_ADDR_SPACE_CONVERT): Ditto.
>>>>           (TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P): Ditto.
>>>
>>> Not a fan of this I'm afraid. I used to have address space support in
>>> the
>>> nvptx backend, but the middle-end was too broken for it to work, so I
>>> made
>>> nvptx deal with all the address space complications internally. Is
>>> there a
>>> reason why this approach can't work for what you want to do? (Also,
>>> where are
>>> you using this?)
>>
>> It is used in patch 06/14, to copy omp_data_o to shared memory.  I
>> don't see
>> any other sane approach.
> 
> There is an alternative - decorate anything you'd like to go to shared
> memory with a special attribute, then handled that attribute in
> nvptx_addr_space_from_address and nvptx_section_for_decl. I actually
> made such a patch for Cesar a while ago, maybe he still has it?
> 
> This would avoid the pitfalls with gcc's middle-end address space
> handling, and the #ifdef ADDR_SPACE_SHARED in patch 6 which is a bit ugly.

Was it this one that you're referring to Bernd? I think this is the
patch that introduces the "oacc ganglocal" attribute. It has bitrot
significantly though.

Regardless, keep in mind that we're abandoning dynamically allocated
shared memory in gcc 6.0. Right now in gomp-4_0-branch the two use cases
for shared memory are spill-and-fill for worker variable broadcasting
and worker reductions.

What are you planning on using shared memory for? It's an extremely
limited resource and it has some quirks.

Cesar

[-- Attachment #2: forcesar.diff --]
[-- Type: text/x-patch, Size: 60688 bytes --]

Index: gcc/cgraphunit.c
===================================================================
--- gcc/cgraphunit.c	(revision 224547)
+++ gcc/cgraphunit.c	(working copy)
@@ -2171,6 +2171,23 @@ ipa_passes (void)
       execute_ipa_pass_list (passes->all_small_ipa_passes);
       if (seen_error ())
 	return;
+
+      if (g->have_offload)
+	{
+	  extern void write_offload_lto ();
+	  section_name_prefix = OFFLOAD_SECTION_NAME_PREFIX;
+	  write_offload_lto ();
+	}
+    }
+  bool do_local_opts = !in_lto_p;
+#ifdef ACCEL_COMPILER
+  do_local_opts = true;
+#endif
+  if (do_local_opts)
+    {
+      execute_ipa_pass_list (passes->all_local_opt_passes);
+      if (seen_error ())
+	return;
     }
 
   /* This extra symtab_remove_unreachable_nodes pass tends to catch some
@@ -2182,7 +2199,7 @@ ipa_passes (void)
   if (symtab->state < IPA_SSA)
     symtab->state = IPA_SSA;
 
-  if (!in_lto_p)
+  if (do_local_opts)
     {
       /* Generate coverage variables and constructors.  */
       coverage_finish ();
@@ -2285,6 +2302,14 @@ symbol_table::compile (void)
   if (seen_error ())
     return;
 
+#ifdef ACCEL_COMPILER
+  {
+    cgraph_node *node;
+    FOR_EACH_DEFINED_FUNCTION (node)
+      node->get_untransformed_body ();
+  }
+#endif
+
 #ifdef ENABLE_CHECKING
   symtab_node::verify_symtab_nodes ();
 #endif
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 224547)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -1171,18 +1171,42 @@ nvptx_section_from_addr_space (addr_spac
     }
 }
 
-/* Determine whether DECL goes into .const or .global.  */
+/* Determine the address space DECL lives in.  */
 
-const char *
-nvptx_section_for_decl (const_tree decl)
+static addr_space_t
+nvptx_addr_space_for_decl (const_tree decl)
 {
+  if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL)
+    return ADDR_SPACE_GENERIC;
+
+  if (lookup_attribute ("oacc ganglocal", DECL_ATTRIBUTES (decl)) != NULL_TREE)
+    return ADDR_SPACE_SHARED;
+
   bool is_const = (CONSTANT_CLASS_P (decl)
 		   || TREE_CODE (decl) == CONST_DECL
 		   || TREE_READONLY (decl));
   if (is_const)
-    return ".const";
+    return ADDR_SPACE_CONST;
 
-  return ".global";
+  return ADDR_SPACE_GLOBAL;
+}
+
+/* Return a ptx string representing the address space for a variable DECL.  */
+
+const char *
+nvptx_section_for_decl (const_tree decl)
+{
+  switch (nvptx_addr_space_for_decl (decl))
+    {
+    case ADDR_SPACE_CONST:
+      return ".const";
+    case ADDR_SPACE_SHARED:
+      return ".shared";
+    case ADDR_SPACE_GLOBAL:
+      return ".global";
+    default:
+      gcc_unreachable ();
+    }
 }
 
 /* Look for a SYMBOL_REF in ADDR and return the address space to be used
@@ -1196,17 +1220,7 @@ nvptx_addr_space_from_address (rtx addr)
   if (GET_CODE (addr) != SYMBOL_REF)
     return ADDR_SPACE_GENERIC;
 
-  tree decl = SYMBOL_REF_DECL (addr);
-  if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL)
-    return ADDR_SPACE_GENERIC;
-
-  bool is_const = (CONSTANT_CLASS_P (decl)
-		   || TREE_CODE (decl) == CONST_DECL
-		   || TREE_READONLY (decl));
-  if (is_const)
-    return ADDR_SPACE_CONST;
-
-  return ADDR_SPACE_GLOBAL;
+  return nvptx_addr_space_for_decl (SYMBOL_REF_DECL (addr));
 }
 \f
 /* Machinery to output constant initializers.  */
Index: gcc/gimple-pretty-print.c
===================================================================
--- gcc/gimple-pretty-print.c	(revision 224547)
+++ gcc/gimple-pretty-print.c	(working copy)
@@ -1175,11 +1175,12 @@ dump_gimple_omp_for (pretty_printer *buf
       dump_gimple_fmt (buffer, spc, flags, " >,");
       for (i = 0; i < gimple_omp_for_collapse (gs); i++)
 	dump_gimple_fmt (buffer, spc, flags,
-			 "%+%T, %T, %T, %s, %T,%n",
+			 "%+%T, %T, %T, %s, %s, %T,%n",
 			 gimple_omp_for_index (gs, i),
 			 gimple_omp_for_initial (gs, i),
 			 gimple_omp_for_final (gs, i),
 			 get_tree_code_name (gimple_omp_for_cond (gs, i)),
+			 get_tree_code_name (gimple_omp_for_incr_code (gs, i)),
 			 gimple_omp_for_incr (gs, i));
       dump_gimple_fmt (buffer, spc, flags, "PRE_BODY <%S>%->",
 		       gimple_omp_for_pre_body (gs));
@@ -1259,6 +1260,20 @@ dump_gimple_omp_for (pretty_printer *buf
 	  dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc,
 			     flags, false);
 	  pp_string (buffer, " = ");
+	  dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc,
+			     flags, false);
+	  switch (gimple_omp_for_incr_code (gs, i))
+	    {
+	    case POINTER_PLUS_EXPR:
+	    case PLUS_EXPR:
+	      pp_plus (buffer);
+	      break;
+	    case MINUS_EXPR:
+	      pp_minus (buffer);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	    }
 	  dump_generic_node (buffer, gimple_omp_for_incr (gs, i), spc,
 			     flags, false);
 	  pp_right_paren (buffer);
Index: gcc/gimple-streamer-in.c
===================================================================
--- gcc/gimple-streamer-in.c	(revision 224547)
+++ gcc/gimple-streamer-in.c	(working copy)
@@ -176,6 +176,7 @@ input_gimple_stmt (struct lto_input_bloc
       }
       /* Fallthru  */
 
+    case GIMPLE_OMP_ENTRY_END:
     case GIMPLE_ASSIGN:
     case GIMPLE_CALL:
     case GIMPLE_RETURN:
@@ -225,6 +226,7 @@ input_gimple_stmt (struct lto_input_bloc
 
     case GIMPLE_NOP:
     case GIMPLE_PREDICT:
+    case GIMPLE_OMP_RETURN:
       break;
 
     case GIMPLE_TRANSACTION:
@@ -232,6 +234,42 @@ input_gimple_stmt (struct lto_input_bloc
 				    stream_read_tree (ib, data_in));
       break;
 
+    case GIMPLE_OMP_FOR:
+      {
+	gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+	gimple_omp_for_set_clauses (for_stmt, stream_read_tree (ib, data_in));
+	size_t collapse = streamer_read_hwi (ib);
+	for_stmt->collapse = collapse;
+	for_stmt->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
+	for (size_t i = 0; i < collapse; i++)
+	  {
+	    gimple_omp_for_set_cond (stmt, i, streamer_read_enum (ib, tree_code,
+							       MAX_TREE_CODES));
+	    gimple_omp_for_set_incr_code (stmt, i, streamer_read_enum (ib, tree_code,
+								       MAX_TREE_CODES));
+	    gimple_omp_for_set_index (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_initial (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_final (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_incr (stmt, i, stream_read_tree (ib, data_in));
+	  }
+      }
+      break;
+
+    case GIMPLE_OMP_CONTINUE:
+      {
+	gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt);
+	gimple_omp_continue_set_control_def (cont_stmt, stream_read_tree (ib, data_in));
+	gimple_omp_continue_set_control_use (cont_stmt, stream_read_tree (ib, data_in));
+      }
+      break;
+
+    case GIMPLE_OMP_TARGET:
+      {
+	gomp_target *tgt_stmt = as_a <gomp_target *> (stmt);
+	gimple_omp_target_set_clauses (tgt_stmt, stream_read_tree (ib, data_in));
+      }
+      break;
+
     default:
       internal_error ("bytecode stream: unknown GIMPLE statement tag %s",
 		      lto_tag_name (tag));
@@ -239,9 +277,9 @@ input_gimple_stmt (struct lto_input_bloc
 
   /* Update the properties of symbols, SSA names and labels associated
      with STMT.  */
-  if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL)
+  if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL || code == GIMPLE_OMP_CONTINUE)
     {
-      tree lhs = gimple_get_lhs (stmt);
+      tree lhs = gimple_op (stmt, 0);
       if (lhs && TREE_CODE (lhs) == SSA_NAME)
 	SSA_NAME_DEF_STMT (lhs) = stmt;
     }
@@ -257,7 +295,16 @@ input_gimple_stmt (struct lto_input_bloc
 	    SSA_NAME_DEF_STMT (op) = stmt;
 	}
     }
-
+  else if (code == GIMPLE_OMP_FOR)
+    {
+      gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+      for (unsigned i = 0; i < gimple_omp_for_collapse (for_stmt); i++)
+	{
+	  tree op = gimple_omp_for_index (for_stmt, i);
+	  if (TREE_CODE (op) == SSA_NAME)
+	    SSA_NAME_DEF_STMT (op) = stmt;
+	}
+    }
   /* Reset alias information.  */
   if (code == GIMPLE_CALL)
     gimple_call_reset_alias_info (as_a <gcall *> (stmt));
Index: gcc/gimple-streamer-out.c
===================================================================
--- gcc/gimple-streamer-out.c	(revision 224547)
+++ gcc/gimple-streamer-out.c	(working copy)
@@ -147,6 +147,7 @@ output_gimple_stmt (struct output_block
       }
       /* Fallthru  */
 
+    case GIMPLE_OMP_ENTRY_END:
     case GIMPLE_ASSIGN:
     case GIMPLE_CALL:
     case GIMPLE_RETURN:
@@ -201,6 +202,7 @@ output_gimple_stmt (struct output_block
 
     case GIMPLE_NOP:
     case GIMPLE_PREDICT:
+    case GIMPLE_OMP_RETURN:
       break;
 
     case GIMPLE_TRANSACTION:
@@ -211,6 +213,45 @@ output_gimple_stmt (struct output_block
       }
       break;
 
+    case GIMPLE_OMP_FOR:
+      {
+	gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+	stream_write_tree (ob, gimple_omp_for_clauses (for_stmt), true);
+	size_t collapse_count = gimple_omp_for_collapse (for_stmt);
+	streamer_write_hwi (ob, collapse_count);
+	for (size_t i = 0; i < collapse_count; i++)
+	  {
+	    streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES,
+				 gimple_omp_for_cond (for_stmt, i));
+	    streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES,
+				 gimple_omp_for_incr_code (for_stmt, i));
+	    stream_write_tree (ob, gimple_omp_for_index (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_initial (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_final (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_incr (for_stmt, i), true);
+	  }
+	/* No need to write out the pre-body, it's empty by the time we
+	   get here.  */
+      }
+      break;
+
+    case GIMPLE_OMP_CONTINUE:
+      {
+	gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt);
+	stream_write_tree (ob, gimple_omp_continue_control_def (cont_stmt),
+			   true);
+	stream_write_tree (ob, gimple_omp_continue_control_use (cont_stmt),
+			   true);
+      }
+      break;
+
+    case GIMPLE_OMP_TARGET:
+      {
+	gomp_target *tgt_stmt = as_a <gomp_target *> (stmt);
+	stream_write_tree (ob, gimple_omp_target_clauses (tgt_stmt), true);
+      }
+      break;
+
     default:
       gcc_unreachable ();
     }
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 224547)
+++ gcc/gimple.c	(working copy)
@@ -855,9 +855,11 @@ gimple_build_debug_source_bind_stat (tre
 /* Build a GIMPLE_OMP_ENTRY_END statement.  */
 
 gimple
-gimple_build_omp_entry_end (void)
+gimple_build_omp_entry_end (tree var)
 {
-  return gimple_alloc (GIMPLE_OMP_ENTRY_END, 0);
+  gimple t = gimple_alloc (GIMPLE_OMP_ENTRY_END, 1);
+  gimple_set_op (t, 0, var);
+  return t;
 }
 
 
@@ -890,13 +892,14 @@ gomp_for *
 gimple_build_omp_for (gimple_seq body, int kind, tree clauses, size_t collapse,
 		      gimple_seq pre_body)
 {
-  gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, 0));
+  int nops = collapse * 4;
+  gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, nops));
   if (body)
     gimple_omp_set_body (p, body);
   gimple_omp_for_set_clauses (p, clauses);
   gimple_omp_for_set_kind (p, kind);
   p->collapse = collapse;
-  p->iter =  ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
+  p->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
 
   if (pre_body)
     gimple_omp_for_set_pre_body (p, pre_body);
@@ -1011,7 +1014,7 @@ gomp_continue *
 gimple_build_omp_continue (tree control_def, tree control_use)
 {
   gomp_continue *p
-    = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 0));
+    = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 2));
   gimple_omp_continue_set_control_def (p, control_def);
   gimple_omp_continue_set_control_use (p, control_use);
   return p;
Index: gcc/gimple.def
===================================================================
--- gcc/gimple.def	(revision 224547)
+++ gcc/gimple.def	(working copy)
@@ -225,11 +225,11 @@ DEFGSCODE(GIMPLE_OMP_ATOMIC_STORE, "gimp
 
 /* GIMPLE_OMP_CONTINUE marks the location of the loop or sections
    iteration in partially lowered OpenMP code.  */
-DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_OMP_CONTINUE)
+DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_WITH_OPS)
 
 /* GIMPLE_OMP_ENTRY_END marks the end of the unpredicated entry block
    into an offloaded region.  */
-DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_BASE)
+DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_WITH_OPS)
 
 /* GIMPLE_OMP_CRITICAL <NAME, BODY> represents
 
Index: gcc/gimple.h
===================================================================
--- gcc/gimple.h	(revision 224547)
+++ gcc/gimple.h	(working copy)
@@ -301,7 +301,7 @@ struct GTY((tag("GSS_CALL")))
 /* OMP statements.  */
 
 struct GTY((tag("GSS_OMP")))
-  gimple_statement_omp : public gimple_statement_base
+  gimple_statement_omp : public gimple_statement_with_ops_base
 {
   /* [ WORD 1-6 ] : base class */
 
@@ -520,20 +520,8 @@ struct GTY((tag("GSS_OMP_CRITICAL")))
 
 
 struct GTY(()) gimple_omp_for_iter {
-  /* Condition code.  */
-  enum tree_code cond;
-
-  /* Index variable.  */
-  tree index;
-
-  /* Initial value.  */
-  tree initial;
-
-  /* Final value.  */
-  tree final;
-
-  /* Increment.  */
-  tree incr;
+  /* Condition code and increment code.  */
+  enum tree_code cond, incr;
 };
 
 /* GIMPLE_OMP_FOR */
@@ -556,6 +544,12 @@ struct GTY((tag("GSS_OMP_FOR")))
   /* [ WORD 11 ]
      Pre-body evaluated before the loop body begins.  */
   gimple_seq pre_body;
+
+  /* [ WORD 12 ]
+     Operand vector.  NOTE!  This must always be the last field
+     of this structure.  In particular, this means that this
+     structure cannot be embedded inside another one.  */
+  tree GTY((length ("%h.num_ops"))) op[1];
 };
 
 
@@ -581,10 +575,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -655,16 +645,10 @@ struct GTY((tag("GSS_OMP_SECTIONS")))
    Note: This does not inherit from gimple_statement_omp, because we
          do not need the body field.  */
 
-struct GTY((tag("GSS_OMP_CONTINUE")))
-  gomp_continue : public gimple_statement_base
+struct GTY((tag("GSS_WITH_OPS")))
+  gomp_continue : public gimple_statement_with_ops
 {
-  /* [ WORD 1-6 ] : base class */
-
-  /* [ WORD 7 ]  */
-  tree control_def;
-
-  /* [ WORD 8 ]  */
-  tree control_use;
+  /* no additional fields; this uses the layout for GSS_WITH_OPS. */
 };
 
 /* GIMPLE_OMP_SINGLE, GIMPLE_OMP_TEAMS */
@@ -1356,7 +1340,7 @@ gimple gimple_build_omp_taskgroup (gimpl
 gomp_continue *gimple_build_omp_continue (tree, tree);
 gimple gimple_build_omp_ordered (gimple_seq);
 gimple gimple_build_omp_return (bool);
-gimple gimple_build_omp_entry_end ();
+gimple gimple_build_omp_entry_end (tree);
 gomp_sections *gimple_build_omp_sections (gimple_seq, tree);
 gimple gimple_build_omp_sections_switch (void);
 gomp_single *gimple_build_omp_single (gimple_seq, tree);
@@ -1853,7 +1837,10 @@ gimple_init_singleton (gimple g)
 static inline bool
 gimple_has_ops (const_gimple g)
 {
-  return gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN;
+  return ((gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN)
+	  || gimple_code (g) == GIMPLE_OMP_FOR
+	  || gimple_code (g) == GIMPLE_OMP_ENTRY_END
+	  || gimple_code (g) == GIMPLE_OMP_CONTINUE);
 }
 
 template <>
@@ -4559,6 +4546,27 @@ gimple_omp_for_set_cond (gimple gs, size
   omp_for_stmt->iter[i].cond = cond;
 }
 
+/* Return the increment code associated with the OMP_FOR statement GS.  */
+
+static inline enum tree_code
+gimple_omp_for_incr_code (const_gimple gs, size_t i)
+{
+  const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
+  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
+  return omp_for_stmt->iter[i].incr;
+}
+
+
+/* Set INCR to be the increment code for the OMP_FOR statement GS.  */
+
+static inline void
+gimple_omp_for_set_incr_code (gimple gs, size_t i, enum tree_code incr)
+{
+  gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
+  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
+  omp_for_stmt->iter[i].incr = incr;
+}
+
 
 /* Return the index variable for the OMP_FOR statement GS.  */
 
@@ -4567,7 +4575,7 @@ gimple_omp_for_index (const_gimple gs, s
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
   gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].index;
+  return gimple_op (gs, i);
 }
 
 
@@ -4578,7 +4586,7 @@ gimple_omp_for_index_ptr (gimple gs, siz
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
   gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].index;
+  return gimple_op_ptr (gs, i);
 }
 
 
@@ -4588,8 +4596,9 @@ static inline void
 gimple_omp_for_set_index (gimple gs, size_t i, tree index)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].index = index;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i, index);
 }
 
 
@@ -4599,8 +4608,9 @@ static inline tree
 gimple_omp_for_initial (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c);
 }
 
 
@@ -4610,8 +4620,9 @@ static inline tree *
 gimple_omp_for_initial_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c);
 }
 
 
@@ -4621,8 +4632,9 @@ static inline void
 gimple_omp_for_set_initial (gimple gs, size_t i, tree initial)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].initial = initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c, initial);
 }
 
 
@@ -4632,8 +4644,9 @@ static inline tree
 gimple_omp_for_final (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c * 2);
 }
 
 
@@ -4643,8 +4656,9 @@ static inline tree *
 gimple_omp_for_final_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c * 2);
 }
 
 
@@ -4654,8 +4668,9 @@ static inline void
 gimple_omp_for_set_final (gimple gs, size_t i, tree final)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].final = final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c * 2, final);
 }
 
 
@@ -4665,8 +4680,9 @@ static inline tree
 gimple_omp_for_incr (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c * 3);
 }
 
 
@@ -4676,8 +4692,9 @@ static inline tree *
 gimple_omp_for_incr_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c * 3);
 }
 
 
@@ -4687,8 +4704,9 @@ static inline void
 gimple_omp_for_set_incr (gimple gs, size_t i, tree incr)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].incr = incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c * 3, incr);
 }
 
 
@@ -5248,25 +5266,6 @@ gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
@@ -5446,7 +5445,7 @@ gimple_omp_atomic_load_rhs_ptr (gomp_ato
 static inline tree
 gimple_omp_continue_control_def (const gomp_continue *cont_stmt)
 {
-  return cont_stmt->control_def;
+  return gimple_op (cont_stmt, 0);
 }
 
 /* The same as above, but return the address.  */
@@ -5454,7 +5453,7 @@ gimple_omp_continue_control_def (const g
 static inline tree *
 gimple_omp_continue_control_def_ptr (gomp_continue *cont_stmt)
 {
-  return &cont_stmt->control_def;
+  return gimple_op_ptr (cont_stmt, 0);
 }
 
 /* Set the definition of the control variable in a GIMPLE_OMP_CONTINUE.  */
@@ -5462,7 +5461,7 @@ gimple_omp_continue_control_def_ptr (gom
 static inline void
 gimple_omp_continue_set_control_def (gomp_continue *cont_stmt, tree def)
 {
-  cont_stmt->control_def = def;
+  gimple_set_op (cont_stmt, 0, def);
 }
 
 
@@ -5471,7 +5470,7 @@ gimple_omp_continue_set_control_def (gom
 static inline tree
 gimple_omp_continue_control_use (const gomp_continue *cont_stmt)
 {
-  return cont_stmt->control_use;
+  return gimple_op (cont_stmt, 1);
 }
 
 
@@ -5480,7 +5479,7 @@ gimple_omp_continue_control_use (const g
 static inline tree *
 gimple_omp_continue_control_use_ptr (gomp_continue *cont_stmt)
 {
-  return &cont_stmt->control_use;
+  return gimple_op_ptr (cont_stmt, 1);
 }
 
 
@@ -5489,7 +5488,7 @@ gimple_omp_continue_control_use_ptr (gom
 static inline void
 gimple_omp_continue_set_control_use (gomp_continue *cont_stmt, tree use)
 {
-  cont_stmt->control_use = use;
+  gimple_set_op (cont_stmt, 1, use);
 }
 
 /* Return a pointer to the body for the GIMPLE_TRANSACTION statement
Index: gcc/gimplify.c
===================================================================
--- gcc/gimplify.c	(revision 224547)
+++ gcc/gimplify.c	(working copy)
@@ -7582,12 +7582,15 @@ gimplify_omp_for (tree *expr_p, gimple_s
   for (i = 0; i < TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)); i++)
     {
       t = TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), i);
-      gimple_omp_for_set_index (gfor, i, TREE_OPERAND (t, 0));
+      tree idxvar = TREE_OPERAND (t, 0);
+      gimple_omp_for_set_index (gfor, i, idxvar);
       gimple_omp_for_set_initial (gfor, i, TREE_OPERAND (t, 1));
       t = TREE_VEC_ELT (OMP_FOR_COND (for_stmt), i);
       gimple_omp_for_set_cond (gfor, i, TREE_CODE (t));
       gimple_omp_for_set_final (gfor, i, TREE_OPERAND (t, 1));
       t = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i);
+      t = TREE_OPERAND (t, 1);
+      gimple_omp_for_set_incr_code (gfor, i, TREE_CODE (t));
       gimple_omp_for_set_incr (gfor, i, TREE_OPERAND (t, 1));
     }
 
Index: gcc/gsstruct.def
===================================================================
--- gcc/gsstruct.def	(revision 224547)
+++ gcc/gsstruct.def	(working copy)
@@ -42,12 +42,11 @@ DEFGSSTRUCT(GSS_EH_ELSE, geh_else, false
 DEFGSSTRUCT(GSS_WCE, gimple_statement_wce, false)
 DEFGSSTRUCT(GSS_OMP, gimple_statement_omp, false)
 DEFGSSTRUCT(GSS_OMP_CRITICAL, gomp_critical, false)
-DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, false)
+DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, true)
 DEFGSSTRUCT(GSS_OMP_PARALLEL_LAYOUT, gimple_statement_omp_parallel_layout, false)
 DEFGSSTRUCT(GSS_OMP_TASK, gomp_task, false)
 DEFGSSTRUCT(GSS_OMP_SECTIONS, gomp_sections, false)
 DEFGSSTRUCT(GSS_OMP_SINGLE_LAYOUT, gimple_statement_omp_single_layout, false)
-DEFGSSTRUCT(GSS_OMP_CONTINUE, gomp_continue, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_LOAD, gomp_atomic_load, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_STORE_LAYOUT, gomp_atomic_store, false)
 DEFGSSTRUCT(GSS_TRANSACTION, gtransaction, false)
Index: gcc/ipa-inline-analysis.c
===================================================================
--- gcc/ipa-inline-analysis.c	(revision 224547)
+++ gcc/ipa-inline-analysis.c	(working copy)
@@ -4122,10 +4122,12 @@ inline_generate_summary (void)
 {
   struct cgraph_node *node;
 
+#ifndef ACCEL_COMPILER
   /* When not optimizing, do not bother to analyze.  Inlining is still done
      because edge redirection needs to happen there.  */
   if (!optimize && !flag_generate_lto && !flag_generate_offload && !flag_wpa)
     return;
+#endif
 
   if (!inline_summaries)
     inline_summaries = (inline_summary_t*) inline_summary_t::create_ggc (symtab);
Index: gcc/lto/lto.c
===================================================================
--- gcc/lto/lto.c	(revision 224547)
+++ gcc/lto/lto.c	(working copy)
@@ -3115,8 +3115,10 @@ read_cgraph_and_symbols (unsigned nfiles
   /* Read the IPA summary data.  */
   if (flag_ltrans)
     ipa_read_optimization_summaries ();
+#ifndef ACCEL_COMPILER
   else
     ipa_read_summaries ();
+#endif
 
   for (i = 0; all_file_decl_data[i]; i++)
     {
Index: gcc/lto-streamer-out.c
===================================================================
--- gcc/lto-streamer-out.c	(revision 224547)
+++ gcc/lto-streamer-out.c	(working copy)
@@ -1800,27 +1800,32 @@ output_ssa_names (struct output_block *o
 {
   unsigned int i, len;
 
-  len = vec_safe_length (SSANAMES (fn));
-  streamer_write_uhwi (ob, len);
-
-  for (i = 1; i < len; i++)
+  if (cfun->gimple_df)
     {
-      tree ptr = (*SSANAMES (fn))[i];
+      len = vec_safe_length (SSANAMES (fn));
+      streamer_write_uhwi (ob, len);
 
-      if (ptr == NULL_TREE
-	  || SSA_NAME_IN_FREE_LIST (ptr)
-	  || virtual_operand_p (ptr))
-	continue;
+      for (i = 1; i < len; i++)
+	{
+	  tree ptr = (*SSANAMES (fn))[i];
 
-      streamer_write_uhwi (ob, i);
-      streamer_write_char_stream (ob->main_stream,
-				  SSA_NAME_IS_DEFAULT_DEF (ptr));
-      if (SSA_NAME_VAR (ptr))
-	stream_write_tree (ob, SSA_NAME_VAR (ptr), true);
-      else
-	/* ???  This drops SSA_NAME_IDENTIFIER on the floor.  */
-	stream_write_tree (ob, TREE_TYPE (ptr), true);
+	  if (ptr == NULL_TREE
+	      || SSA_NAME_IN_FREE_LIST (ptr)
+	      || virtual_operand_p (ptr))
+	    continue;
+
+	  streamer_write_uhwi (ob, i);
+	  streamer_write_char_stream (ob->main_stream,
+				      SSA_NAME_IS_DEFAULT_DEF (ptr));
+	  if (SSA_NAME_VAR (ptr))
+	    stream_write_tree (ob, SSA_NAME_VAR (ptr), true);
+	  else
+	    /* ???  This drops SSA_NAME_IDENTIFIER on the floor.  */
+	    stream_write_tree (ob, TREE_TYPE (ptr), true);
+	}
     }
+  else
+    streamer_write_zero (ob);
 
   streamer_write_zero (ob);
 }
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 224547)
+++ gcc/omp-low.c	(working copy)
@@ -110,7 +110,7 @@ along with GCC; see the file COPYING3.
 #include "gomp-constants.h"
 #include "gimple-pretty-print.h"
 #include "set"
-
+#include "output.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -597,17 +597,17 @@ extract_omp_for_data (gomp_for *for_stmt
 	}
 
       t = gimple_omp_for_incr (for_stmt, i);
-      gcc_assert (TREE_OPERAND (t, 0) == var);
-      switch (TREE_CODE (t))
+      enum tree_code incr_code = gimple_omp_for_incr_code (for_stmt, i);
+      switch (incr_code)
 	{
 	case PLUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
+	  loop->step = t;
 	  break;
 	case POINTER_PLUS_EXPR:
-	  loop->step = fold_convert (ssizetype, TREE_OPERAND (t, 1));
+	  loop->step = fold_convert (ssizetype, t);
 	  break;
 	case MINUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
+	  loop->step = t;
 	  loop->step = fold_build1_loc (loc,
 				    NEGATE_EXPR, TREE_TYPE (loop->step),
 				    loop->step);
@@ -9721,12 +9721,21 @@ loop_get_oacc_kernels_region_entry (stru
     }
 }
 
+static bool
+was_offloaded_p (tree fn)
+{
+#ifdef ACCEL_COMPILER
+  return true;
+#endif
+  struct cgraph_node *node = cgraph_node::get (fn);
+  return node->offloadable;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
 expand_omp_target (struct omp_region *region)
 {
-  basic_block entry_bb, exit_bb, new_bb;
   struct function *child_cfun;
   tree child_fn, block, t;
   gimple_stmt_iterator gsi;
@@ -9736,12 +9745,33 @@ expand_omp_target (struct omp_region *re
   bool offloaded, data_region;
   bool do_emit_library_call = true;
   bool do_splitoff = true;
+  bool already_offloaded = was_offloaded_p (current_function_decl);
 
   entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
+  location_t entry_loc = gimple_location (entry_stmt);
 
-  new_bb = region->entry;
+  basic_block new_bb = region->entry;
+  basic_block entry_bb = region->entry;
+  basic_block exit_bb = region->exit;
+  basic_block entry_succ_bb = single_succ (entry_bb);
 
-  offloaded = is_gimple_omp_offloaded (entry_stmt);
+  if (already_offloaded)
+    {
+      gsi = gsi_for_stmt (entry_stmt);
+      gsi_remove (&gsi, true);
+
+      gsi = gsi_last_bb (exit_bb);
+      gcc_assert (!gsi_end_p (gsi)
+		  && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+      gsi_remove (&gsi, true);
+
+      gsi = gsi_last_bb (entry_succ_bb);
+      if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END)
+	gsi_remove (&gsi, true);
+      return;
+    }
+
+  offloaded = !already_offloaded && is_gimple_omp_offloaded (entry_stmt);
   switch (gimple_omp_target_kind (entry_stmt))
     {
     case GF_OMP_TARGET_KIND_REGION:
@@ -9773,9 +9803,6 @@ expand_omp_target (struct omp_region *re
   if (child_cfun != NULL)
     gcc_checking_assert (!child_cfun->cfg);
 
-  entry_bb = region->entry;
-  exit_bb = region->exit;
-
   if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
     {
       if (!gimple_in_ssa_p (cfun))
@@ -9814,13 +9841,7 @@ expand_omp_target (struct omp_region *re
 	}
     }
 
-  basic_block entry_succ_bb = single_succ (entry_bb);
-  if (offloaded && !gimple_in_ssa_p (cfun))
-    {
-      gsi = gsi_last_bb (entry_succ_bb);
-      if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END)
-	gsi_remove (&gsi, true);
-    }
+  tree data_arg = gimple_omp_target_data_arg (entry_stmt);
 
   if (offloaded
       && do_splitoff)
@@ -9840,7 +9861,6 @@ expand_omp_target (struct omp_region *re
 	 a function call that has been inlined, the original PARM_DECL
 	 .OMP_DATA_I may have been converted into a different local
 	 variable.  In which case, we need to keep the assignment.  */
-      tree data_arg = gimple_omp_target_data_arg (entry_stmt);
       if (data_arg)
 	{
 	  gimple_stmt_iterator gsi;
@@ -9923,8 +9943,12 @@ expand_omp_target (struct omp_region *re
       stmt = gsi_stmt (gsi);
       gcc_assert (stmt
 		  && gimple_code (stmt) == gimple_code (entry_stmt));
+      gsi_prev (&gsi);
+      stmt = gsi_stmt (gsi);
       e = split_block (entry_bb, stmt);
+#if 0
       gsi_remove (&gsi, true);
+#endif
       entry_bb = e->dest;
       single_succ_edge (entry_bb)->flags = EDGE_FALLTHRU;
 
@@ -9932,11 +9956,16 @@ expand_omp_target (struct omp_region *re
       if (exit_bb)
 	{
 	  gsi = gsi_last_bb (exit_bb);
+	  gimple ompret = gsi_stmt (gsi);
 	  gcc_assert (!gsi_end_p (gsi)
-		      && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+		      && gimple_code (ompret) == GIMPLE_OMP_RETURN);
 	  stmt = gimple_build_return (NULL);
 	  gsi_insert_after (&gsi, stmt, GSI_SAME_STMT);
+#if 0
 	  gsi_remove (&gsi, true);
+#endif
+	  edge e1 = split_block (exit_bb, ompret);
+	  exit_bb = e1->dest;
 
 	  /* A vuse in single_succ (exit_bb) may use a vdef from the region
 	     which is about to be split off.  Mark the vdef for renaming.  */
@@ -9955,6 +9984,9 @@ expand_omp_target (struct omp_region *re
       else
 	block = gimple_block (entry_stmt);
 
+      /* Make sure we don't try to copy these.  */
+      gimple_omp_target_set_child_fn (entry_stmt, NULL);
+      gimple_omp_target_set_data_arg (entry_stmt, NULL);
       new_bb = move_sese_region_to_fn (child_cfun, entry_bb, exit_bb, block);
       if (exit_bb)
 	single_succ_edge (new_bb)->flags = EDGE_FALLTHRU;
@@ -9979,6 +10011,8 @@ expand_omp_target (struct omp_region *re
 
       /* Inform the callgraph about the new function.  */
       DECL_STRUCT_FUNCTION (child_fn)->curr_properties = cfun->curr_properties;
+      DECL_STRUCT_FUNCTION (child_fn)->curr_properties &= ~PROP_gimple_eomp;
+      
       cgraph_node::add_new_function (child_fn, true);
       cgraph_node::get (child_fn)->parallelized_function = 1;
 
@@ -10088,7 +10122,7 @@ expand_omp_target (struct omp_region *re
       clause_loc = OMP_CLAUSE_LOCATION (c);
     }
   else
-    clause_loc = gimple_location (entry_stmt);
+    clause_loc = entry_loc;
 
   /* Ensure 'device' is of the correct type.  */
   device = fold_convert_loc (clause_loc, integer_type_node, device);
@@ -10147,7 +10181,7 @@ expand_omp_target (struct omp_region *re
 
   gsi = gsi_last_bb (new_bb);
   t = gimple_omp_target_data_arg (entry_stmt);
-  if (t == NULL)
+  if (data_arg == NULL)
     {
       t1 = size_zero_node;
       t2 = build_zero_cst (ptr_type_node);
@@ -10156,11 +10190,11 @@ expand_omp_target (struct omp_region *re
     }
   else
     {
-      t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (t, 1))));
+      t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (data_arg, 1))));
       t1 = size_binop (PLUS_EXPR, t1, size_int (1));
-      t2 = build_fold_addr_expr (TREE_VEC_ELT (t, 0));
-      t3 = build_fold_addr_expr (TREE_VEC_ELT (t, 1));
-      t4 = build_fold_addr_expr (TREE_VEC_ELT (t, 2));
+      t2 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 0));
+      t3 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 1));
+      t4 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 2));
     }
 
   gimple g;
@@ -10209,8 +10243,7 @@ expand_omp_target (struct omp_region *re
 
 	/* Default values for num_gangs, num_workers, and vector_length.  */
 	t_num_gangs = t_num_workers = t_vector_length
-	  = fold_convert_loc (gimple_location (entry_stmt),
-			      integer_type_node, integer_one_node);
+	  = fold_convert_loc (entry_loc, integer_type_node, integer_one_node);
 	/* ..., but if present, use the value specified by the respective
 	   clause, making sure that are of the correct type.  */
 	c = find_omp_clause (clauses, OMP_CLAUSE_NUM_GANGS);
@@ -10241,8 +10274,7 @@ expand_omp_target (struct omp_region *re
 	int t_wait_idx;
 
 	/* Default values for t_async.  */
-	t_async = fold_convert_loc (gimple_location (entry_stmt),
-				    integer_type_node,
+	t_async = fold_convert_loc (entry_loc, integer_type_node,
 				    build_int_cst (integer_type_node,
 						   GOMP_ASYNC_SYNC));
 	/* ..., but if present, use the value specified by the respective
@@ -10257,8 +10289,7 @@ expand_omp_target (struct omp_region *re
 	/* Save the index, and... */
 	t_wait_idx = args.length ();
 	/* ... push a default value.  */
-	args.quick_push (fold_convert_loc (gimple_location (entry_stmt),
-					   integer_type_node,
+	args.quick_push (fold_convert_loc (entry_loc, integer_type_node,
 					   integer_zero_node));
 	c = find_omp_clause (clauses, OMP_CLAUSE_WAIT);
 	if (c)
@@ -10279,8 +10310,7 @@ expand_omp_target (struct omp_region *re
 	    /* Now that we know the number, replace the default value.  */
 	    args.ordered_remove (t_wait_idx);
 	    args.quick_insert (t_wait_idx,
-			       fold_convert_loc (gimple_location (entry_stmt),
-						 integer_type_node,
+			       fold_convert_loc (entry_loc, integer_type_node,
 						 build_int_cst (integer_type_node, n)));
 	  }
       }
@@ -10290,7 +10320,7 @@ expand_omp_target (struct omp_region *re
     }
 
   g = gimple_build_call_vec (builtin_decl_explicit (start_ix), args);
-  gimple_set_location (g, gimple_location (entry_stmt));
+  gimple_set_location (g, entry_loc);
   gsi_insert_before (&gsi, g, GSI_SAME_STMT);
   if (!offloaded)
     {
@@ -10310,6 +10340,23 @@ expand_omp_target (struct omp_region *re
     update_ssa (TODO_update_ssa_only_virtuals);
 }
 
+static bool
+expand_region_inner_p (omp_region *region)
+{
+  if (!region->inner)
+    return false;
+
+  if (region->type != GIMPLE_OMP_TARGET)
+    return true;
+  if (was_offloaded_p (current_function_decl))
+    return true;
+
+  gomp_target *entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
+  bool offloaded = is_gimple_omp_offloaded (entry_stmt);
+
+  return !offloaded || !is_gimple_omp_oacc (entry_stmt);
+}
+
 /* Expand the parallel region tree rooted at REGION.  Expansion
    proceeds in depth-first order.  Innermost regions are expanded
    first.  This way, parallel regions that require a new function to
@@ -10340,8 +10387,7 @@ expand_omp (struct omp_region *region)
       if (region->type == GIMPLE_OMP_FOR
 	  && gimple_omp_for_combined_p (last_stmt (region->entry)))
 	inner_stmt = last_stmt (region->inner->entry);
-     
-      if (region->inner)
+      if (expand_region_inner_p (region))
 	expand_omp (region->inner);
 
       saved_location = input_location;
@@ -10439,7 +10485,9 @@ find_omp_target_region_data (struct omp_
     region->gwv_this |= MASK_WORKER;
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= MASK_VECTOR;
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
+  basic_block entry_succ = single_succ (region->entry);
+  gimple ee_stmt = last_stmt (entry_succ);
+  region->broadcast_array = gimple_op (ee_stmt, 0);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10666,6 +10714,7 @@ generate_vector_broadcast (tree dest_var
 	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
 
       gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
+      retval = conv1;
     }
 
   tree decl = builtin_decl_explicit (fn);
@@ -10709,19 +10758,21 @@ generate_oacc_broadcast (omp_region *reg
   omp_region *parent = enclosing_target_region (region);
 
   tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
+  tree ptrtype = build_pointer_type (elttype);
+  tree ptr1 = make_ssa_name (ptrtype);
+  tree ptr2 = make_ssa_name (ptrtype);
+  gassign *cast1 = gimple_build_assign (ptr1, NOP_EXPR,
 				       parent->broadcast_array);
   gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
+  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr1), var);
   gsi_insert_after (&where, st, GSI_NEW_STMT);
 
   gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
 
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
+  gassign *cast2 = gimple_build_assign (ptr2, NOP_EXPR,
 					parent->broadcast_array);
   gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
+  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr2));
   gsi_insert_after (&where, ld, GSI_NEW_STMT);
 
   gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
@@ -10735,7 +10786,8 @@ generate_oacc_broadcast (omp_region *reg
    the bits MASK_VECTOR and/or MASK_WORKER.  */
 
 static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
+make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask,
+		       bool set_dominator)
 {
   basic_block cond_bb = true_edge->src;
   
@@ -10747,7 +10799,7 @@ make_predication_test (edge true_edge, b
   if (mask & MASK_VECTOR)
     {
       gimple call = gimple_build_call (decl, 1, integer_zero_node);
-      vvar = create_tmp_var (unsigned_type_node);
+      vvar = make_ssa_name (unsigned_type_node);
       comp_var = vvar;
       gimple_call_set_lhs (call, vvar);
       gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
@@ -10755,14 +10807,14 @@ make_predication_test (edge true_edge, b
   if (mask & MASK_WORKER)
     {
       gimple call = gimple_build_call (decl, 1, integer_one_node);
-      wvar = create_tmp_var (unsigned_type_node);
+      wvar = make_ssa_name (unsigned_type_node);
       comp_var = wvar;
       gimple_call_set_lhs (call, wvar);
       gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
     }
   if (wvar && vvar)
     {
-      comp_var = create_tmp_var (unsigned_type_node);
+      comp_var = make_ssa_name (unsigned_type_node);
       gassign *ior = gimple_build_assign (comp_var, BIT_IOR_EXPR, wvar, vvar);
       gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
     }
@@ -10782,6 +10834,9 @@ make_predication_test (edge true_edge, b
   basic_block false_abnorm_bb = split_edge (e);
   edge abnorm_edge = single_succ_edge (false_abnorm_bb);
   abnorm_edge->flags |= EDGE_ABNORMAL;
+
+  if (set_dominator)
+    set_immediate_dominator (CDI_DOMINATORS, skip_dest_bb, cond_bb);
 }
 
 /* Apply OpenACC predication to basic block BB which is in
@@ -10791,6 +10846,8 @@ make_predication_test (edge true_edge, b
 static void
 predicate_bb (basic_block bb, struct omp_region *parent, int mask)
 {
+  bool set_dominator = true;
+
   /* We handle worker-single vector-partitioned loops by jumping
      around them if not in the controlling worker.  Don't insert
      unnecessary (and incorrect) predication.  */
@@ -10816,8 +10873,8 @@ predicate_bb (basic_block bb, struct omp
 
   if (gimple_code (stmt) == GIMPLE_COND)
     {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
+      tree cond_var = make_ssa_name (boolean_type_node);
+      tree broadcast_cond = make_ssa_name (boolean_type_node);
       gassign *asgn = gimple_build_assign (cond_var,
 					   gimple_cond_code (stmt),
 					   gimple_cond_lhs (stmt),
@@ -10830,30 +10887,36 @@ predicate_bb (basic_block bb, struct omp
 						   mask);
 
       edge e = split_block (bb, splitpoint);
+      set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
       e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
 				 broadcast_cond, boolean_true_node);
+      update_stmt (stmt);
     }
   else if (gimple_code (stmt) == GIMPLE_SWITCH)
     {
       gswitch *sstmt = as_a <gswitch *> (stmt);
       tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
+      tree new_var = make_ssa_name (TREE_TYPE (var));
 
+#if 0
       gassign *asgn = gimple_build_assign (new_var, var);
       gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
       gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
+#endif
+      gsi_prev (&gsi);
       gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
+						   gsi, mask);
 
       edge e = split_block (bb, splitpoint);
+      set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
       e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_switch_set_index (sstmt, new_var);
+      update_stmt (stmt);
     }
   else if (is_gimple_omp (stmt))
     {
@@ -10876,6 +10939,7 @@ predicate_bb (basic_block bb, struct omp
 	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
 	      gsi_prev (&head_gsi);
 	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
+	      set_immediate_dominator (CDI_DOMINATORS, e0->dest, e0->src);
 	      int mask2 = mask;
 	      if (code == GIMPLE_OMP_FOR)
 		mask2 &= ~MASK_VECTOR;
@@ -10885,7 +10949,7 @@ predicate_bb (basic_block bb, struct omp
 		     so we just need to make one branch around the
 		     entire loop.  */
 		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
+		  make_predication_test (e0, skip_dest_bb, mask2, true);
 		  return;
 		}
 	      basic_block for_block = e0->dest;
@@ -10896,9 +10960,9 @@ predicate_bb (basic_block bb, struct omp
 	      edge e2 = split_block (for_block, split_stmt);
 	      basic_block bb2 = e2->dest;
 
-	      make_predication_test (e0, bb2, mask);
+	      make_predication_test (e0, bb2, mask, true);
 	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
+				     mask2, true);
 	      inner->entry = bb3;
 	      return;
 	    }
@@ -10917,6 +10981,7 @@ predicate_bb (basic_block bb, struct omp
 	  if (!split_stmt)
 	    return;
 	  edge e = split_block (bb, split_stmt);
+	  set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
 	  skip_dest_bb = e->dest;
 	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
 	    {
@@ -10945,6 +11010,8 @@ predicate_bb (basic_block bb, struct omp
 	gsi_prev (&gsi);
       if (gsi_stmt (gsi) == 0)
 	return;
+      if (get_immediate_dominator (CDI_DOMINATORS, skip_dest_bb) != bb)
+	set_dominator = false;
     }
 
   if (skip_dest_bb != NULL)
@@ -10952,24 +11019,31 @@ predicate_bb (basic_block bb, struct omp
       gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
       gsi_prev (&head_gsi);
       edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
+      set_immediate_dominator (CDI_DOMINATORS, e2->dest, e2->src);
+      make_predication_test (e2, skip_dest_bb, mask, set_dominator);
     }
 }
 
 /* Walk the dominator tree starting at BB to collect basic blocks in
    WORKLIST which need OpenACC vector predication applied to them.  */
 
-static void
+static bool
 find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
 {
+  bool ret = false;
   struct omp_region *parent = *bb_region_map->get (bb);
   if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
+    {
+      worklist.safe_push (bb);
+      ret = true;
+    }
+  
   basic_block son;
   for (son = first_dom_son (CDI_DOMINATORS, bb);
        son;
        son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
+    ret |= find_predicatable_bbs (son, worklist);
+  return ret;
 }
 
 /* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
@@ -10979,7 +11053,9 @@ static void
 predicate_omp_regions (basic_block head_bb)
 {
   vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
+  if (!find_predicatable_bbs (head_bb, worklist))
+    return;
+
   int i;
   basic_block bb;
   FOR_EACH_VEC_ELT (worklist, i, bb)
@@ -10988,6 +11064,11 @@ predicate_omp_regions (basic_block head_
       int mask = required_predication_mask (region);
       predicate_bb (bb, region, mask);
     }
+  free_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS);
+  mark_virtual_operands_for_renaming (cfun);
+  update_ssa (TODO_update_ssa);
+  verify_ssa (true, true);
 }
 
 /* USE and GET sets for variable broadcasting.  */
@@ -11176,7 +11257,8 @@ oacc_broadcast (basic_block entry_bb, ba
 
   /* Currently, subroutines aren't supported.  */
   gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
+				 DECL_ATTRIBUTES (current_function_decl))
+	      || was_offloaded_p (current_function_decl));
 
   /* Populate live_in.  */
   oacc_populate_live_in (entry_bb, region);
@@ -11236,7 +11318,7 @@ oacc_broadcast (basic_block entry_bb, ba
 	  gsi_prev (&gsi);
 	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
 	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
+	  make_predication_test (e2, dest_bb, mask, true);
 
 	  /* Update entry_bb.  */
 	  entry_bb = dest_bb;
@@ -11249,7 +11331,7 @@ oacc_broadcast (basic_block entry_bb, ba
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
-execute_expand_omp (void)
+execute_expand_omp (bool first)
 {
   bb_region_map = new hash_map<basic_block, omp_region *>;
 
@@ -11264,7 +11346,8 @@ execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+      if (!first)
+	predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
 
       remove_exit_barriers (root_omp_region);
 
@@ -11317,9 +11400,10 @@ public:
       if (!gate)
 	return 0;
 
-      return execute_expand_omp ();
+      return execute_expand_omp (true);
     }
 
+  opt_pass * clone () { return new pass_expand_omp (m_ctxt); }
 }; // class pass_expand_omp
 
 } // anon namespace
@@ -11400,9 +11484,9 @@ public:
     }
   virtual unsigned int execute (function *)
     {
-      unsigned res = execute_expand_omp ();
+      unsigned res = execute_expand_omp (false);
       release_dangling_ssa_names ();
-      return res;
+      return res | TODO_update_ssa;
     }
   opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
@@ -12562,7 +12646,7 @@ lower_omp_for (gimple_stmt_iterator *gsi
       if (!is_gimple_min_invariant (*rhs_p))
 	*rhs_p = get_formal_tmp_var (*rhs_p, &body);
 
-      rhs_p = &TREE_OPERAND (gimple_omp_for_incr (stmt, i), 1);
+      rhs_p = gimple_omp_for_incr_ptr (stmt, i);
       if (!is_gimple_min_invariant (*rhs_p))
 	*rhs_p = get_formal_tmp_var (*rhs_p, &body);
     }
@@ -13547,7 +13631,7 @@ lower_omp_target (gimple_stmt_iterator *
 
   if (offloaded)
     {
-      gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end ());
+      gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end (ctx->worker_sync_elt));
       if (has_reduction)
 	{
 	  gimple_seq_add_seq (&irlist, tgt_body);
@@ -13583,7 +13667,6 @@ lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
Index: gcc/pass_manager.h
===================================================================
--- gcc/pass_manager.h	(revision 224547)
+++ gcc/pass_manager.h	(working copy)
@@ -28,6 +28,7 @@ struct register_pass_info;
 #define GCC_PASS_LISTS \
   DEF_PASS_LIST (all_lowering_passes) \
   DEF_PASS_LIST (all_small_ipa_passes) \
+  DEF_PASS_LIST (all_local_opt_passes) \
   DEF_PASS_LIST (all_regular_ipa_passes) \
   DEF_PASS_LIST (all_late_ipa_passes) \
   DEF_PASS_LIST (all_passes)
@@ -82,6 +83,7 @@ public:
   /* The root of the compilation pass tree, once constructed.  */
   opt_pass *all_passes;
   opt_pass *all_small_ipa_passes;
+  opt_pass *all_local_opt_passes;
   opt_pass *all_lowering_passes;
   opt_pass *all_regular_ipa_passes;
   opt_pass *all_late_ipa_passes;
Index: gcc/passes.c
===================================================================
--- gcc/passes.c	(revision 224547)
+++ gcc/passes.c	(working copy)
@@ -454,8 +454,12 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      /* Don't bother doing anything if the program has errors.  */
-      return (!seen_error () && !in_lto_p);
+      if (seen_error ())
+	return false;
+#ifdef ACCEL_COMPILER
+      return true;
+#endif
+      return !in_lto_p;
     }
 
 }; // class pass_local_optimization_passes
@@ -952,6 +956,7 @@ pass_manager::dump_passes () const
 
   dump_pass_list (all_lowering_passes, 1);
   dump_pass_list (all_small_ipa_passes, 1);
+  dump_pass_list (all_local_opt_passes, 1);
   dump_pass_list (all_regular_ipa_passes, 1);
   dump_pass_list (all_late_ipa_passes, 1);
   dump_pass_list (all_passes, 1);
@@ -1463,6 +1468,8 @@ pass_manager::register_pass (struct regi
   if (!success || all_instances)
     success |= position_pass (pass_info, &all_small_ipa_passes);
   if (!success || all_instances)
+    success |= position_pass (pass_info, &all_local_opt_passes);
+  if (!success || all_instances)
     success |= position_pass (pass_info, &all_regular_ipa_passes);
   if (!success || all_instances)
     success |= position_pass (pass_info, &all_late_ipa_passes);
@@ -1515,9 +1522,10 @@ pass_manager::register_pass (struct regi
    If we are optimizing, compile is then invoked:
 
    compile ()
-       ipa_passes () 			-> all_small_ipa_passes
+       ipa_passes () 			-> all_small_ipa_passes,
+					   all_local_opt_passes
 					-> Analysis of all_regular_ipa_passes
-	* possible LTO streaming at copmilation time *
+	* possible LTO streaming at compilation time *
 					-> Execution of all_regular_ipa_passes
 	* possible LTO streaming at link time *
 					-> all_late_ipa_passes
@@ -1541,8 +1549,8 @@ pass_manager::operator delete (void *ptr
 }
 
 pass_manager::pass_manager (context *ctxt)
-: all_passes (NULL), all_small_ipa_passes (NULL), all_lowering_passes (NULL),
-  all_regular_ipa_passes (NULL),
+: all_passes (NULL), all_small_ipa_passes (NULL), all_local_opt_passes (NULL),
+  all_lowering_passes (NULL), all_regular_ipa_passes (NULL),
   all_late_ipa_passes (NULL), passes_by_id (NULL), passes_by_id_size (0),
   m_ctxt (ctxt)
 {
@@ -1592,6 +1600,7 @@ pass_manager::pass_manager (context *ctx
   /* Register the passes with the tree dump code.  */
   register_dump_files (all_lowering_passes);
   register_dump_files (all_small_ipa_passes);
+  register_dump_files (all_local_opt_passes);
   register_dump_files (all_regular_ipa_passes);
   register_dump_files (all_late_ipa_passes);
   register_dump_files (all_passes);
@@ -2463,24 +2472,15 @@ ipa_write_summaries_1 (lto_symtab_encode
   lto_delete_out_decl_state (state);
 }
 
-/* Write out summaries for all the nodes in the callgraph.  */
-
-void
-ipa_write_summaries (void)
+static lto_symtab_encoder_t
+build_symtab_encoder (void)
 {
-  lto_symtab_encoder_t encoder;
+  lto_symtab_encoder_t encoder = lto_symtab_encoder_new (false);
   int i, order_pos;
   varpool_node *vnode;
   struct cgraph_node *node;
   struct cgraph_node **order;
 
-  if ((!flag_generate_lto && !flag_generate_offload) || seen_error ())
-    return;
-
-  select_what_to_stream ();
-
-  encoder = lto_symtab_encoder_new (false);
-
   /* Create the callgraph set in the same order used in
      cgraph_expand_all_functions.  This mostly facilitates debugging,
      since it causes the gimple file to be processed in the same order
@@ -2515,10 +2515,50 @@ ipa_write_summaries (void)
   FOR_EACH_DEFINED_VARIABLE (vnode)
     if (vnode->need_lto_streaming)
       lto_set_symtab_encoder_in_partition (encoder, vnode);
+  free (order);
+  return encoder;
+}
 
+/* Write out summaries for all the nodes in the callgraph.  */
+
+void
+ipa_write_summaries (void)
+{
+  if ((!flag_generate_lto && !flag_generate_offload) || seen_error ())
+    return;
+
+  select_what_to_stream ();
+  lto_symtab_encoder_t encoder = build_symtab_encoder ();
   ipa_write_summaries_1 (compute_ltrans_boundary (encoder));
+}
 
-  free (order);
+void
+write_offload_lto (void)
+{
+  if (!flag_generate_offload || seen_error ())
+    return;
+
+  lto_stream_offload_p = true;
+
+  select_what_to_stream ();
+  lto_symtab_encoder_t encoder = build_symtab_encoder ();
+  encoder = compute_ltrans_boundary (encoder);
+
+  struct lto_out_decl_state *state = lto_new_out_decl_state ();
+  state->symtab_node_encoder = encoder;
+
+  lto_output_init_mode_table ();
+  lto_push_out_decl_state (state);
+
+  gcc_assert (!flag_wpa);
+
+  write_lto ();
+
+  gcc_assert (lto_get_out_decl_state () == state);
+  lto_pop_out_decl_state ();
+  lto_delete_out_decl_state (state);
+
+  lto_stream_offload_p = false;
 }
 
 /* Same as execute_pass_list but assume that subpasses of IPA passes
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	(revision 224547)
+++ gcc/passes.def	(working copy)
@@ -60,6 +60,10 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_early_warn_uninitialized);
       NEXT_PASS (pass_nothrow);
   POP_INSERT_PASSES ()
+  TERMINATE_PASS_LIST ()
+
+  /* Local optimization passes.  */
+  INSERT_PASSES_AFTER (all_local_opt_passes)
 
   NEXT_PASS (pass_chkp_instrumentation_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_chkp_instrumentation_passes)
@@ -70,6 +74,7 @@ along with GCC; see the file COPYING3.
 
   NEXT_PASS (pass_local_optimization_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_local_optimization_passes)
+      NEXT_PASS (pass_expand_omp_ssa);
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
Index: gcc/ssa-iterators.h
===================================================================
--- gcc/ssa-iterators.h	(revision 224547)
+++ gcc/ssa-iterators.h	(working copy)
@@ -609,17 +609,21 @@ op_iter_init (ssa_op_iter *ptr, gimple s
     {
       switch (gimple_code (stmt))
 	{
-	  case GIMPLE_ASSIGN:
-	  case GIMPLE_CALL:
-	    ptr->numops = 1;
-	    break;
-	  case GIMPLE_ASM:
-	    ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt));
-	    break;
-	  default:
-	    ptr->numops = 0;
-	    flags &= ~(SSA_OP_DEF | SSA_OP_VDEF);
-	    break;
+	case GIMPLE_ASSIGN:
+	case GIMPLE_CALL:
+	case GIMPLE_OMP_CONTINUE:
+	  ptr->numops = 1;
+	  break;
+	case GIMPLE_ASM:
+	  ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt));
+	  break;
+	case GIMPLE_OMP_FOR:
+	  ptr->numops = gimple_omp_for_collapse (stmt);
+	  break;
+	default:
+	  ptr->numops = 0;
+	  flags &= ~(SSA_OP_DEF | SSA_OP_VDEF);
+	  break;
 	}
     }
   ptr->uses = (flags & (SSA_OP_USE|SSA_OP_VUSE)) ? gimple_use_ops (stmt) : NULL;
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 224547)
+++ gcc/tree-cfg.c	(working copy)
@@ -6649,6 +6649,7 @@ move_stmt_r (gimple_stmt_iterator *gsi_p
 
     case GIMPLE_OMP_RETURN:
     case GIMPLE_OMP_CONTINUE:
+    case GIMPLE_OMP_ENTRY_END:
       break;
     default:
       if (is_gimple_omp (stmt))
@@ -6659,7 +6660,7 @@ move_stmt_r (gimple_stmt_iterator *gsi_p
 	     function.  */
 	  bool save_remap_decls_p = p->remap_decls_p;
 	  p->remap_decls_p = false;
-	  *handled_ops_p = true;
+	  //	  *handled_ops_p = true;
 
 	  walk_gimple_seq_mod (gimple_omp_body_ptr (stmt), move_stmt_r,
 			       move_stmt_op, wi);
Index: gcc/tree-into-ssa.c
===================================================================
--- gcc/tree-into-ssa.c	(revision 224547)
+++ gcc/tree-into-ssa.c	(working copy)
@@ -2442,6 +2442,7 @@ pass_build_ssa::execute (function *fun)
 	SET_SSA_NAME_VAR_OR_IDENTIFIER (name, DECL_NAME (decl));
     }
 
+  verify_ssa (false, true);
   return 0;
 }
 
Index: gcc/tree-nested.c
===================================================================
--- gcc/tree-nested.c	(revision 224547)
+++ gcc/tree-nested.c	(working copy)
@@ -673,14 +673,8 @@ walk_gimple_omp_for (gomp_for *for_stmt,
       wi.is_lhs = false;
       walk_tree (gimple_omp_for_final_ptr (for_stmt, i), callback_op,
 		 &wi, NULL);
-
-      t = gimple_omp_for_incr (for_stmt, i);
-      gcc_assert (BINARY_CLASS_P (t));
-      wi.val_only = false;
-      walk_tree (&TREE_OPERAND (t, 0), callback_op, &wi, NULL);
-      wi.val_only = true;
-      wi.is_lhs = false;
-      walk_tree (&TREE_OPERAND (t, 1), callback_op, &wi, NULL);
+      walk_tree (gimple_omp_for_incr_ptr (for_stmt, i), callback_op,
+		 &wi, NULL);
     }
 
   seq = gsi_seq (wi.gsi);
Index: gcc/tree-ssa-operands.c
===================================================================
--- gcc/tree-ssa-operands.c	(revision 224547)
+++ gcc/tree-ssa-operands.c	(working copy)
@@ -942,11 +942,18 @@ parse_ssa_operands (struct function *fn,
       append_vuse (gimple_vop (fn));
       goto do_default;
 
+    case GIMPLE_OMP_FOR:
+      start = gimple_omp_for_collapse (stmt);
+      for (i = 0; i < start; i++)
+	get_expr_operands (fn, stmt, gimple_op_ptr (stmt, i), opf_def);
+      goto do_default;
+      
     case GIMPLE_CALL:
       /* Add call-clobbered operands, if needed.  */
       maybe_add_call_vops (fn, as_a <gcall *> (stmt));
       /* FALLTHRU */
 
+    case GIMPLE_OMP_CONTINUE:
     case GIMPLE_ASSIGN:
       get_expr_operands (fn, stmt, gimple_op_ptr (stmt, 0), opf_def);
       start = 1;

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c
  2015-10-20 20:48   ` Bernd Schmidt
@ 2015-10-20 21:41     ` Alexander Monakov
  2015-10-20 21:46       ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 21:41 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik



On Tue, 20 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > NVPTX does not support alloca or variable-length stack allocations, thus
> > heap allocation needs to be used instead.  I've opted to make this a generic
> > change instead of guarding it with an #ifdef: libgomp usually leaves thread
> > stack size up to libc, so avoiding unbounded stack allocation makes sense.
> >
> >  * task.c (GOMP_task): Use a fixed-size on-stack buffer or a heap
> >          allocation instead of a variable-size on-stack allocation.
> 
> > +	  char buf_fixed[2048], *buf = buf_fixed;
> 
> This might also not be the best of ideas on a GPU - the stack size isn't all
> that unlimited, what with there being lots of threads. If I do
> 
>   size_t stack, heap;
>   cuCtxGetLimit (&stack, CU_LIMIT_STACK_SIZE);
> 
> in the nvptx-run program we've used for testing, it shows a default stack size
> of just 1kB.

Thanks, NVPTX will need a low buf_fixed size, perhaps 64 bytes or so.
What about the generic case, should it use a more generous threshold,
or revert to existing unbounded alloca?

Any ideas how big is the required allocation size is in practice?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c
  2015-10-20 21:41     ` Alexander Monakov
@ 2015-10-20 21:46       ` Bernd Schmidt
  0 siblings, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 21:46 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/20/2015 11:36 PM, Alexander Monakov wrote:
> Thanks, NVPTX will need a low buf_fixed size, perhaps 64 bytes or so.
> What about the generic case, should it use a more generous threshold,
> or revert to existing unbounded alloca?
>
> Any ideas how big is the required allocation size is in practice?

I'll defer to Jakub for questions and patches that are more strongly 
libgomp-related than ptx-related.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 03/14] nvptx: expand support for address spaces
  2015-10-20 21:41         ` Cesar Philippidis
@ 2015-10-20 21:51           ` Bernd Schmidt
  0 siblings, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 21:51 UTC (permalink / raw)
  To: Cesar Philippidis, Alexander Monakov
  Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/20/2015 11:41 PM, Cesar Philippidis wrote:
> Was it this one that you're referring to Bernd? I think this is the
> patch that introduces the "oacc ganglocal" attribute. It has bitrot
> significantly though.

Yeah, the bits in nvptx.c are the ones I was referring to. Thanks!

> What are you planning on using shared memory for? It's an extremely
> limited resource and it has some quirks.

Alexander wanted to allocate something in shared memory. I haven't 
really gotten around to looking at the code that would use it.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 09/14] libgomp: provide barriers on NVPTX
  2015-10-20 20:56   ` Bernd Schmidt
@ 2015-10-20 22:00     ` Alexander Monakov
  2015-10-21  2:23       ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-20 22:00 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Tue, 20 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > On NVPTX, there's 16 hardware barriers for each thread team, each barrier
> > has
> > a variable waiter count.  The instruction 'bar.sync N, M;' allows to wait on
> > barrier number N until M threads have arrived.  M should be pre-multiplied
> > by
> > warp width.  It's also possible to 'post' the barrier without suspending
> > with
> > 'bar.arrive'.
> >
> > We should be able to provide gomp barrier via a combination of ptx barriers
> > and atomics.  This patch is a first step in that direction.
> >
> > It's mostly a copy of Linux implementation, and it's very likely that
> > functions more complex than gomp_barrier_wait_end are implemented
> > incorrectly.
> > I will have to review all of that (and optimize, hopefully).
> >
> > I'm not sure if naked asm()'s are OK.  It's possible to implement a builtin
> > instead for a minor beautification.  Thoughts?
> 
> I have no concerns about naked asms. I'm more concerned about whether this
> actually works - how much testing has this had?

It does survive libgomp c/c++ tests, which makes use of the simplest barrier,
gomp_barrier_wait_end, at least.

> My experience has been that there is practically no way of using bar.sync
> reliably, since we can't control warp divergence and reconvergence at the
> ptx level but the hardware bar.sync instruction only works when executed by
> all threads in a warp at the same time.

I don't think it's that bad.  Divergence and reconvergence are implicit: a
non-uniform branch is a divergence point, and the corresponding reconvergence
point is at its immediate post-dominator.  Though I do miss a possibility to
force reconvergence at a given point, "resurrecting" masked-out warp members.

For bar.sync behavior the documentation gives an explicit guarantee: every
time a warp encounters a bar.sync instruction, it bumps the count by the warp
width (32), irrespective of how many warp members are active at the time of
encounter.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC
  2015-10-20 18:34 ` [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC Alexander Monakov
@ 2015-10-20 23:48   ` Bernd Schmidt
  2015-10-21  5:40     ` Alexander Monakov
  2015-10-21  8:11   ` Jakub Jelinek
  1 sibling, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 23:48 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> The NVPTX backend emits each functions either as .func (callable only from the
> device code) or as .kernel (entry point for a parallel region).  OpenMP
> lowering adds "omp target entrypoint" attribute to functions outlined from
> target regions.  Unlike OpenACC offloading, OpenMP offloading does not invoke
> such outlined functions directly, but instead passes their address to
> 'gomp_nvptx_main'.  Restrict the special attribute treatment to OpenACC only.
>
> 	* config/nvptx/nvptx.c (write_as_kernel): Additionally test
> 	flag_openacc for "omp_target_entrypoint".

I'm not too keen on this. The idea of the attribute is to make it a 
kernel that can be executed from the host. If that isn't wanted by 
OpenMP, it shouldn't set the attribute.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-10-20 18:34 ` [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints Alexander Monakov
@ 2015-10-20 23:57   ` Bernd Schmidt
  2015-10-21  8:20   ` Jakub Jelinek
  1 sibling, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-20 23:57 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> (note to reviewers: I'm not sure what we're after here, on the high level;
> will be happy to rework the patch in a saner manner based on feedback, or even
> drop it for now)
>
> At the moment the attribute setting logic in omp-low.c is such that if a
> function that should be present in target code does not already have 'omp
> declare target' attribute, it receives 'omp target entrypoint'.  That is
> wasteful: clearly not all user-declared target functions will be target region
> entry points in OpenMP.
>
> The motivating example for this change is OpenMP parallel target regions.  The
> 'parallel' part is outlined into its own function.  We don't want that
> function be an 'entrypoint' on PTX (but only as a matter of optimality rather
> than correctness).
>
> 	* omp-low.c (create_omp_child_function): Set "omp target entrypoint"
>          or "omp declare target" attribute based on is_gimple_omp_offloaded.

I think this looks reasonable, but you might want to adjust it in 
whatever way is necessary so that you can drop patch 1/14.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-20 18:34 ` [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX Alexander Monakov
@ 2015-10-21  0:07   ` Bernd Schmidt
  2015-10-21  6:49     ` Alexander Monakov
  2015-10-21  8:48   ` Jakub Jelinek
  2015-11-03 14:25   ` Alexander Monakov
  2 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-21  0:07 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> (This patch serves as a straw man proposal to have something concrete for
> discussion and further patches)
>
> On PTX, stack memory is private to each thread.  When master thread constructs
> 'omp_data_o' on its own stack and passes it to other threads via
> GOMP_parallel by reference, other threads cannot use the resulting pointer.
> We need to arrange structures passed between threads be in global, or better,
> in PTX __shared__ memory (private to each CUDA thread block).

I guess the question is - why is it better? Do you have multiple thread 
blocks active in your execution model, and do they require different 
omp_data_o structures? Are accesses to it performance critical (more so 
than any other access?) If the answers are "no", then I think you 
probably want to fall back to just normal malloced memory or a regular
static variable, as shared memory is a fairly limited resource.

It might be slightly cleaner to have the copy described as a new builtin 
call that is always generated and expanded to nothing on normal targets 
rather than modifying existing calls in the IL. Or maybe:

  p = __builtin_omp_select_location (&stack_local_var, size)
  ....
  __builtin_omp_maybe_free (p);

where the select_location could get simplified to a malloc for nvptx, 
hopefully making the stack variable unused and discarded.

> Using separate variables is wasteful: they should go into a union to reduce
> shared memory consumption.

Not sure what you mean by separate variables?

Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 09/14] libgomp: provide barriers on NVPTX
  2015-10-20 22:00     ` Alexander Monakov
@ 2015-10-21  2:23       ` Bernd Schmidt
  0 siblings, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-21  2:23 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/20/2015 11:51 PM, Alexander Monakov wrote:
> On Tue, 20 Oct 2015, Bernd Schmidt wrote:
>
>> My experience has been that there is practically no way of using bar.sync
>> reliably, since we can't control warp divergence and reconvergence at the
>> ptx level but the hardware bar.sync instruction only works when executed by
>> all threads in a warp at the same time.
>
> I don't think it's that bad.  Divergence and reconvergence are implicit: a
> non-uniform branch is a divergence point, and the corresponding reconvergence
> point is at its immediate post-dominator.

That's good in theory, but I have seen cases where very odd things 
seemed to be happening in ptxas, and another problem is that gcc is 
quite unconcerned about maintaining such reconvergence points in its 
optimization passes.

> For bar.sync behavior the documentation gives an explicit guarantee: every
> time a warp encounters a bar.sync instruction, it bumps the count by the warp
> width (32), irrespective of how many warp members are active at the time of
> encounter.

Yeah, but that's undesirable: you can breeze right past a bar.sync 
before the thing you wanted to synchronize has completed.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC
  2015-10-20 23:48   ` Bernd Schmidt
@ 2015-10-21  5:40     ` Alexander Monakov
  0 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21  5:40 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik



On Wed, 21 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > The NVPTX backend emits each functions either as .func (callable only from
> > the
> > device code) or as .kernel (entry point for a parallel region).  OpenMP
> > lowering adds "omp target entrypoint" attribute to functions outlined from
> > target regions.  Unlike OpenACC offloading, OpenMP offloading does not
> > invoke
> > such outlined functions directly, but instead passes their address to
> > 'gomp_nvptx_main'.  Restrict the special attribute treatment to OpenACC
> > only.
> >
> >  * config/nvptx/nvptx.c (write_as_kernel): Additionally test
> >  flag_openacc for "omp_target_entrypoint".
> 
> I'm not too keen on this. The idea of the attribute is to make it a kernel
> that can be executed from the host. If that isn't wanted by OpenMP, it
> shouldn't set the attribute.

I do want the attribute in OpenMP, but I want to use the attribute
differently in the backend, as in patch 02/14 demonstrates.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-21  0:07   ` Bernd Schmidt
@ 2015-10-21  6:49     ` Alexander Monakov
  0 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21  6:49 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik



On Wed, 21 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > (This patch serves as a straw man proposal to have something concrete for
> > discussion and further patches)
> >
> > On PTX, stack memory is private to each thread.  When master thread
> > constructs
> > 'omp_data_o' on its own stack and passes it to other threads via
> > GOMP_parallel by reference, other threads cannot use the resulting pointer.
> > We need to arrange structures passed between threads be in global, or
> > better,
> > in PTX __shared__ memory (private to each CUDA thread block).
> 
> I guess the question is - why is it better? Do you have multiple thread blocks
> active in your execution model, 

'#pragma omp teams' should map to spawning multiple thread blocks, so yes, at
least in plans I do (but honestly I don't see how it affects the
heap-vs-shared memory decision here)

> and do they require different omp_data_o structures?

yes, each omp_data_o should be private to a team

> Are accesses to it performance critical (more so than any other access?)

Not sure how to address the "more so than ..." part, but since omp_data_o is
accessed by all threads after entering a parallel region, potentially many
times throughout the region, it does seem helpful to arrange it in shared
memory.

I expect there will be other instances like this one, where some on-stack data
will need to be moved to team-shared storage for nvptx.

> It might be slightly cleaner to have the copy described as a new builtin
> call that is always generated and expanded to nothing on normal targets
> rather than modifying existing calls in the IL. Or maybe:
> 
>  p = __builtin_omp_select_location (&stack_local_var, size) ....
>  __builtin_omp_maybe_free (p);
> 
> where the select_location could get simplified to a malloc for nvptx,
> hopefully making the stack variable unused and discarded.

Agreed.

> > Using separate variables is wasteful: they should go into a union to
> > reduce shared memory consumption.
> 
> Not sure what you mean by separate variables?

If two parallel regions are nested in a target region, there will be two
omp_data_o variables of potentially different types, but they can reuse the
same storage.  The patch does not achieve that, because it simply emits a
static __shared__ declaration for each original variable.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (13 preceding siblings ...)
  2015-10-20 19:01 ` [gomp4 02/14] nvptx: emit pointers to OpenMP target region entry points Alexander Monakov
@ 2015-10-21  7:55 ` Martin Jambor
  2015-10-21  8:56 ` Jakub Jelinek
  2015-10-21 12:06 ` Bernd Schmidt
  16 siblings, 0 replies; 99+ messages in thread
From: Martin Jambor @ 2015-10-21  7:55 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

Hi,

On Tue, Oct 20, 2015 at 09:34:22PM +0300, Alexander Monakov wrote:
> Hello,
> 
> This patch series moves libgomp/nvptx porting further along to get initial
> bits of parallel execution working, mostly unbreaking the testsuite.  Please
> have a look!  I'm interested in feedback, and would like to know if it's
> suitable to become a part of a branch.
> 
> This patch series ports enough of libgomp.c to get warp-level parallelism
> working for OpenMP offloading.  The overall approach is as follows.
> 
> I've opted not to use dynamic parallelism.

in that case, I encourage you to have a look at omp-low.c (and
gimple.h) in the hsa branch.  Since Cauldron I have improved the code
that processes construct such as

#omp pragma target teams distribute parallel for

so that it creates copies of the target bodies that are suitable for
execution as one kernel, (all of it is eventually outlined to one
single function).  It does not handle more complicated cases (ost
notably, supporting reductions well is will require quite some work)
but now I do think this is the way to expand code for GPUs at least or
the sources where it makes sense.

The code is still OpenMP 4.0, I have only just recently started
porting it to the 4.5 that landed in trunk.  When I am done, I will
write-up some overview and post the first patch for review.  Let's
hope it is not going to take too long.

Martin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC
  2015-10-20 18:34 ` [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC Alexander Monakov
  2015-10-20 23:48   ` Bernd Schmidt
@ 2015-10-21  8:11   ` Jakub Jelinek
  2015-10-21  8:36     ` Alexander Monakov
  1 sibling, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  8:11 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:23PM +0300, Alexander Monakov wrote:
> The NVPTX backend emits each functions either as .func (callable only from the
> device code) or as .kernel (entry point for a parallel region).  OpenMP
> lowering adds "omp target entrypoint" attribute to functions outlined from
> target regions.  Unlike OpenACC offloading, OpenMP offloading does not invoke
> such outlined functions directly, but instead passes their address to
> 'gomp_nvptx_main'.  Restrict the special attribute treatment to OpenACC only.
> 
> 	* config/nvptx/nvptx.c (write_as_kernel): Additionally test
> 	flag_openacc for "omp_target_entrypoint".
> ---
>  gcc/config/nvptx/nvptx.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
> index 21c59ef..df7b61f 100644
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c
> @@ -401,8 +401,10 @@ write_one_arg (std::stringstream &s, tree type, int i, machine_mode mode,
>  static bool
>  write_as_kernel (tree attrs)
>  {
> -  return (lookup_attribute ("kernel", attrs) != NULL_TREE
> -	  || lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE);
> +  if (flag_openacc
> +      && lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE)
> +    return true;
> +  return lookup_attribute ("kernel", attrs) != NULL_TREE;
>  }
>  
>  /* Write a function decl for DECL to S, where NAME is the name to be used.

This is certainly wrong.  People can use -fopenmp -fopenacc, whether
flag_openacc is set does not mean whether the particular function is
outlined openacc or openmp region.
The question is, is .kernel actually harmful, even when you invoke it
through a stub wrapper?  Like, does it not work at all, or is slower than it
could be?  If it is harmful, you should use a different attribute for
OpenMP and OpenACC target entrypoints, so perhaps
"omp target entrypoint" for OpenMP ones and "acc target entrypoint" for
OpenACC ones? create_omp_child_function that adds the attribute should
have the stmt for which it is created in ctx->stmt, so you could e.g. use
is_gimple_omp_oacc for that.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-10-20 18:34 ` [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints Alexander Monakov
  2015-10-20 23:57   ` Bernd Schmidt
@ 2015-10-21  8:20   ` Jakub Jelinek
  2015-10-30 16:58     ` Alexander Monakov
  1 sibling, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  8:20 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:27PM +0300, Alexander Monakov wrote:
> (note to reviewers: I'm not sure what we're after here, on the high level;
> will be happy to rework the patch in a saner manner based on feedback, or even
> drop it for now)
> 
> At the moment the attribute setting logic in omp-low.c is such that if a
> function that should be present in target code does not already have 'omp
> declare target' attribute, it receives 'omp target entrypoint'.  That is
> wasteful: clearly not all user-declared target functions will be target region
> entry points in OpenMP.
> 
> The motivating example for this change is OpenMP parallel target regions.  The
> 'parallel' part is outlined into its own function.  We don't want that
> function be an 'entrypoint' on PTX (but only as a matter of optimality rather
> than correctness).
> 
> 	* omp-low.c (create_omp_child_function): Set "omp target entrypoint"
>         or "omp declare target" attribute based on is_gimple_omp_offloaded.

This is principally ok, but you want to change it for 01/14.
After that I think it is ready for trunk.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC
  2015-10-21  8:11   ` Jakub Jelinek
@ 2015-10-21  8:36     ` Alexander Monakov
  0 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21  8:36 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On Wed, 21 Oct 2015, Jakub Jelinek wrote:

> On Tue, Oct 20, 2015 at 09:34:23PM +0300, Alexander Monakov wrote:
> > The NVPTX backend emits each functions either as .func (callable only from the
> > device code) or as .kernel (entry point for a parallel region).  OpenMP
> > lowering adds "omp target entrypoint" attribute to functions outlined from
> > target regions.  Unlike OpenACC offloading, OpenMP offloading does not invoke
> > such outlined functions directly, but instead passes their address to
> > 'gomp_nvptx_main'.  Restrict the special attribute treatment to OpenACC only.
> > 
> > 	* config/nvptx/nvptx.c (write_as_kernel): Additionally test
> > 	flag_openacc for "omp_target_entrypoint".
[...]

> This is certainly wrong.  People can use -fopenmp -fopenacc, whether
> flag_openacc is set does not mean whether the particular function is
> outlined openacc or openmp region.
> The question is, is .kernel actually harmful, even when you invoke it
> through a stub wrapper?  Like, does it not work at all, or is slower than it
> could be?

Taking the address of a .kernel function doesn't work (for some reason it's
restricted to sm_35+), and taking the address is necessary to pass it on to
gomp_nvptx_main.

> If it is harmful, you should use a different attribute for OpenMP and
> OpenACC target entrypoints, so perhaps "omp target entrypoint" for OpenMP
> ones and "acc target entrypoint" for OpenACC ones? create_omp_child_function
> that adds the attribute should have the stmt for which it is created in
> ctx->stmt, so you could e.g. use is_gimple_omp_oacc for that.

Thanks, that should resolve the issue nicely.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-20 18:34 ` [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX Alexander Monakov
  2015-10-21  0:07   ` Bernd Schmidt
@ 2015-10-21  8:48   ` Jakub Jelinek
  2015-10-21  9:09     ` Alexander Monakov
  2015-11-03 14:25   ` Alexander Monakov
  2 siblings, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  8:48 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:28PM +0300, Alexander Monakov wrote:
> (This patch serves as a straw man proposal to have something concrete for
> discussion and further patches)
> 
> On PTX, stack memory is private to each thread.  When master thread constructs
> 'omp_data_o' on its own stack and passes it to other threads via
> GOMP_parallel by reference, other threads cannot use the resulting pointer.
> We need to arrange structures passed between threads be in global, or better,
> in PTX __shared__ memory (private to each CUDA thread block).

Can you please clarify on what exactly doesn't work and what works and if it
is just a performance issue or some other?
Because .omp_data_o* variables are just one small part of the picture.
That structure holds sometimes the shared variables themselves (in that case
the model is that the variables are first copied into the structure and
after the end of parallel copied back from that back to the original
location), but often just addresses of the shared variables, where the
shared variables then live in their original location and just the address
is stored in .omp_data_o* field.  In this case, the variable could very well
be just a private automatic variable of the initial thread, living on its
stack.  And then .omp_data_o* contains fields for
firstprivate/lastprivate/reduction etc. variables, typically addresses of
the original variables, I believe for firstprivate it can be also the
variables themselves in certain cases.
In any case, user can do stuff like:
#pragma omp declare target
void bar (int *p)
{
  #pragma omp parallel shared (p)
  {
    use (*p);
  }
}
void foo (void)
{
  int a = 6;
  bar (&a);
}
#pragma omp end declare target
void baz (void)
{
  #pragma omp target
  foo ();
}
and then, even if you arrange for the p variable itself to be copied to heap
or .shared for the duration of the parallel region, what it points to is
still living in initial thread's stack.

If this is just a performance thing, can't you e.g. just copy the
.omp_data_o* structure inside of GOMP_parallel into either some .shared
buffer or heap allocated object and copy it back at the end?

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (14 preceding siblings ...)
  2015-10-21  7:55 ` [gomp4 00/14] NVPTX: further porting Martin Jambor
@ 2015-10-21  8:56 ` Jakub Jelinek
  2015-10-21  9:17   ` Alexander Monakov
  2015-10-21 12:06 ` Bernd Schmidt
  16 siblings, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  8:56 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:22PM +0300, Alexander Monakov wrote:
> I've opted not to use dynamic parallelism.  It increases the hardware
> requirement from sm_30 to sm_35, needs a library from CUDA Toolkit at link

I'll try to add the thread_limit/num_teams arguments to GOMP_target_41
soon (together with the target teams clause evaluation changes), so
sometimes you'll have that information at target time, but not always.
Using teams/thread preallocation when possible is fine with me, but I think
it is not always possible, if you can't see what teams will require for
number of teams or what thread_limit will it want, or if thread_limit is
unspecified and you have no idea how many threads will be requested...
I think requiring sm_35 should not be a very big deal.

> time (libcudadevrt.a), and imposes overhead at run time.  The last point might

But if this is the case, that is really serious issue.  Is that really
something that isn't available in a shared library?
E.g. with my distro GCC maintainer hat on, I'd really like to tweak the
libgomp PTX plugin, so that it compiles against a stub cuda.h header and
doesn't like against libcuda*.so at all, but instead dlopens it, to avoid
hard dependencies on the non-free CUDA stuff and more importantly any link
time dependencies on that.  If libcudadevrt is not
available as shared library, this wouldn't of course work.  Would be nice to
talk to NVidia about this...

> libgomp.c/thread-limit-2.c: fails to link due to 'usleep' unavailable on
> NVPTX.  Note, the test does not run anything on the device because the target
> region has 'if (0)' clause.

As optimization, perhaps we could avoid adding the "omp target entrypoint"
attribute for the body of if(0) target region, that one always goes to host
fallback, so no offloaded code is needed.

As for other tests, XFAILing them always is undesirable, supposedly we could
add a dejagnu target check whether the default target goes to PTX (if we
don't have it already) and use that to xfail?  Of course that doesn't help
the thread-limit-2.c testcase.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main
  2015-10-20 21:27       ` Bernd Schmidt
@ 2015-10-21  9:07         ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  9:07 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 11:19:06PM +0200, Bernd Schmidt wrote:
> On 10/20/2015 11:13 PM, Alexander Monakov wrote:
> >On Tue, 20 Oct 2015, Bernd Schmidt wrote:
> >
> >>On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> >>>2.  Make gomp_nvptx_main a device (.func) function.  To have that work, we'd
> >>>need to additionally emit a "trampoline" of sorts in the NVPTX backend.  For
> >>>each OpenMP target entrypoint foo$_omp_fn$0, we'd have to additionally emit
> >>>
> >>>__global__ void foo$_omp_fn$0$entry(void *args)
> >>>{
> >>>     gomp_nvptx_main(foo$_omp_fn$0, args);
> >>>}
> >>
> >>Wouldn't it be simpler to generate a .kernel for every target region function
> >>(as OpenACC does)? That could be a small stub in each case which just calls
> >>gomp_nvptx_main with the right function pointer. We already have the machinery
> >>to look up the right kernel corresponding to a host address and invoke it, so
> >>I think we should just reuse that functionality.
> >
> >As I see we are describing the same thing in different words.
> >
> >In what you describe, and in my quoted paragraph, both gomp_nvptx_main and the
> >function originally outlined for a target region are device-only (.func)
> >functions.  The .kernel function that the plugin looks up and launches is a
> >small piece of code that calls gomp_nvptx_main, passing it a pointer to the
> >target region function.
> >
> >Unless I didn't fully catch what you say?  Like I said in the email, I do like
> >this approach more.
> 
> Could be that we're talking about the same thing. I think I was confused by
> a reference to .func vs .kernel and sm_30 vs sm_35 in patch 2/14. So let's
> go for this approach.

But you'd better then rename the .kernel stub to the name that libgomp looks
up and rename the actual outlined body to some other name (add some suffix).
Or maybe another possibility is to use the normal outlined target body
function, and just inject some gomp_nvptx_start and gomp_nvptx_end functions
to the beginning and end of "omp target entrypoint" function.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-21  8:48   ` Jakub Jelinek
@ 2015-10-21  9:09     ` Alexander Monakov
  2015-10-21  9:24       ` Jakub Jelinek
  2015-10-21 10:42       ` Bernd Schmidt
  0 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21  9:09 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On Wed, 21 Oct 2015, Jakub Jelinek wrote:

> On Tue, Oct 20, 2015 at 09:34:28PM +0300, Alexander Monakov wrote:
> > (This patch serves as a straw man proposal to have something concrete for
> > discussion and further patches)
> > 
> > On PTX, stack memory is private to each thread.  When master thread constructs
> > 'omp_data_o' on its own stack and passes it to other threads via
> > GOMP_parallel by reference, other threads cannot use the resulting pointer.
> > We need to arrange structures passed between threads be in global, or better,
> > in PTX __shared__ memory (private to each CUDA thread block).
> 
> Can you please clarify on what exactly doesn't work and what works and if it
> is just a performance issue or some other?

Sadly it's not just performance.

In PTX, stack storage is in .local address space -- and that memory is
thread-private.  A thread can make a pointer to its own stack memory and
successfully dereference it, but dereferencing that pointer from other threads
does not work (I observed it returning garbage values).

The reason for .local addresses being private like that, I think, is that
references to .local memory undergo address translation to make simultaneous
accesses to stack slots from threads in a warp form a coalesced memory
transaction.  So .local memory looking consecutive from an individual thread's
point of view are actually strided in physical memory.

So yes, when omp_data_o needs to hold a pointer to stack memory, it still won't
work.  For simple cases the compiler could notice it and provide a diagnostic
message, but in general I don't see what can be done, apart from documenting
it as a fundamental limitation.

(exposing shared memory to users might alleviate the issue slightly, but
non-trivial in itself)

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 08/14] libgomp nvptx: populate proc.c
  2015-10-20 18:34 ` [gomp4 08/14] libgomp nvptx: populate proc.c Alexander Monakov
@ 2015-10-21  9:15   ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  9:15 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:30PM +0300, Alexander Monakov wrote:
> This provides minimal implementations of gomp_dynamic_max_threads and
> omp_get_num_procs.
> 
> 	* config/nvptx/proc.c: New.

LGTM. 

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-21  8:56 ` Jakub Jelinek
@ 2015-10-21  9:17   ` Alexander Monakov
  2015-10-21  9:29     ` Jakub Jelinek
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21  9:17 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> > time (libcudadevrt.a), and imposes overhead at run time.  The last point might
> 
> But if this is the case, that is really serious issue.  Is that really
> something that isn't available in a shared library?
> E.g. with my distro GCC maintainer hat on, I'd really like to tweak the
> libgomp PTX plugin, so that it compiles against a stub cuda.h header and
> doesn't like against libcuda*.so at all, but instead dlopens it, to avoid
> hard dependencies on the non-free CUDA stuff and more importantly any link
> time dependencies on that.  If libcudadevrt is not
> available as shared library, this wouldn't of course work.  Would be nice to
> talk to NVidia about this...

It's a library of device (PTX) code, not host code, so dynamic linking does
not apply.

> > libgomp.c/thread-limit-2.c: fails to link due to 'usleep' unavailable on
> > NVPTX.  Note, the test does not run anything on the device because the target
> > region has 'if (0)' clause.
> 
> As optimization, perhaps we could avoid adding the "omp target entrypoint"
> attribute for the body of if(0) target region, that one always goes to host
> fallback, so no offloaded code is needed.
> 
> As for other tests, XFAILing them always is undesirable, supposedly we could
> add a dejagnu target check whether the default target goes to PTX (if we
> don't have it already) and use that to xfail?

Yes, that's what I meant; such a check is already implemented for OpenACC.

> Of course that doesn't help the thread-limit-2.c testcase.

Why not?

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-21  9:09     ` Alexander Monakov
@ 2015-10-21  9:24       ` Jakub Jelinek
  2015-10-21 10:42       ` Bernd Schmidt
  1 sibling, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  9:24 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Wed, Oct 21, 2015 at 12:07:22PM +0300, Alexander Monakov wrote:
> On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> 
> > On Tue, Oct 20, 2015 at 09:34:28PM +0300, Alexander Monakov wrote:
> > > (This patch serves as a straw man proposal to have something concrete for
> > > discussion and further patches)
> > > 
> > > On PTX, stack memory is private to each thread.  When master thread constructs
> > > 'omp_data_o' on its own stack and passes it to other threads via
> > > GOMP_parallel by reference, other threads cannot use the resulting pointer.
> > > We need to arrange structures passed between threads be in global, or better,
> > > in PTX __shared__ memory (private to each CUDA thread block).
> > 
> > Can you please clarify on what exactly doesn't work and what works and if it
> > is just a performance issue or some other?
> 
> Sadly it's not just performance.
> 
> In PTX, stack storage is in .local address space -- and that memory is
> thread-private.  A thread can make a pointer to its own stack memory and
> successfully dereference it, but dereferencing that pointer from other threads
> does not work (I observed it returning garbage values).
> 
> The reason for .local addresses being private like that, I think, is that
> references to .local memory undergo address translation to make simultaneous
> accesses to stack slots from threads in a warp form a coalesced memory
> transaction.  So .local memory looking consecutive from an individual thread's
> point of view are actually strided in physical memory.
> 
> So yes, when omp_data_o needs to hold a pointer to stack memory, it still won't
> work.  For simple cases the compiler could notice it and provide a diagnostic
> message, but in general I don't see what can be done, apart from documenting
> it as a fundamental limitation.
> 
> (exposing shared memory to users might alleviate the issue slightly, but
> non-trivial in itself)

Ugh, that is extremely serious limitation.  Guess it would be nice to
investigate a little bit on what other compilers are doing here.

For variables defined inside the function that contains the parallel region
guess we could somehow notice, add some attributes or whatever, and
try to allocate those variables in .shared memory or on the heap instead.
But that would surely catch just the easy cases.
Another thing is copyprivate clause, that needs to broadcast a private
variable of one thread to all threads participating in the parallel.
Right now this is implemented everywhere the standard host way, each thread
but one is told the address of the private var in the one thread that
executed the single region and copies the var back (note, for C++,
this actually means invoking an assignment operator, which can do various
things).

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-21  9:17   ` Alexander Monakov
@ 2015-10-21  9:29     ` Jakub Jelinek
  2015-10-28 17:22       ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  9:29 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Wed, Oct 21, 2015 at 12:16:35PM +0300, Alexander Monakov wrote:
> > Of course that doesn't help the thread-limit-2.c testcase.
> 
> Why not?

Because the compiler can be configured for multiple offloading devices,
and PTX might not be the first device.  So, you'd need to have a tcl
test whether PTX is enabled at all rather than whether it is the default
device.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 09/14] libgomp: provide barriers on NVPTX
  2015-10-20 18:53 ` [gomp4 09/14] libgomp: provide barriers on NVPTX Alexander Monakov
  2015-10-20 20:56   ` Bernd Schmidt
@ 2015-10-21  9:39   ` Jakub Jelinek
  1 sibling, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  9:39 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:31PM +0300, Alexander Monakov wrote:
> +  asm ("bar.sync 0, %0;" : : "r"(32*bar->total));

Formatting, space between "(, spaces around * (in many places).

As for re-convergence of threads in a warp, if we use threads in the warp
other than thread 0 only for simd regions, I'd strongly hope that the end
of simd region (or "vectorized" loop) is always a convergence point.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main
  2015-10-20 18:52 ` [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main Alexander Monakov
@ 2015-10-21  9:49   ` Jakub Jelinek
  2015-10-21 14:41     ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  9:49 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:32PM +0300, Alexander Monakov wrote:
> diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
> deleted file mode 100644
> index e69de29..0000000
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index 1454adf..f25b265 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -483,7 +483,15 @@ enum gomp_cancel_kind
>  
>  /* ... and here is that TLS data.  */
>  
> -#if defined HAVE_TLS || defined USE_EMUTLS
> +#if defined __nvptx__
> +extern struct gomp_thread *nvptx_thrs;

What kind of address space is this variable?  It should be
a per-CTA var, so that different teams have different, and
simultaneous target regions have different too.

> +static inline struct gomp_thread *gomp_thread (void)
> +{
> +  int tid;
> +  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
> +  return nvptx_thrs + tid;
> +}
> +#elif defined HAVE_TLS || defined USE_EMUTLS

Other than that, yes, such bit of nvptx specific stuff is acceptable here.

>  extern __thread struct gomp_thread gomp_tls_data;
>  static inline struct gomp_thread *gomp_thread (void)
>  {
> diff --git a/libgomp/team.c b/libgomp/team.c
> index 7671b05..5b74532 100644
> --- a/libgomp/team.c
> +++ b/libgomp/team.c
> @@ -30,6 +30,7 @@
>  #include <stdlib.h>
>  #include <string.h>
>  
> +#ifdef LIBGOMP_USE_PTHREADS
>  /* This attribute contains PTHREAD_CREATE_DETACHED.  */
>  pthread_attr_t gomp_thread_attr;
>  
> @@ -43,6 +44,7 @@ __thread struct gomp_thread gomp_tls_data;
>  #else
>  pthread_key_t gomp_tls_key;
>  #endif
> +#endif

I'm surprised that for team.c you chose to adjust the shared source,
rather than copy and remove all the cruft you don't need/want.

That includes the LIBGOMP_USE_PTHREADS guarded parts, all the thread binding
stuff etc.  I'd like to see at least for comparison how much actually
remained in there.

> @@ -58,6 +60,52 @@ struct gomp_thread_start_data
>    bool nested;
>  };
>  
> +#ifdef __nvptx__
> +struct gomp_thread *nvptx_thrs;
> +
> +static struct gomp_thread_pool *gomp_new_thread_pool (void);
> +static void *gomp_thread_start (void *);
> +
> +void __attribute__((kernel))
> +gomp_nvptx_main (void (*fn) (void *), void *fn_data)
> +{
> +  int ntids, tid, laneid;
> +  asm ("mov.u32 %0, %%laneid;" : "=r" (laneid));
> +  if (laneid)
> +    return;
> +  static struct gomp_thread_pool *pool;
> +  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
> +  asm ("mov.u32 %0, %%ntid.y;" : "=r"(ntids));
> +  if (tid == 0)
> +    {
> +      gomp_global_icv.nthreads_var = ntids;
> +
> +      nvptx_thrs = gomp_malloc_cleared (ntids * sizeof (*nvptx_thrs));
> +
> +      pool = gomp_new_thread_pool ();
> +      pool->threads = gomp_malloc (ntids * sizeof (*pool->threads));
> +      pool->threads[0] = nvptx_thrs;
> +      pool->threads_size = ntids;
> +      pool->threads_used = ntids;
> +      gomp_barrier_init (&pool->threads_dock, ntids);
> +
> +      nvptx_thrs[0].thread_pool = pool;
> +      asm ("bar.sync 0;");
> +      fn (fn_data);
> +
> +      gomp_free_thread (nvptx_thrs);
> +      free (nvptx_thrs);
> +    }
> +  else
> +    {
> +      struct gomp_thread_start_data tsdata = {0};
> +      tsdata.ts.team_id = tid;
> +      asm ("bar.sync 0;");
> +      tsdata.thread_pool = pool;
> +      gomp_thread_start (&tsdata);
> +    }
> +}
> +#endif

If nvptx is going to use the toplevel team.c, then at least this should not
go in there, it is sufficiently large, and you can just stick it into
config/nvptx/team.c and include the toplevel team.c from there.

> @@ -88,7 +138,8 @@ gomp_thread_start (void *xdata)
>    thr->task = data->task;
>    thr->place = data->place;
>  
> -  thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;
> +  if (thr->ts.team)
> +    thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;

Why this?  Lots of gomp_thread_start other places assume thr->ts.team is
non-NULL.  And, this isn't even guarded with __nvptx__, we don't want
to slow down host thread start.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c
  2015-10-20 18:34 ` [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c Alexander Monakov
  2015-10-20 20:48   ` Bernd Schmidt
@ 2015-10-21  9:59   ` Jakub Jelinek
  1 sibling, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21  9:59 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:33PM +0300, Alexander Monakov wrote:
> NVPTX does not support alloca or variable-length stack allocations, thus
> heap allocation needs to be used instead.  I've opted to make this a generic
> change instead of guarding it with an #ifdef: libgomp usually leaves thread
> stack size up to libc, so avoiding unbounded stack allocation makes sense.
> 
> 	* task.c (GOMP_task): Use a fixed-size on-stack buffer or a heap
>         allocation instead of a variable-size on-stack allocation.

I don't like this unconditionally.
This really isn't unbounded, the buffer just contains the privatized
variables.  If one uses
int c[124];
void foo (void)
{
  int a, b[10];
  #pragma omp parallel firstprivate (a, b, c)
  {
    use (a, b, c);
  }
}
then the private copies of the variables are allocated on the stack too,
a, b already in addition to the original non-privatized vars a and b,
c, which has been above a global var, is automatic just in the private copy.
Now, for #pragma omp task firstprivate (a, b, c) if there are copy constructors
involved, the copy ctors for the firstprivate vars need to be run before
GOMP_task returns, and therefore we let those variables live in the heap
rather than on the stack; but if we know the task needs to execute immediately,
with the alloca we do pretty much the same thing as parallel does, all the
privatized variables are allocated on the stack (just all of them together
using alloca instead of individually by the compiler).

I'm fine with temporarily having some #ifdef HAVE_BROKEN_ALLOCA or similar,
but as nvptx I think doesn't support setjmp/longjmp nor computed goto,
I think just supporting alloca by using malloc instead and freeing at the
end of function if any allocations happened in the function is the right
thing.
> ---
>  libgomp/task.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/libgomp/task.c b/libgomp/task.c
> index 74920d5..ffb7ed2 100644
> --- a/libgomp/task.c
> +++ b/libgomp/task.c
> @@ -162,11 +162,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
>        thr->task = &task;
>        if (__builtin_expect (cpyfn != NULL, 0))
>  	{
> -	  char buf[arg_size + arg_align - 1];
> +	  long buf_size = arg_size + arg_align - 1;
> +	  char buf_fixed[2048], *buf = buf_fixed;
> +	  if (sizeof(buf_fixed) < buf_size)
> +	    buf = gomp_malloc (buf_size);
>  	  char *arg = (char *) (((uintptr_t) buf + arg_align - 1)
>  				& ~(uintptr_t) (arg_align - 1));
>  	  cpyfn (arg, data);
>  	  fn (arg);
> +	  if (buf != buf_fixed)
> +	    free (buf);
>  	}
>        else
>  	fn (data);

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 12/14] libgomp: fixup error.c on nvptx
  2015-10-20 18:34 ` [gomp4 12/14] libgomp: fixup error.c on nvptx Alexander Monakov
@ 2015-10-21 10:03   ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21 10:03 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:34PM +0300, Alexander Monakov wrote:
> NVPTX provides vprintf, but there's no stream separation: everything is
> printed as if into stdout.  This is the minimal change to get error.c working.
> 
> 	* error.c [__nvptx__]: Replace vfprintf, fputs, fputc with [v]printf.

I'd guess it would be cleaner to have this in config/nvptx/error.c.
Have there all the includes copies from error.c, then
#undef the 3, redefine them and finally #include the toplevel error.c.

> ---
>  libgomp/error.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/libgomp/error.c b/libgomp/error.c
> index 094c24a..009efdc 100644
> --- a/libgomp/error.c
> +++ b/libgomp/error.c
> @@ -35,6 +35,11 @@
>  #include <stdio.h>
>  #include <stdlib.h>
>  
> +#ifdef __nvptx__
> +#define vfprintf(stream, fmt, list) vprintf(fmt, list)
> +#define fputs(s, stream) printf("%s", s)
> +#define fputc(c, stream) printf("%c", c)
> +#endif
>  
>  #undef gomp_vdebug
>  void

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 13/14] libgomp: provide minimal GOMP_teams
  2015-10-20 18:52 ` [gomp4 13/14] libgomp: provide minimal GOMP_teams Alexander Monakov
@ 2015-10-21 10:12   ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21 10:12 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:35PM +0300, Alexander Monakov wrote:
> On NVPTX, we don't need most of target.c functionality, except for GOMP_teams.
> Provide it as a copy of the generic implementation for now (it most likely
> will need to change down the line: on NVPTX we do need to spawn several
> thread blocks for #pragma omp teams).
> 
> Alternatively, it might make sense to split GOMP_teams out of target.c into
> its own file (teams.c?), leaving target.c a 0-size stub in config/nvptx.
> 
> 	* config/nvptx/target.c: New.

This is fine.  I bet it will need changes on the compiler side too,
GOMP_teams has been written with rough idea of the dynamic parallelism,
we'll need to do something with private (per-CTA) vars, shared (global) vars
etc., dunno if with tweaks just in the late omp lowering pass after it
is lowered/expanded as OpenMP already, or if we'll need Martin's HSA
approach.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 14/14] libgomp: use more generic implementations on nvptx
  2015-10-20 18:34 ` [gomp4 14/14] libgomp: use more generic implementations on nvptx Alexander Monakov
@ 2015-10-21 10:17   ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21 10:17 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Oct 20, 2015 at 09:34:36PM +0300, Alexander Monakov wrote:
> This patch removes 0-size libgomp stubs where generic implementations can be
> compiled for the NVPTX target.
> 
> It also removes non-stub critical.c, which contains assembly implementations
> for GOMP_atomic_{start,end}, but does not contain implementations for
> GOMP_critical_*.  My understanding is that OpenACC offloading uses
> GOMP_atomic_* routines (by virtue of OpenMP lowering using them).  Linking in
> GOMP_critical_* and dependencies would be pointless for OpenACC.
> 
> If OpenACC indeed uses GOMP_atomic_*, then it makes sense to split them out
> into a separate file (atomic.c?).

Splitting atomic stuff into atomic.c and keeping critical in critical.c is
fine.

Note, when you actually start supporting multiple teams, likely
both GOMP_critical_* and GOMP_atomic_* will need to change for NVPTX
- the locks should be per-CTA variables in that case, rather than global
vars.  In OpenMP 4.5, 8, 16, 32 and 64-bit atomics (on aligned vars) are
required to be per-device, but that should be fine for PTX, which (I'd hope)
should expand all those inline using atomic instructions.  GOMP_atomic_*
is then used either for other types, where it should be private to the
contention group (CTA for PTX), and for reductions if HW atomics aren't
used (weird type or many different reductions) but for parallel/for/sections
reductions different CTAs shouldn't really fight for the same lock, each
should have its own.  Critical is similar thing.
I bet for PTX we'll need for named critical to adjust the compiler side.

And for teams reduction, PTX will need a different approach.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-21  9:09     ` Alexander Monakov
  2015-10-21  9:24       ` Jakub Jelinek
@ 2015-10-21 10:42       ` Bernd Schmidt
  2015-10-21 14:06         ` Alexander Monakov
  1 sibling, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-21 10:42 UTC (permalink / raw)
  To: Alexander Monakov, Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On 10/21/2015 11:07 AM, Alexander Monakov wrote:

> In PTX, stack storage is in .local address space -- and that memory is
> thread-private.  A thread can make a pointer to its own stack memory and
> successfully dereference it, but dereferencing that pointer from other threads
> does not work (I observed it returning garbage values).
>
> The reason for .local addresses being private like that, I think, is that
> references to .local memory undergo address translation to make simultaneous
> accesses to stack slots from threads in a warp form a coalesced memory
> transaction.  So .local memory looking consecutive from an individual thread's
> point of view are actually strided in physical memory.

This sounds a little odd. You can convert a .local pointer to a generic 
one and dereference the latter. Do you think there is such 
behind-the-scenes magic going on for accesses through generic pointers?


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
                   ` (15 preceding siblings ...)
  2015-10-21  8:56 ` Jakub Jelinek
@ 2015-10-21 12:06 ` Bernd Schmidt
  2015-10-21 15:48   ` Alexander Monakov
  16 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-21 12:06 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> This patch series ports enough of libgomp.c to get warp-level parallelism
> working for OpenMP offloading.  The overall approach is as follows.

Could you elaborate a bit what you mean by this just so we understand 
each other in terms of terminology? "Warp-level" sounds to me like you 
have all threads in a warp executing in lockstep at all times. If 
individual threads can take different paths, I'd expect it to be called 
thread-level parallelism or something like that.

What is your end goal in terms of mapping GPU parallelism onto OpenMP?

Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-21 10:42       ` Bernd Schmidt
@ 2015-10-21 14:06         ` Alexander Monakov
  0 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21 14:06 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jakub Jelinek, gcc-patches, Dmitry Melnik

On Wed, 21 Oct 2015, Bernd Schmidt wrote:

> On 10/21/2015 11:07 AM, Alexander Monakov wrote:
> 
> > In PTX, stack storage is in .local address space -- and that memory is
> > thread-private.  A thread can make a pointer to its own stack memory and
> > successfully dereference it, but dereferencing that pointer from other
> > threads
> > does not work (I observed it returning garbage values).
> >
> > The reason for .local addresses being private like that, I think, is that
> > references to .local memory undergo address translation to make simultaneous
> > accesses to stack slots from threads in a warp form a coalesced memory
> > transaction.  So .local memory looking consecutive from an individual
> > thread's
> > point of view are actually strided in physical memory.
> 
> This sounds a little odd. You can convert a .local pointer to a generic one
> and dereference the latter. Do you think there is such behind-the-scenes magic
> going on for accesses through generic pointers?

Yes.  It's fun: if you retrieve a generic pointer for a stack slot in
different threads, you get the same pointer.  If you dump cubin, you'll see
that local->generic conversion is a bitwise OR with a value in constant
memory, and generic->local conversion is a bitwise AND with immediate
0xffffff.

CUDA Programming Guide says on this matter: "Local memory is however organized
such that consecutive 32-bit words are accessed by consecutive thread IDs",
confirming presence of address scrambling.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main
  2015-10-21  9:49   ` Jakub Jelinek
@ 2015-10-21 14:41     ` Alexander Monakov
  2015-10-21 15:02       ` Jakub Jelinek
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21 14:41 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> > -#if defined HAVE_TLS || defined USE_EMUTLS
> > +#if defined __nvptx__
> > +extern struct gomp_thread *nvptx_thrs;
> 
> What kind of address space is this variable?  It should be
> a per-CTA var, so that different teams have different, and
> simultaneous target regions have different too.

As written it's in global accelerator memory.  Indeed it's broken with
simultaneous target regions, and to unbreak that I'd like to place it in
shared memory (but that would require expanding address-space support
a bit more, exposing shared-memory space to C source code).

> I'm surprised that for team.c you chose to adjust the shared source,
> rather than copy and remove all the cruft you don't need/want.
> 
> That includes the LIBGOMP_USE_PTHREADS guarded parts, all the thread binding
> stuff etc.  I'd like to see at least for comparison how much actually
> remained in there.

Diffstat for the copy/remove patch is 66+/474-, almost all of removed 470
lines are in gomp_team_start, which counts only ~150 lines after removals.

> > @@ -88,7 +138,8 @@ gomp_thread_start (void *xdata)
> >    thr->task = data->task;
> >    thr->place = data->place;
> >  
> > -  thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;
> > +  if (thr->ts.team)
> > +    thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;
> 
> Why this?  Lots of gomp_thread_start other places assume thr->ts.team is
> non-NULL.  And, this isn't even guarded with __nvptx__, we don't want
> to slow down host thread start.

thr->ts.team is NULL when entering from gomp_nvptx_main, and should be setup
by master thread from gomp_team_start while we are waiting on first
threads_dock barrier in gomp_thread_start (currently I don't expect
data->nested to be true on nvptx).

Should I drop the "if" and instead simply guard the statement with #ifndef
__nvptx__?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main
  2015-10-21 14:41     ` Alexander Monakov
@ 2015-10-21 15:02       ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-21 15:02 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Wed, Oct 21, 2015 at 05:40:24PM +0300, Alexander Monakov wrote:
> On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> > > -#if defined HAVE_TLS || defined USE_EMUTLS
> > > +#if defined __nvptx__
> > > +extern struct gomp_thread *nvptx_thrs;
> > 
> > What kind of address space is this variable?  It should be
> > a per-CTA var, so that different teams have different, and
> > simultaneous target regions have different too.
> 
> As written it's in global accelerator memory.  Indeed it's broken with
> simultaneous target regions, and to unbreak that I'd like to place it in
> shared memory (but that would require expanding address-space support
> a bit more, exposing shared-memory space to C source code).

Or declare the pointer in inline asm and read from it in inline asm
(and ditto for initialization)?
Perhaps at least short term.

> > I'm surprised that for team.c you chose to adjust the shared source,
> > rather than copy and remove all the cruft you don't need/want.
> > 
> > That includes the LIBGOMP_USE_PTHREADS guarded parts, all the thread binding
> > stuff etc.  I'd like to see at least for comparison how much actually
> > remained in there.
> 
> Diffstat for the copy/remove patch is 66+/474-, almost all of removed 470
> lines are in gomp_team_start, which counts only ~150 lines after removals.

I think I prefer config/nvptx/team.c including the toplevel team.c, where
you ifdef out all of gomp_thread_start and gomp_team_start in there,
and define it yourself in config/nvptx/team.c.  After removing non-PTX
related stuff, gomp_thread_start is like 65 lines including comments/blank
lines, and gomp_team_start is 85 lines.  Note I've also removed the nested
parallelism handling from there, you can't support that without dynamic
parallelism.  And you'll surely want to tweak it even more.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-21 12:06 ` Bernd Schmidt
@ 2015-10-21 15:48   ` Alexander Monakov
  2015-10-21 16:10     ` Bernd Schmidt
  2015-10-22  9:55     ` Jakub Jelinek
  0 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-21 15:48 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Wed, 21 Oct 2015, Bernd Schmidt wrote:

> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > This patch series ports enough of libgomp.c to get warp-level parallelism
> > working for OpenMP offloading.  The overall approach is as follows.
> 
> Could you elaborate a bit what you mean by this just so we understand each
> other in terms of terminology? "Warp-level" sounds to me like you have all
> threads in a warp executing in lockstep at all times. If individual threads
> can take different paths, I'd expect it to be called thread-level parallelism
> or something like that.

Sorry, that was unclear.  What I meant is that there is a degree of
parallelism available across different warps, but not across different teams
(because only 1 team is spawned), nor across threads in a warp (because all
threads in a warp except one exit immediately -- later on we'd need to
keep them converged so they can enter a simd region together).

> What is your end goal in terms of mapping GPU parallelism onto OpenMP?

OpenMP team is mapped to a CUDA thread block, OpenMP thread is mapped to a
warp, OpenMP simd lane is mapped to a CUDA thread.  So, follow the OpenACC
model.  Like in OpenACC, we'd need to artificially deactivate/reactivate warp
members on simd region boundaires.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-21 15:48   ` Alexander Monakov
@ 2015-10-21 16:10     ` Bernd Schmidt
  2015-10-22  9:55     ` Jakub Jelinek
  1 sibling, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-21 16:10 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/21/2015 05:18 PM, Alexander Monakov wrote:
> On Wed, 21 Oct 2015, Bernd Schmidt wrote:
>
>> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
>>> This patch series ports enough of libgomp.c to get warp-level parallelism
>>> working for OpenMP offloading.  The overall approach is as follows.
>>
>> Could you elaborate a bit what you mean by this just so we understand each
>> other in terms of terminology? "Warp-level" sounds to me like you have all
>> threads in a warp executing in lockstep at all times. If individual threads
>> can take different paths, I'd expect it to be called thread-level parallelism
>> or something like that.
>
> Sorry, that was unclear.  What I meant is that there is a degree of
> parallelism available across different warps, but not across different teams
> (because only 1 team is spawned), nor across threads in a warp (because all
> threads in a warp except one exit immediately -- later on we'd need to
> keep them converged so they can enter a simd region together).

I see. That ensures that you wouldn't see any problems related to 
reconvergence yet. I think I would like to see something with actual 
thread parallelism to demonstrate at least as a proof of concept that 
the synchronization code is roughly sane.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-21 15:48   ` Alexander Monakov
  2015-10-21 16:10     ` Bernd Schmidt
@ 2015-10-22  9:55     ` Jakub Jelinek
  2015-10-22 16:42       ` Alexander Monakov
  1 sibling, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-22  9:55 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Bernd Schmidt, gcc-patches, Dmitry Melnik

On Wed, Oct 21, 2015 at 06:18:25PM +0300, Alexander Monakov wrote:
> On Wed, 21 Oct 2015, Bernd Schmidt wrote:
> 
> > On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > > This patch series ports enough of libgomp.c to get warp-level parallelism
> > > working for OpenMP offloading.  The overall approach is as follows.
> > 
> > Could you elaborate a bit what you mean by this just so we understand each
> > other in terms of terminology? "Warp-level" sounds to me like you have all
> > threads in a warp executing in lockstep at all times. If individual threads
> > can take different paths, I'd expect it to be called thread-level parallelism
> > or something like that.
> 
> Sorry, that was unclear.  What I meant is that there is a degree of
> parallelism available across different warps, but not across different teams
> (because only 1 team is spawned), nor across threads in a warp (because all
> threads in a warp except one exit immediately -- later on we'd need to
> keep them converged so they can enter a simd region together).
>  
> > What is your end goal in terms of mapping GPU parallelism onto OpenMP?
> 
> OpenMP team is mapped to a CUDA thread block, OpenMP thread is mapped to a
> warp, OpenMP simd lane is mapped to a CUDA thread.  So, follow the OpenACC
> model.  Like in OpenACC, we'd need to artificially deactivate/reactivate warp
> members on simd region boundaires.

Does that apply also to threads within a warp?  I.e. is .local local to each
thread in the warp, or to the whole warp, and if the former, how can say at
the start of a SIMD region or at its end the local vars be broadcast to
other threads and collected back?  One thing is scalar vars, another
pointers, or references to various types, or even bigger indirection.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22  9:55     ` Jakub Jelinek
@ 2015-10-22 16:42       ` Alexander Monakov
  2015-10-22 17:16         ` Julian Brown
  2015-10-22 17:17         ` Bernd Schmidt
  0 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-22 16:42 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Bernd Schmidt, gcc-patches, Dmitry Melnik

On Thu, 22 Oct 2015, Jakub Jelinek wrote:
> Does that apply also to threads within a warp?  I.e. is .local local to each
> thread in the warp, or to the whole warp, and if the former, how can say at
> the start of a SIMD region or at its end the local vars be broadcast to
> other threads and collected back?  One thing is scalar vars, another
> pointers, or references to various types, or even bigger indirection.

.local is indeed local to each warp member, not the warp as a whole.  What
OpenACC/PTX implementation does is to copy the whole stack frame, plus live
registers: the implementation is in nvptx.c:nvptx_propagate.

I see two possible alternative approaches for OpenMP/PTX.

The first approach is to try and follow the OpenACC scheme.  In OpenMP that
will be more complicated.  First, we won't have a single stack frame, so we'll
need to emit stack propagation at call sites.  Second, we'll have to ensure
that each libgomp function that can appear in call chain from target region
entry to simd loop runs in "vector-neutered" mode, that is, threads 1-31 in
each warp follow branches that thread 0 executes.

The second approach is to run all threads in the warp all the time, making
sure they execute the same code with the same data, and thus build up the same
local state.  In this case we'd need to ensure this invariant: if threads in
the warp have the same state prior to executing an instruction, they also have
the same state after executing that instruction (plus global state changes as
if only one thread executed that instruction).

Most instructions are safe w.r.t this invariant.  Atomics break it, so to
maintain the invariant for atomics we need to conditionally execute it in only
one thread, and then copy the register holding the result to other threads.
Apart from atomics, I see only two more hazards: calls and user asm.

For calls, I think the solution is to execute the call in all threads,
demanding that callees hold up the invariant.  To ensure that, we'd need to
recompile newlib and other libs in that mode.  Finally, a few callees are out
of our control since they are provided by the driver: malloc, free, vprintf.
Those we can treat like atomics.

What do you think?  Does that sound correct?

Was something like this considered (and rejected?) for OpenACC?

Thanks.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 16:42       ` Alexander Monakov
@ 2015-10-22 17:16         ` Julian Brown
  2015-10-22 18:19           ` Alexander Monakov
  2015-10-22 17:17         ` Bernd Schmidt
  1 sibling, 1 reply; 99+ messages in thread
From: Julian Brown @ 2015-10-22 17:16 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Jakub Jelinek, Bernd Schmidt, gcc-patches, Dmitry Melnik

On Thu, 22 Oct 2015 19:41:51 +0300
Alexander Monakov <amonakov@ispras.ru> wrote:

> On Thu, 22 Oct 2015, Jakub Jelinek wrote:
> > Does that apply also to threads within a warp?  I.e. is .local
> > local to each thread in the warp, or to the whole warp, and if the
> > former, how can say at the start of a SIMD region or at its end the
> > local vars be broadcast to other threads and collected back?  One
> > thing is scalar vars, another pointers, or references to various
> > types, or even bigger indirection.  
> 
> .local is indeed local to each warp member, not the warp as a whole.
> What OpenACC/PTX implementation does is to copy the whole stack
> frame, plus live registers: the implementation is in
> nvptx.c:nvptx_propagate.
> 
> I see two possible alternative approaches for OpenMP/PTX.

> The second approach is to run all threads in the warp all the time,
> making sure they execute the same code with the same data, and thus
> build up the same local state.  In this case we'd need to ensure this
> invariant: if threads in the warp have the same state prior to
> executing an instruction, they also have the same state after
> executing that instruction (plus global state changes as if only one
> thread executed that instruction).
> 
> Most instructions are safe w.r.t this invariant.

> Was something like this considered (and rejected?) for OpenACC?

I'm not sure we understood the "global state changes as if only one
thread executed that instruction" bit (do you have a citation?). But
anyway, even if that works for threads within a warp, it doesn't work
for warps within a CTA, so we'd still need some broadcast mechanism for
those.

Julian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 16:42       ` Alexander Monakov
  2015-10-22 17:16         ` Julian Brown
@ 2015-10-22 17:17         ` Bernd Schmidt
  2015-10-22 18:10           ` Alexander Monakov
                             ` (3 more replies)
  1 sibling, 4 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-22 17:17 UTC (permalink / raw)
  To: Alexander Monakov, Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

I'm not really familiar with OpenMP and what it allows, so take all my 
comments with a grain of salt.

On 10/22/2015 06:41 PM, Alexander Monakov wrote:
> The second approach is to run all threads in the warp all the time, making
> sure they execute the same code with the same data, and thus build up the same
> local state.

But is that equivalent? If each thread takes the address of a variable 
on its own stack, that's not the same as taking an address once and 
broadcasting it.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 17:17         ` Bernd Schmidt
@ 2015-10-22 18:10           ` Alexander Monakov
  2015-10-22 18:27             ` Bernd Schmidt
  2015-10-23  8:23           ` Jakub Jelinek
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-22 18:10 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jakub Jelinek, gcc-patches, Dmitry Melnik

On Thu, 22 Oct 2015, Bernd Schmidt wrote:

> I'm not really familiar with OpenMP and what it allows, so take all my
> comments with a grain of salt.
> 
> On 10/22/2015 06:41 PM, Alexander Monakov wrote:
> > The second approach is to run all threads in the warp all the time, making
> > sure they execute the same code with the same data, and thus build up the
> > same
> > local state.
> 
> But is that equivalent? If each thread takes the address of a variable on its
> own stack, that's not the same as taking an address once and broadcasting it.

Taking the address yields the same pointer in all threads in PTX.  Even if it
didn't, broadcasting the pointer is pointless, as stacks are thread-private.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 17:16         ` Julian Brown
@ 2015-10-22 18:19           ` Alexander Monakov
  0 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-22 18:19 UTC (permalink / raw)
  To: Julian Brown; +Cc: Jakub Jelinek, Bernd Schmidt, gcc-patches, Dmitry Melnik

On Thu, 22 Oct 2015, Julian Brown wrote:
> > The second approach is to run all threads in the warp all the time,
> > making sure they execute the same code with the same data, and thus
> > build up the same local state.  In this case we'd need to ensure this
> > invariant: if threads in the warp have the same state prior to
> > executing an instruction, they also have the same state after
> > executing that instruction (plus global state changes as if only one
> > thread executed that instruction).
> > 
> > Most instructions are safe w.r.t this invariant.
> 
> > Was something like this considered (and rejected?) for OpenACC?
> 
> I'm not sure we understood the "global state changes as if only one
> thread executed that instruction" bit (do you have a citation?).

Not sure what kind of citation you want.  It's something I need be satisfied.

Taking a store to memory for example.  I want to ensure that if all threads in
a warp store the same value to the same location, the effect on memory is the
same as if only one thread performed the store (and not writing garbage or
invoking undefined behavior).  PTX gives me that guarantee automatically:

  "If a non-atomic instruction executed by a warp writes to the same location in
  global or shared memory for more than one of the threads of the warp, the
  number of serialized writes that occur to that location and the order in which
  they occur is undefined, but one of the writes is guaranteed to succeed"

> But anyway, even if that works for threads within a warp, it doesn't work
> for warps within a CTA, so we'd still need some broadcast mechanism for
> those.

Yes.  In OpenMP that corresponds to #omp parallel/GOMP_parallel, which was
discussed in relation to the patch where I want to store omp_data_o in shared
memory.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 18:10           ` Alexander Monakov
@ 2015-10-22 18:27             ` Bernd Schmidt
  2015-10-22 19:28               ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-22 18:27 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jakub Jelinek, gcc-patches, Dmitry Melnik

On 10/22/2015 08:08 PM, Alexander Monakov wrote:
> On Thu, 22 Oct 2015, Bernd Schmidt wrote:
>
>> I'm not really familiar with OpenMP and what it allows, so take all my
>> comments with a grain of salt.
>>
>> On 10/22/2015 06:41 PM, Alexander Monakov wrote:
>>> The second approach is to run all threads in the warp all the time, making
>>> sure they execute the same code with the same data, and thus build up the
>>> same
>>> local state.
>>
>> But is that equivalent? If each thread takes the address of a variable on its
>> own stack, that's not the same as taking an address once and broadcasting it.
>
> Taking the address yields the same pointer in all threads in PTX.  Even if it
> didn't, broadcasting the pointer is pointless, as stacks are thread-private.

It doesn't yield a pointer pointing to the same location in memory, 
which is the point. If you then have code operating on the location 
being pointed to, behaviour would be different than what you'd expect if 
you were on the host and broadcasted the pointer.

The problem is that you may get a user programs which have this 
behaviour, and which may not be supportable. I think that is what Jakub 
was trying to say (correct me if I'm wrong).


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 18:27             ` Bernd Schmidt
@ 2015-10-22 19:28               ` Alexander Monakov
  0 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-22 19:28 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jakub Jelinek, gcc-patches, Dmitry Melnik



On Thu, 22 Oct 2015, Bernd Schmidt wrote:

> On 10/22/2015 08:08 PM, Alexander Monakov wrote:
> > On Thu, 22 Oct 2015, Bernd Schmidt wrote:
> >
> > > I'm not really familiar with OpenMP and what it allows, so take all my
> > > comments with a grain of salt.
> > >
> > > On 10/22/2015 06:41 PM, Alexander Monakov wrote:
> > > > The second approach is to run all threads in the warp all the time,
> > > > making
> > > > sure they execute the same code with the same data, and thus build up
> > > > the
> > > > same
> > > > local state.
> > >
> > > But is that equivalent? If each thread takes the address of a variable on
> > > its
> > > own stack, that's not the same as taking an address once and broadcasting
> > > it.
> >
> > Taking the address yields the same pointer in all threads in PTX.  Even if
> > it
> > didn't, broadcasting the pointer is pointless, as stacks are thread-private.
> 
> It doesn't yield a pointer pointing to the same location in memory, which is
> the point. If you then have code operating on the location being pointed to,
> behaviour would be different than what you'd expect if you were on the host
> and broadcasted the pointer.

The value in that location would be the same, by construction.  You'd only
diverge on an attempt to perform an atomic on that location.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 17:17         ` Bernd Schmidt
  2015-10-22 18:10           ` Alexander Monakov
@ 2015-10-23  8:23           ` Jakub Jelinek
  2015-10-23  8:25           ` Jakub Jelinek
  2015-10-23 10:24           ` Jakub Jelinek
  3 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-23  8:23 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On Thu, Oct 22, 2015 at 07:16:49PM +0200, Bernd Schmidt wrote:
> I'm not really familiar with OpenMP and what it allows, so take all my
> comments with a grain of salt.
> 
> On 10/22/2015 06:41 PM, Alexander Monakov wrote:
> >The second approach is to run all threads in the warp all the time, making
> >sure they execute the same code with the same data, and thus build up the same
> >local state.
> 
> But is that equivalent? If each thread takes the address of a variable on
> its own stack, that's not the same as taking an address once and
> broadcasting it.

Does PTX allow function scope .shared variables (rather than just file
scope)?  If yes, then perhaps all the automatic vars that in theory could be
passed to other threads (i.e. addressable vars) could be then .shared and
the non-addressable ones .local.  In target constructs directly embedded
into host code you can know what variables are shared (which are shared
between teams, then .global, but that is primarily about mapped variables
which are heap allocated and firstprivate vars in target but not teams;
which are shared between threads, then .shared).  In separate functions
where it is unknown if they are called from within teams context (where it
is run by just the first thread in the first warp), or from within parallel
context (where it is run by one or more warps and thus privatized vars need
to be .local or ideally warp-local) and what to do for the SIMD stuff
broadcasts.

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 17:17         ` Bernd Schmidt
  2015-10-22 18:10           ` Alexander Monakov
  2015-10-23  8:23           ` Jakub Jelinek
@ 2015-10-23  8:25           ` Jakub Jelinek
  2015-10-23 10:24           ` Jakub Jelinek
  3 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-23  8:25 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On Thu, Oct 22, 2015 at 07:16:49PM +0200, Bernd Schmidt wrote:
> I'm not really familiar with OpenMP and what it allows, so take all my
> comments with a grain of salt.
> 
> On 10/22/2015 06:41 PM, Alexander Monakov wrote:
> >The second approach is to run all threads in the warp all the time, making
> >sure they execute the same code with the same data, and thus build up the same
> >local state.
> 
> But is that equivalent? If each thread takes the address of a variable on
> its own stack, that's not the same as taking an address once and
> broadcasting it.

BTW, does it consume more energy if all threads in the warp in a lock step
do the same thing, vs. just the first one doing something and all the others
neuterized?  What about stores to global or shared memory if done in lock step by
multiple threads in the warp?

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-22 17:17         ` Bernd Schmidt
                             ` (2 preceding siblings ...)
  2015-10-23  8:25           ` Jakub Jelinek
@ 2015-10-23 10:24           ` Jakub Jelinek
  2015-10-23 10:48             ` Bernd Schmidt
  2015-10-23 17:36             ` Alexander Monakov
  3 siblings, 2 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-23 10:24 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On Thu, Oct 22, 2015 at 07:16:49PM +0200, Bernd Schmidt wrote:
> I'm not really familiar with OpenMP and what it allows, so take all my
> comments with a grain of salt.

The OpenMP execution/data sharing model for the target regions
is very roughly that variables referenced in the various constructs
are either private (lots of different kinds like {,first,last}private,
linear, reduction) or shared in the specific construct.  The teams
construct splits work among the league of teams (CTAs in PTX case),
the parallel construct splits work among threads and simd construct
says that a loop can be performed using SIMD instructions (which for PTX
lockstep execution within a warp is).  teams construct must be only right
below target, with no intervening code in between, therefore it generally
allows both dynamic parallelism and CTA/thread preallocation, except that in
some cases it is unknown at the target time how many CTAs or max number of
threads you want.

So
#pragma omp declare target
int v;
void
foo ()
{
  // Please see the comments in main first.
  // This function shows that for functions containing orphaned OpenMP
  // constructs, but even for functions not containing any OpenMP
  // constructs, but just e.g. declaring variables where e.g. C++ reference
  // is initialized to them and that reference passed around to functions
  // containing orphaned OpenMP constructs, things are more complex,
  // as it is not possible to determine at compile time from which context
  // it might be called.  All that the compiler knows is that this routine
  // should be compiled for both the host and the offloading device(s).
  int u = 5, w = 6;
  // The u and w variables are here private to whatever construct
  // encountered the function.  The main below shows that it is called
  // both from within the teams region, in that case the code in the
  // function is executed by the 1st thread in each team, and u and w
  // variables are private to each team (i.e. ideally .shared).
  // Or it is executed from within the parallel region, the body is executed
  // by each thread in each team (warp in CTA for PTX?).
  u++; w++;
  // Global variables in orphaned constructs are shared, so v is per-device.
  #pragma omp atomic
  v += 6;
  #pragma omp parallel num_threads (17) shared (u, v) firstprivate (w)
  {
    // As we won't be supporting nested parallelism, if foo is executed
    // from within parallel, this will not split the work further,
    // the body of the parallel will just run in the thread that encountered
    // the parallel, just privatized variables will get yet another private
    // copy in there.  When foo is executed from within teams, this will
    // split the work among up to 17 threads, u will be local to each
    // team (.shared), v will be global for device and w will be private
    // to each thread (warp).
    #pragma omp atomic
    u++;
    #pragma omp atomic
    v += 2;
    w++;
  }
}
#pragma omp end declare target
int
main ()
{
  int a = 4, b = 5, c = 6, d = 7;
  #pragma omp target map(tofrom: a, c) firstprivate (b, d)
  {
    // Nothing can really be execute here, so just the teams body could
    // be run immediately on the first thread of each team.
    // a, b are mapped vars, b and d are private to the target region.
    #pragma omp teams num_teams (6) thread_limit (33) shared(a, b) firstprivate (c, d)
    {
      // This region is executed by the 1st thread in each team (up to 6 teams).
      // a and b refers to the same variable in all teams, e.g. you can do
      // atomics on them (requirement is that only 8/16/32/64 bit vars can
      // be in atomical across the device).
      #pragma omp atomic
      a++;
      #pragma omp atomic
      b += 2;
      // c and d are private to each team (so ideally .shared vars or
      // global with each team having its own set).
      c++; d++;
      int e = 8, f = 9, g = 10;
      // Local variables declared in the construct are private to that
      // construct, so e and f are ideally .shared vars or global with
      // each team having its own set.
      // Similarly g is private, but if the compiler can find out it is
      // never accessed by parallel region's body, it could very well be
      // .local too.
      g++;
      #pragma omp parallel num_threads (24) shared (a, c, e) firstprivate (b, d, f)
      {
        // This region is executed by each of the threads (so can be
	// say the first thread in a warp, or maybe all threads in the warp
        // in a lockstep doing the same thing).
	// a is shared by all threads in all teams.
	// c and e are private to each team, but shared by all threads in
	// that team.
	// b, d, f are private to each thread.
	#pragma omp atomic
	a++;
	#pragma omp atomic
	c++;
	#pragma omp atomic
	d++;
	b++; d++; f++;
	int h = 11, i = 12;
	// h and i declared in the parallel construct are private to each
	// thread.
	h++; i++;
	#pragma omp simd private (h) safelen(32) simdlen(32)
	for (int j = 0; j < 64; j++)
	  {
	    // h and j are private to each SIMD lane (so for PTX
	    // supposedly to each thread in a warp), the stmts are executed
	    // in lockstep, all other vars referenced in the construct
	    // are shared.
	    h = i + j;
	    h++;
	  }
	#pragma omp parallel num_threads (5)
	{
	  // We are probably not going to support nested parallelism on
	  // PTX, so this parallel will just run the body as a single
	  // thread, the one that encountered the parallel (times the
	  // number of threads that encountered it times number of teams
	  // of course).
	}
	foo ();
      }
      foo ();
    }
    // Nothing can really be executed here.
  }
  return 0;
}

Thus, if .shared function local is allowed, we'd need to emit two copies of
foo, one which assumes it is run in the teams context and one which assumes
it is run in the parallel context.  If automatic vars can be only .local,
we are just in big trouble and I guess we really want to investigate what
others supporting PTX/Cuda are trying to do here.
I can certainly cook up testcases which will verify all the required
properties (and using atomics really make sure the vars are indeed shared
rather then e.g. copied etc.).

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-23 10:24           ` Jakub Jelinek
@ 2015-10-23 10:48             ` Bernd Schmidt
  2015-10-23 17:36             ` Alexander Monakov
  1 sibling, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-23 10:48 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On 10/23/2015 12:16 PM, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 07:16:49PM +0200, Bernd Schmidt wrote:
>> I'm not really familiar with OpenMP and what it allows, so take all my
>> comments with a grain of salt.
>

> So

[snip - really good example]
Thanks!

So what I was trying to describe as a problem would be something along 
the lines of

#pragma omp end declare target
int
main ()
{
   int a = 4, b = 5, c = 6, d = 7;
   #pragma omp target map(tofrom: a, c) firstprivate (b, d)
   {
     #pragma omp teams num_teams (6) thread_limit (33) shared(a, b)
     {
       #pragma omp parallel num_threads (24) shared (a, c, e)
       {
         int x[64], *xp = x;
	#pragma omp simd private (h) safelen(32) simdlen(32)
  	for (int j = 0; j < 64; j++)
  	  {
	    // if the assignment of xp was executed in lockstep by
             // everything, then each thread stores into its own local
             // array rather than the one owned by the controlling thread
	    xp[j] = j;
	  }
	#pragma omp parallel num_threads (5)
      }
    }
    return 0;
}

> Thus, if .shared function local is allowed, we'd need to emit two copies of
> foo, one which assumes it is run in the teams context and one which assumes
> it is run in the parallel context.

Well, I suppose you could keep track of a second stack pointer manually, 
or disallow recursion and just use a static block. Then you can put your 
data anywhere you like. There isn't very much space in .shared though, 
but the normal ptx stack isn't large either.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-23 10:24           ` Jakub Jelinek
  2015-10-23 10:48             ` Bernd Schmidt
@ 2015-10-23 17:36             ` Alexander Monakov
  1 sibling, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-23 17:36 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Bernd Schmidt, gcc-patches, Dmitry Melnik

On Fri, 23 Oct 2015, Jakub Jelinek wrote:
> Thus, if .shared function local is allowed, we'd need to emit two copies of
> foo, one which assumes it is run in the teams context and one which assumes
> it is run in the parallel context.  If automatic vars can be only .local,
> we are just in big trouble and I guess we really want to investigate what
> others supporting PTX/Cuda are trying to do here.

.shared is statically allocated.  There's an implementation of nvptx
offloading in Clang/LLVM here https://github.com/clang-omp , they put data
that can be shared either in .shared or global memory (user configurable I
think).  Not sure how they deal with recursion or uncertainty that you
describe in regards to the 'foo' function in your example.

Can you point me to other compilers implementing OpenMP offloading for PTX?

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-20 21:04     ` Alexander Monakov
@ 2015-10-28 16:56       ` Alexander Monakov
  2015-10-28 17:01         ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-28 16:56 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Tue, 20 Oct 2015, Alexander Monakov wrote:
> On Tue, 20 Oct 2015, Bernd Schmidt wrote:
> 
> > On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > > Due to special treatment of types, emitting variables of type _Bool in
> > > global scope is impossible: extern references are emitted with .u8, but
> > > definitions use .u64.  This patch fixes the issue by treating boolean type
> > > as
> > > integer types.
> > >
> > >  * config/nvptx/nvptx.c (init_output_initializer): Also accept
> > >          BOOLEAN_TYPE.
> > 
> > Interesting, what was the testcase? I didn't stumble over this one. In any
> > case, I think this patch is ok for trunk.
> 
> libgomp has 'bool gomp_cancel_var' in global scope, and since it is not
> compiled with -ffunction-sections, GOMP_parallel pulls in
> GOMP_cancel (same TU, parallel.c), which references the variable.  Anything
> with "#pragma omp parallel" would fail to link.

Hi Bernd,

There's another issue with handling of types in init_output_initializer.
Test libgomp.c++/examples-4/declare_target-2.C fails with "libgomp: Can't map
target variables (size mismatch)".

Variables varX and varY of struct/class type have size 4, but their definition
in PTX code has size 8 because init_output_initializer changes 'type' to
ptr_type_node and emits them in size-8 chunks.  Could you please explain why
the chunking is in a pointer type rather than bytewise?  It seems the current
scheme is problematic for all types with sizes non divisible by 8, except
those mapped to PTX types already.

With the trivial patch changing ptr_type_node to char_type_node (pasted below)
I get the desired behavior with no libgomp testsuite regressions.

While looking at relevant code I found what appears to be a bad assert in
nvptx.c:

  gcc_assert (size = decl_chunk_size);

and also above that:

  gcc_assert (vec_pos = XVECLEN (pat, 0));

Those should have been equality checks rather than assignments, correct?

Apart from those, I found another assignment-in-assertion in GCC in tree-eh.c:

  gcc_assert (ri = (int)ri);

I'll send a separate mail about that shortly.

Thanks.
Alexander

--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1893,7 +1893,7 @@ init_output_initializer (FILE *file, const char *name, const_tree type,
        && TREE_CODE (type) != REAL_TYPE)
       || sz < 0
       || sz > HOST_BITS_PER_WIDE_INT)
-    type = ptr_type_node;
+    type = char_type_node;
   decl_chunk_size = int_size_in_bytes (type);
   decl_chunk_mode = int_mode_for_mode (TYPE_MODE (type));
   decl_offset = 0;

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 16:56       ` Alexander Monakov
@ 2015-10-28 17:01         ` Bernd Schmidt
  2015-10-28 17:38           ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-28 17:01 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/28/2015 05:54 PM, Alexander Monakov wrote:
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c
> @@ -1893,7 +1893,7 @@ init_output_initializer (FILE *file, const char *name, const_tree type,
>          && TREE_CODE (type) != REAL_TYPE)
>         || sz < 0
>         || sz > HOST_BITS_PER_WIDE_INT)
> -    type = ptr_type_node;
> +    type = char_type_node;
>     decl_chunk_size = int_size_in_bytes (type);
>     decl_chunk_mode = int_mode_for_mode (TYPE_MODE (type));
>     decl_offset = 0;

The idea here was that if you have a struct with a pointer field, and an 
initialization of it that uses a symbolic address, you'd be able to 
output the initializer. I don't quite see how that would still work 
after your patch.

You say "with no libgomp testsuite regressions", did you run any other 
tests?


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-21  9:29     ` Jakub Jelinek
@ 2015-10-28 17:22       ` Alexander Monakov
  2015-10-29  8:54         ` Jakub Jelinek
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-28 17:22 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On Wed, 21 Oct 2015, Jakub Jelinek wrote:

> On Wed, Oct 21, 2015 at 12:16:35PM +0300, Alexander Monakov wrote:
> > > Of course that doesn't help the thread-limit-2.c testcase.
> > 
> > Why not?
> 
> Because the compiler can be configured for multiple offloading devices,
> and PTX might not be the first device.  So, you'd need to have a tcl
> test whether PTX is enabled at all rather than whether it is the default
> device.

My previous response was a bit confused so let me correct myself.

Checking whether PTX is enabled as an offload target is relatively easy in
libgomp DejaGNU harness: just inspect $offload_targets_s.  This helps to XFAIL
the test that would fail at compile time, but such XFAIL'ing is, like you
said, undesirable because it drops the test for all offload targets.  I'd
rather provide a dummy 'usleep' under #ifdef __nvptx__.  WDYT?

On the other hand, checking whether PTX will be the default device when
running the compiled test seems non-trivial.

Here's how OpenACC seems to handle it: they loop over all configured offload
targets (plus "disable"), and set a single offload target explicitely with
-foffload=$offload_target_openacc.  Thus each test is compiled once or twice
(for host-only, and for nvptx if applicable).  This allows to XFAIL OpenACC
tests on per-offload-target basis.

I've updated my patches locally in response to reviews to the last series,
except where shared memory is involved.  Should I resend?  

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 17:01         ` Bernd Schmidt
@ 2015-10-28 17:38           ` Alexander Monakov
  2015-10-28 17:39             ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-28 17:38 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik



On Wed, 28 Oct 2015, Bernd Schmidt wrote:

> On 10/28/2015 05:54 PM, Alexander Monakov wrote:
> > --- a/gcc/config/nvptx/nvptx.c
> > +++ b/gcc/config/nvptx/nvptx.c
> > @@ -1893,7 +1893,7 @@ init_output_initializer (FILE *file, const char *name,
> > const_tree type,
> >          && TREE_CODE (type) != REAL_TYPE)
> > | | sz < 0
> > | | sz > HOST_BITS_PER_WIDE_INT)
> > -    type = ptr_type_node;
> > +    type = char_type_node;
> >     decl_chunk_size = int_size_in_bytes (type);
> >     decl_chunk_mode = int_mode_for_mode (TYPE_MODE (type));
> >     decl_offset = 0;
> 
> The idea here was that if you have a struct with a pointer field, and an
> initialization of it that uses a symbolic address, you'd be able to output the
> initializer. I don't quite see how that would still work after your patch.
> 
> You say "with no libgomp testsuite regressions", did you run any other tests?

I didn't; a simple test exercising pointer fields appears to fail, indeed.  So
what's the way forward here?  Unless packed, a structure with pointer fields
will have size divisible by sizeof(void*), so can we simply pick the largest
PTX type evenly dividing the size of the original type?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 17:38           ` Alexander Monakov
@ 2015-10-28 17:39             ` Bernd Schmidt
  2015-10-28 17:51               ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-28 17:39 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/28/2015 06:34 PM, Alexander Monakov wrote:
> I didn't; a simple test exercising pointer fields appears to fail, indeed.  So
> what's the way forward here?

I didn't quite get why you felt a need to change this? Your previous 
mail didn't seem to include a testcase, just that you get "the desired 
behaviour".

Running the full testsuite is required for each change; for 
nvptx-specific code it's probably sufficient to just run check-gcc C 
tests and maybe C++ tests if you're feeling adventurous.

> Unless packed, a structure with pointer fields
> will have size divisible by sizeof(void*), so can we simply pick the largest
> PTX type evenly dividing the size of the original type?

Or just keep ptr_type_node? What's wrong with this?

Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 17:39             ` Bernd Schmidt
@ 2015-10-28 17:51               ` Alexander Monakov
  2015-10-28 18:06                 ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-28 17:51 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Wed, 28 Oct 2015, Bernd Schmidt wrote:

> On 10/28/2015 06:34 PM, Alexander Monakov wrote:
> > I didn't; a simple test exercising pointer fields appears to fail, indeed.
> > So
> > what's the way forward here?
> 
> I didn't quite get why you felt a need to change this? Your previous mail
> didn't seem to include a testcase, just that you get "the desired behaviour".

Hm?  I said,

  There's another issue with handling of types in init_output_initializer.
  Test libgomp.c++/examples-4/declare_target-2.C fails with "libgomp: Can't map
  target variables (size mismatch)".

  Variables varX and varY of struct/class type have size 4, but their definition
  in PTX code has size 8 because init_output_initializer changes 'type' to
  ptr_type_node and emits them in size-8 chunks.

So I did mention the testcase, and the issue: structs with size 4 on host are
emitted with size 8 into PTX code.

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 17:51               ` Alexander Monakov
@ 2015-10-28 18:06                 ` Bernd Schmidt
  2015-10-28 18:07                   ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-28 18:06 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/28/2015 06:51 PM, Alexander Monakov wrote:
> On Wed, 28 Oct 2015, Bernd Schmidt wrote:
>> I didn't quite get why you felt a need to change this? Your previous mail
>> didn't seem to include a testcase, just that you get "the desired behaviour".
>
> Hm?  I said,
>
>    There's another issue with handling of types in init_output_initializer.
>    Test libgomp.c++/examples-4/declare_target-2.C fails with "libgomp: Can't map
>    target variables (size mismatch)".
>
>    Variables varX and varY of struct/class type have size 4, but their definition
>    in PTX code has size 8 because init_output_initializer changes 'type' to
>    ptr_type_node and emits them in size-8 chunks.

Ooops, sorry, somehow that didn't show up. Mailer user interface trouble.

> So I did mention the testcase, and the issue: structs with size 4 on host are
> emitted with size 8 into PTX code.

Ok, so adjust the if condition for non-integral types - make it false if 
the size of the struct is smaller than the pointer type.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 18:06                 ` Bernd Schmidt
@ 2015-10-28 18:07                   ` Alexander Monakov
  2015-10-28 18:33                     ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-28 18:07 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Wed, 28 Oct 2015, Bernd Schmidt wrote:
> Ok, so adjust the if condition for non-integral types - make it false if the
> size of the struct is smaller than the pointer type.

I'm afraid it's an insufficient fix: it would remain broken for size-12
structs (containing 3 int fields, for example): they would be emitted with
size 16 instead.

(and as far as I see I can't make the condition false anyway, we still need
to pick some PTX type when emitting a struct)

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 18:07                   ` Alexander Monakov
@ 2015-10-28 18:33                     ` Bernd Schmidt
  2015-10-28 19:37                       ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-28 18:33 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/28/2015 07:06 PM, Alexander Monakov wrote:
> On Wed, 28 Oct 2015, Bernd Schmidt wrote:
>> Ok, so adjust the if condition for non-integral types - make it false if the
>> size of the struct is smaller than the pointer type.
>
> I'm afraid it's an insufficient fix: it would remain broken for size-12
> structs (containing 3 int fields, for example): they would be emitted with
> size 16 instead.

Maybe check TYPE_ or DECL_ALIGN as well then. But I think in general the 
problem cannot be avoided, let's say if you have a size-12 struct with a 
pointer field.

> (and as far as I see I can't make the condition false anyway, we still need
> to pick some PTX type when emitting a struct)

Well I thought you could rely on the int_size_in_bytes thing, but the 
size-12 struct is a counterexample.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 18:33                     ` Bernd Schmidt
@ 2015-10-28 19:37                       ` Alexander Monakov
  2015-10-29 11:13                         ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-28 19:37 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Wed, 28 Oct 2015, Bernd Schmidt wrote:

> On 10/28/2015 07:06 PM, Alexander Monakov wrote:
> > On Wed, 28 Oct 2015, Bernd Schmidt wrote:
> > > Ok, so adjust the if condition for non-integral types - make it false if
> > > the
> > > size of the struct is smaller than the pointer type.
> >
> > I'm afraid it's an insufficient fix: it would remain broken for size-12
> > structs (containing 3 int fields, for example): they would be emitted with
> > size 16 instead.
> 
> Maybe check TYPE_ or DECL_ALIGN as well then. But I think in general the
> problem cannot be avoided, let's say if you have a size-12 struct with a
> pointer field.

Only packed structs might have their size indivisible by pointer size while
containing a pointer field, so the problematic category is narrow.

Packed structs aside, checking alignment doesn't bring new information: size
will be divisible by alignment.

Anything wrong with the simple fix: pick an integer type with the largest size
dividing the original struct type size?

Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-28 17:22       ` Alexander Monakov
@ 2015-10-29  8:54         ` Jakub Jelinek
  2015-10-29 11:38           ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-10-29  8:54 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Wed, Oct 28, 2015 at 08:19:19PM +0300, Alexander Monakov wrote:
> 
> 
> On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> 
> > On Wed, Oct 21, 2015 at 12:16:35PM +0300, Alexander Monakov wrote:
> > > > Of course that doesn't help the thread-limit-2.c testcase.
> > > 
> > > Why not?
> > 
> > Because the compiler can be configured for multiple offloading devices,
> > and PTX might not be the first device.  So, you'd need to have a tcl
> > test whether PTX is enabled at all rather than whether it is the default
> > device.
> 
> My previous response was a bit confused so let me correct myself.
> 
> Checking whether PTX is enabled as an offload target is relatively easy in
> libgomp DejaGNU harness: just inspect $offload_targets_s.  This helps to XFAIL
> the test that would fail at compile time, but such XFAIL'ing is, like you
> said, undesirable because it drops the test for all offload targets.  I'd
> rather provide a dummy 'usleep' under #ifdef __nvptx__.  WDYT?

Such ifdefs aren't really easily possible in OpenMP right now, the
preprocessing is done with the host compiler only, you'd need to arrange for
usleep being defined only in the PTX path and nowhere else.
> 
> On the other hand, checking whether PTX will be the default device when
> running the compiled test seems non-trivial.

The OpenMP standard indeed does not have a function which would return you
an enum on what offloading device it is run on, it would need to be an
extension (so a header different from omp.h and perhaps gomp_* prefixed
function), or we could just use some OpenACC function for that?
What is easily possible to distinguish is offloading vs. host fallback
(omp_is_initial_device ()), and whether it has shared address space or
separate address space (so HSA vs. PTX + XeonPhi + XeonPhi emul), and
several tests already check those two properties.

> I've updated my patches locally in response to reviews to the last series,
> except where shared memory is involved.  Should I resend?

Are there any dependencies in your patch series against stuff still in
gomp-4_0-branch rather than already on the trunk?

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-28 19:37                       ` Alexander Monakov
@ 2015-10-29 11:13                         ` Bernd Schmidt
  2015-10-30 13:27                           ` Alexander Monakov
  0 siblings, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-29 11:13 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On 10/28/2015 08:29 PM, Alexander Monakov wrote:

> Anything wrong with the simple fix: pick an integer type with the largest size
> dividing the original struct type size?

Try it and run it through the testsuite.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 00/14] NVPTX: further porting
  2015-10-29  8:54         ` Jakub Jelinek
@ 2015-10-29 11:38           ` Alexander Monakov
  0 siblings, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-10-29 11:38 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On Thu, 29 Oct 2015, Jakub Jelinek wrote:
> > rather provide a dummy 'usleep' under #ifdef __nvptx__.  WDYT?
> 
> Such ifdefs aren't really easily possible in OpenMP right now, the
> preprocessing is done with the host compiler only, you'd need to arrange for
> usleep being defined only in the PTX path and nowhere else.

Right, I should have remembered that.  Providing it as a weak symbol doesn't
work either.  Providing it in a separate static library should work I think
(if the library goes after -lc in host linking, the linker should pick up the
dynamic symbol from libc).

What do you think about building tests for each offload target separately,
like OpenACC does?

> > I've updated my patches locally in response to reviews to the last series,
> > except where shared memory is involved.  Should I resend?
> 
> Are there any dependencies in your patch series against stuff still in
> gomp-4_0-branch rather than already on the trunk?

From reviewing the diff I've found one dependency: code to place 'omp_data_o'
in shared memory needs pass_late_lower_omp.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-29 11:13                         ` Bernd Schmidt
@ 2015-10-30 13:27                           ` Alexander Monakov
  2015-10-30 13:38                             ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-30 13:27 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik



On Thu, 29 Oct 2015, Bernd Schmidt wrote:

> On 10/28/2015 08:29 PM, Alexander Monakov wrote:
> 
> > Anything wrong with the simple fix: pick an integer type with the largest
> > size
> > dividing the original struct type size?
> 
> Try it and run it through the testsuite.

The following patch passes testing with

make -k check-c DEJAGNU=.../dejagnu.exp RUNTESTFLAGS=--target_board=nvptx-none-run

with no new regressions, and fixes 1 test: 
-FAIL: gcc.dg/compat/struct-align-1 c_compat_x_tst.o-c_compat_y_tst.o execute

OK?

Thanks.
Alexander

nvptx: fix chunk size selection for structure types

	* config/nvptx/nvptx.c (nvptx_ptx_type_for_output): New.  Handle
	COMPLEX_TYPE like ARRAY_TYPE.  Drop special handling of scalar types.
	Fix handling of structure types by choosing integer type that divides
	original size evenly.  Split out from and use it...
	(init_output_initializer): ...here.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b541666..3a0cac2 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1692,6 +1692,29 @@ nvptx_assemble_decl_end (void)
   fprintf (asm_out_file, ";\n");
 }
 
+/* Return a type suitable to output initializers for TYPE.  */
+static const_tree
+nvptx_ptx_type_for_output (const_tree type)
+{
+  /* Avoid picking a larger type than the underlying type.  */
+  if (TREE_CODE (type) == ARRAY_TYPE
+      || TREE_CODE (type) == COMPLEX_TYPE)
+    type = TREE_TYPE (type);
+  int sz = int_size_in_bytes (type);
+  if (sz < 0)
+    return char_type_node;
+  /* Size of the output type must divide that of original type.  Initializers
+     with pointers to objects need a pointer-sized type.  These requirements
+     may be contradictory for packed structs, but giving priority to first at
+     least allows to output some initializers correctly.  Here we pick largest
+     suitable integer type without deeper inspection.  */
+  return (sz % 8 || !TARGET_ABI64
+         ? (sz % 4
+            ? (sz % 2 ? char_type_node : short_integer_type_node)
+            : integer_type_node)
+         : long_integer_type_node);
+}
+
 /* Start a declaration of a variable of TYPE with NAME to
    FILE.  IS_PUBLIC says whether this will be externally visible.
    Here we just write the linker hint and decide on the chunk size
@@ -1705,15 +1728,7 @@ init_output_initializer (FILE *file, const char *name, const_tree type,
   assemble_name_raw (file, name);
   fputc ('\n', file);
 
-  if (TREE_CODE (type) == ARRAY_TYPE)
-    type = TREE_TYPE (type);
-  int sz = int_size_in_bytes (type);
-  if ((TREE_CODE (type) != INTEGER_TYPE
-       && TREE_CODE (type) != ENUMERAL_TYPE
-       && TREE_CODE (type) != REAL_TYPE)
-      || sz < 0
-      || sz > HOST_BITS_PER_WIDE_INT)
-    type = ptr_type_node;
+  type = nvptx_ptx_type_for_output (type);
   decl_chunk_size = int_size_in_bytes (type);
   decl_chunk_mode = int_mode_for_mode (TYPE_MODE (type));
   decl_offset = 0;

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 04/14] nvptx: fix output of _Bool global variables
  2015-10-30 13:27                           ` Alexander Monakov
@ 2015-10-30 13:38                             ` Bernd Schmidt
  0 siblings, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-10-30 13:38 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

> The following patch passes testing with
>
> make -k check-c DEJAGNU=.../dejagnu.exp RUNTESTFLAGS=--target_board=nvptx-none-run
>
> with no new regressions, and fixes 1 test:
> -FAIL: gcc.dg/compat/struct-align-1 c_compat_x_tst.o-c_compat_y_tst.o execute

Ok. Thanks!


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-10-21  8:20   ` Jakub Jelinek
@ 2015-10-30 16:58     ` Alexander Monakov
  2015-11-06 14:05       ` Bernd Schmidt
  0 siblings, 1 reply; 99+ messages in thread
From: Alexander Monakov @ 2015-10-30 16:58 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> > At the moment the attribute setting logic in omp-low.c is such that if a
> > function that should be present in target code does not already have 'omp
> > declare target' attribute, it receives 'omp target entrypoint'.  That is
> > wasteful: clearly not all user-declared target functions will be target region
> > entry points in OpenMP.
> > 
> > The motivating example for this change is OpenMP parallel target regions.  The
> > 'parallel' part is outlined into its own function.  We don't want that
> > function be an 'entrypoint' on PTX (but only as a matter of optimality rather
> > than correctness).
> > 
> > 	* omp-low.c (create_omp_child_function): Set "omp target entrypoint"
> >         or "omp declare target" attribute based on is_gimple_omp_offloaded.
> 
> This is principally ok, but you want to change it for 01/14.
> After that I think it is ready for trunk.

As you suggested in patch 01/14, I'll want to use different attributes to
distinguish between OpenACC and OpenMP entrypoints.  I've tried your
suggestion "acc target entrypoint", and unfortunately it doesn't work because
ipa-icf does: 

sem_function::parse (cgraph_node *node, bitmap_obstack *stack)
{
  ...
  if (lookup_attribute_by_prefix ("omp ", DECL_ATTRIBUTES (node->decl)) != NULL)
    return NULL;
  ...
}

so changing the prefix affects code generation.  I don't see why IPA-ICF needs
that (it's there from when the file was added to trunk, and there's no comment).
I went with "omp acc target entrypoint" instead, as below.

OK? (for trunk?)

Thanks.
Alexander

omp-low: set more precise attributes on offloaded functions

At the moment the attribute setting logic in omp-low.c is such that if a
function that should be present in target code does not already have 'omp
declare target' attribute, it receives 'omp target entrypoint'.  However, that
is inaccurate for OpenMP: functions outlined for e.g. omp-parallel regions do
not have 'omp declare target' attribute, but they cannot be entrypoints either.

Detect entrypoints using 'is_gimple_omp_offloaded'.  Additionally, distinguish
between OpenACC and OpenMP entrypoints, and use 'omp acc target entrypoint'
for the former.

        * omp-low.c (create_omp_child_function): Set "omp target entrypoint",
        "omp acc target entrypoint" or "omp declare target" attribute based on
        is_gimple_omp_offloaded and is_gimple_omp_oacc.
        * config/nvptx/nvptx.c (write_as_kernel): Test OpenACC-specific
        attribute "omp acc target entrypoint".  Add a comment about the OpenMP
        attribute handling.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 21c59ef..ac021fc 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -402,7 +402,10 @@ static bool
 write_as_kernel (tree attrs)
 {
   return (lookup_attribute ("kernel", attrs) != NULL_TREE
-         || lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE);
+         || lookup_attribute ("omp acc target entrypoint", attrs) != NULL_TREE);
+  /* Ignore "omp target entrypoint" here: OpenMP target region functions are
+     called from gomp_nvptx_main.  The corresponding kernel entry is emitted
+     from write_omp_entry.  */
 }
 
 /* Write a function decl for DECL to S, where NAME is the name to be used.
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 06b4a5e..696889d 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -2245,9 +2245,16 @@ create_omp_child_function (omp_context *ctx, bool task_copy)
   if (cgraph_node::get_create (decl)->offloadable
       && !lookup_attribute ("omp declare target",
                            DECL_ATTRIBUTES (current_function_decl)))
-    DECL_ATTRIBUTES (decl)
-      = tree_cons (get_identifier ("omp target entrypoint"),
-                   NULL_TREE, DECL_ATTRIBUTES (decl));
+    {
+      const char *target_attr = (is_gimple_omp_offloaded (ctx->stmt)
+                                ? (is_gimple_omp_oacc (ctx->stmt)
+                                   ? "omp acc target entrypoint"
+                                   : "omp target entrypoint")
+                                : "omp declare target");
+      DECL_ATTRIBUTES (decl)
+       = tree_cons (get_identifier (target_attr),
+                    NULL_TREE, DECL_ATTRIBUTES (decl));
+    }
 
   t = build_decl (DECL_SOURCE_LOCATION (decl),
                  RESULT_DECL, NULL_TREE, void_type_node);

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-10-20 18:34 ` [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX Alexander Monakov
  2015-10-21  0:07   ` Bernd Schmidt
  2015-10-21  8:48   ` Jakub Jelinek
@ 2015-11-03 14:25   ` Alexander Monakov
  2015-11-06 14:00     ` Bernd Schmidt
  2015-11-10 10:39     ` Jakub Jelinek
  2 siblings, 2 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-11-03 14:25 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

Hello,

Here's an alternative patch that does not depend on exposure of shared-memory
address space, and does not try to use pass_late_lower_omp.  It's based on
Bernd's suggestion to transform

  (use .omp_data_o)
  GOMP_parallel (fn, &omp_data_o, ...);
  .omp_data_o = {CLOBBER};

to

  .omp_data_o_ptr = __internal_omp_alloc_shared (&.omp_data_o, sizeof ...);
  (use (*.omp_data_o_ptr) instead of .omp_data_o)
  GOMP_parallel (fn, .omp_data_o_ptr, ...);
  __internal_omp_free_shared (.omp_data_o_ptr);
  .omp_data_o = {CLOBBER};

Every target except nvptx can lower free_shared to nothing and alloc_shared to
just returning the first argument, and nvptx can select storage in shared
memory or global memory.  For now it simply uses malloc/free.

Sanity-checked by running the libgomp testsuite.  I realize the #ifdef in
internal-fn.c is not appropriate: it's there to make the patch smaller, I'll
replace it with a target hook if otherwise this approach is ok.

Thanks.
Alexander

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index bf0f23e..3145a8d 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -175,6 +175,38 @@ expand_GOMP_SIMD_LAST_LANE (gcall *)
   gcc_unreachable ();
 }
 
+static void
+expand_GOMP_ALLOC_SHARED (gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+
+  /* XXX PoC only, needs to be a target hook.  */
+#ifdef GCC_NVPTX_H
+  tree fndecl = builtin_decl_explicit (BUILT_IN_MALLOC);
+  tree t = build_call_expr (fndecl, 1, gimple_call_arg (stmt, 1));
+
+  expand_call (t, target, 0);
+#else
+  tree rhs = gimple_call_arg (stmt, 0);
+
+  rtx src = expand_normal (rhs);
+
+  emit_move_insn (target, src);
+#endif
+}
+
+static void
+expand_GOMP_FREE_SHARED (gcall *stmt)
+{
+#ifdef GCC_NVPTX_H
+  tree fndecl = builtin_decl_explicit (BUILT_IN_FREE);
+  tree t = build_call_expr (fndecl, 1, gimple_call_arg (stmt, 0));
+
+  expand_call (t, NULL_RTX, 1);
+#endif
+}
+
 /* This should get expanded in the sanopt pass.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 0db03f1..0c8e76a 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -44,6 +44,8 @@ DEF_INTERNAL_FN (STORE_LANES, ECF_CONST | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_ALLOC_SHARED, ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_FREE_SHARED, ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (LOOP_VECTORIZED, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (MASK_LOAD, ECF_PURE | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (MASK_STORE, ECF_LEAF, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 696889d..225bf20 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -5870,7 +5870,8 @@ expand_omp_taskreg (struct omp_region *region)
         a function call that has been inlined, the original PARM_DECL
         .OMP_DATA_I may have been converted into a different local
         variable.  In which case, we need to keep the assignment.  */
-      if (gimple_omp_taskreg_data_arg (entry_stmt))
+      tree data_arg = gimple_omp_taskreg_data_arg (entry_stmt);
+      if (data_arg)
        {
          basic_block entry_succ_bb
            = single_succ_p (entry_bb) ? single_succ (entry_bb)
@@ -5894,9 +5895,10 @@ expand_omp_taskreg (struct omp_region *region)
                  /* We're ignore the subcode because we're
                     effectively doing a STRIP_NOPS.  */
 
-                 if (TREE_CODE (arg) == ADDR_EXPR
-                     && TREE_OPERAND (arg, 0)
-                       == gimple_omp_taskreg_data_arg (entry_stmt))
+                 if ((TREE_CODE (arg) == ADDR_EXPR
+                      && TREE_OPERAND (arg, 0) == data_arg)
+                     || (TREE_CODE (data_arg) == INDIRECT_REF
+                         && TREE_OPERAND (data_arg, 0) == arg))
                    {
                      parcopy_stmt = stmt;
                      break;
@@ -11835,27 +11837,44 @@ lower_omp_taskreg (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   record_vars_into (ctx->block_vars, child_fn);
   record_vars_into (gimple_bind_vars (par_bind), child_fn);
 
+  ilist = NULL;
+  tree sender_decl = NULL_TREE;
+
   if (ctx->record_type)
     {
-      ctx->sender_decl
+      sender_decl
        = create_tmp_var (ctx->srecord_type ? ctx->srecord_type
                          : ctx->record_type, ".omp_data_o");
-      DECL_NAMELESS (ctx->sender_decl) = 1;
-      TREE_ADDRESSABLE (ctx->sender_decl) = 1;
+      DECL_NAMELESS (sender_decl) = 1;
+      TREE_ADDRESSABLE (sender_decl) = 1;
+
+      /* Instead of using the automatic variable .omp_data_o directly, build
+         .omp_data_o_ptr = GOMP_ALLOC_SHARED (&.omp_data_o, sizeof .omp_data_o)
+         ... and replace SENDER_DECL with indirect ref *.omp_data_o_ptr.  */
+      tree ae = build_fold_addr_expr (sender_decl);
+      tree sz = TYPE_SIZE_UNIT (TREE_TYPE (sender_decl));
+      gimple g = gimple_build_call_internal (IFN_GOMP_ALLOC_SHARED, 2, ae, sz);
+      gimple_seq_add_stmt (&ilist, g);
+      tree result = create_tmp_var (TREE_TYPE (ae), ".omp_data_o_ptr");
+      gimple_call_set_lhs (g, result);
+      ctx->sender_decl = build_fold_indirect_ref (result);
       gimple_omp_taskreg_set_data_arg (stmt, ctx->sender_decl);
     }
 
   olist = NULL;
-  ilist = NULL;
   lower_send_clauses (clauses, &ilist, &olist, ctx);
   lower_send_shared_vars (&ilist, &olist, ctx);
 
   if (ctx->record_type)
     {
-      tree clobber = build_constructor (TREE_TYPE (ctx->sender_decl), NULL);
+      /* GOMP_FREE_SHARED (.omp_data_o_ptr).  */
+      tree ae = build_fold_addr_expr (ctx->sender_decl);
+      gimple g = gimple_build_call_internal (IFN_GOMP_FREE_SHARED, 1, ae);
+      gimple_seq_add_stmt (&olist, g);
+      /* Clobber the original stack variable.  */
+      tree clobber = build_constructor (TREE_TYPE (sender_decl), NULL);
       TREE_THIS_VOLATILE (clobber) = 1;
-      gimple_seq_add_stmt (&olist, gimple_build_assign (ctx->sender_decl,
-                                                       clobber));
+      gimple_seq_add_stmt (&olist, gimple_build_assign (sender_decl, clobber));
     }
 
   /* Once all the expansions are done, sequence all the different

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-11-03 14:25   ` Alexander Monakov
@ 2015-11-06 14:00     ` Bernd Schmidt
  2015-11-06 14:06       ` Jakub Jelinek
  2015-11-10 10:39     ` Jakub Jelinek
  1 sibling, 1 reply; 99+ messages in thread
From: Bernd Schmidt @ 2015-11-06 14:00 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

> Sanity-checked by running the libgomp testsuite.  I realize the #ifdef in
> internal-fn.c is not appropriate: it's there to make the patch smaller, I'll
> replace it with a target hook if otherwise this approach is ok.

FWIW, no objections from me regarding the approach.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-10-30 16:58     ` Alexander Monakov
@ 2015-11-06 14:05       ` Bernd Schmidt
  2015-11-06 14:08         ` Jakub Jelinek
  2015-11-06 17:16         ` Alexander Monakov
  0 siblings, 2 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-11-06 14:05 UTC (permalink / raw)
  To: Alexander Monakov, Jakub Jelinek; +Cc: gcc-patches, Dmitry Melnik

On 10/30/2015 05:44 PM, Alexander Monakov wrote:
> +  /* Ignore "omp target entrypoint" here: OpenMP target region functions are
> +     called from gomp_nvptx_main.  The corresponding kernel entry is emitted
> +     from write_omp_entry.  */
>   }

I'm probably confused, but didn't we agree that this should be changed 
so that the entry point isn't gomp_nvptx_main but instead something that 
wraps a call to that function?

This patch creates a new "omp target entrypoint" annotation that appears 
not to be used - it would be better to just not annotate a function if 
it's not going to need entrypoint treatment. IMO a single type of 
attribute should be sufficient for that.

Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-11-06 14:00     ` Bernd Schmidt
@ 2015-11-06 14:06       ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-11-06 14:06 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On Fri, Nov 06, 2015 at 03:00:30PM +0100, Bernd Schmidt wrote:
> >Sanity-checked by running the libgomp testsuite.  I realize the #ifdef in
> >internal-fn.c is not appropriate: it's there to make the patch smaller, I'll
> >replace it with a target hook if otherwise this approach is ok.
> 
> FWIW, no objections from me regarding the approach.

As I said on IRC, I fear it is not a general solution, and will try to write
a testcase that demonstrates that soon.
That said, as a temporary partial workaround it might be acceptable, but
1) there really should be a target hook
2) it should never be emitted when not in target regions (declare target
   functions or when inside of OpenMP target region)
3) it should be folded away as soon as possible for the non-PTX
   targets (both host and say XeonPhi), so in the openacc pass shortly post 
   IPA (or look at if something that will come with the HSA merge could help there)

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-11-06 14:05       ` Bernd Schmidt
@ 2015-11-06 14:08         ` Jakub Jelinek
  2015-11-06 14:12           ` Bernd Schmidt
  2015-11-06 17:16         ` Alexander Monakov
  1 sibling, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-11-06 14:08 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On Fri, Nov 06, 2015 at 03:05:05PM +0100, Bernd Schmidt wrote:
> This patch creates a new "omp target entrypoint" annotation that appears not
> to be used - it would be better to just not annotate a function if it's not
> going to need entrypoint treatment. IMO a single type of attribute should be
> sufficient for that.

But NVPTX is just one backend, perhaps other backends need different
treatment of the entry points?

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-11-06 14:08         ` Jakub Jelinek
@ 2015-11-06 14:12           ` Bernd Schmidt
  0 siblings, 0 replies; 99+ messages in thread
From: Bernd Schmidt @ 2015-11-06 14:12 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Alexander Monakov, gcc-patches, Dmitry Melnik

On 11/06/2015 03:08 PM, Jakub Jelinek wrote:
> On Fri, Nov 06, 2015 at 03:05:05PM +0100, Bernd Schmidt wrote:
>> This patch creates a new "omp target entrypoint" annotation that appears not
>> to be used - it would be better to just not annotate a function if it's not
>> going to need entrypoint treatment. IMO a single type of attribute should be
>> sufficient for that.
>
> But NVPTX is just one backend, perhaps other backends need different
> treatment of the entry points?

If we don't know, then it's not a problem we have to solve now. We can 
change it at any time later. For now, let's just keep it simple - no 
need to invent special annotations that end up unused.


Bernd

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints
  2015-11-06 14:05       ` Bernd Schmidt
  2015-11-06 14:08         ` Jakub Jelinek
@ 2015-11-06 17:16         ` Alexander Monakov
  1 sibling, 0 replies; 99+ messages in thread
From: Alexander Monakov @ 2015-11-06 17:16 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jakub Jelinek, gcc-patches, Dmitry Melnik

On Fri, 6 Nov 2015, Bernd Schmidt wrote:
> On 10/30/2015 05:44 PM, Alexander Monakov wrote:
> > +  /* Ignore "omp target entrypoint" here: OpenMP target region functions
> > are
> > +     called from gomp_nvptx_main.  The corresponding kernel entry is
> > emitted
> > +     from write_omp_entry.  */
> >   }
> 
> I'm probably confused, but didn't we agree that this should be changed so that
> the entry point isn't gomp_nvptx_main but instead something that wraps a call
> to that function?

Yes, we did agree to that, and I've implemented that locally.  I didn't resend
the patch series yet, but for clarity I'm pasting the corresponding patch
below.  As you'll see, there's no contradiction because...

> This patch creates a new "omp target entrypoint" annotation that appears not
> to be used - it would be better to just not annotate a function if it's not
> going to need entrypoint treatment. IMO a single type of attribute should be
> sufficient for that.

... I need to examine "omp target entrypoint" in a different place -- not in
write_as_kernel -- to invoke write_omp_entry (new function).

To clarify, when a function 'main$_omp_fn$0' has "omp target entrypoint", the
following patch renames it to 'main$_omp_fn$0$impl' and emits an entry-point
wrapper with the original name that invokes the original -- now renamed --
function via gomp_nvptx_main.

Alexander

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 15efc26..b17e5a9 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -596,6 +596,45 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }

+/* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
+
+   extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
+   void __attribute__((kernel)) NAME(void *arg)
+   {
+     gomp_nvptx_main (ORIG, arg);
+   }
+   ORIG itself should not be emitted as a PTX .entry function.  */
+
+static void
+write_omp_entry (std::stringstream &s, const char *name, const char *orig)
+{
+  /* Pointer-sized PTX integer type, .u32 or .u64 depending on target ABI.  */
+  const char *sfx = nvptx_ptx_type_from_mode (Pmode, false);
+
+  /* OpenMP target regions are entered via gomp_nvptx_main.  */
+  static bool gomp_nvptx_main_declared;
+  if (!gomp_nvptx_main_declared)
+    {
+      gomp_nvptx_main_declared = true;
+      s << "// BEGIN GLOBAL FUNCTION DECL: gomp_nvptx_main\n";
+      s << ".extern .func gomp_nvptx_main";
+      s << "(.param" << sfx << " %in_ar1, .param" << sfx << " %in_ar2);\n";
+    }
+  s << ".visible .entry " << name << "(.param" << sfx << " %in_ar1)\n";
+  s << "{\n";
+  s << "\t.reg" << sfx << " %ar1;\n";
+  s << "\tld.param" << sfx << " %ar1, [%in_ar1];\n";
+  s << "\t{\n";
+  s << "\t\t.param" << sfx << " %out_arg0;\n";
+  s << "\t\t.param" << sfx << " %out_arg1;\n";
+  s << "\t\tst.param" << sfx << " [%out_arg0], " << orig << ";\n";
+  s << "\t\tst.param" << sfx << " [%out_arg1], %ar1;\n";
+  s << "\t\tcall.uni gomp_nvptx_main, (%out_arg0, %out_arg1);\n";
+  s << "\t}\n";
+  s << "\tret;\n";
+  s << "}\n";
+}
+
 /* Implement ASM_DECLARE_FUNCTION_NAME.  Writes the start of a ptx
    function, including local var decls and copies from the arguments to
    local regs.  */
@@ -609,6 +648,14 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   name = nvptx_name_replacement (name);

   std::stringstream s;
+  if (flag_openmp
+      && lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)))
+    {
+      char *buf = (char *) alloca (strlen (name) + sizeof ("$impl"));
+      sprintf (buf, "%s$impl", name);
+      write_omp_entry (s, name, buf);
+      name = buf;
+    }
   write_function_decl_and_comment (s, name, decl);
   s << "// BEGIN";
   if (TREE_PUBLIC (decl))

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-11-03 14:25   ` Alexander Monakov
  2015-11-06 14:00     ` Bernd Schmidt
@ 2015-11-10 10:39     ` Jakub Jelinek
  2015-11-26  9:51       ` Jakub Jelinek
  1 sibling, 1 reply; 99+ messages in thread
From: Jakub Jelinek @ 2015-11-10 10:39 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Nov 03, 2015 at 05:25:53PM +0300, Alexander Monakov wrote:
> Here's an alternative patch that does not depend on exposure of shared-memory
> address space, and does not try to use pass_late_lower_omp.  It's based on
> Bernd's suggestion to transform

FYI, I've committed a new testcase to gomp-4_5-branch that covers various
target data sharing/team sharing/privatization parallel
sharing/privatization offloading cases.

2015-11-10  Jakub Jelinek  <jakub@redhat.com>

	* testsuite/libgomp.c/target-31.c: New test.

--- libgomp/testsuite/libgomp.c/target-31.c.jj	2015-11-09 19:05:50.439644694 +0100
+++ libgomp/testsuite/libgomp.c/target-31.c	2015-11-10 11:12:12.930286760 +0100
@@ -0,0 +1,163 @@
+#include <omp.h>
+#include <stdlib.h>
+
+int a = 1, b = 2, c = 3, d = 4;
+int e[2] = { 5, 6 }, f[2] = { 7, 8 }, g[2] = { 9, 10 }, h[2] = { 11, 12 };
+
+__attribute__((noinline, noclone)) void
+use (int *k, int *l, int *m, int *n, int *o, int *p, int *q, int *r)
+{
+  asm volatile ("" : : "r" (k) : "memory");
+  asm volatile ("" : : "r" (l) : "memory");
+  asm volatile ("" : : "r" (m) : "memory");
+  asm volatile ("" : : "r" (n) : "memory");
+  asm volatile ("" : : "r" (o) : "memory");
+  asm volatile ("" : : "r" (p) : "memory");
+  asm volatile ("" : : "r" (q) : "memory");
+  asm volatile ("" : : "r" (r) : "memory");
+}
+
+#pragma omp declare target to (use)
+
+int
+main ()
+{
+  int err = 0, r = -1, t[4];
+  long s[4] = { -1, -2, -3, -4 };
+  int j = 13, k = 14, l[2] = { 15, 16 }, m[2] = { 17, 18 };
+  #pragma omp target private (a, b, e, f) firstprivate (c, d, g, h) map(from: r, s, t) \
+		     map(tofrom: err, j, l) map(to: k, m)
+  #pragma omp teams num_teams (4) thread_limit (8) private (b, f) firstprivate (d, h, k, m)
+  {
+    int u1 = k, u2[2] = { m[0], m[1] };
+    int u3[64];
+    int i;
+    for (i = 0; i < 64; i++)
+      u3[i] = k + i;
+    #pragma omp parallel num_threads (1)
+    {
+      if (c != 3 || d != 4 || g[0] != 9 || g[1] != 10 || h[0] != 11 || h[1] != 12 || k != 14 || m[0] != 17 || m[1] != 18)
+	#pragma omp atomic write
+	  err = 1;
+      b = omp_get_team_num ();
+      if (b >= 4)
+	#pragma omp atomic write
+	  err = 1;
+      if (b == 0)
+	{
+	  a = omp_get_num_teams ();
+	  e[0] = 2 * a;
+	  e[1] = 3 * a;
+	}
+      f[0] = 2 * b;
+      f[1] = 3 * b;
+      #pragma omp atomic update
+	c++;
+      #pragma omp atomic update
+	g[0] += 2;
+      #pragma omp atomic update
+	g[1] += 3;
+      d++;
+      h[0] += 2;
+      h[1] += 3;
+      k += b;
+      m[0] += 2 * b;
+      m[1] += 3 * b;
+    }
+    use (&a, &b, &c, &d, e, f, g, h);
+    #pragma omp parallel firstprivate (u1, u2)
+    {
+      int w = omp_get_thread_num ();
+      int x = 19;
+      int y[2] = { 20, 21 };
+      int v = 24;
+      int ll[64];
+      if (u1 != 14 || u2[0] != 17 || u2[1] != 18)
+	#pragma omp atomic write
+	  err = 1;
+      u1 += w;
+      u2[0] += 2 * w;
+      u2[1] += 3 * w;
+      use (&u1, u2, &t[b], l, &k, m, &j, h);
+      #pragma omp master
+	t[b] = omp_get_num_threads ();
+      #pragma omp atomic update
+	j++;
+      #pragma omp atomic update
+	l[0] += 2;
+      #pragma omp atomic update
+	l[1] += 3;
+      #pragma omp atomic update
+	k += 4;
+      #pragma omp atomic update
+	m[0] += 5;
+      #pragma omp atomic update
+	m[1] += 6;
+      x += w;
+      y[0] += 2 * w;
+      y[1] += 3 * w;
+      #pragma omp simd safelen(32) private (v)
+      for (i = 0; i < 64; i++)
+	{
+	  v = 3 * i;
+	  ll[i] = u1 + v * u2[0] + u2[1] + x + y[0] + y[1] + v + h[0] + u3[i];
+	}
+      #pragma omp barrier
+      use (&u1, u2, &t[b], l, &k, m, &x, y);
+      if (w < 0 || w > 8 || w != omp_get_thread_num () || u1 != 14 + w
+	  || u2[0] != 17 + 2 * w || u2[1] != 18 + 3 * w
+	  || x != 19 + w || y[0] != 20 + 2 * w || y[1] != 21 + 3 * w
+	  || v != 24)
+	#pragma omp atomic write
+	  err = 1;
+      for (i = 0; i < 64; i++)
+	if (ll[i] != u1 + 3 * i * u2[0] + u2[1] + x + y[0] + y[1] + 3 * i + 13 + 14 + i)
+	  #pragma omp atomic write
+	    err = 1;
+    }
+    #pragma omp parallel num_threads (1)
+    {
+      if (b == 0)
+	{
+	  r = a;
+	  if (a != omp_get_num_teams ()
+	      || e[0] != 2 * a
+	      || e[1] != 3 * a)
+	    #pragma omp atomic write
+	      err = 1;
+	}
+      int v1, v2, v3;
+      #pragma omp atomic read
+	v1 = c;
+      #pragma omp atomic read
+	v2 = g[0];
+      #pragma omp atomic read
+	v3 = g[1];
+      s[b] = v1 * 65536L + v2 * 256L + v3;
+      if (d != 5 || h[0] != 13 || h[1] != 15
+	  || k != 14 + b + 4 * t[b]
+	  || m[0] != 17 + 2 * b + 5 * t[b]
+	  || m[1] != 18 + 3 * b + 6 * t[b]
+	  || b != omp_get_team_num ()
+	  || f[0] != 2 * b || f[1] != 3 * b)
+	#pragma omp atomic write
+	  err = 1;
+    }
+  }
+  if (err != 0) abort ();
+  if (r < 1 || r > 4) abort ();
+  if (a != 1 || b != 2 || c != 3 || d != 4) abort ();
+  if (e[0] != 5 || e[1] != 6 || f[0] != 7 || f[1] != 8) abort ();
+  if (g[0] != 9 || g[1] != 10 || h[0] != 11 || h[1] != 12) abort ();
+  int i, cnt = 0;
+  for (i = 0; i < r; i++)
+    if ((s[i] >> 16) < 3 + 1 || (s[i] >> 16) > 3 + 4
+	|| ((s[i] >> 8) & 0xff) < 9 + 2 * 1 || ((s[i] >> 8) & 0xff) > 9 + 2 * 4
+	|| (s[i] & 0xff) < 10 + 3 * 1 || (s[i] & 0xff) > 10 + 3 * 4
+	|| t[i] < 1 || t[i] > 8)
+      abort ();
+    else
+      cnt += t[i];
+  if (j != 13 + cnt || l[0] != 15 + 2 * cnt || l[1] != 16 + 3 * cnt) abort ();
+  return 0;
+}

	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
  2015-11-10 10:39     ` Jakub Jelinek
@ 2015-11-26  9:51       ` Jakub Jelinek
  0 siblings, 0 replies; 99+ messages in thread
From: Jakub Jelinek @ 2015-11-26  9:51 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Dmitry Melnik

On Tue, Nov 10, 2015 at 11:39:36AM +0100, Jakub Jelinek wrote:
> On Tue, Nov 03, 2015 at 05:25:53PM +0300, Alexander Monakov wrote:
> > Here's an alternative patch that does not depend on exposure of shared-memory
> > address space, and does not try to use pass_late_lower_omp.  It's based on
> > Bernd's suggestion to transform
> 
> FYI, I've committed a new testcase to gomp-4_5-branch that covers various
> target data sharing/team sharing/privatization parallel
> sharing/privatization offloading cases.

And another testcase, this time using only OpenMP 4.0 features, and trying
to test the behavior of addressable vars in declare target functions where
it is not clear if they are executed in teams, distribute or parallel for
contexts.

Wanted to look what LLVM generates here (tried llvm trunk), but they are
unable to parse #pragma omp distribute or #pragma omp declare target,
so it is hard to guess anything.

Tested with XeonPhi offloading as well as host fallback, committed to trunk.

2015-11-26  Jakub Jelinek  <jakub@redhat.com>

	* testsuite/libgomp.c/target-35.c: New test.

--- libgomp/testsuite/libgomp.c/target-35.c	(revision 0)
+++ libgomp/testsuite/libgomp.c/target-35.c	(working copy)
@@ -0,0 +1,129 @@
+#include <omp.h>
+#include <stdlib.h>
+
+#pragma omp declare target
+__attribute__((noinline))
+void
+foo (int x, int y, int z, int *a, int *b)
+{
+  if (x == 0)
+    {
+      int i, j;
+      for (i = 0; i < 64; i++)
+	#pragma omp parallel for shared (a, b)
+	for (j = 0; j < 32; j++)
+	  foo (3, i, j, a, b);
+    }
+  else if (x == 1)
+    {
+      int i, j;
+      #pragma omp distribute dist_schedule (static, 1)
+      for (i = 0; i < 64; i++)
+	#pragma omp parallel for shared (a, b)
+	for (j = 0; j < 32; j++)
+	  foo (3, i, j, a, b);
+    }
+  else if (x == 2)
+    {
+      int j;
+      #pragma omp parallel for shared (a, b)
+      for (j = 0; j < 32; j++)
+	foo (3, y, j, a, b);
+    }
+  else
+    {
+      #pragma omp atomic
+      b[y] += z;
+      #pragma omp atomic
+      *a += 1;
+    }
+}
+
+__attribute__((noinline))
+int
+bar (int x, int y, int z)
+{
+  int a, b[64], i;
+  a = 8;
+  for (i = 0; i < 64; i++)
+    b[i] = i;
+  foo (x, y, z, &a, b);
+  if (x == 0)
+    {
+      if (a != 8 + 64 * 32)
+	return 1;
+      for (i = 0; i < 64; i++)
+	if (b[i] != i + 31 * 32 / 2)
+	  return 1;
+    }
+  else if (x == 1)
+    {
+      int c = omp_get_num_teams ();
+      int d = omp_get_team_num ();
+      int e = d;
+      int f = 0;
+      for (i = 0; i < 64; i++)
+	if (i == e)
+	  {
+	    if (b[i] != i + 31 * 32 / 2)
+	      return 1;
+	    f++;
+	    e = e + c;
+	  }
+	else if (b[i] != i)
+	  return 1;
+      if (a < 8 || a > 8 + f * 32)
+	return 1;
+    }
+  else if (x == 2)
+    {
+      if (a != 8 + 32)
+	return 1;
+      for (i = 0; i < 64; i++)
+	if (b[i] != i + (i == y ? 31 * 32 / 2 : 0))
+	  return 1;
+    }
+  else if (x == 3)
+    {
+      if (a != 8 + 1)
+	return 1;
+      for (i = 0; i < 64; i++)
+	if (b[i] != i + (i == y ? z : 0))
+	  return 1;
+    }
+  return 0;
+}
+#pragma omp end declare target
+
+int
+main ()
+{
+  int i, j, err = 0;
+  #pragma omp target map(tofrom:err)
+  #pragma omp teams reduction(+:err)
+  err += bar (0, 0, 0);
+  if (err)
+    abort ();
+  #pragma omp target map(tofrom:err)
+  #pragma omp teams reduction(+:err)
+  err += bar (1, 0, 0);
+  if (err)
+    abort ();
+  #pragma omp target map(tofrom:err)
+  #pragma omp teams reduction(+:err)
+  #pragma omp distribute
+  for (i = 0; i < 64; i++)
+    err += bar (2, i, 0);
+  if (err)
+    abort ();
+  #pragma omp target map(tofrom:err)
+  #pragma omp teams reduction(+:err)
+  #pragma omp distribute
+  for (i = 0; i < 64; i++)
+  #pragma omp parallel for reduction(+:err)
+    for (j = 0; j < 32; j++)
+      err += bar (3, i, j);
+  if (err)
+    abort ();
+  return 0;
+}


	Jakub

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2015-11-26  9:50 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-20 18:34 [gomp4 00/14] NVPTX: further porting Alexander Monakov
2015-10-20 18:34 ` [gomp4 12/14] libgomp: fixup error.c on nvptx Alexander Monakov
2015-10-21 10:03   ` Jakub Jelinek
2015-10-20 18:34 ` [gomp4 07/14] libgomp nvptx plugin: launch target functions via gomp_nvptx_main Alexander Monakov
2015-10-20 21:12   ` Bernd Schmidt
2015-10-20 21:19     ` Alexander Monakov
2015-10-20 21:27       ` Bernd Schmidt
2015-10-21  9:07         ` Jakub Jelinek
2015-10-20 18:34 ` [gomp4 14/14] libgomp: use more generic implementations on nvptx Alexander Monakov
2015-10-21 10:17   ` Jakub Jelinek
2015-10-20 18:34 ` [gomp4 08/14] libgomp nvptx: populate proc.c Alexander Monakov
2015-10-21  9:15   ` Jakub Jelinek
2015-10-20 18:34 ` [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX Alexander Monakov
2015-10-21  0:07   ` Bernd Schmidt
2015-10-21  6:49     ` Alexander Monakov
2015-10-21  8:48   ` Jakub Jelinek
2015-10-21  9:09     ` Alexander Monakov
2015-10-21  9:24       ` Jakub Jelinek
2015-10-21 10:42       ` Bernd Schmidt
2015-10-21 14:06         ` Alexander Monakov
2015-11-03 14:25   ` Alexander Monakov
2015-11-06 14:00     ` Bernd Schmidt
2015-11-06 14:06       ` Jakub Jelinek
2015-11-10 10:39     ` Jakub Jelinek
2015-11-26  9:51       ` Jakub Jelinek
2015-10-20 18:34 ` [gomp4 11/14] libgomp: avoid variable-length stack allocation in team.c Alexander Monakov
2015-10-20 20:48   ` Bernd Schmidt
2015-10-20 21:41     ` Alexander Monakov
2015-10-20 21:46       ` Bernd Schmidt
2015-10-21  9:59   ` Jakub Jelinek
2015-10-20 18:34 ` [gomp4 05/14] omp-low: set 'omp target entrypoint' only on entypoints Alexander Monakov
2015-10-20 23:57   ` Bernd Schmidt
2015-10-21  8:20   ` Jakub Jelinek
2015-10-30 16:58     ` Alexander Monakov
2015-11-06 14:05       ` Bernd Schmidt
2015-11-06 14:08         ` Jakub Jelinek
2015-11-06 14:12           ` Bernd Schmidt
2015-11-06 17:16         ` Alexander Monakov
2015-10-20 18:34 ` [gomp4 03/14] nvptx: expand support for address spaces Alexander Monakov
2015-10-20 20:56   ` Bernd Schmidt
2015-10-20 21:06     ` Alexander Monakov
2015-10-20 21:13       ` Bernd Schmidt
2015-10-20 21:41         ` Cesar Philippidis
2015-10-20 21:51           ` Bernd Schmidt
2015-10-20 18:34 ` [gomp4 04/14] nvptx: fix output of _Bool global variables Alexander Monakov
2015-10-20 20:51   ` Bernd Schmidt
2015-10-20 21:04     ` Alexander Monakov
2015-10-28 16:56       ` Alexander Monakov
2015-10-28 17:01         ` Bernd Schmidt
2015-10-28 17:38           ` Alexander Monakov
2015-10-28 17:39             ` Bernd Schmidt
2015-10-28 17:51               ` Alexander Monakov
2015-10-28 18:06                 ` Bernd Schmidt
2015-10-28 18:07                   ` Alexander Monakov
2015-10-28 18:33                     ` Bernd Schmidt
2015-10-28 19:37                       ` Alexander Monakov
2015-10-29 11:13                         ` Bernd Schmidt
2015-10-30 13:27                           ` Alexander Monakov
2015-10-30 13:38                             ` Bernd Schmidt
2015-10-20 18:34 ` [gomp4 01/14] nvptx: emit kernels for 'omp target entrypoint' only for OpenACC Alexander Monakov
2015-10-20 23:48   ` Bernd Schmidt
2015-10-21  5:40     ` Alexander Monakov
2015-10-21  8:11   ` Jakub Jelinek
2015-10-21  8:36     ` Alexander Monakov
2015-10-20 18:52 ` [gomp4 13/14] libgomp: provide minimal GOMP_teams Alexander Monakov
2015-10-21 10:12   ` Jakub Jelinek
2015-10-20 18:52 ` [gomp4 10/14] libgomp: arrange a team of pre-started threads via gomp_nvptx_main Alexander Monakov
2015-10-21  9:49   ` Jakub Jelinek
2015-10-21 14:41     ` Alexander Monakov
2015-10-21 15:02       ` Jakub Jelinek
2015-10-20 18:53 ` [gomp4 09/14] libgomp: provide barriers on NVPTX Alexander Monakov
2015-10-20 20:56   ` Bernd Schmidt
2015-10-20 22:00     ` Alexander Monakov
2015-10-21  2:23       ` Bernd Schmidt
2015-10-21  9:39   ` Jakub Jelinek
2015-10-20 19:01 ` [gomp4 02/14] nvptx: emit pointers to OpenMP target region entry points Alexander Monakov
2015-10-21  7:55 ` [gomp4 00/14] NVPTX: further porting Martin Jambor
2015-10-21  8:56 ` Jakub Jelinek
2015-10-21  9:17   ` Alexander Monakov
2015-10-21  9:29     ` Jakub Jelinek
2015-10-28 17:22       ` Alexander Monakov
2015-10-29  8:54         ` Jakub Jelinek
2015-10-29 11:38           ` Alexander Monakov
2015-10-21 12:06 ` Bernd Schmidt
2015-10-21 15:48   ` Alexander Monakov
2015-10-21 16:10     ` Bernd Schmidt
2015-10-22  9:55     ` Jakub Jelinek
2015-10-22 16:42       ` Alexander Monakov
2015-10-22 17:16         ` Julian Brown
2015-10-22 18:19           ` Alexander Monakov
2015-10-22 17:17         ` Bernd Schmidt
2015-10-22 18:10           ` Alexander Monakov
2015-10-22 18:27             ` Bernd Schmidt
2015-10-22 19:28               ` Alexander Monakov
2015-10-23  8:23           ` Jakub Jelinek
2015-10-23  8:25           ` Jakub Jelinek
2015-10-23 10:24           ` Jakub Jelinek
2015-10-23 10:48             ` Bernd Schmidt
2015-10-23 17:36             ` Alexander Monakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).