public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp-nvptx 7/9] nvptx mkoffload: pass -mgomp for OpenMP offloading
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
  2015-12-01 15:28 ` [gomp-nvptx 5/9] new target hook: TARGET_SIMT_VF Alexander Monakov
  2015-12-01 15:28 ` [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets Alexander Monakov
@ 2015-12-01 15:28 ` Alexander Monakov
  2015-12-01 15:28 ` [gomp-nvptx 6/9] nvptx libgcc: rewrite in C Alexander Monakov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

This patch wires up use of alternative -mgomp multilib for OpenMP offloading
via nvptx mkoffload.  It makes OpenACC and OpenMP incompatible for
simultaneous offloading compilation, so I've added a diagnostic for that.

	* config/nvptx/mkoffload.c (main): Check that either OpenACC or OpenMP
	is selected.  Pass -mgomp to offload compiler in OpenMP case.
---
 gcc/config/nvptx/mkoffload.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c
index 7aa6f09..9a5d36d 100644
--- a/gcc/config/nvptx/mkoffload.c
+++ b/gcc/config/nvptx/mkoffload.c
@@ -460,6 +460,7 @@ main (int argc, char **argv)
 
   /* Scan the argument vector.  */
   bool fopenmp = false;
+  bool fopenacc = false;
   for (int i = 1; i < argc; i++)
     {
 #define STR "-foffload-abi="
@@ -476,11 +477,15 @@ main (int argc, char **argv)
 #undef STR
       else if (strcmp (argv[i], "-fopenmp") == 0)
 	fopenmp = true;
+      else if (strcmp (argv[i], "-fopenacc") == 0)
+	fopenacc = true;
       else if (strcmp (argv[i], "-save-temps") == 0)
 	save_temps = true;
       else if (strcmp (argv[i], "-v") == 0)
 	verbose = true;
     }
+  if (!(fopenacc ^ fopenmp))
+    fatal_error (input_location, "either -fopenacc or -fopenmp must be set");
 
   struct obstack argv_obstack;
   obstack_init (&argv_obstack);
@@ -501,6 +506,8 @@ main (int argc, char **argv)
     default:
       gcc_unreachable ();
     }
+  if (fopenmp)
+    obstack_ptr_grow (&argv_obstack, "-mgomp");
 
   for (int ix = 1; ix != argc; ix++)
     {

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 6/9] nvptx libgcc: rewrite in C
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
                   ` (2 preceding siblings ...)
  2015-12-01 15:28 ` [gomp-nvptx 7/9] nvptx mkoffload: pass -mgomp for OpenMP offloading Alexander Monakov
@ 2015-12-01 15:28 ` Alexander Monakov
  2015-12-01 23:52   ` Bernd Schmidt
  2015-12-01 15:28 ` [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib Alexander Monakov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

To easily build libgcc for -mgomp multilib, I've rewritten libgcc routines
from asm to C.

En passant, I've fixed a bug in malloc and realloc wrappers where they failed
to handle out-of-memory conditions.  I'm assuming it wasn't intentional.

I also use a patch for Newlib that rewrites its nvptx-specific 'printf'
implementation in C.

	* config/nvptx/crt0.c: New, rewritten in C from ...
	* config/nvptx/crt0.s: ...this.  Delete.
	* config/nvptx/free.c: New, rewritten in C from ...
	* config/nvptx/free.asm: ...this.  Delete.
	* config/nvptx/malloc.c: New, rewritten in C from ...
	* config/nvptx/malloc.asm: ...this.  Delete.
	* config/nvptx/realloc.c: Handle out-of-memory condition.
	* config/nvptx/nvptx-malloc.h (__nvptx_real_free,
	__nvptx_real_malloc): Declare.
	* config/nvptx/stacks.c: New.
	* config/nvptx/t-nvptx: Adjust.
---
 libgcc/config/nvptx/crt0.c         | 61 ++++++++++++++++++++++++++++++++++++++
 libgcc/config/nvptx/crt0.s         | 54 ---------------------------------
 libgcc/config/nvptx/free.asm       | 50 -------------------------------
 libgcc/config/nvptx/free.c         | 34 +++++++++++++++++++++
 libgcc/config/nvptx/malloc.asm     | 55 ----------------------------------
 libgcc/config/nvptx/malloc.c       | 35 ++++++++++++++++++++++
 libgcc/config/nvptx/nvptx-malloc.h |  5 ++++
 libgcc/config/nvptx/realloc.c      |  2 ++
 libgcc/config/nvptx/stacks.c       | 30 +++++++++++++++++++
 libgcc/config/nvptx/t-nvptx        | 11 +++----
 10 files changed, 173 insertions(+), 164 deletions(-)
 create mode 100644 libgcc/config/nvptx/crt0.c
 delete mode 100644 libgcc/config/nvptx/crt0.s
 delete mode 100644 libgcc/config/nvptx/free.asm
 create mode 100644 libgcc/config/nvptx/free.c
 delete mode 100644 libgcc/config/nvptx/malloc.asm
 create mode 100644 libgcc/config/nvptx/malloc.c
 create mode 100644 libgcc/config/nvptx/stacks.c

diff --git a/libgcc/config/nvptx/crt0.c b/libgcc/config/nvptx/crt0.c
new file mode 100644
index 0000000..74483c4
--- /dev/null
+++ b/libgcc/config/nvptx/crt0.c
@@ -0,0 +1,61 @@
+/* Startup routine for standalone execution.
+
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+void exit (int);
+void abort (void);
+void __attribute__((kernel)) __main (int *, int, char *[]);
+
+static int *__exitval;
+
+void
+exit (int arg)
+{
+  *__exitval = arg;
+  asm volatile ("exit;");
+  __builtin_unreachable ();
+}
+
+void
+abort (void)
+{
+  exit (255);
+}
+
+asm ("// BEGIN GLOBAL VAR DECL: __nvptx_stacks");
+asm (".extern .shared .u64 __nvptx_stacks[32];");
+asm ("// BEGIN GLOBAL VAR DECL: __nvptx_uni");
+asm (".extern .shared .u32 __nvptx_uni[32];");
+
+extern int main (int argc, char *argv[]);
+
+void __attribute__((kernel))
+__main (int *__retval, int __argc, char *__argv[])
+{
+  __exitval = __retval;
+
+  static char gstack[131072] __attribute__((aligned(8)));
+  asm ("st.shared.u64 [__nvptx_stacks], %0;" : : "r" (gstack + sizeof gstack));
+  asm ("st.shared.u32 [__nvptx_uni], %0;" : : "r" (0));
+
+  exit (main (__argc, __argv));
+}
diff --git a/libgcc/config/nvptx/crt0.s b/libgcc/config/nvptx/crt0.s
deleted file mode 100644
index 1ac69a5..0000000
--- a/libgcc/config/nvptx/crt0.s
+++ /dev/null
@@ -1,54 +0,0 @@
-	.version 3.1
-	.target	sm_30
-	.address_size 64
-
-.global .u64 %__exitval;
-// BEGIN GLOBAL FUNCTION DEF: abort
-.visible .func abort
-{
-        .reg .u64 %rd1;
-        ld.global.u64   %rd1,[%__exitval];
-        st.u32   [%rd1], 255;
-        exit;
-}
-// BEGIN GLOBAL FUNCTION DEF: exit
-.visible .func exit (.param .u32 %arg)
-{
-        .reg .u64 %rd1;
-	.reg .u32 %val;
-	ld.param.u32 %val,[%arg];
-        ld.global.u64   %rd1,[%__exitval];
-        st.u32   [%rd1], %val;
-        exit;
-}
-
-.visible .shared .u64 __nvptx_stacks[1];
-.global .align 8 .u8 %__softstack[131072];
-
-.extern .func (.param.u32 retval) main (.param.u32 argc, .param.u64 argv);
-
-.visible .entry __main (.param .u64 __retval, .param.u32 __argc, .param.u64 __argv)
-{
-        .reg .u32 %r<3>;
-        .reg .u64 %rd<3>;
-	.param.u32 %argc;
-	.param.u64 %argp;
-	.param.u32 %mainret;
-        ld.param.u64    %rd0, [__retval];
-        st.global.u64   [%__exitval], %rd0;
-
-        .reg .u64 %stackptr;
-        mov.u64	%stackptr, %__softstack;
-        cvta.global.u64	%stackptr, %stackptr;
-        add.u64	%stackptr, %stackptr, 131072;
-        st.shared.u64	[__nvptx_stacks], %stackptr;
-
-	ld.param.u32	%r1, [__argc];
-	ld.param.u64	%rd1, [__argv];
-	st.param.u32	[%argc], %r1;
-	st.param.u64	[%argp], %rd1;
-        call.uni        (%mainret), main, (%argc, %argp);
-	ld.param.u32	%r1,[%mainret];
-        st.s32   [%rd0], %r1;
-        exit;
-}
diff --git a/libgcc/config/nvptx/free.asm b/libgcc/config/nvptx/free.asm
deleted file mode 100644
index 251d733..0000000
--- a/libgcc/config/nvptx/free.asm
+++ /dev/null
@@ -1,50 +0,0 @@
-// A wrapper around free to enable a realloc implementation.
-
-// Copyright (C) 2014-2015 Free Software Foundation, Inc.
-
-// This file is free software; you can redistribute it and/or modify it
-// under the terms of the GNU General Public License as published by the
-// Free Software Foundation; either version 3, or (at your option) any
-// later version.
-
-// This file is distributed in the hope that it will be useful, but
-// WITHOUT ANY WARRANTY; without even the implied warranty of
-// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-// General Public License for more details.
-
-// Under Section 7 of GPL version 3, you are granted additional
-// permissions described in the GCC Runtime Library Exception, version
-// 3.1, as published by the Free Software Foundation.
-
-// You should have received a copy of the GNU General Public License and
-// a copy of the GCC Runtime Library Exception along with this program;
-// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-// <http://www.gnu.org/licenses/>.
-
-        .version        3.1
-        .target sm_30
-        .address_size 64
-
-.extern .func free(.param.u64 %in_ar1);
-
-// BEGIN GLOBAL FUNCTION DEF: __nvptx_free
-.visible .func __nvptx_free(.param.u64 %in_ar1)
-{
-	.reg.u64 %ar1;
-	.reg.u64 %hr10;
-	.reg.u64 %r23;
-	.reg.pred %r25;
-	.reg.u64 %r27;
-	ld.param.u64 %ar1, [%in_ar1];
-		mov.u64	%r23, %ar1;
-		setp.eq.u64 %r25,%r23,0;
-	@%r25	bra	$L1;
-		add.u64	%r27, %r23, -8;
-	{
-		.param.u64 %out_arg0;
-		st.param.u64 [%out_arg0], %r27;
-		call free, (%out_arg0);
-	}
-$L1:
-	ret;
-	}
diff --git a/libgcc/config/nvptx/free.c b/libgcc/config/nvptx/free.c
new file mode 100644
index 0000000..90699c7
--- /dev/null
+++ b/libgcc/config/nvptx/free.c
@@ -0,0 +1,34 @@
+/* Implement free wrapper to help support realloc.
+
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stddef.h>
+#include "nvptx-malloc.h"
+
+void
+__nvptx_free (void *ptr)
+{
+  if (ptr == NULL)
+    return;
+
+  __nvptx_real_free ((char *)ptr - 8);
+}
diff --git a/libgcc/config/nvptx/malloc.asm b/libgcc/config/nvptx/malloc.asm
deleted file mode 100644
index 9f36715..0000000
--- a/libgcc/config/nvptx/malloc.asm
+++ /dev/null
@@ -1,55 +0,0 @@
-// A wrapper around malloc to enable a realloc implementation.
-
-// Copyright (C) 2014-2015 Free Software Foundation, Inc.
-
-// This file is free software; you can redistribute it and/or modify it
-// under the terms of the GNU General Public License as published by the
-// Free Software Foundation; either version 3, or (at your option) any
-// later version.
-
-// This file is distributed in the hope that it will be useful, but
-// WITHOUT ANY WARRANTY; without even the implied warranty of
-// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-// General Public License for more details.
-
-// Under Section 7 of GPL version 3, you are granted additional
-// permissions described in the GCC Runtime Library Exception, version
-// 3.1, as published by the Free Software Foundation.
-
-// You should have received a copy of the GNU General Public License and
-// a copy of the GCC Runtime Library Exception along with this program;
-// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-// <http://www.gnu.org/licenses/>.
-
-        .version        3.1
-        .target sm_30
-        .address_size 64
-
-.extern .func (.param.u64 %out_retval) malloc(.param.u64 %in_ar1);
-
-// BEGIN GLOBAL FUNCTION DEF: __nvptx_malloc
-.visible .func (.param.u64 %out_retval) __nvptx_malloc(.param.u64 %in_ar1)
-{
-        .reg.u64 %ar1;
-.reg.u64 %retval;
-        .reg.u64 %hr10;
-        .reg.u64 %r26;
-        .reg.u64 %r28;
-        .reg.u64 %r29;
-        .reg.u64 %r31;
-        ld.param.u64 %ar1, [%in_ar1];
-		mov.u64 %r26, %ar1;
-		add.u64 %r28, %r26, 8;
-        {
-		.param.u64 %retval_in;
-		.param.u64 %out_arg0;
-		st.param.u64 [%out_arg0], %r28;
-		call (%retval_in), malloc, (%out_arg0);
-		ld.param.u64    %r29, [%retval_in];
-        }
-		st.u64  [%r29], %r26;
-		add.u64 %r31, %r29, 8;
-		mov.u64 %retval, %r31;
-		st.param.u64    [%out_retval], %retval;
-		ret;
-}
diff --git a/libgcc/config/nvptx/malloc.c b/libgcc/config/nvptx/malloc.c
new file mode 100644
index 0000000..2de995c
--- /dev/null
+++ b/libgcc/config/nvptx/malloc.c
@@ -0,0 +1,35 @@
+/* Implement malloc wrapper to help support realloc.
+
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stddef.h>
+#include "nvptx-malloc.h"
+
+void *
+__nvptx_malloc (size_t sz)
+{
+  size_t *ptr = __nvptx_real_malloc (sz + 8);
+  if (!ptr)
+    return NULL;
+  *ptr = sz;
+  return ptr + 1;
+}
diff --git a/libgcc/config/nvptx/nvptx-malloc.h b/libgcc/config/nvptx/nvptx-malloc.h
index d0ce65a..437f8b3 100644
--- a/libgcc/config/nvptx/nvptx-malloc.h
+++ b/libgcc/config/nvptx/nvptx-malloc.h
@@ -21,6 +21,11 @@
    see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
    <http://www.gnu.org/licenses/>.  */
 
+/* malloc/realloc/free are remapped to these by the NVPTX backend.  */
 extern void __nvptx_free (void *);
 extern void *__nvptx_malloc (size_t);
 extern void *__nvptx_realloc (void *, size_t);
+
+/* And these are remapped back to "real" malloc/free.  */
+extern void __nvptx_real_free (void *);
+extern void *__nvptx_real_malloc (size_t);
diff --git a/libgcc/config/nvptx/realloc.c b/libgcc/config/nvptx/realloc.c
index 136f010..dba429e 100644
--- a/libgcc/config/nvptx/realloc.c
+++ b/libgcc/config/nvptx/realloc.c
@@ -33,6 +33,8 @@ __nvptx_realloc (void *ptr, size_t newsz)
       return NULL;
     }
   void *newptr = __nvptx_malloc (newsz);
+  if (!newptr)
+    return NULL;
 
   size_t oldsz;
   if (ptr == NULL)
diff --git a/libgcc/config/nvptx/stacks.c b/libgcc/config/nvptx/stacks.c
new file mode 100644
index 0000000..c597cd1
--- /dev/null
+++ b/libgcc/config/nvptx/stacks.c
@@ -0,0 +1,30 @@
+/* Define shared memory arrays for -msoft-stack and -munified-simt.
+
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* __shared__ char *__nvptx_stacks[32];  */
+asm ("// BEGIN GLOBAL VAR DEF: __nvptx_stacks");
+asm (".visible .shared .u64 __nvptx_stacks[32];");
+
+/* __shared__ unsigned __nvptx_uni[32];  */
+asm ("// BEGIN GLOBAL VAR DEF: __nvptx_uni");
+asm (".visible .shared .u32 __nvptx_uni[32];");
diff --git a/libgcc/config/nvptx/t-nvptx b/libgcc/config/nvptx/t-nvptx
index 34d68cc..e302494 100644
--- a/libgcc/config/nvptx/t-nvptx
+++ b/libgcc/config/nvptx/t-nvptx
@@ -1,12 +1,13 @@
-LIB2ADD=$(srcdir)/config/nvptx/malloc.asm \
-	$(srcdir)/config/nvptx/free.asm \
-	$(srcdir)/config/nvptx/realloc.c
+LIB2ADD=$(srcdir)/config/nvptx/malloc.c \
+	$(srcdir)/config/nvptx/free.c \
+	$(srcdir)/config/nvptx/realloc.c \
+	$(srcdir)/config/nvptx/stacks.c
 
 LIB2ADDEH=
 LIB2FUNCS_EXCLUDE=__main
 
-crt0.o: $(srcdir)/config/nvptx/crt0.s
-	cp $< $@
+crt0.o: $(srcdir)/config/nvptx/crt0.c
+	$(gcc_compile) -c $<
 
 # Prevent building "advanced" stuff (for example, gcov support).  We don't
 # support it, and it may cause the build to fail, because of alloca usage, for

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
                   ` (3 preceding siblings ...)
  2015-12-01 15:28 ` [gomp-nvptx 6/9] nvptx libgcc: rewrite in C Alexander Monakov
@ 2015-12-01 15:28 ` Alexander Monakov
  2015-12-02 10:56   ` Jakub Jelinek
  2015-12-01 15:28 ` [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp Alexander Monakov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

Since OpenMP offloading requires both soft-stacks and "uniform SIMT", both
non-traditional codegen variants, I'm building a multilib variant with those
enabled.  This patch adds option -mgomp which enables -msoft-stack plus
-muniform-simt, and builds a multilib with it.

	* config/nvptx/nvptx.c (nvptx_option_override): Handle TARGET_GOMP.
	* config/nvptx/nvptx.opt (mgomp): New option.
	* config/nvptx/t-nvptx (MULTILIB_OPTIONS): New.
	* doc/invoke.texi (mgomp): Document.
---
 gcc/config/nvptx/nvptx.c   | 3 +++
 gcc/config/nvptx/nvptx.opt | 4 ++++
 gcc/config/nvptx/t-nvptx   | 2 ++
 gcc/doc/invoke.texi        | 5 +++++
 4 files changed, 14 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 3bd3cf7..48ee96e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -153,6 +153,9 @@ nvptx_option_override (void)
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, worker_red_name);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
+  if (TARGET_GOMP)
+    target_flags |= MASK_SOFT_STACK | MASK_UNIFORM_SIMT;
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 47e811e..8826659 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -36,3 +36,7 @@ Use custom stacks instead of local memory for automatic storage.
 muniform-simt
 Target Report Mask(UNIFORM_SIMT)
 Generate code that executes all threads in a warp as if one was active.
+
+mgomp
+Target Report Mask(GOMP)
+Generate code for OpenMP offloading: enables -msoft-stack and -muniform-simt.
diff --git a/gcc/config/nvptx/t-nvptx b/gcc/config/nvptx/t-nvptx
index e2580c9..6c1010d 100644
--- a/gcc/config/nvptx/t-nvptx
+++ b/gcc/config/nvptx/t-nvptx
@@ -8,3 +8,5 @@ ALL_HOST_OBJS += mkoffload.o
 mkoffload$(exeext): mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBDEPS)
 	+$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ \
 	  mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBS)
+
+MULTILIB_OPTIONS = mgomp
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 46cd2e9..7e7f3b4 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18956,6 +18956,11 @@ all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD
 regions).  Each thread can bitwise-and the bitmask at position @code{tid.y}
 with current lane index to compute the master lane index.
 
+@item -mgomp
+@opindex mgomp
+Generate code for use in OpenMP offloading: enables @option{-msoft-stack} and
+@option{-muniform-simt} options, and selects corresponding multilib variant.
+
 @end table
 
 @node PDP-11 Options

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
                   ` (5 preceding siblings ...)
  2015-12-01 15:28 ` [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp Alexander Monakov
@ 2015-12-01 15:28 ` Alexander Monakov
  2015-12-01 16:02   ` Bernd Schmidt
  2015-12-02 10:40   ` Jakub Jelinek
  2015-12-01 15:46 ` [gomp-nvptx 3/9] nvptx backend: add two more identifier maps Alexander Monakov
  2015-12-01 15:47 ` [gomp-nvptx 1/9] nvptx backend: allow emitting COND_EXEC insns Alexander Monakov
  8 siblings, 2 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

This patch introduces a code generation variant for NVPTX that I'm using for
SIMD work in OpenMP offloading.  Let me try to explain the idea behind it...

In place of SIMD vectorization, NVPTX is using SIMT (single
instruction/multiple threads) execution: groups of 32 threads execute the same
instruction, with some threads possibly masked off if under a divergent branch.
So we are mapping OpenMP threads to such thread groups ("warps"), and hardware
threads are then mapped to OpenMP SIMD lanes.

We need to reach heads of SIMD regions with all hw threads active, because
there's no way to "resurrect" them once masked off: they need to follow the
same control flow, and reach the SIMD region entry with the same local state
(registers, and stack too for OpenACC).

The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
"slaves" which just follow branches without any computation -- that requires
extra jumps and broadcasting branch predicates, -- and 2) broadcast register
state and stack state from master to slaves when entering "vector" regions.

I'm taking a different approach.  I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction.  Most instructions
automatically satisfy that: if threads have the same state, then executing an
arithmetic instruction, normal memory load/store, etc. keep local state the
same in all threads.

The two exception insn categories are atomics and calls.  For calls, we can
demand recursively that they uphold this execution model, until we reach
runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
atomics.

To handle atomics, we
  1) execute the atomic conditionally only in one warp member -- so its side
  effect happens once;
  2) copy the register that was set from that warp member to others -- so
  local state is kept synchronized:

    atom.op dest, ...

becomes

    /* pred = (current_lane == 0);  */
    @pred atom.op dest, ...
    shuffle.idx dest, dest, /*srclane=*/0

So the overhead is one shuffle insn following each atomic, plus predicate
setup in the prologue.

OK, so the above handles execution out of SIMD regions nicely, but then we'd
also need to run code inside of SIMD regions, where we need to turn off this
synching effect.  Turns out we can keep atomics decorated almost like before:

    @pred atom.op dest, ...
    shuffle.idx dest, dest, master_lane

and compute 'pred' and 'master_lane' accordingly: outside of SIMD regions we
need (master_lane == 0 && pred == (current_lane == 0)), and inside we need
(master_lane == current_lane && pred == true) (so that shuffle is no-op, and
predicate is 'true' for all lanes).  Then, (pred = (current_lane ==
master_lane) works in both cases, and we just need to set up master_lane
accordingly: master_lane = current_lane & mask, where mask is all-0 outside of
SIMD regions, and all-1 inside.  To store these per-warp masks, I've
introduced another shared memory array, __nvptx_uni.

	* config/nvptx/nvptx.c (need_unisimt_decl): New variable.  Set it...
	(nvptx_init_unisimt_predicate): ...here (new function) and use it...
	(nvptx_file_end): ...here to emit declaration of __nvptx_uni array.
	(nvptx_declare_function_name): Call nvptx_init_unisimt_predicate.
	(nvptx_get_unisimt_master): New helper function.
	(nvptx_get_unisimt_predicate): Ditto.
	(nvptx_call_insn_is_syscall_p): Ditto.
	(nvptx_unisimt_handle_set): Ditto.
	(nvptx_reorg_uniform_simt): New.  Transform code for -muniform-simt.
	(nvptx_get_axis_predicate): New helper function, factored out from...
	(nvptx_single): ...here.
	(nvptx_reorg): Call nvptx_reorg_uniform_simt.
	* config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_unisimt__ when -muniform-simt option is active.
	(struct machine_function): Add unisimt_master, unisimt_predicate
	rtx fields.
	* config/nvptx/nvptx.md (divergent): New attribute.
	(atomic_compare_and_swap<mode>_1): Mark as divergent.
	(atomic_exchange<mode>): Ditto.
	(atomic_fetch_add<mode>): Ditto.
	(atomic_fetch_addsf): Ditto.
	(atomic_fetch_<logic><mode>): Ditto.
	* config/nvptx/nvptx.opt (muniform-simt): New option.
	* doc/invoke.texi (-muniform-simt): Document.
---
 gcc/config/nvptx/nvptx.c   | 138 ++++++++++++++++++++++++++++++++++++++++++---
 gcc/config/nvptx/nvptx.h   |   4 ++
 gcc/config/nvptx/nvptx.md  |  18 ++++--
 gcc/config/nvptx/nvptx.opt |   4 ++
 gcc/doc/invoke.texi        |  14 +++++
 5 files changed, 165 insertions(+), 13 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2dad3e2..9209b47 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -117,6 +117,9 @@ static GTY(()) rtx worker_red_sym;
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 
+/* True if any function references __nvptx_uni.  */
+static bool need_unisimt_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -599,6 +602,33 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize predicate and master lane index registers for
+   -muniform-simt code generation variant.  */
+
+static void
+nvptx_init_unisimt_predicate (FILE *file)
+{
+  int bits = BITS_PER_WORD;
+  int master = REGNO (cfun->machine->unisimt_master);
+  int pred = REGNO (cfun->machine->unisimt_predicate);
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32 %%ustmp0;\n");
+  fprintf (file, "\t\t.reg.u%d %%ustmp1;\n", bits);
+  fprintf (file, "\t\t.reg.u%d %%ustmp2;\n", bits);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.y;\n");
+  fprintf (file, "\t\tmul%s.u32 %%ustmp1, %%ustmp0, 4;\n",
+	   bits == 64 ? ".wide" : "");
+  fprintf (file, "\t\tmov.u%d %%ustmp2, __nvptx_uni;\n", bits);
+  fprintf (file, "\t\tadd.u%d %%ustmp2, %%ustmp2, %%ustmp1;\n", bits);
+  fprintf (file, "\t\tld.shared.u32 %%r%d, [%%ustmp2];\n", master);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.x;\n");
+  /* rNN = tid.x & __nvptx_uni[tid.y];  */
+  fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master);
+  fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, %%ustmp0;\n", pred, master);
+  fprintf (file, "\t}\n");
+  need_unisimt_decl = true;
+}
+
 /* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
 
    extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
@@ -811,6 +841,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->axis_predicate[1])
     nvptx_init_axis_predicate (file,
 			       REGNO (cfun->machine->axis_predicate[1]), "x");
+  if (cfun->machine->unisimt_predicate)
+    nvptx_init_unisimt_predicate (file);
 }
 
 /* Output a return instruction.  Also copy the return value to its outgoing
@@ -2394,6 +2426,86 @@ nvptx_reorg_subreg (void)
     }
 }
 
+/* Return a SImode "master lane index" register for uniform-simt, allocating on
+   first use.  */
+
+static rtx
+nvptx_get_unisimt_master ()
+{
+  rtx &master = cfun->machine->unisimt_master;
+  return master ? master : master = gen_reg_rtx (SImode);
+}
+
+/* Return a BImode "predicate" register for uniform-simt, similar to above.  */
+
+static rtx
+nvptx_get_unisimt_predicate ()
+{
+  rtx &pred = cfun->machine->unisimt_predicate;
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
+/* Return true if given call insn references one of the functions provided by
+   the CUDA runtime: malloc, free, vprintf.  */
+
+static bool
+nvptx_call_insn_is_syscall_p (rtx_insn *insn)
+{
+  rtx pat = PATTERN (insn);
+  gcc_checking_assert (GET_CODE (pat) == PARALLEL);
+  pat = XVECEXP (pat, 0, 0);
+  if (GET_CODE (pat) == SET)
+    pat = SET_SRC (pat);
+  gcc_checking_assert (GET_CODE (pat) == CALL
+		       && GET_CODE (XEXP (pat, 0)) == MEM);
+  rtx addr = XEXP (XEXP (pat, 0), 0);
+  if (GET_CODE (addr) != SYMBOL_REF)
+    return false;
+  const char *name = XSTR (addr, 0);
+  return (!strcmp (name, "vprintf")
+	  || !strcmp (name, "__nvptx_real_malloc")
+	  || !strcmp (name, "__nvptx_real_free"));
+}
+
+/* If SET subexpression of INSN sets a register, emit a shuffle instruction to
+   propagate its value from lane MASTER to current lane.  */
+
+static void
+nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
+{
+  rtx reg;
+  if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
+    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
+}
+
+/* Adjust code for uniform-simt code generation variant by making atomics and
+   "syscalls" conditionally executed, and inserting shuffle-based propagation
+   for registers being set.  */
+
+static void
+nvptx_reorg_uniform_simt ()
+{
+  rtx_insn *insn, *next;
+
+  for (insn = get_insns (); insn; insn = next)
+    {
+      next = NEXT_INSN (insn);
+      if (!(CALL_P (insn) && nvptx_call_insn_is_syscall_p (insn))
+	  && !(NONJUMP_INSN_P (insn)
+	       && GET_CODE (PATTERN (insn)) == PARALLEL
+	       && get_attr_divergent (insn)))
+	continue;
+      rtx pat = PATTERN (insn);
+      rtx master = nvptx_get_unisimt_master ();
+      for (int i = 0; i < XVECLEN (pat, 0); i++)
+	nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+      rtx pred = nvptx_get_unisimt_predicate ();
+      pred = gen_rtx_NE (BImode, pred, const0_rtx);
+      pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
+      validate_change (insn, &PATTERN (insn), pat, false);
+    }
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -2872,6 +2984,15 @@ nvptx_wsync (bool after)
   return gen_nvptx_barsync (GEN_INT (after));
 }
 
+/* Return a BImode "axis predicate" register, allocating on first use.  */
+
+static rtx
+nvptx_get_axis_predicate (int axis)
+{
+  rtx &pred = cfun->machine->axis_predicate[axis];
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
 /* Single neutering according to MASK.  FROM is the incoming block and
    TO is the outgoing block.  These may be the same block. Insert at
    start of FROM:
@@ -2956,14 +3077,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
     if (GOMP_DIM_MASK (mode) & skip_mask)
       {
 	rtx_code_label *label = gen_label_rtx ();
-	rtx pred = cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER];
-
-	if (!pred)
-	  {
-	    pred = gen_reg_rtx (BImode);
-	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
-	  }
-	
+	rtx pred = nvptx_get_axis_predicate (mode - GOMP_DIM_WORKER);
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
@@ -3202,6 +3316,9 @@ nvptx_reorg (void)
   /* Replace subregs.  */
   nvptx_reorg_subreg ();
 
+  if (TARGET_UNIFORM_SIMT)
+    nvptx_reorg_uniform_simt ();
+
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -3379,6 +3496,11 @@ nvptx_file_end (void)
       fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n;",
 	       BITS_PER_WORD);
     }
+  if (need_unisimt_decl)
+    {
+      fprintf (asm_out_file, "// BEGIN GLOBAL VAR DECL: __nvptx_uni\n");
+      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n;");
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index db8e201..1c605df 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -33,6 +33,8 @@
       builtin_define ("__nvptx__");		\
       if (TARGET_SOFT_STACK)			\
         builtin_define ("__nvptx_softstack__");	\
+      if (TARGET_UNIFORM_SIMT)			\
+        builtin_define ("__nvptx_unisimt__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -234,6 +236,8 @@ struct GTY(()) machine_function
   int ret_reg_mode; /* machine_mode not defined yet. */
   int punning_buffer_size;
   rtx axis_predicate[2];
+  rtx unisimt_master;
+  rtx unisimt_predicate;
 };
 #endif
 \f
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 5ce7a89..f0fc02c 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -75,6 +75,9 @@ (define_c_enum "unspecv" [
 (define_attr "subregs_ok" "false,true"
   (const_string "false"))
 
+(define_attr "divergent" "false,true"
+  (const_string "false"))
+
 (define_predicate "nvptx_register_operand"
   (match_code "reg,subreg")
 {
@@ -1519,7 +1522,8 @@ (define_insn "atomic_compare_and_swap<mode>_1"
    (set (match_dup 1)
 	(unspec_volatile:SDIM [(const_int 0)] UNSPECV_CAS))]
   ""
-  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;")
+  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_exchange<mode>"
   [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")	;; output
@@ -1530,7 +1534,8 @@ (define_insn "atomic_exchange<mode>"
    (set (match_dup 1)
 	(match_operand:SDIM 2 "nvptx_register_operand" "R"))]	;; input
   ""
-  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;")
+  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_add<mode>"
   [(set (match_operand:SDIM 1 "memory_operand" "+m")
@@ -1542,7 +1547,8 @@ (define_insn "atomic_fetch_add<mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_addsf"
   [(set (match_operand:SF 1 "memory_operand" "+m")
@@ -1554,7 +1560,8 @@ (define_insn "atomic_fetch_addsf"
    (set (match_operand:SF 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_code_iterator any_logic [and ior xor])
 (define_code_attr logic [(and "and") (ior "or") (xor "xor")])
@@ -1570,7 +1577,8 @@ (define_insn "atomic_fetch_<logic><mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   "0"
-  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
+  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "nvptx_barsync"
   [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 7ab09b9..47e811e 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -32,3 +32,7 @@ Link in code for a __main kernel.
 msoft-stack
 Target Report Mask(SOFT_STACK)
 Use custom stacks instead of local memory for automatic storage.
+
+muniform-simt
+Target Report Mask(UNIFORM_SIMT)
+Generate code that executes all threads in a warp as if one was active.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 6e45fb6..46cd2e9 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18942,6 +18942,20 @@ in shared memory array @code{char *__nvptx_stacks[]} at position @code{tid.y}
 as the stack pointer.  This is for placing automatic variables into storage
 that can be accessed from other threads, or modified with atomic instructions.
 
+@item -muniform-simt
+@opindex muniform-simt
+Switch to code generation variant that allows to execute all threads in each
+warp, while maintaining memory state and side effects as if only one thread
+in each warp was active outside of OpenMP SIMD regions.  All atomic operations
+and calls to runtime (malloc, free, vprintf) are conditionally executed (iff
+current lane index equals the master lane index), and the register being
+assigned is copied via a shuffle instruction from the master lane.  Outside of
+SIMD regions lane 0 is the master; inside, each thread sees itself as the
+master.  Shared memory array @code{int __nvptx_uni[]} stores all-zeros or
+all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD
+regions).  Each thread can bitwise-and the bitmask at position @code{tid.y}
+with current lane index to compute the master lane index.
+
 @end table
 
 @node PDP-11 Options

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 5/9] new target hook: TARGET_SIMT_VF
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
@ 2015-12-01 15:28 ` Alexander Monakov
  2015-12-01 15:28 ` [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets Alexander Monakov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

This patch adds a new target hook and implements it in a straightforward
manner on NVPTX to indicate that the target is running in SIMT fashion with 32
threads in a synchronous group ("warp").  For use in OpenMP transforms.
---
 gcc/config/nvptx/nvptx.c | 12 ++++++++++++
 gcc/doc/tm.texi          |  4 ++++
 gcc/doc/tm.texi.in       |  2 ++
 gcc/target.def           | 12 ++++++++++++
 4 files changed, 30 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 48ee96e..eb3b67e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3684,10 +3684,19 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
     }
 }
 \f
+
 /* Define dimension sizes for known hardware.  */
 #define PTX_VECTOR_LENGTH 32
 #define PTX_WORKER_LENGTH 32
 
+/* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
+
+static int
+nvptx_simt_vf ()
+{
+  return PTX_VECTOR_LENGTH;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  */
@@ -4258,6 +4267,9 @@ nvptx_goacc_reduction (gcall *call)
 #undef  TARGET_BUILTIN_DECL
 #define TARGET_BUILTIN_DECL nvptx_builtin_decl
 
+#undef TARGET_SIMT_VF
+#define TARGET_SIMT_VF nvptx_simt_vf
+
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index f394db7..e54944d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5765,6 +5765,10 @@ usable.  In that case, the smaller the number is, the more desirable it is
 to use it.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_SIMT_VF (void)
+Return number of threads in SIMT thread group on the target.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
 This hook should check the launch dimensions provided for an OpenACC
 compute region, or routine.  Defaulted values are represented as -1
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index d188c57..44ba697c 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4260,6 +4260,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_SIMD_CLONE_USABLE
 
+@hook TARGET_SIMT_VF
+
 @hook TARGET_GOACC_VALIDATE_DIMS
 
 @hook TARGET_GOACC_DIM_LIMIT
diff --git a/gcc/target.def b/gcc/target.def
index c7ec292..f5a03d6 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1639,6 +1639,18 @@ int, (struct cgraph_node *), NULL)
 
 HOOK_VECTOR_END (simd_clone)
 
+/* Functions relating to OpenMP SIMT vectorization transform.  */
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_SIMT_"
+HOOK_VECTOR (TARGET_SIMT, simt)
+
+DEFHOOK
+(vf,
+"Return number of threads in SIMT thread group on the target.",
+int, (void), NULL)
+
+HOOK_VECTOR_END (simt)
+
 /* Functions relating to openacc.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_GOACC_"

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
                   ` (4 preceding siblings ...)
  2015-12-01 15:28 ` [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib Alexander Monakov
@ 2015-12-01 15:28 ` Alexander Monakov
  2015-12-01 15:56   ` Bernd Schmidt
  2015-12-02 11:02   ` Jakub Jelinek
  2015-12-01 15:28 ` [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant Alexander Monakov
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

Here's how I've updated gomp_nvptx_main to set up shared memory arrays
__nvptx_stacks and __nvptx_uni for -mgomp.  Since it makes sense only for
-mgomp multilib, I've wrapped the whole file under #ifdef that checks
corresponding built-in macros.

Reaching those shared memory arrays is awkward.  I cannot declare them with
toplevel asms because the compiler implicitely declares them too, and ptxas
does not handle duplicated declaration.  Ideally I'd like to be able to say:

    extern char *__shared __nvptx_stacks[32];

Bernd, is your position on exposing shared memory as first-class address space
on NVPTX subject to change?  Do you remember what middle-end issues you've
encountered when trying that?

	* config/nvptx/team.c (gomp_nvptx_main): Rename to...
	(gomp_nvptx_main_1): ... this and mark noinline.
	(gomp_nvptx_main): Wrap the above, set up __nvptx_uni and
	__nvptx_stacks.
---
 libgomp/config/nvptx/team.c | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index 88d1d34..deb0860 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -24,6 +24,8 @@
 
 /* This file handles the maintainence of threads on NVPTX.  */
 
+#if defined __nvptx_softstack && defined __nvptx_unisimt__
+
 #include "libgomp.h"
 #include <stdlib.h>
 
@@ -31,15 +33,9 @@ struct gomp_thread *nvptx_thrs;
 
 static void gomp_thread_start (struct gomp_thread_pool *);
 
-void
-gomp_nvptx_main (void (*fn) (void *), void *fn_data)
+static void __attribute__((noinline))
+gomp_nvptx_main_1 (void (*fn) (void *), void *fn_data, int ntids, int tid)
 {
-  int ntids, tid, laneid;
-  asm ("mov.u32 %0, %%laneid;" : "=r" (laneid));
-  if (laneid)
-    return;
-  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
-  asm ("mov.u32 %0, %%ntid.y;" : "=r"(ntids));
   if (tid == 0)
     {
       gomp_global_icv.nthreads_var = ntids;
@@ -72,6 +68,30 @@ gomp_nvptx_main (void (*fn) (void *), void *fn_data)
     }
 }
 
+void
+gomp_nvptx_main (void (*fn) (void *), void *fn_data)
+{
+  int tid, ntids;
+  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
+  asm ("mov.u32 %0, %%ntid.y;" : "=r"(ntids));
+  char *stacks = 0;
+  int *__nvptx_uni;
+  asm ("cvta.shared.u64 %0, __nvptx_uni;" : "=r" (__nvptx_uni));
+  __nvptx_uni[tid] = 0;
+  if (tid == 0)
+    {
+      size_t stacksize = 131072;
+      stacks = gomp_malloc (stacksize * ntids);
+      char **__nvptx_stacks = 0;
+      asm ("cvta.shared.u64 %0, __nvptx_stacks;" : "=r" (__nvptx_stacks));
+      for (int i = 0; i < ntids; i++)
+	__nvptx_stacks[i] = stacks + stacksize * (i + 1);
+    }
+  asm ("bar.sync 0;");
+  gomp_nvptx_main_1 (fn, fn_data, ntids, tid);
+  free (stacks);
+}
+
 /* This function is a pthread_create entry point.  This contains the idle
    loop in which a thread waits to be called up to become part of a team.  */
 
@@ -160,3 +180,4 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
 }
 
 #include "../../team.c"
+#endif

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD
@ 2015-12-01 15:28 Alexander Monakov
  2015-12-01 15:28 ` [gomp-nvptx 5/9] new target hook: TARGET_SIMT_VF Alexander Monakov
                   ` (8 more replies)
  0 siblings, 9 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

Hello!

This patch series shows how I'm approaching OpenMP SIMD for NVPTX.  It looks
good both in check-c testing and libgomp testing, including new target-3x.c
cases (but for-5.c fails to run with resource exhaustion, maybe it should be
split for NVPTX -- will investigate more later).

The previously posted patch to handle 'omp_data_o' is no longer necessary with
soft-stacks.

Looking forward to your comments.

Alexander

  nvptx backend: allow emitting COND_EXEC insns
  nvptx backend: new "uniform SIMT" codegen variant
  nvptx backend: add two more identifier maps
  nvptx backend: add -mgomp option and multilib
  new target hook: TARGET_SIMT_VF
  nvptx libgcc: rewrite in C
  nvptx mkoffload: pass -mgomp for OpenMP offloading
  libgomp: update gomp_nvptx_main for -mgomp
  adjust SIMD loop lowering for SIMT targets

 gcc/config/nvptx/mkoffload.c       |   7 ++
 gcc/config/nvptx/nvptx.c           | 181 ++++++++++++++++++++++++++++++++-----
 gcc/config/nvptx/nvptx.h           |   4 +
 gcc/config/nvptx/nvptx.md          |  61 +++++++++----
 gcc/config/nvptx/nvptx.opt         |   8 ++
 gcc/config/nvptx/t-nvptx           |   2 +
 gcc/doc/invoke.texi                |  19 ++++
 gcc/doc/tm.texi                    |   4 +
 gcc/doc/tm.texi.in                 |   2 +
 gcc/internal-fn.c                  |  22 +++++
 gcc/internal-fn.def                |   2 +
 gcc/omp-low.c                      | 138 ++++++++++++++++++++++++++--
 gcc/passes.def                     |   1 +
 gcc/target.def                     |  12 +++
 gcc/tree-pass.h                    |   2 +
 libgcc/config/nvptx/crt0.c         |  61 +++++++++++++
 libgcc/config/nvptx/crt0.s         |  54 -----------
 libgcc/config/nvptx/free.asm       |  50 ----------
 libgcc/config/nvptx/free.c         |  34 +++++++
 libgcc/config/nvptx/malloc.asm     |  55 -----------
 libgcc/config/nvptx/malloc.c       |  35 +++++++
 libgcc/config/nvptx/nvptx-malloc.h |   5 +
 libgcc/config/nvptx/realloc.c      |   2 +
 libgcc/config/nvptx/stacks.c       |  30 ++++++
 libgcc/config/nvptx/t-nvptx        |  11 ++-
 libgomp/config/nvptx/team.c        |  37 ++++++--
 26 files changed, 622 insertions(+), 217 deletions(-)
 create mode 100644 libgcc/config/nvptx/crt0.c
 delete mode 100644 libgcc/config/nvptx/crt0.s
 delete mode 100644 libgcc/config/nvptx/free.asm
 create mode 100644 libgcc/config/nvptx/free.c
 delete mode 100644 libgcc/config/nvptx/malloc.asm
 create mode 100644 libgcc/config/nvptx/malloc.c
 create mode 100644 libgcc/config/nvptx/stacks.c

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
  2015-12-01 15:28 ` [gomp-nvptx 5/9] new target hook: TARGET_SIMT_VF Alexander Monakov
@ 2015-12-01 15:28 ` Alexander Monakov
  2015-12-01 22:40   ` Alexander Monakov
  2015-12-02 11:48   ` Jakub Jelinek
  2015-12-01 15:28 ` [gomp-nvptx 7/9] nvptx mkoffload: pass -mgomp for OpenMP offloading Alexander Monakov
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

This is incomplete.

This handles OpenMP SIMD for NVPTX in simple cases, partly by punting on
anything unusual such as simduid loops, partly by getting lucky, as testcases
do not expose the missing bits.

What it currently does is transform SIMD loop

  for (V = N1; V cmp N2; V + STEP) BODY;

into

  for (V = N1 + (STEP * LANE); V cmp N2; V + (STEP * VF)) BODY;

and then folding LANE/VF to 0/1 on non-NVPTX post-ipa.

To make it proper, I'll need to handle SIMDUID loops (still thinking how to
best approach that), and SAFELEN (but that simply need a condition jump around
the loop, "if (LANE >= SAFELEN)").  Handling collapsed loops eventually should
be nice too.

Also, it needs something like __nvptx_{enter/exit}_simd() calls around the
loop, to switch from uniform to non-uniform SIMT execution (set bitmask in
__nvptx_uni from 0 to -1, and back on exit), and to switch from per-warp
soft-stacks to per-hwthread hard-stacks (by reserving a small area in .local
memory, and setting __nvptx_stacks[] pointer to top of that area).

Also, since SIMD regions should run on per-hwthread stacks, I'm thinking I'll
have to outline the loop into its own function.  Can I do that post-ipa
easily?
---
 gcc/internal-fn.c   |  22 +++++++++
 gcc/internal-fn.def |   2 +
 gcc/omp-low.c       | 138 +++++++++++++++++++++++++++++++++++++++++++++++++---
 gcc/passes.def      |   1 +
 gcc/tree-pass.h     |   2 +
 5 files changed, 158 insertions(+), 7 deletions(-)

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index a3c4a90..3189e96 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -142,6 +142,28 @@ expand_ANNOTATE (gcall *)
   gcc_unreachable ();
 }
 
+/* Lane index on SIMT targets: thread index in the warp on NVPTX.  On targets
+   without SIMT execution this should be expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_LANE (gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  /* FIXME: use a separate pattern for OpenMP?  */
+  gcc_assert (targetm.have_oacc_dim_pos ());
+  emit_insn (targetm.gen_oacc_dim_pos (target, const2_rtx));
+}
+
+/* This should get expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_VF (gcall *)
+{
+  gcc_unreachable ();
+}
+
 /* This should get expanded in adjust_simduid_builtins.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1cb14a8..66c7422 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
 
 DEF_INTERNAL_FN (LOAD_LANES, ECF_CONST | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (STORE_LANES, ECF_CONST | ECF_LEAF, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_VF, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index cc0435e..51ac0e5 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -10173,7 +10173,7 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
 				  OMP_CLAUSE_SAFELEN);
   tree simduid = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 				  OMP_CLAUSE__SIMDUID_);
-  tree n1, n2;
+  tree n1, n2, step;
 
   type = TREE_TYPE (fd->loop.v);
   entry_bb = region->entry;
@@ -10218,12 +10218,37 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
 
   n1 = fd->loop.n1;
   n2 = fd->loop.n2;
+  step = fd->loop.step;
+  bool do_simt_transform
+    = (cgraph_node::get (current_function_decl)->offloadable
+       && !broken_loop
+       && !safelen
+       && !simduid
+       && !(fd->collapse > 1));
+  if (do_simt_transform)
+    {
+      tree simt_lane
+	= build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_LANE,
+					integer_type_node, 0);
+      simt_lane = fold_convert (TREE_TYPE (step), simt_lane);
+      simt_lane = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_lane);
+      cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+    }
+
   if (gimple_omp_for_combined_into_p (fd->for_stmt))
     {
       tree innerc = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 				     OMP_CLAUSE__LOOPTEMP_);
       gcc_assert (innerc);
       n1 = OMP_CLAUSE_DECL (innerc);
+      if (do_simt_transform)
+	{
+	  n1 = fold_convert (type, n1);
+	  if (POINTER_TYPE_P (type))
+	    n1 = fold_build_pointer_plus (n1, simt_lane);
+	  else
+	    n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, simt_lane));
+	}
       innerc = find_omp_clause (OMP_CLAUSE_CHAIN (innerc),
 				OMP_CLAUSE__LOOPTEMP_);
       gcc_assert (innerc);
@@ -10239,8 +10264,15 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
     }
   else
     {
-      expand_omp_build_assign (&gsi, fd->loop.v,
-			       fold_convert (type, fd->loop.n1));
+      if (do_simt_transform)
+	{
+	  n1 = fold_convert (type, n1);
+	  if (POINTER_TYPE_P (type))
+	    n1 = fold_build_pointer_plus (n1, simt_lane);
+	  else
+	    n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, simt_lane));
+	}
+      expand_omp_build_assign (&gsi, fd->loop.v, fold_convert (type, n1));
       if (fd->collapse > 1)
 	for (i = 0; i < fd->collapse; i++)
 	  {
@@ -10262,10 +10294,18 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
       stmt = gsi_stmt (gsi);
       gcc_assert (gimple_code (stmt) == GIMPLE_OMP_CONTINUE);
 
+      if (do_simt_transform)
+	{
+	  tree simt_vf
+	    = build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_VF,
+					    integer_type_node, 0);
+	  simt_vf = fold_convert (TREE_TYPE (step), simt_vf);
+	  step = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_vf);
+	}
       if (POINTER_TYPE_P (type))
-	t = fold_build_pointer_plus (fd->loop.v, fd->loop.step);
+	t = fold_build_pointer_plus (fd->loop.v, step);
       else
-	t = fold_build2 (PLUS_EXPR, type, fd->loop.v, fd->loop.step);
+	t = fold_build2 (PLUS_EXPR, type, fd->loop.v, step);
       expand_omp_build_assign (&gsi, fd->loop.v, t);
 
       if (fd->collapse > 1)
@@ -12960,7 +13000,6 @@ expand_omp (struct omp_region *region)
     }
 }
 
-
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
    block BB.  PARENT is the region that contains BB.  If SINGLE_TREE is
    true, the function ends once a single tree is built (otherwise, whole
@@ -16235,7 +16274,7 @@ const pass_data pass_data_lower_omp =
   OPTGROUP_NONE, /* optinfo_flags */
   TV_NONE, /* tv_id */
   PROP_gimple_any, /* properties_required */
-  PROP_gimple_lomp, /* properties_provided */
+  PROP_gimple_lomp | PROP_gimple_lomp_dev, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
   0, /* todo_flags_finish */
@@ -19470,5 +19509,90 @@ make_pass_oacc_device_lower (gcc::context *ctxt)
 {
   return new pass_oacc_device_lower (ctxt);
 }
+\f
+
+/* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets,
+   VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and
+   LANE is kept to be expanded to RTL later on.  */
+
+static unsigned int
+execute_omp_device_lower ()
+{
+  int vf = 1;
+  if (targetm.simt.vf)
+    vf = targetm.simt.vf ();
+  tree vf_tree = build_int_cst (integer_type_node, vf);
+  basic_block bb;
+  gimple_stmt_iterator gsi;
+  FOR_EACH_BB_FN (bb, cfun)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	if (!is_gimple_call (stmt) || !gimple_call_internal_p (stmt))
+	  continue;
+	tree lhs = gimple_call_lhs (stmt), rhs = NULL_TREE;
+	switch (gimple_call_internal_fn (stmt))
+	  {
+	  case IFN_GOMP_SIMT_LANE:
+	    rhs = vf == 1 ? integer_zero_node : NULL_TREE;
+	    break;
+	  case IFN_GOMP_SIMT_VF:
+	    rhs = vf_tree;
+	    break;
+	  default:
+	    break;
+	  }
+	if (!rhs)
+	  continue;
+	stmt = gimple_build_assign (lhs, rhs);
+	gsi_replace (&gsi, stmt, false);
+      }
+  if (vf != 1)
+    cfun->has_force_vectorize_loops = false;
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_omp_device_lower =
+{
+  GIMPLE_PASS, /* type */
+  "ompdevlow", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  PROP_gimple_lomp_dev, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_omp_device_lower : public gimple_opt_pass
+{
+public:
+  pass_omp_device_lower (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_omp_device_lower, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fun)
+    {
+      /* FIXME: inlining does not propagate the lomp_dev property.  */
+      return 1 || !(fun->curr_properties & PROP_gimple_lomp_dev);
+    }
+  virtual unsigned int execute (function *)
+    {
+      return execute_omp_device_lower ();
+    }
+
+}; // class pass_expand_omp_ssa
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_omp_device_lower (gcc::context *ctxt)
+{
+  return new pass_omp_device_lower (ctxt);
+}
 
 #include "gt-omp-low.h"
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..ec049f8 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -151,6 +151,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
   NEXT_PASS (pass_oacc_device_lower);
+  NEXT_PASS (pass_omp_device_lower);
   NEXT_PASS (pass_all_optimizations);
   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations)
       NEXT_PASS (pass_remove_cgraph_callee_edges);
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..71b2561 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -226,6 +226,7 @@ protected:
 						   of math functions; the
 						   current choices have
 						   been optimized.  */
+#define PROP_gimple_lomp_dev	(1 << 16)	/* done omp_device_lower */
 
 #define PROP_trees \
   (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
@@ -414,6 +415,7 @@ extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_device_lower (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_omp_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_strlen (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_fold_builtins (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 3/9] nvptx backend: add two more identifier maps
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
                   ` (6 preceding siblings ...)
  2015-12-01 15:28 ` [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant Alexander Monakov
@ 2015-12-01 15:46 ` Alexander Monakov
  2015-12-01 15:47 ` [gomp-nvptx 1/9] nvptx backend: allow emitting COND_EXEC insns Alexander Monakov
  8 siblings, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

This allows to rewrite libgcc wrappers in C by adding back-maps
__nvptx_real_malloc -> malloc and __nvptx_real_free -> free.  While at it,
I've made the implementation leaner.

	* config/nvptx/nvptx.c (nvptx_name_replacement): Rewrite.  Add
	__nvptx_real_malloc -> malloc and __nvptx_real_free -> free
	replacements.
---
 gcc/config/nvptx/nvptx.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9209b47..3bd3cf7 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -538,14 +538,14 @@ write_function_decl_and_comment (std::stringstream &s, const char *name, const_t
 static const char *
 nvptx_name_replacement (const char *name)
 {
-  if (strcmp (name, "call") == 0)
-    return "__nvptx_call";
-  if (strcmp (name, "malloc") == 0)
-    return "__nvptx_malloc";
-  if (strcmp (name, "free") == 0)
-    return "__nvptx_free";
-  if (strcmp (name, "realloc") == 0)
-    return "__nvptx_realloc";
+  static const char *const replacements[] = {
+    "malloc", "__nvptx_malloc", "free", "__nvptx_free",
+    "realloc", "__nvptx_realloc", "call", "__nvptx_call",
+    "__nvptx_real_malloc", "malloc", "__nvptx_real_free", "free"
+  };
+  for (size_t i = 0; i < ARRAY_SIZE (replacements) / 2; i++)
+    if (!strcmp (name, replacements[2 * i]))
+      return replacements[2 * i + 1];
   return name;
 }
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [gomp-nvptx 1/9] nvptx backend: allow emitting COND_EXEC insns
  2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
                   ` (7 preceding siblings ...)
  2015-12-01 15:46 ` [gomp-nvptx 3/9] nvptx backend: add two more identifier maps Alexander Monakov
@ 2015-12-01 15:47 ` Alexander Monakov
  2015-12-02 13:31   ` Bernd Schmidt
  8 siblings, 1 reply; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 15:47 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

This allows to use COND_EXEC patterns on nvptx.  The backend is mostly ready
for that, although I had to slightly fix nvptx_print_operand.  I've also opted
to make calls predicable to make the uniform-simt patch simpler, and to that
end I need a small fixup in nvptx_output_call_insn.

RTL optimization won't emit COND_EXEC insns, because it's done only after
reload, and register allocation is not done.  I need this patch to create
COND_EXEC patterns in the backend during reorg.

	* config/nvptx/nvptx.c (nvptx_output_call_insn): Handle COND_EXEC
	patterns.  Emit instruction predicate.
	(nvptx_print_operand): Unbreak handling of instruction predicates.
	* config/nvptx/nvptx.md (predicable): New attribute.  Generate
	predicated forms via define_cond_exec.
	(br_true): Mark as not predicable.
	(br_false): Ditto.
	(br_true_uni): Ditto.
	(br_false_uni): Ditto.
	(return): Ditto.
	(trap_if_true): Ditto.
	(trap_if_false): Ditto.
	(nvptx_fork): Ditto.
	(nvptx_forked): Ditto.
	(nvptx_joining): Ditto.
	(nvptx_join): Ditto.
	(nvptx_barsync): Ditto.
---
 gcc/config/nvptx/nvptx.c  | 12 +++++++-----
 gcc/config/nvptx/nvptx.md | 43 +++++++++++++++++++++++++++++++------------
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 19445ad..2dad3e2 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1905,6 +1905,8 @@ nvptx_assemble_undefined_decl (FILE *file, const char *name, const_tree decl)
   fprintf (file, ";\n\n");
 }
 
+static void nvptx_print_operand (FILE *, rtx, int);
+
 /* Output INSN, which is a call to CALLEE with result RESULT.  For ptx, this
    involves writing .param declarations and in/out copies into them.  For
    indirect calls, also write the .callprototype.  */
@@ -1916,6 +1918,8 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee)
   static int labelno;
   bool needs_tgt = register_operand (callee, Pmode);
   rtx pat = PATTERN (insn);
+  if (GET_CODE (pat) == COND_EXEC)
+    pat = COND_EXEC_CODE (pat);
   int arg_end = XVECLEN (pat, 0);
   tree decl = NULL_TREE;
 
@@ -1975,6 +1979,7 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee)
 	}
     }
 
+  nvptx_print_operand (asm_out_file, NULL_RTX, '.');
   fprintf (asm_out_file, "\t\tcall ");
   if (result != NULL_RTX)
     fprintf (asm_out_file, "(%%retval_in), ");
@@ -2032,8 +2037,6 @@ nvptx_print_operand_punct_valid_p (unsigned char c)
   return c == '.' || c== '#';
 }
 
-static void nvptx_print_operand (FILE *, rtx, int);
-
 /* Subroutine of nvptx_print_operand; used to print a memory reference X to FILE.  */
 
 static void
@@ -2098,11 +2101,10 @@ nvptx_print_operand (FILE *file, rtx x, int code)
       if (x)
 	{
 	  unsigned int regno = REGNO (XEXP (x, 0));
-	  fputs ("[", file);
+	  fputs ("@", file);
 	  if (GET_CODE (x) == EQ)
 	    fputs ("!", file);
-	  fputs (reg_names [regno], file);
-	  fputs ("]", file);
+	  fprintf (file, "%%r%d", regno);
 	}
       return;
     }
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 7930f8d..5ce7a89 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -226,6 +226,17 @@ (define_predicate "call_operation"
   return true;
 })
 
+(define_attr "predicable" "false,true"
+  (const_string "true"))
+
+(define_cond_exec
+  [(match_operator 0 "predicate_operator"
+      [(match_operand:BI 1 "nvptx_register_operand" "")
+       (match_operand:BI 2 "const0_operand" "")])]
+  ""
+  ""
+  )
+
 (define_constraint "P0"
   "An integer with the value 0."
   (and (match_code "const_int")
@@ -821,7 +832,8 @@ (define_insn "br_true"
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra\\t%l1;")
+  "%j0\\tbra\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 (define_insn "br_false"
   [(set (pc)
@@ -830,7 +842,8 @@ (define_insn "br_false"
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra\\t%l1;")
+  "%J0\\tbra\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 ;; unified conditional branch
 (define_insn "br_true_uni"
@@ -839,7 +852,8 @@ (define_insn "br_true_uni"
 		       UNSPEC_BR_UNIFIED) (const_int 0))
         (label_ref (match_operand 1 "" "")) (pc)))]
   ""
-  "%j0\\tbra.uni\\t%l1;")
+  "%j0\\tbra.uni\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 (define_insn "br_false_uni"
   [(set (pc) (if_then_else
@@ -847,7 +861,8 @@ (define_insn "br_false_uni"
 		       UNSPEC_BR_UNIFIED) (const_int 0))
         (label_ref (match_operand 1 "" "")) (pc)))]
   ""
-  "%J0\\tbra.uni\\t%l1;")
+  "%J0\\tbra.uni\\t%l1;"
+  [(set_attr "predicable" "false")])
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1239,7 +1254,8 @@ (define_insn "return"
   ""
 {
   return nvptx_output_return ();
-})
+}
+  [(set_attr "predicable" "false")])
 
 (define_expand "epilogue"
   [(clobber (const_int 0))]
@@ -1319,14 +1335,16 @@ (define_insn "trap_if_true"
 		(const_int 0))
 	    (const_int 0))]
   ""
-  "%j0 trap;")
+  "%j0 trap;"
+  [(set_attr "predicable" "false")])
 
 (define_insn "trap_if_false"
   [(trap_if (eq (match_operand:BI 0 "nvptx_register_operand" "R")
 		(const_int 0))
 	    (const_int 0))]
   ""
-  "%J0 trap;")
+  "%J0 trap;"
+  [(set_attr "predicable" "false")])
 
 (define_expand "ctrap<mode>4"
   [(trap_if (match_operator 0 "nvptx_comparison_operator"
@@ -1375,28 +1393,28 @@ (define_insn "nvptx_fork"
 		       UNSPECV_FORK)]
   ""
   "// fork %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_forked"
   [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
 		       UNSPECV_FORKED)]
   ""
   "// forked %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_joining"
   [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
 		       UNSPECV_JOINING)]
   ""
   "// joining %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_insn "nvptx_join"
   [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
 		       UNSPECV_JOIN)]
   ""
   "// join %0;"
-)
+  [(set_attr "predicable" "false")])
 
 (define_expand "oacc_fork"
   [(set (match_operand:SI 0 "nvptx_nonmemory_operand" "")
@@ -1558,4 +1576,5 @@ (define_insn "nvptx_barsync"
   [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;")
+  "\\tbar.sync\\t%0;"
+  [(set_attr "predicable" "false")])

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp
  2015-12-01 15:28 ` [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp Alexander Monakov
@ 2015-12-01 15:56   ` Bernd Schmidt
  2015-12-02 11:02   ` Jakub Jelinek
  1 sibling, 0 replies; 49+ messages in thread
From: Bernd Schmidt @ 2015-12-01 15:56 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 12/01/2015 04:28 PM, Alexander Monakov wrote:
> Bernd, is your position on exposing shared memory as first-class address space
> on NVPTX subject to change?  Do you remember what middle-end issues you've
> encountered when trying that?

TYPE_ADDR_SPACE does not reliably contain the address space. Patches to 
deal with that (rather than fix it which Joseph doesn't like) got really 
ugly and I gave up on it. So please use the patch I sent which deals 
with .shared inside the ptx backend (although I think it may have to be 
reworked a little since Nathan changed the code around recently).


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-01 15:28 ` [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant Alexander Monakov
@ 2015-12-01 16:02   ` Bernd Schmidt
  2015-12-01 16:20     ` Alexander Monakov
  2015-12-07 15:09     ` Nathan Sidwell
  2015-12-02 10:40   ` Jakub Jelinek
  1 sibling, 2 replies; 49+ messages in thread
From: Bernd Schmidt @ 2015-12-01 16:02 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 12/01/2015 04:28 PM, Alexander Monakov wrote:
> I'm taking a different approach.  I want to execute all insns in all warp
> members, while ensuring that effect (on global and local state) is that same
> as if any single thread was executing that instruction.  Most instructions
> automatically satisfy that: if threads have the same state, then executing an
> arithmetic instruction, normal memory load/store, etc. keep local state the
> same in all threads.
>
> The two exception insn categories are atomics and calls.  For calls, we can
> demand recursively that they uphold this execution model, until we reach
> runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
> atomics.

Didn't we also conclude that address-taking (let's say for stack 
addresses) is also an operation that does not result in the same state?

Have you tried to use the mechanism used for OpenACC? IMO that would be 
a good first step - get things working with fewer changes, and then look 
into optimizing them (ideally for OpenMP and OpenACC both).


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-01 16:02   ` Bernd Schmidt
@ 2015-12-01 16:20     ` Alexander Monakov
  2015-12-07 15:09     ` Nathan Sidwell
  1 sibling, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 16:20 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Tue, 1 Dec 2015, Bernd Schmidt wrote:
> 
> Didn't we also conclude that address-taking (let's say for stack addresses) is
> also an operation that does not result in the same state?

This is intended to be used with soft-stacks in OpenMP offloading, and
soft-stacks are per-warp outside of SIMD regions, not private to hwthread.  So
no such problem arises.

(also, I wouldn't phrase it that way -- I wouldn't say that taking address of
a classic .local stack slot desyncs state)

> Have you tried to use the mechanism used for OpenACC? IMO that would be a good
> first step - get things working with fewer changes, and then look into
> optimizing them (ideally for OpenMP and OpenACC both).

I don't think I would have as much success trying to apply the OpenACC
mechanism with the overall direction I'm taking, that is, running with a
slightly modified libgomp port.  The way parallel regions are activated in the
guts of libgomp via GOMP_parallel/gomp_team_start makes things different, for
example.

Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets
  2015-12-01 15:28 ` [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets Alexander Monakov
@ 2015-12-01 22:40   ` Alexander Monakov
  2015-12-02 11:48   ` Jakub Jelinek
  1 sibling, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-01 22:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Bernd Schmidt, Dmitry Melnik

Apologies -- last-minute attempt to cleanup and enhance broke this patch;
fixed version below.  The main difference is checking whether we're
transforming a loop that might be executed on the target: checking
decl->offloadable isn't enough, because target region outlining might not have
happened yet; in that case, we need to walk the region tree upwards to check
if any containing region is a target region.

Alexander

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index a3c4a90..3189e96 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -142,6 +142,28 @@ expand_ANNOTATE (gcall *)
   gcc_unreachable ();
 }
 
+/* Lane index on SIMT targets: thread index in the warp on NVPTX.  On targets
+   without SIMT execution this should be expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_LANE (gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  /* FIXME: use a separate pattern for OpenMP?  */
+  gcc_assert (targetm.have_oacc_dim_pos ());
+  emit_insn (targetm.gen_oacc_dim_pos (target, const2_rtx));
+}
+
+/* This should get expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_VF (gcall *)
+{
+  gcc_unreachable ();
+}
+
 /* This should get expanded in adjust_simduid_builtins.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1cb14a8..66c7422 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
 
 DEF_INTERNAL_FN (LOAD_LANES, ECF_CONST | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (STORE_LANES, ECF_CONST | ECF_LEAF, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_VF, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index cc0435e..0478b2a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -10173,7 +10173,7 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
 				  OMP_CLAUSE_SAFELEN);
   tree simduid = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 				  OMP_CLAUSE__SIMDUID_);
-  tree n1, n2;
+  tree n1, n2, step, simt_lane;
 
   type = TREE_TYPE (fd->loop.v);
   entry_bb = region->entry;
@@ -10218,12 +10218,36 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
 
   n1 = fd->loop.n1;
   n2 = fd->loop.n2;
+  step = fd->loop.step;
+  bool offloaded = cgraph_node::get (current_function_decl)->offloadable;
+  for (struct omp_region *reg = region; !offloaded && reg; reg = reg->outer)
+    offloaded = reg->type == GIMPLE_OMP_TARGET;
+  bool do_simt_transform
+    = offloaded && !broken_loop && !safelen && !simduid && !(fd->collapse > 1);
+  if (do_simt_transform)
+    {
+      simt_lane
+	= build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_LANE,
+					integer_type_node, 0);
+      simt_lane = fold_convert (TREE_TYPE (step), simt_lane);
+      simt_lane = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_lane);
+      cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+    }
+
   if (gimple_omp_for_combined_into_p (fd->for_stmt))
     {
       tree innerc = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 				     OMP_CLAUSE__LOOPTEMP_);
       gcc_assert (innerc);
       n1 = OMP_CLAUSE_DECL (innerc);
+      if (do_simt_transform)
+	{
+	  n1 = fold_convert (type, n1);
+	  if (POINTER_TYPE_P (type))
+	    n1 = fold_build_pointer_plus (n1, simt_lane);
+	  else
+	    n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, simt_lane));
+	}
       innerc = find_omp_clause (OMP_CLAUSE_CHAIN (innerc),
 				OMP_CLAUSE__LOOPTEMP_);
       gcc_assert (innerc);
@@ -10239,8 +10263,15 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
     }
   else
     {
-      expand_omp_build_assign (&gsi, fd->loop.v,
-			       fold_convert (type, fd->loop.n1));
+      if (do_simt_transform)
+	{
+	  n1 = fold_convert (type, n1);
+	  if (POINTER_TYPE_P (type))
+	    n1 = fold_build_pointer_plus (n1, simt_lane);
+	  else
+	    n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, simt_lane));
+	}
+      expand_omp_build_assign (&gsi, fd->loop.v, fold_convert (type, n1));
       if (fd->collapse > 1)
 	for (i = 0; i < fd->collapse; i++)
 	  {
@@ -10262,10 +10293,18 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
       stmt = gsi_stmt (gsi);
       gcc_assert (gimple_code (stmt) == GIMPLE_OMP_CONTINUE);
 
+      if (do_simt_transform)
+	{
+	  tree simt_vf
+	    = build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_VF,
+					    integer_type_node, 0);
+	  simt_vf = fold_convert (TREE_TYPE (step), simt_vf);
+	  step = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_vf);
+	}
       if (POINTER_TYPE_P (type))
-	t = fold_build_pointer_plus (fd->loop.v, fd->loop.step);
+	t = fold_build_pointer_plus (fd->loop.v, step);
       else
-	t = fold_build2 (PLUS_EXPR, type, fd->loop.v, fd->loop.step);
+	t = fold_build2 (PLUS_EXPR, type, fd->loop.v, step);
       expand_omp_build_assign (&gsi, fd->loop.v, t);
 
       if (fd->collapse > 1)
@@ -12960,7 +12999,6 @@ expand_omp (struct omp_region *region)
     }
 }
 
-
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
    block BB.  PARENT is the region that contains BB.  If SINGLE_TREE is
    true, the function ends once a single tree is built (otherwise, whole
@@ -16235,7 +16273,7 @@ const pass_data pass_data_lower_omp =
   OPTGROUP_NONE, /* optinfo_flags */
   TV_NONE, /* tv_id */
   PROP_gimple_any, /* properties_required */
-  PROP_gimple_lomp, /* properties_provided */
+  PROP_gimple_lomp | PROP_gimple_lomp_dev, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
   0, /* todo_flags_finish */
@@ -19470,5 +19508,90 @@ make_pass_oacc_device_lower (gcc::context *ctxt)
 {
   return new pass_oacc_device_lower (ctxt);
 }
+\f
+
+/* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets,
+   VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and
+   LANE is kept to be expanded to RTL later on.  */
+
+static unsigned int
+execute_omp_device_lower ()
+{
+  int vf = 1;
+  if (targetm.simt.vf)
+    vf = targetm.simt.vf ();
+  tree vf_tree = build_int_cst (integer_type_node, vf);
+  basic_block bb;
+  gimple_stmt_iterator gsi;
+  FOR_EACH_BB_FN (bb, cfun)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	if (!is_gimple_call (stmt) || !gimple_call_internal_p (stmt))
+	  continue;
+	tree lhs = gimple_call_lhs (stmt), rhs = NULL_TREE;
+	switch (gimple_call_internal_fn (stmt))
+	  {
+	  case IFN_GOMP_SIMT_LANE:
+	    rhs = vf == 1 ? integer_zero_node : NULL_TREE;
+	    break;
+	  case IFN_GOMP_SIMT_VF:
+	    rhs = vf_tree;
+	    break;
+	  default:
+	    break;
+	  }
+	if (!rhs)
+	  continue;
+	stmt = gimple_build_assign (lhs, rhs);
+	gsi_replace (&gsi, stmt, false);
+      }
+  if (vf != 1)
+    cfun->has_force_vectorize_loops = false;
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_omp_device_lower =
+{
+  GIMPLE_PASS, /* type */
+  "ompdevlow", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  PROP_gimple_lomp_dev, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_omp_device_lower : public gimple_opt_pass
+{
+public:
+  pass_omp_device_lower (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_omp_device_lower, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fun)
+    {
+      /* FIXME: inlining does not propagate the lomp_dev property.  */
+      return 1 || !(fun->curr_properties & PROP_gimple_lomp_dev);
+    }
+  virtual unsigned int execute (function *)
+    {
+      return execute_omp_device_lower ();
+    }
+
+}; // class pass_expand_omp_ssa
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_omp_device_lower (gcc::context *ctxt)
+{
+  return new pass_omp_device_lower (ctxt);
+}
 
 #include "gt-omp-low.h"
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..ec049f8 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -151,6 +151,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
   NEXT_PASS (pass_oacc_device_lower);
+  NEXT_PASS (pass_omp_device_lower);
   NEXT_PASS (pass_all_optimizations);
   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations)
       NEXT_PASS (pass_remove_cgraph_callee_edges);
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..71b2561 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -226,6 +226,7 @@ protected:
 						   of math functions; the
 						   current choices have
 						   been optimized.  */
+#define PROP_gimple_lomp_dev	(1 << 16)	/* done omp_device_lower */
 
 #define PROP_trees \
   (PROP_gimple_any | PROP_gimple_lcf | PROP_gimple_leh | PROP_gimple_lomp)
@@ -414,6 +415,7 @@ extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_device_lower (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_omp_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_strlen (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_fold_builtins (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 6/9] nvptx libgcc: rewrite in C
  2015-12-01 15:28 ` [gomp-nvptx 6/9] nvptx libgcc: rewrite in C Alexander Monakov
@ 2015-12-01 23:52   ` Bernd Schmidt
  2015-12-02  0:23     ` Alexander Monakov
  2015-12-07 15:13     ` Nathan Sidwell
  0 siblings, 2 replies; 49+ messages in thread
From: Bernd Schmidt @ 2015-12-01 23:52 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

What exactly is the problem with having asm files? I'm asking because 
this...

On 12/01/2015 04:28 PM, Alexander Monakov wrote:
> +/* __shared__ char *__nvptx_stacks[32];  */
> +asm ("// BEGIN GLOBAL VAR DEF: __nvptx_stacks");
> +asm (".visible .shared .u64 __nvptx_stacks[32];");
> +
> +/* __shared__ unsigned __nvptx_uni[32];  */
> +asm ("// BEGIN GLOBAL VAR DEF: __nvptx_uni");
> +asm (".visible .shared .u32 __nvptx_uni[32];");

... doesn't look great to me. This is better done in assembly directly IMO.


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 6/9] nvptx libgcc: rewrite in C
  2015-12-01 23:52   ` Bernd Schmidt
@ 2015-12-02  0:23     ` Alexander Monakov
  2015-12-07 15:13     ` Nathan Sidwell
  1 sibling, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02  0:23 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches, Jakub Jelinek, Dmitry Melnik

On Wed, 2 Dec 2015, Bernd Schmidt wrote:

> What exactly is the problem with having asm files? I'm asking because this...

Wrappers for malloc and free need different code under -muniform-simt.

> 
> On 12/01/2015 04:28 PM, Alexander Monakov wrote:
> > +/* __shared__ char *__nvptx_stacks[32];  */
> > +asm ("// BEGIN GLOBAL VAR DEF: __nvptx_stacks");
> > +asm (".visible .shared .u64 __nvptx_stacks[32];");
> > +
> > +/* __shared__ unsigned __nvptx_uni[32];  */
> > +asm ("// BEGIN GLOBAL VAR DEF: __nvptx_uni");
> > +asm (".visible .shared .u32 __nvptx_uni[32];");
> 
> ... doesn't look great to me. This is better done in assembly directly IMO.

Hm.  I can convert it to asm, but then if/when I start using attribute-based
shared memory, I'd have to move it back to C again, I think.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-01 15:28 ` [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant Alexander Monakov
  2015-12-01 16:02   ` Bernd Schmidt
@ 2015-12-02 10:40   ` Jakub Jelinek
  2015-12-02 13:02     ` Nathan Sidwell
  1 sibling, 1 reply; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 10:40 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik, Nathan Sidwell,
	Thomas Schwinge

On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote:
> The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
> "slaves" which just follow branches without any computation -- that requires
> extra jumps and broadcasting branch predicates, -- and 2) broadcast register
> state and stack state from master to slaves when entering "vector" regions.
> 
> I'm taking a different approach.  I want to execute all insns in all warp
> members, while ensuring that effect (on global and local state) is that same
> as if any single thread was executing that instruction.  Most instructions
> automatically satisfy that: if threads have the same state, then executing an
> arithmetic instruction, normal memory load/store, etc. keep local state the
> same in all threads.

Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches?  I mean does the HW consume different
amount of power if only one thread in a warp executes code and the other
threads in the same warp just jump around it, vs. having all threads busy?

If it is the same, then I think your approach is reasonable, but my
understanding of PTX is limited.

How exactly does OpenACC copy the stack?  At least for OpenMP, one could
have automatic vars whose addresses are passed to simd regions in different
functions, say like:

void
baz (int x, int *arr)
{
  int i;
  #pragma omp simd
  for (i = 0; i < 128; i++)
    arr[i] *= arr[i] + i + x; // Replace with something useful and expensive
}

void
bar (int x)
{
  int arr[128], i;
  for (i = 0; i < 128; i++)
    arr[i] = i + x;
  baz (x, arr);
}
#pragma omp declare target to (bar, baz)

void
foo ()
{
  int i;
  #pragma omp target teams distribute parallel for
  for (i = 0; i < 131072; i++)
    bar (i);
}
and without inlining you don't know if the arr in bar above will be shared
by all SIMD lanes (SIMT in PTX case) or not.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib
  2015-12-01 15:28 ` [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib Alexander Monakov
@ 2015-12-02 10:56   ` Jakub Jelinek
  2015-12-02 14:18     ` Alexander Monakov
  2015-12-03 10:42     ` Alexander Monakov
  0 siblings, 2 replies; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 10:56 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Tue, Dec 01, 2015 at 06:28:22PM +0300, Alexander Monakov wrote:
> Since OpenMP offloading requires both soft-stacks and "uniform SIMT", both
> non-traditional codegen variants, I'm building a multilib variant with those
> enabled.  This patch adds option -mgomp which enables -msoft-stack plus
> -muniform-simt, and builds a multilib with it.
> 
> 	* config/nvptx/nvptx.c (nvptx_option_override): Handle TARGET_GOMP.
> 	* config/nvptx/nvptx.opt (mgomp): New option.
> 	* config/nvptx/t-nvptx (MULTILIB_OPTIONS): New.
> 	* doc/invoke.texi (mgomp): Document.

I thought the MULTILIB* vars allow you to multilib on none of
-msoft-stack/-muniform-simt and both -msoft-stack/-muniform-simt, without
building other variants, so you wouldn't need this.
Furthermore, as I said, I believe for e.g. most of newlib libc / libm
I think it is enough if they are built as -muniform-simt -mno-soft-stack,
if those functions are leaf or don't call user routines that could have
#pragma omp parallel.  -msoft-stack would unnecessarily slow the routines
down.
So perhaps just multilib on -muniform-simt, and document that -muniform-simt
built code requires also that the soft-stack var is set up and thus
-msoft-stack can be used when needed?

Can you post sample code with assembly for -msoft-stack and -muniform-simt
showing how are short interesting cases expanded?
Is there really no way even in direct PTX assembly to have .local file scope
vars (rather than the global arrays indexed by %tid)?

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp
  2015-12-01 15:28 ` [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp Alexander Monakov
  2015-12-01 15:56   ` Bernd Schmidt
@ 2015-12-02 11:02   ` Jakub Jelinek
  1 sibling, 0 replies; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 11:02 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Tue, Dec 01, 2015 at 06:28:26PM +0300, Alexander Monakov wrote:
> +void
> +gomp_nvptx_main (void (*fn) (void *), void *fn_data)
> +{
> +  int tid, ntids;
> +  asm ("mov.u32 %0, %%tid.y;" : "=r" (tid));
> +  asm ("mov.u32 %0, %%ntid.y;" : "=r"(ntids));

Formatting (missing space before ( ).

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets
  2015-12-01 15:28 ` [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets Alexander Monakov
  2015-12-01 22:40   ` Alexander Monakov
@ 2015-12-02 11:48   ` Jakub Jelinek
  2015-12-02 13:54     ` Alexander Monakov
  1 sibling, 1 reply; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 11:48 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Tue, Dec 01, 2015 at 06:28:27PM +0300, Alexander Monakov wrote:
> @@ -10218,12 +10218,37 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd)
>  
>    n1 = fd->loop.n1;
>    n2 = fd->loop.n2;
> +  step = fd->loop.step;
> +  bool do_simt_transform
> +    = (cgraph_node::get (current_function_decl)->offloadable
> +       && !broken_loop
> +       && !safelen
> +       && !simduid
> +       && !(fd->collapse > 1));

expand_omp is depth-first expansion, so for the case where the simd
region is in lexically (directly or indirectly) nested inside of a
target region, the above will not trigger.  You'd need to
use cgraph_node::get (current_function_decl)->offloadable or
just walk through outer fields of region up and see if this isn't in
a target region.

Also, please consider privatized variables in the simd loops.
int
foo (int *p)
{
  int r = 0, i;
  #pragma omp simd reduction(+:r)
  for (i = 0; i < 32; i++)
    {
      p[i] += i;
      r += i;
    }
  return r;
}
#pragma omp declare target to (foo)

int
main ()
{
  int p[32], err, i;
  for (i = 0; i < 32; i++)
    p[i] = i;
  #pragma omp target map(tofrom:p) map(from:err)
  {
    int r = 0;
    #pragma omp simd reduction(+:r)
    for (i = 0; i < 32; i++)
    {
      p[i] += i;
      r += i;
    }
    err = r != 31 * 32 / 2;
    err |= foo (p) != 31 * 32 / 2;
  }
  if (err)
    __builtin_abort ();
  for (i = 0; i < 32; i++)
    if (p[i] != 3 * i)
      __builtin_abort ();
  return 0;
}

Here, it would be nice to extend omp_max_vf in the host compiler,
such that if PTX offloading is enabled, and optimize && !optimize_debug
(and vectorizer on the host not disabled, otherwise it won't be cleaned up
on the host), it returns MIN (32, whatever it would return otherwise).
And then arrange for the stores to and other operations on the "omp simd array"
attributed arrays before/after the simd loop to be handled specially for
SIMT, basically you want those to be .local, if non-addressable handled as
any other scalars, the loop up to GOMP_SIMD_LANES run exactly once, and for
the various reductions or lastprivate selection reduce it the SIMT way or
pick value from the thread in warp that had the last SIMT lane, etc.

> +  if (do_simt_transform)
> +    {
> +      tree simt_lane
> +	= build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_LANE,
> +					integer_type_node, 0);
> +      simt_lane = fold_convert (TREE_TYPE (step), simt_lane);
> +      simt_lane = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_lane);
> +      cfun->curr_properties &= ~PROP_gimple_lomp_dev;

How does this even compile?  simt_lane is a local var in the if
(do_simt_transform) body.
> +    }
> +
>    if (gimple_omp_for_combined_into_p (fd->for_stmt))
>      {
>        tree innerc = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
>  				     OMP_CLAUSE__LOOPTEMP_);
>        gcc_assert (innerc);
>        n1 = OMP_CLAUSE_DECL (innerc);
> +      if (do_simt_transform)
> +	{
> +	  n1 = fold_convert (type, n1);
> +	  if (POINTER_TYPE_P (type))
> +	    n1 = fold_build_pointer_plus (n1, simt_lane);

And then you use it here, outside of its scope.

BTW, again, it would help if you post a simple *.ompexp dump on what exactly
you want to look it up.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 10:40   ` Jakub Jelinek
@ 2015-12-02 13:02     ` Nathan Sidwell
  2015-12-02 13:10       ` Jakub Jelinek
  2015-12-02 14:41       ` Alexander Monakov
  0 siblings, 2 replies; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 13:02 UTC (permalink / raw)
  To: Jakub Jelinek, Alexander Monakov
  Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik, Thomas Schwinge

On 12/02/15 05:40, Jakub Jelinek wrote:
>  Don't know the HW good enough, is there any power consumption, heat etc.
> difference between the two approaches?  I mean does the HW consume different
> amount of power if only one thread in a warp executes code and the other
> threads in the same warp just jump around it, vs. having all threads busy?

Having all threads busy will increase power consumption.  It's also bad if the 
other vectors are executing memory access instructions.  However, for small 
blocks, it is probably a win over the jump around approach.  One of the 
optimizations for the future of the neutering algorithm is to add such 
predication for small blocks and keep branching for the larger blocks.

> How exactly does OpenACC copy the stack?  At least for OpenMP, one could
> have automatic vars whose addresses are passed to simd regions in different
> functions, say like:

The stack frame of the current function is copied when entering a partitioned 
region.  (There is no visibility of caller's frame and such.) Again, 
optimization would be trying to only copy the stack that's used in the 
partitioned region.

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 13:02     ` Nathan Sidwell
@ 2015-12-02 13:10       ` Jakub Jelinek
  2015-12-02 13:39         ` Nathan Sidwell
  2015-12-02 14:54         ` Alexander Monakov
  2015-12-02 14:41       ` Alexander Monakov
  1 sibling, 2 replies; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 13:10 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> >difference between the two approaches?  I mean does the HW consume different
> >amount of power if only one thread in a warp executes code and the other
> >threads in the same warp just jump around it, vs. having all threads busy?
> 
> Having all threads busy will increase power consumption.  It's also bad if
> the other vectors are executing memory access instructions.  However, for

Then the uniform SIMT approach might not be that good idea.

> small blocks, it is probably a win over the jump around approach.  One of
> the optimizations for the future of the neutering algorithm is to add such
> predication for small blocks and keep branching for the larger blocks.
> 
> >How exactly does OpenACC copy the stack?  At least for OpenMP, one could
> >have automatic vars whose addresses are passed to simd regions in different
> >functions, say like:
> 
> The stack frame of the current function is copied when entering a
> partitioned region.  (There is no visibility of caller's frame and such.)
> Again, optimization would be trying to only copy the stack that's used in
> the partitioned region.

Always the whole stack, from the current stack pointer up to top of the
stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 1/9] nvptx backend: allow emitting COND_EXEC insns
  2015-12-01 15:47 ` [gomp-nvptx 1/9] nvptx backend: allow emitting COND_EXEC insns Alexander Monakov
@ 2015-12-02 13:31   ` Bernd Schmidt
  0 siblings, 0 replies; 49+ messages in thread
From: Bernd Schmidt @ 2015-12-02 13:31 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Jakub Jelinek, Dmitry Melnik

On 12/01/2015 04:28 PM, Alexander Monakov wrote:
> This allows to use COND_EXEC patterns on nvptx.  The backend is mostly ready
> for that, although I had to slightly fix nvptx_print_operand.  I've also opted
> to make calls predicable to make the uniform-simt patch simpler, and to that
> end I need a small fixup in nvptx_output_call_insn.
>
> RTL optimization won't emit COND_EXEC insns, because it's done only after
> reload, and register allocation is not done.  I need this patch to create
> COND_EXEC patterns in the backend during reorg.

This looks OK to me (in general, not in the sense of OK for trunk in 
stage 3).


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 13:10       ` Jakub Jelinek
@ 2015-12-02 13:39         ` Nathan Sidwell
  2015-12-02 13:46           ` Jakub Jelinek
  2015-12-02 14:54         ` Alexander Monakov
  1 sibling, 1 reply; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 13:39 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On 12/02/15 08:10, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:

> Always the whole stack, from the current stack pointer up to top of the
> stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?

The frame of the current function.  Not the whole stack.  As I said, there's no 
visibility of the stack beyond the current function.  (one could implement some 
kind of chaining, I guess)

PTX does not expose the concept of a stack at all.  No stack pointer, no link 
register, no argument pushing.

It does expose 'local' memory, which is private to a thread and only live during 
a function (not like function-scope 'static').  From that we construct stack frames.

The rules of PTX are such that one can (almost) determine the call graph 
statically.  I don't know whether the JIT implements .local as a stack or 
statically allocates it (and perhaps uses a liveness algorithm to determine 
which pieces may overlap).  Perhaps it depends on the physical device capabilities.

The 'almost' fails with indirect calls, except that
1) at an indirect call, you may specify the static set of fns you know it'll 
resolve to
2) if you don't know that, you have to specify the function prototype anyway. 
So the static set would be 'all functions of that type'.

I don't know if the JIT makes use of that information.

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 13:39         ` Nathan Sidwell
@ 2015-12-02 13:46           ` Jakub Jelinek
  2015-12-02 14:00             ` Bernd Schmidt
  2015-12-02 14:14             ` Nathan Sidwell
  0 siblings, 2 replies; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 13:46 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On Wed, Dec 02, 2015 at 08:38:56AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:10, Jakub Jelinek wrote:
> >On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> 
> >Always the whole stack, from the current stack pointer up to top of the
> >stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?
> 
> The frame of the current function.  Not the whole stack.  As I said, there's
> no visibility of the stack beyond the current function.  (one could
> implement some kind of chaining, I guess)

So, how does OpenACC cope with this?

Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at the worker level and splits the work among
the threads in the warp, on the array section pointed by that passed in
pointer?  See the OpenMP testcase I've posted in this thread.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets
  2015-12-02 11:48   ` Jakub Jelinek
@ 2015-12-02 13:54     ` Alexander Monakov
  2015-12-02 14:02       ` Jakub Jelinek
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02 13:54 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> expand_omp is depth-first expansion, so for the case where the simd
> region is in lexically (directly or indirectly) nested inside of a
> target region, the above will not trigger.  You'd need to
> use cgraph_node::get (current_function_decl)->offloadable or
> just walk through outer fields of region up and see if this isn't in
> a target region.

I've addressed this in my follow-up response to this patch.  Again, sorry for
the mishap, I was overconfident when adjusting the patch just before sending.

> Here, it would be nice to extend omp_max_vf in the host compiler,
> such that if PTX offloading is enabled, and optimize && !optimize_debug
> (and vectorizer on the host not disabled, otherwise it won't be cleaned up
> on the host), it returns MIN (32, whatever it would return otherwise).

Did you mean MAX (32, host_vf), not MIN?

> How does this even compile?  simt_lane is a local var in the if
> (do_simt_transform) body.

I addressed in this in the reposted patch too, a few hours after posting this
broken code.

> BTW, again, it would help if you post a simple *.ompexp dump on what exactly
> you want to look it up.

Sorry, I'm not following you here -- can you rephrase what I should post?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 13:46           ` Jakub Jelinek
@ 2015-12-02 14:00             ` Bernd Schmidt
  2015-12-02 14:14             ` Nathan Sidwell
  1 sibling, 0 replies; 49+ messages in thread
From: Bernd Schmidt @ 2015-12-02 14:00 UTC (permalink / raw)
  To: Jakub Jelinek, Nathan Sidwell
  Cc: Alexander Monakov, gcc-patches, Dmitry Melnik, Thomas Schwinge

On 12/02/2015 02:46 PM, Jakub Jelinek wrote:
> Or does the OpenACC execution model not allow anything like that, i.e.
> have some function with an automatic variable pass the address of that
> variable to some other function and that other function use #acc loop kind
> that expects the caller to be at the worker level and splits the work among
> the threads in the warp, on the array section pointed by that passed in
> pointer?  See the OpenMP testcase I've posted in this thread.

I believe you're making a mistake if you think that the OpenACC 
"specification" considers such cases.


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets
  2015-12-02 13:54     ` Alexander Monakov
@ 2015-12-02 14:02       ` Jakub Jelinek
  2015-12-02 14:26         ` Alexander Monakov
  0 siblings, 1 reply; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 14:02 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Wed, Dec 02, 2015 at 04:54:39PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> > expand_omp is depth-first expansion, so for the case where the simd
> > region is in lexically (directly or indirectly) nested inside of a
> > target region, the above will not trigger.  You'd need to
> > use cgraph_node::get (current_function_decl)->offloadable or
> > just walk through outer fields of region up and see if this isn't in
> > a target region.
> 
> I've addressed this in my follow-up response to this patch.  Again, sorry for
> the mishap, I was overconfident when adjusting the patch just before sending.
> 
> > Here, it would be nice to extend omp_max_vf in the host compiler,
> > such that if PTX offloading is enabled, and optimize && !optimize_debug
> > (and vectorizer on the host not disabled, otherwise it won't be cleaned up
> > on the host), it returns MIN (32, whatever it would return otherwise).
> 
> Did you mean MAX (32, host_vf), not MIN?

Sure, MAX.  Though, if the SIMTification treats "omp simd array" arrays
specially, it probably only cares whether it is > 1 (because 1 disables the
"omp simd array" handling).  If all we want to achieve is that those arrays
in PTX ACCEL_COMPILER become again scalars (or aggregates or whatever they
were before) with each thread in warp writing their own, it doesn't really
care about their size that much.

> > How does this even compile?  simt_lane is a local var in the if
> > (do_simt_transform) body.
> 
> I addressed in this in the reposted patch too, a few hours after posting this
> broken code.
> 
> > BTW, again, it would help if you post a simple *.ompexp dump on what exactly
> > you want to look it up.
> 
> Sorry, I'm not following you here -- can you rephrase what I should post?

Just wanted to see -fdump-tree-ompexp dump say from the testcase I've
posted.  Does your patchset have any dependencies that aren't on the trunk?
If not, I guess I just could apply the patchset and look at the results, but
if there are, it would need applying more.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 13:46           ` Jakub Jelinek
  2015-12-02 14:00             ` Bernd Schmidt
@ 2015-12-02 14:14             ` Nathan Sidwell
  2015-12-02 14:22               ` Jakub Jelinek
  1 sibling, 1 reply; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 14:14 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On 12/02/15 08:46, Jakub Jelinek wrote:

> Or does the OpenACC execution model not allow anything like that, i.e.
> have some function with an automatic variable pass the address of that
> variable to some other function and that other function use #acc loop kind
> that expects the caller to be at the worker level and splits the work among
> the threads in the warp, on the array section pointed by that passed in
> pointer?  See the OpenMP testcase I've posted in this thread.

There are two cases to consider

1) the caller (& address taker) is already partitioned.  Thus the callers' 
frames are already copied.  The caller takes the address of the object in its 
own frame.

An example would be calling say __mulcd3 where the return value location is 
passed by pointer.

2) the caller is not partitioned and calls a function containing a partitioned 
loop.  The caller takes the address of its instance of the variable.  As part of 
the RTL expansion we have to convert addresses (to be stored in registers) to 
the generic address space.  That conversion creates a pointer that may be used 
by any thread (on the same CTA)[*].  The function call is  executed by all 
threads (they're partially un-neutered before the call).  In the partitioned 
loop, each thread ends up accessing the location in the frame of the original 
calling active thread.

[*]  although .local is private to each thread, it's placed in memory that is 
reachable from anywhere, provided a generic address is used.  Essentially it's 
like TLS and genericization is simply adding the thread pointer to the local 
memory offset to create a generic address.

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib
  2015-12-02 10:56   ` Jakub Jelinek
@ 2015-12-02 14:18     ` Alexander Monakov
  2015-12-03 10:42     ` Alexander Monakov
  1 sibling, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02 14:18 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> I thought the MULTILIB* vars allow you to multilib on none of
> -msoft-stack/-muniform-simt and both -msoft-stack/-muniform-simt, without
> building other variants, so you wouldn't need this.

The nice effect of having -mgomp is better factorization: if I need to change
what OpenMP needs, e.g. for going with your suggestion below and dropping
-msoft-stack, I need to only change one line.  Otherwise I'd have to change
mkoffload too.

> Furthermore, as I said, I believe for e.g. most of newlib libc / libm
> I think it is enough if they are built as -muniform-simt -mno-soft-stack,
> if those functions are leaf or don't call user routines that could have
> #pragma omp parallel.  -msoft-stack would unnecessarily slow the routines
> down.

Not obviously so.  Outside of SIMD regions, running on hard stacks pointlessly
amplifies cache/memory traffic for stack references, so there would have to be
some evaluation before deciding.

> So perhaps just multilib on -muniform-simt, and document that -muniform-simt
> built code requires also that the soft-stack var is set up and thus
> -msoft-stack can be used when needed?

It's an interesting point, but I have doubts.  Is that something you'd want me
to address short-term?

> Can you post sample code with assembly for -msoft-stack and -muniform-simt
> showing how are short interesting cases expanded?
> Is there really no way even in direct PTX assembly to have .local file scope
> vars (rather than the global arrays indexed by %tid)?

Allow me to post samples a bit later; as for .local, the PTX documentation
explicitely states it must not be done:

    5.1.5. Local State Space
    [...]
    When compiling to use the Application Binary Interface (ABI), .local
    state-space variables must be declared within function scope and are
    allocated on the stack. In implementations that do not support a stack,
    all local memory variables are stored at fixed addresses, recursive
    function calls are not supported, and .local variables may be declared at
    module scope. When compiling legacy PTX code (ISA versions prior to 3.0)
    containing module-scoped .local variables, the compiler silently disables
    use of the ABI.

(while I'm unsure as to what exactly "compiling to use the ABI" is defined,
I'm assuming that's what we want in GCC, and otherwise linking may not work)

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 14:14             ` Nathan Sidwell
@ 2015-12-02 14:22               ` Jakub Jelinek
  2015-12-02 14:23                 ` Nathan Sidwell
  0 siblings, 1 reply; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 14:22 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On Wed, Dec 02, 2015 at 09:14:03AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:46, Jakub Jelinek wrote:
> 
> >Or does the OpenACC execution model not allow anything like that, i.e.
> >have some function with an automatic variable pass the address of that
> >variable to some other function and that other function use #acc loop kind
> >that expects the caller to be at the worker level and splits the work among
> >the threads in the warp, on the array section pointed by that passed in
> >pointer?  See the OpenMP testcase I've posted in this thread.
> 
> There are two cases to consider
> 
> 1) the caller (& address taker) is already partitioned.  Thus the callers'
> frames are already copied.  The caller takes the address of the object in
> its own frame.
> 
> An example would be calling say __mulcd3 where the return value location is
> passed by pointer.
> 
> 2) the caller is not partitioned and calls a function containing a
> partitioned loop.  The caller takes the address of its instance of the
> variable.  As part of the RTL expansion we have to convert addresses (to be
> stored in registers) to the generic address space.  That conversion creates
> a pointer that may be used by any thread (on the same CTA)[*].  The function
> call is  executed by all threads (they're partially un-neutered before the
> call).  In the partitioned loop, each thread ends up accessing the location
> in the frame of the original calling active thread.
> 
> [*]  although .local is private to each thread, it's placed in memory that
> is reachable from anywhere, provided a generic address is used.  Essentially
> it's like TLS and genericization is simply adding the thread pointer to the
> local memory offset to create a generic address.

I believe Alex' testing revealed that if you take address of the same .local
objects in several threads, the addresses are the same, and therefore you
refer to your own .local space rather than the other thread's.  Which is why
the -msoft-stack stuff has been added.
Perhaps we need to use it everywhere, at least for OpenMP, and do it
selectively, non-addressable vars can stay .local, addressable vars proven
not to escape to other threads (or other functions that could access them
from other threads) would go to soft stack.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 14:22               ` Jakub Jelinek
@ 2015-12-02 14:23                 ` Nathan Sidwell
  2015-12-02 14:24                   ` Jakub Jelinek
  0 siblings, 1 reply; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 14:23 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On 12/02/15 09:22, Jakub Jelinek wrote:

> I believe Alex' testing revealed that if you take address of the same .local
> objects in several threads, the addresses are the same, and therefore you
> refer to your own .local space rather than the other thread's.

Before or after applying cvta?

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 14:23                 ` Nathan Sidwell
@ 2015-12-02 14:24                   ` Jakub Jelinek
  2015-12-02 14:34                     ` Alexander Monakov
  2015-12-02 14:39                     ` Nathan Sidwell
  0 siblings, 2 replies; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 14:24 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> On 12/02/15 09:22, Jakub Jelinek wrote:
> 
> >I believe Alex' testing revealed that if you take address of the same .local
> >objects in several threads, the addresses are the same, and therefore you
> >refer to your own .local space rather than the other thread's.
> 
> Before or after applying cvta?

I'll let Alex answer that.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets
  2015-12-02 14:02       ` Jakub Jelinek
@ 2015-12-02 14:26         ` Alexander Monakov
  0 siblings, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02 14:26 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> Just wanted to see -fdump-tree-ompexp dump say from the testcase I've
> posted.  Does your patchset have any dependencies that aren't on the trunk?
> If not, I guess I just could apply the patchset and look at the results, but
> if there are, it would need applying more.

Hm, the testcase has a reduction, which would cause the loop have a _SIMDUID
clause, which would in turn make my patch give up, setting do_simt_transform
to false.  So I'm using presence of SIMDUID to see whether the loop has any
reduction/lastprivate data, which I'm not handling for SIMT yet.

(I should really start a branch)

Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 14:24                   ` Jakub Jelinek
@ 2015-12-02 14:34                     ` Alexander Monakov
  2015-12-02 14:39                     ` Nathan Sidwell
  1 sibling, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02 14:34 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge



On Wed, 2 Dec 2015, Jakub Jelinek wrote:

> On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 09:22, Jakub Jelinek wrote:
> > 
> > >I believe Alex' testing revealed that if you take address of the same .local
> > >objects in several threads, the addresses are the same, and therefore you
> > >refer to your own .local space rather than the other thread's.
> > 
> > Before or after applying cvta?
> 
> I'll let Alex answer that.

Both before and after, see this email:
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg02081.html

Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 14:24                   ` Jakub Jelinek
  2015-12-02 14:34                     ` Alexander Monakov
@ 2015-12-02 14:39                     ` Nathan Sidwell
  1 sibling, 0 replies; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 14:39 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Alexander Monakov, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On 12/02/15 09:24, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
>> On 12/02/15 09:22, Jakub Jelinek wrote:
>>
>>> I believe Alex' testing revealed that if you take address of the same .local
>>> objects in several threads, the addresses are the same, and therefore you
>>> refer to your own .local space rather than the other thread's.
>>
>> Before or after applying cvta?
>
> I'll let Alex answer that.

Nevermind, I've run an experiment, and it appears that local addresses converted 
to generic do give the same value regardless of executing thread.  I guess that 
means that genericization of local addresses to physical memory is done late at 
the load/store insn, rather than in the cvta insn.

When I added routine support, I did wonder whether the calling routine would 
need to clone its stack frame, but determined against it using the logic I wrote 
earlier.

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 13:02     ` Nathan Sidwell
  2015-12-02 13:10       ` Jakub Jelinek
@ 2015-12-02 14:41       ` Alexander Monakov
  2015-12-02 14:43         ` Nathan Sidwell
  1 sibling, 1 reply; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02 14:41 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Jakub Jelinek, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On Wed, 2 Dec 2015, Nathan Sidwell wrote:

> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> > difference between the two approaches?  I mean does the HW consume different
> > amount of power if only one thread in a warp executes code and the other
> > threads in the same warp just jump around it, vs. having all threads busy?
> 
> Having all threads busy will increase power consumption. >

Is that from general principles (i.e. "if it doesn't increase power
consumption, the GPU is poorly optimized"), or is that based on specific
knowledge on how existing GPUs operate (presumably reverse-engineered or
privately communicated -- I've never seen any public statements on this
point)?

The only certain case I imagine is instructions that go to SFU rather than
normal SPs -- but those are relatively rare.

> It's also bad if the other vectors are executing memory access instructions.

How so?  The memory accesses are the same independent of whether you reading
the same data from 1 thread or 32 synchronous threads.

Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 14:41       ` Alexander Monakov
@ 2015-12-02 14:43         ` Nathan Sidwell
  0 siblings, 0 replies; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 14:43 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Jakub Jelinek, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On 12/02/15 09:41, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Nathan Sidwell wrote:
>
>> On 12/02/15 05:40, Jakub Jelinek wrote:
>>> Don't know the HW good enough, is there any power consumption, heat etc.
>>> difference between the two approaches?  I mean does the HW consume different
>>> amount of power if only one thread in a warp executes code and the other
>>> threads in the same warp just jump around it, vs. having all threads busy?
>>
>> Having all threads busy will increase power consumption. >
>
> Is that from general principles (i.e. "if it doesn't increase power
> consumption, the GPU is poorly optimized"), or is that based on specific
> knowledge on how existing GPUs operate (presumably reverse-engineered or
> privately communicated -- I've never seen any public statements on this
> point)?

Nvidia told me.

> The only certain case I imagine is instructions that go to SFU rather than
> normal SPs -- but those are relatively rare.
>
>> It's also bad if the other vectors are executing memory access instructions.
>
> How so?  The memory accesses are the same independent of whether you reading
> the same data from 1 thread or 32 synchronous threads.

Nvidia told me.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 13:10       ` Jakub Jelinek
  2015-12-02 13:39         ` Nathan Sidwell
@ 2015-12-02 14:54         ` Alexander Monakov
  2015-12-02 15:12           ` Jakub Jelinek
  1 sibling, 1 reply; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02 14:54 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On Wed, 2 Dec 2015, Jakub Jelinek wrote:

> On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > Don't know the HW good enough, is there any power consumption, heat etc.
> > >difference between the two approaches?  I mean does the HW consume different
> > >amount of power if only one thread in a warp executes code and the other
> > >threads in the same warp just jump around it, vs. having all threads busy?
> > 
> > Having all threads busy will increase power consumption.  It's also bad if
> > the other vectors are executing memory access instructions.  However, for
> 
> Then the uniform SIMT approach might not be that good idea.

Why?  Remember that the tradeoff is copying registers (and in OpenACC, stacks
too).  We don't know how the costs balance.  My intuition is that copying is
worse compared to what I'm doing.

Anyhow, for good performance the offloaded code needs to be running in vector
regions most of the time, where the concern doesn't apply.

Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 14:54         ` Alexander Monakov
@ 2015-12-02 15:12           ` Jakub Jelinek
  2015-12-02 15:18             ` Nathan Sidwell
       [not found]             ` <CABtfrpAyUtWub2CBHKYqN0aLNTZ1QspmxyQzOU6Gr+3ogZpSNA@mail.gmail.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 15:12 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Nathan Sidwell, gcc-patches, Bernd Schmidt, Dmitry Melnik,
	Thomas Schwinge

On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> 
> > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > > Don't know the HW good enough, is there any power consumption, heat etc.
> > > >difference between the two approaches?  I mean does the HW consume different
> > > >amount of power if only one thread in a warp executes code and the other
> > > >threads in the same warp just jump around it, vs. having all threads busy?
> > > 
> > > Having all threads busy will increase power consumption.  It's also bad if
> > > the other vectors are executing memory access instructions.  However, for
> > 
> > Then the uniform SIMT approach might not be that good idea.
> 
> Why?  Remember that the tradeoff is copying registers (and in OpenACC, stacks
> too).  We don't know how the costs balance.  My intuition is that copying is
> worse compared to what I'm doing.
> 
> Anyhow, for good performance the offloaded code needs to be running in vector
> regions most of the time, where the concern doesn't apply.

But you never know if people actually use #pragma omp simd regions or not,
sometimes they will, sometimes they won't, and if the uniform SIMT increases
power consumption, it might not be desirable.

If we have a reasonable IPA pass to discover which addressable variables can
be shared by multiple threads and which can't, then we could use soft-stack
for those that can be shared by multiple PTX threads (different warps, or
same warp, different threads in it), then we shouldn't need to copy any
stack, just broadcast the scalar vars.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 15:12           ` Jakub Jelinek
@ 2015-12-02 15:18             ` Nathan Sidwell
       [not found]             ` <CABtfrpAyUtWub2CBHKYqN0aLNTZ1QspmxyQzOU6Gr+3ogZpSNA@mail.gmail.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 15:18 UTC (permalink / raw)
  To: Jakub Jelinek, Alexander Monakov
  Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik, Thomas Schwinge

On 12/02/15 10:12, Jakub Jelinek wrote:

> If we have a reasonable IPA pass to discover which addressable variables can
> be shared by multiple threads and which can't, then we could use soft-stack
> for those that can be shared by multiple PTX threads (different warps, or
> same warp, different threads in it), then we shouldn't need to copy any
> stack, just broadcast the scalar vars.

Note the current scalar (.reg)  broadcasting uses the live register set.  Not 
the subset of that that is actually read within the partitioned region.  That'd 
be a relatively straightforward optimization I think.

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
       [not found]             ` <CABtfrpAyUtWub2CBHKYqN0aLNTZ1QspmxyQzOU6Gr+3ogZpSNA@mail.gmail.com>
@ 2015-12-02 16:36               ` Jakub Jelinek
  2015-12-02 17:09                 ` Nathan Sidwell
  2015-12-02 17:09                 ` Alexander Monakov
  0 siblings, 2 replies; 49+ messages in thread
From: Jakub Jelinek @ 2015-12-02 16:36 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Nathan Sidwell, Alexander Monakov, Thomas Schwinge,
	Bernd Schmidt, gcc-patches, Dmitry Melnik

On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
> > But you never know if people actually use #pragma omp simd regions or not,
> > sometimes they will, sometimes they won't, and if the uniform SIMT
> increases
> > power consumption, it might not be desirable.
> 
> It's easy to address: just terminate threads 1-31 if the linked image has
> no SIMD regions, like my pre-simd libgomp was doing.

Well, can't say the linked image in one shared library call a function
in another linked image in another shared library?  Or is that just not
supported for PTX?  I believe XeonPhi supports that.

If each linked image is self-contained, then that is probably a good idea,
but still you could have a single simd region somewhere and lots of other
target regions that don't use simd, or cases where only small amount of time
is spent in a simd region and this wouldn't help in that case.

If the addressables are handled through soft stack, then the rest is mostly
just SSA_NAMEs you can see on the edges of the SIMT region, that really
shouldn't be that expensive to broadcast or reduce back.

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 16:36               ` Jakub Jelinek
  2015-12-02 17:09                 ` Nathan Sidwell
@ 2015-12-02 17:09                 ` Alexander Monakov
  2015-12-02 17:20                   ` Nathan Sidwell
  1 sibling, 1 reply; 49+ messages in thread
From: Alexander Monakov @ 2015-12-02 17:09 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, Thomas Schwinge, Bernd Schmidt, gcc-patches,
	Dmitry Melnik

On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> > It's easy to address: just terminate threads 1-31 if the linked image has
> > no SIMD regions, like my pre-simd libgomp was doing.
> 
> Well, can't say the linked image in one shared library call a function
> in another linked image in another shared library?  Or is that just not
> supported for PTX?  I believe XeonPhi supports that.

I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
it's not an issue.  E.g. check early in gomp_nvptx_main if .weak
__nvptx_has_simd != 0.  It would only break if there was dlopen on PTX.

> If each linked image is self-contained, then that is probably a good idea,
> but still you could have a single simd region somewhere and lots of other
> target regions that don't use simd, or cases where only small amount of time
> is spent in a simd region and this wouldn't help in that case.

Should we actually be much concerned about optimizing this case, which
is unlikely to run faster than host cpu in the first place?

> If the addressables are handled through soft stack, then the rest is mostly
> just SSA_NAMEs you can see on the edges of the SIMT region, that really
> shouldn't be that expensive to broadcast or reduce back.

That's not enough: you have to reach the SIMD region entry in threads 1-31,
which means they need to execute all preceding control flow like thread 0,
which means they need to compute controlling predicates like thread 0.
(OpenACC broadcasts controlling predicates at branches)

Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 16:36               ` Jakub Jelinek
@ 2015-12-02 17:09                 ` Nathan Sidwell
  2015-12-02 17:09                 ` Alexander Monakov
  1 sibling, 0 replies; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 17:09 UTC (permalink / raw)
  To: Jakub Jelinek, Alexander Monakov
  Cc: Alexander Monakov, Thomas Schwinge, Bernd Schmidt, gcc-patches,
	Dmitry Melnik

On 12/02/15 11:35, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
>>> But you never know if people actually use #pragma omp simd regions or not,
>>> sometimes they will, sometimes they won't, and if the uniform SIMT
>> increases
>>> power consumption, it might not be desirable.
>>
>> It's easy to address: just terminate threads 1-31 if the linked image has
>> no SIMD regions, like my pre-simd libgomp was doing.
>
> Well, can't say the linked image in one shared library call a function
> in another linked image in another shared library?  Or is that just not
> supported for PTX?  I believe XeonPhi supports that.

I don't believe PTX supports such dynamic loading within the PTX program 
currently being executed.  The JIT compiler can have several PTX 'objects' 
loaded into it before you tell it to go link everything.  At that point all 
symbols must be resolved.  I've no idea as to how passing a pointer to a 
function in some other 'executable' and calling it might behave -- my suspicion 
is 'badly'.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 17:09                 ` Alexander Monakov
@ 2015-12-02 17:20                   ` Nathan Sidwell
  2015-12-03 13:57                     ` Alexander Monakov
  0 siblings, 1 reply; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-02 17:20 UTC (permalink / raw)
  To: Alexander Monakov, Jakub Jelinek
  Cc: Thomas Schwinge, Bernd Schmidt, gcc-patches, Dmitry Melnik

On 12/02/15 12:09, Alexander Monakov wrote:

> I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
> it's not an issue.  E.g. check early in gomp_nvptx_main if .weak
> __nvptx_has_simd != 0.  It would only break if there was dlopen on PTX.

Note I found a bug in .weak support.  See the comment in  gcc.dg/special/weak-2.c

/* NVPTX's implementation of weak is broken when a strong symbol is in
    a later object file than the weak definition.   */

> That's not enough: you have to reach the SIMD region entry in threads 1-31,
> which means they need to execute all preceding control flow like thread 0,
> which means they need to compute controlling predicates like thread 0.
> (OpenACC broadcasts controlling predicates at branches)

indeed.  Hence the partial 'forking' before a function call of a function with 
internal partitioned execution.

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib
  2015-12-02 10:56   ` Jakub Jelinek
  2015-12-02 14:18     ` Alexander Monakov
@ 2015-12-03 10:42     ` Alexander Monakov
  1 sibling, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-03 10:42 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Bernd Schmidt, Dmitry Melnik

On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> Can you post sample code with assembly for -msoft-stack and -muniform-simt
> showing how are short interesting cases expanded?

Here's short examples;  please let me know if I'm misunderstanding and you
wanted something else.

First, -muniform-simt effect on this input:

int f (int *p, int v)
{
  return __atomic_exchange_n (p, v, __ATOMIC_SEQ_CST);
}

leads to this assembly (showing diff -without/+with option):

 .visible .func (.param.u32 %out_retval)f(.param.u64 %in_ar1, .param.u32 %in_ar2)
 {
        .reg.u64 %ar1;
        .reg.u32 %ar2;
        .reg.u32 %retval;
        .reg.u64 %hr10;
        .reg.u32 %r23;
        .reg.u64 %r25;
        .reg.u32 %r26;
+       .reg.u32 %r28;
+       .reg.pred %r29;
        ld.param.u64 %ar1, [%in_ar1];
        ld.param.u32 %ar2, [%in_ar2];
+       {
+               .reg.u32 %ustmp0;
+               .reg.u64 %ustmp1;
+               .reg.u64 %ustmp2;
+               mov.u32 %ustmp0, %tid.y;
+               mul.wide.u32 %ustmp1, %ustmp0, 4;
+               mov.u64 %ustmp2, __nvptx_uni;
+               add.u64 %ustmp2, %ustmp2, %ustmp1;
+               ld.shared.u32 %r28, [%ustmp2];
+               mov.u32 %ustmp0, %tid.x;
+               and.b32 %r28, %r28, %ustmp0;
+               setp.eq.u32 %r29, %r28, %ustmp0;
+       }
                mov.u64 %r25, %ar1;
                mov.u32 %r26, %ar2;
-               atom.exch.b32   %r23, [%r25], %r26;
+       @%r29   atom.exch.b32   %r23, [%r25], %r26;
+               shfl.idx.b32    %r23, %r23, %r28, 31;
                mov.u32 %retval, %r23;
        st.param.u32    [%out_retval], %retval;
        ret;
        }
+// BEGIN GLOBAL VAR DECL: __nvptx_uni
+.extern .shared .u32 __nvptx_uni[32];

And, -msoft-stack for this input:

void g(void *);
void f()
{
  char a[42] __attribute__((aligned(64)));
  g(a);
}

leads to:

 .visible .func f
 {
        .reg.u64 %hr10;
        .reg.u64 %r22;
        .reg.u64 %frame;
-       .local.align 64 .b8 %farray[48];
-       cvta.local.u64 %frame, %farray;
+       .reg.u32 %fstmp0;
+       .reg.u64 %fstmp1;
+       .reg.u64 %fstmp2;
+       mov.u32 %fstmp0, %tid.y;
+       mul.wide.u32 %fstmp1, %fstmp0, 8;
+       mov.u64 %fstmp2, __nvptx_stacks;
+       add.u64 %fstmp2, %fstmp2, %fstmp1;
+       ld.shared.u64 %fstmp1, [%fstmp2];
+       sub.u64 %frame, %fstmp1, 48;
+       and.b64 %frame, %frame, -64;
+       st.shared.u64 [%fstmp2], %frame;
                mov.u64 %r22, %frame;
        {
                .param.u64 %out_arg0;
                st.param.u64 [%out_arg0], %r22;
                call g, (%out_arg0);
        }
+       st.shared.u64 [%fstmp2], %fstmp1;
        ret;
        }
 // BEGIN GLOBAL FUNCTION DECL: g
 .extern .func g(.param.u64 %in_ar1);
+// BEGIN GLOBAL VAR DECL: __nvptx_stacks
+.extern .shared .u64 __nvptx_stacks[32];


Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-02 17:20                   ` Nathan Sidwell
@ 2015-12-03 13:57                     ` Alexander Monakov
  0 siblings, 0 replies; 49+ messages in thread
From: Alexander Monakov @ 2015-12-03 13:57 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Jakub Jelinek, Thomas Schwinge, Bernd Schmidt, gcc-patches,
	Dmitry Melnik

On Wed, 2 Dec 2015, Nathan Sidwell wrote:
> On 12/02/15 12:09, Alexander Monakov wrote:
> 
> > I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
> > it's not an issue.  E.g. check early in gomp_nvptx_main if .weak
> > __nvptx_has_simd != 0.  It would only break if there was dlopen on PTX.
> 
> Note I found a bug in .weak support.  See the comment in
> gcc.dg/special/weak-2.c
> 
> /* NVPTX's implementation of weak is broken when a strong symbol is in
>    a later object file than the weak definition.   */

Thanks for the warning.  However, the issue seems limited to function symbols:
I've made a test for data symbols, and they appear to work fine -- which
suffices in this context.

Alexander

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  2015-12-01 16:02   ` Bernd Schmidt
  2015-12-01 16:20     ` Alexander Monakov
@ 2015-12-07 15:09     ` Nathan Sidwell
  1 sibling, 0 replies; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-07 15:09 UTC (permalink / raw)
  To: Bernd Schmidt, Alexander Monakov, gcc-patches
  Cc: Jakub Jelinek, Dmitry Melnik

On 12/01/15 11:01, Bernd Schmidt wrote:
> On 12/01/2015 04:28 PM, Alexander Monakov wrote:
>> I'm taking a different approach.  I want to execute all insns in all warp
>> members, while ensuring that effect (on global and local state) is that same
>> as if any single thread was executing that instruction.  Most instructions
>> automatically satisfy that: if threads have the same state, then executing an
>> arithmetic instruction, normal memory load/store, etc. keep local state the
>> same in all threads.
>>
>> The two exception insn categories are atomics and calls.  For calls, we can
>> demand recursively that they uphold this execution model, until we reach
>> runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
>> atomics.
>
> Didn't we also conclude that address-taking (let's say for stack addresses) is
> also an operation that does not result in the same state?
>
> Have you tried to use the mechanism used for OpenACC? IMO that would be a good
> first step - get things working with fewer changes, and then look into
> optimizing them (ideally for OpenMP and OpenACC both).

I would have thought the right approach would be to augment the existing 
neutering code to insert predication (instead of branch-around) using a 
heuristic as to which is the better choice.

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [gomp-nvptx 6/9] nvptx libgcc: rewrite in C
  2015-12-01 23:52   ` Bernd Schmidt
  2015-12-02  0:23     ` Alexander Monakov
@ 2015-12-07 15:13     ` Nathan Sidwell
  1 sibling, 0 replies; 49+ messages in thread
From: Nathan Sidwell @ 2015-12-07 15:13 UTC (permalink / raw)
  To: Bernd Schmidt, Alexander Monakov, gcc-patches
  Cc: Jakub Jelinek, Dmitry Melnik

On 12/01/15 18:52, Bernd Schmidt wrote:
> What exactly is the problem with having asm files? I'm asking because this...
>
> On 12/01/2015 04:28 PM, Alexander Monakov wrote:
>> +/* __shared__ char *__nvptx_stacks[32];  */
>> +asm ("// BEGIN GLOBAL VAR DEF: __nvptx_stacks");
>> +asm (".visible .shared .u64 __nvptx_stacks[32];");
>> +
>> +/* __shared__ unsigned __nvptx_uni[32];  */
>> +asm ("// BEGIN GLOBAL VAR DEF: __nvptx_uni");
>> +asm (".visible .shared .u32 __nvptx_uni[32];");
>
> ... doesn't look great to me. This is better done in assembly directly IMO.

the decl reworking I recently committed has a 'TODO: this would be a good place 
to check for a .shared section' in it.  That would  seem a better place to 
augment and allow the above with a regular __attribute__((section...))

nathan

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2015-12-07 15:13 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-01 15:28 [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD Alexander Monakov
2015-12-01 15:28 ` [gomp-nvptx 5/9] new target hook: TARGET_SIMT_VF Alexander Monakov
2015-12-01 15:28 ` [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets Alexander Monakov
2015-12-01 22:40   ` Alexander Monakov
2015-12-02 11:48   ` Jakub Jelinek
2015-12-02 13:54     ` Alexander Monakov
2015-12-02 14:02       ` Jakub Jelinek
2015-12-02 14:26         ` Alexander Monakov
2015-12-01 15:28 ` [gomp-nvptx 7/9] nvptx mkoffload: pass -mgomp for OpenMP offloading Alexander Monakov
2015-12-01 15:28 ` [gomp-nvptx 6/9] nvptx libgcc: rewrite in C Alexander Monakov
2015-12-01 23:52   ` Bernd Schmidt
2015-12-02  0:23     ` Alexander Monakov
2015-12-07 15:13     ` Nathan Sidwell
2015-12-01 15:28 ` [gomp-nvptx 4/9] nvptx backend: add -mgomp option and multilib Alexander Monakov
2015-12-02 10:56   ` Jakub Jelinek
2015-12-02 14:18     ` Alexander Monakov
2015-12-03 10:42     ` Alexander Monakov
2015-12-01 15:28 ` [gomp-nvptx 8/9] libgomp: update gomp_nvptx_main for -mgomp Alexander Monakov
2015-12-01 15:56   ` Bernd Schmidt
2015-12-02 11:02   ` Jakub Jelinek
2015-12-01 15:28 ` [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant Alexander Monakov
2015-12-01 16:02   ` Bernd Schmidt
2015-12-01 16:20     ` Alexander Monakov
2015-12-07 15:09     ` Nathan Sidwell
2015-12-02 10:40   ` Jakub Jelinek
2015-12-02 13:02     ` Nathan Sidwell
2015-12-02 13:10       ` Jakub Jelinek
2015-12-02 13:39         ` Nathan Sidwell
2015-12-02 13:46           ` Jakub Jelinek
2015-12-02 14:00             ` Bernd Schmidt
2015-12-02 14:14             ` Nathan Sidwell
2015-12-02 14:22               ` Jakub Jelinek
2015-12-02 14:23                 ` Nathan Sidwell
2015-12-02 14:24                   ` Jakub Jelinek
2015-12-02 14:34                     ` Alexander Monakov
2015-12-02 14:39                     ` Nathan Sidwell
2015-12-02 14:54         ` Alexander Monakov
2015-12-02 15:12           ` Jakub Jelinek
2015-12-02 15:18             ` Nathan Sidwell
     [not found]             ` <CABtfrpAyUtWub2CBHKYqN0aLNTZ1QspmxyQzOU6Gr+3ogZpSNA@mail.gmail.com>
2015-12-02 16:36               ` Jakub Jelinek
2015-12-02 17:09                 ` Nathan Sidwell
2015-12-02 17:09                 ` Alexander Monakov
2015-12-02 17:20                   ` Nathan Sidwell
2015-12-03 13:57                     ` Alexander Monakov
2015-12-02 14:41       ` Alexander Monakov
2015-12-02 14:43         ` Nathan Sidwell
2015-12-01 15:46 ` [gomp-nvptx 3/9] nvptx backend: add two more identifier maps Alexander Monakov
2015-12-01 15:47 ` [gomp-nvptx 1/9] nvptx backend: allow emitting COND_EXEC insns Alexander Monakov
2015-12-02 13:31   ` Bernd Schmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).